I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.
i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.
The problem in this line of code:
HTML = lxml.html.fromstring(htmltext)
Maybe someone know what it can be, or hoe to fix this?
Thanks for help.
P.S.
Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
UP:
i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.
And now in with some time they start writing in error log:
"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".
And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.
How can i catch this exception to kill process so daemon can create new one??
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…