python - lxml parser eats all memory

Question

Welcome To Ask or Share your Answers For Others

python - lxml parser eats all memory

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - lxml parser eats all memory

I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.

i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.

The problem in this line of code:

HTML = lxml.html.fromstring(htmltext)

Maybe someone know what it can be, or hoe to fix this?

Thanks for help.

P.S.

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

UP:

i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.

And now in with some time they start writing in error log:

"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".

And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.

How can i catch this exception to kill process so daemon can create new one??

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:52:27+0000

You might be keeping some references which keep the documents alive. Be careful with string results from xpath evaluation for example: by default they are "smart" strings, which provide access to the containing element, thus keeping the tree in memory if you keep a reference to them. See the docs on xpath return values:

There are certain cases where the smart string behaviour is undesirable. For example, it means that the tree will be kept alive by the string, which may have a considerable memory impact in the case that the string value is the only thing in the tree that is actually of interest. For these cases, you can deactivate the parental relationship using the keyword argument smart_strings.

(I have no idea if this is the problem in your case, but it's a candidate. I've been bitten by this myself once ;-))

Categories

python - lxml parser eats all memory

python - lxml parser eats all memory

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags