When you use multiprocessing
, everything you pass to a worker has to be pickled.
Unfortunately, many BeautifulSoup
trees can't be pickled.
There are a few different reasons for this. Some of them are bugs that have since been fixed, so you could try making sure you have the latest bs4 version, and some are specific to different parsers or tree builders… but there's a good chance nothing like this will help.
But the fundamental problem is that many elements in the tree contain references to the rest of the tree.
Occasionally, this leads to an actual infinite loop, because the circular references are too indirect for its circular reference detection. But that's usually a bug that gets fixed.
But, even more importantly, even when the loop isn't infinite, it can still drag in more than 1000 elements from all over the rest of the tree, and that's already enough to cause a RecursionError
.
And I think the latter is what's happening here. If I take your code and try to pickle divList[0]
, it fails. (If I bump the recursion limit way up and count the frames, it needs a depth of 23080, which is way, way past the default of 1000.) But if I take that exact same div
and parse it separately, it succeeds with no problem.
So, one possibility is to just do sys.setrecursionlimit(25000)
. That will solve the problem for this exact page, but a slightly different page might need even more than that. (Plus, it's usually not a great idea to set the recursion limit that high—not so much because of the wasted memory, but because it means actual infinite recursion takes 25x as long, and 25x as much wasted resources, to detect.)
Another trick is to write code that "prunes the tree", eliminating any upward links from the div before/as you pickle it. This is a great solution, except that it might be a lot of work, and requires diving into the internals of how BeautifulSoup works, which I doubt you want to do.
The easiest workaround is a bit clunky, but… you can convert the soup to a string, pass that to the child, and have the child re-parse it:
def new_check():
divTexts = [str(div) for div in divList]
with Pool() as pool:
pool.map(get_info, divTexts)
def get_info(each):
div = BeautifulSoup(each, 'html.parser')
if __name__ == '__main__':
new_check()
The performance cost for doing this is probably not going to matter; the bigger worry is that if you had imperfect HTML, converting to a string and re-parsing it might not be a perfect round trip. So, I'd suggest that you do some tests without multiprocessing first to make sure this doesn't affect the results.