I've built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info' I need.
after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I'm wasting a lot of time. I could see that there are a a few returning subtrahends which I can use to check the next valid url.
you can see the subtrahends quite fast (a little ex' of the few first "good IDs") -
510000011 # +8
510000029 # +18
510000037 # +8
510000045 # +8
510000052 # +7
510000060 # +8
510000078 # +18
510000086 # +8
510000094 # +8
510000102 # +8
510000110 # etc'
510000128
510000136
510000144
510000151
510000169
510000177
510000185
510000193
510000201
after crawling about 200K urls which gave me only 14K good results I knew I was wasting my time and need to optimize it, so I run some statistics and built a function that will check the urls while increasing the id with 818178 (top returning subtrahends ) etc'.
this is the function -
def checkNextID(ID):
global numOfRuns, curRes, lastResult
while ID < lastResult:
try:
numOfRuns += 1
if numOfRuns % 10 == 0:
time.sleep(3) # sleep every 10 iterations
if isValid(ID + 8):
parseHTML(curRes)
checkNextID(ID + 8)
return 0
if isValid(ID + 18):
parseHTML(curRes)
checkNextID(ID + 18)
return 0
if isValid(ID + 7):
parseHTML(curRes)
checkNextID(ID + 7)
return 0
if isValid(ID + 17):
parseHTML(curRes)
checkNextID(ID + 17)
return 0
if isValid(ID+6):
parseHTML(curRes)
checkNextID(ID + 6)
return 0
if isValid(ID + 16):
parseHTML(curRes)
checkNextID(ID + 16)
return 0
else:
checkNextID(ID + 1)
return 0
except Exception, e:
print "somethin went wrong: " + str(e)
what is basically does is -checkNextID(ID) is getting the first id I know that contain the data minus 8 so the first iteration will match the first "if isValid" clause (isValid(ID + 8) will return True).
lastResult is a variable which saves the last known url id, so we'll run until numOfRuns is
isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global varibale named - 'curRes', it returns False if the url doesn't contain the data I need.
parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.
if isValid() returns True, we'll call parseHTML() and then try to check the next ID+the subtrahends (by calling checkNextID(ID + subtrahends), if none of them will return what I'm looking for I'll increase it with 1 and check again until I'll find the next valid url.
you can see the rest of the code here
after running the code I got about 950~ good results and suddenly an exception had raised -
"somethin went wrong: maximum recursion depth exceeded while calling a
Python object"
I could see on WireShark that the scipt stuck on id - 510009541 (I started my script with 510000003), the script tried getting the url with that ID a few times before I noticed the error and stopped it.
I was really exciting to see that I got the same results but 25x-40x times faster then my old script, with fewer HTTP requests, it's very precise, I have missed only 1 result for 1000 good results, which is find by me, it's impossible to rum 5M times, I had my old script running for 30 hours and got 14-15K results when my new script gave me 960~ results in 5-10 minutes.
I read about stack limitations, but there must be a solution for the algorithm I'm trying to implement in Python (I can't go back to my old "algorithm", it will never end).
Thanks!
See Question&Answers more detail:
os