Following my previous question, i'm now trying to scrape multiple pages of a url (all the pages with games in a given season). I'm also trying to scrape multiple parent urls (seasons):
from selenium import webdriver
import pandas as pd
import time
url = ['http://www.oddsportal.com/hockey/austria/ebel-2014-2015/results/#/page/',
'http://www.oddsportal.com/hockey/austria/ebel-2013-2014/results/#/page/']
data = []
for i in url:
for j in range(1,8):
print i+str(j)
driver = webdriver.PhantomJS()
driver.implicitly_wait(10)
driver.get(i+str(j))
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text
if " - " in date:
date, event = date.split(" - ")
else:
event = "Not specified"
data.append({
"home": home.strip(),
"away": away.strip(),
"date": date.strip(),
"event": event.strip()
})
driver.close()
time.sleep(3)
print str(j)+" was ok"
df = pd.DataFrame(data)
print df
# ok for six results then socket.error: [Errno 10054] An existing connection was forcibly closed by the remote host
# ok for two results, then infinite load
# added time.sleep(3)
# ok for first result, infinite load after that
# added implicitly wait
# no result, infinite load
At first I tried the code twice without either the implicit wait on line 14 or the sleep on 35. First result gave the socket error. Second result stalled with no error after two good scraped pages.
Then added the time waits as noted above and they haven't helped.
Since the results are not consistent, my guess is connection be reset between the end of the loop & next run. I'd like to know if that's a likely solution and how to implement. I checked the robots.txt of the site and can't see anything that prevents scraping after a set interval.
Secondly, say the scraper gets 90% of the pages, then stalls (infinite wait). Is there a way to have it retry that loop after x seconds so as to save what you've got and retry from the stalled point again?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…