First, I'm not sure as to what Python version you are using but how you implement BeautifulSoup is incorrect, at least in my version. BeautifulSoup heavily recommends using a parser here. Your following code here:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)
table = soup.find(id='table')
should be:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')
What your issue is how you define the url inside the for-loop. I managed to loop through all the elements, but how you define the url
is specifically the issue. The way you read the define the url
inside the for-loop returns blankspace.
So you say it returns just the last item. When it gets' to the last item, it'll fetch the url in the for-loop. But the url is just blankspace, and the key already exists in movies. Therefore, it'll overwrite the existing data there.
I'm not sure how you wanted the url
defined, but this code does as you intend - fetch all the movies, their names, href
values, and return the first 10. The only differences should be how you define url
and movies[url]
, but be careful not to trip up on the url again.
Also, the way you redefine url
within the for-loop to represent a unique-ID should reflect that, as such - name it unique_id (or, in this example uid
). I also included print statements to demonstrate it goes through the entire loop and also gets the first 10 values.
def Get_Top_List_GR(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')
rows = [row for row in table.find_all('tr')]
movies = {}
for row in rows[1:]:
items = row.find_all('td')
link = items[1].find('a')
title, url_string = link.text, link['href']
# split url string into unique movie serial number
uid = url_string.split("/")[-2]
print("{0} - {1} - {2}".format(url, title, uid))
# set serial number as key to avoid duplication in any other category- especially title
movies[uid] = [url_string] + [i.text for i in items]
movie_page = pd.DataFrame(movies).T # transpose
return movie_page
df_test_ = Get_Top_List_GR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')
print(df_test_.head(10))
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…