Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
269 views
in Technique[技术] by (71.8m points)

python - Obtaining just the last row when using beautiful soup

I have the following code:

from bs4 import BeautifulSoup

import requests

import pandas as pd

def Get_Top_List_BR(url):
    
    
    response = requests.get(url)

    page = response.text

    soup = BeautifulSoup(page)

    table = soup.find(id='table')
   
    rows = [row for row in table.find_all('tr')]
    

    movies = {}

    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        #split url string into unique movie serial number
        url = url_string.split('?', 1)[0].split('t', 4)[-1].split('/', 1)[0]
        #set serial number as key to avoid duplication in any other category-especially title
        movies[url] = [url_string] +[i.text for i in items]
   
    movie_page = pd.DataFrame(movies).T  #transpose
    movie_page.columns = ['URL', 'Rank', 'Title', 'Genre', 'Budget', 'Running Time','Gross',
                    'Theaters', 'Total_Gross', 'Release_Date', 'Distributor', 'Estimated']

    return movie_page

df_test_BR = Get_Top_List_BR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')

df_test_BR.head(10)

Problem: I am only getting the last row. Question: How can I fix it to return all the rows?

question from:https://stackoverflow.com/questions/65867400/obtaining-just-the-last-row-when-using-beautiful-soup

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First, I'm not sure as to what Python version you are using but how you implement BeautifulSoup is incorrect, at least in my version. BeautifulSoup heavily recommends using a parser here. Your following code here:

 response = requests.get(url)
 page = response.text
 soup = BeautifulSoup(page)
 table = soup.find(id='table')

should be:

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')

What your issue is how you define the url inside the for-loop. I managed to loop through all the elements, but how you define the url is specifically the issue. The way you read the define the url inside the for-loop returns blankspace.

So you say it returns just the last item. When it gets' to the last item, it'll fetch the url in the for-loop. But the url is just blankspace, and the key already exists in movies. Therefore, it'll overwrite the existing data there.

I'm not sure how you wanted the url defined, but this code does as you intend - fetch all the movies, their names, href values, and return the first 10. The only differences should be how you define url and movies[url], but be careful not to trip up on the url again.

Also, the way you redefine url within the for-loop to represent a unique-ID should reflect that, as such - name it unique_id (or, in this example uid). I also included print statements to demonstrate it goes through the entire loop and also gets the first 10 values.

def Get_Top_List_GR(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find(id='table')
    rows = [row for row in table.find_all('tr')]

    movies = {}
    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        # split url string into unique movie serial number
        uid = url_string.split("/")[-2]
        print("{0} - {1} - {2}".format(url, title, uid))
        # set serial number as key to avoid duplication in any other category-        especially title
        movies[uid] = [url_string] + [i.text for i in items]
    movie_page = pd.DataFrame(movies).T  # transpose
    return movie_page

df_test_ = Get_Top_List_GR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')
print(df_test_.head(10))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...