python - Obtaining just the last row when using beautiful soup

Question

Welcome To Ask or Share your Answers For Others

python - Obtaining just the last row when using beautiful soup

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Obtaining just the last row when using beautiful soup

I have the following code:

from bs4 import BeautifulSoup

import requests

import pandas as pd

def Get_Top_List_BR(url):
    
    
    response = requests.get(url)

    page = response.text

    soup = BeautifulSoup(page)

    table = soup.find(id='table')
   
    rows = [row for row in table.find_all('tr')]
    

    movies = {}

    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        #split url string into unique movie serial number
        url = url_string.split('?', 1)[0].split('t', 4)[-1].split('/', 1)[0]
        #set serial number as key to avoid duplication in any other category-especially title
        movies[url] = [url_string] +[i.text for i in items]
   
    movie_page = pd.DataFrame(movies).T  #transpose
    movie_page.columns = ['URL', 'Rank', 'Title', 'Genre', 'Budget', 'Running Time','Gross',
                    'Theaters', 'Total_Gross', 'Release_Date', 'Distributor', 'Estimated']

    return movie_page

df_test_BR = Get_Top_List_BR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')

df_test_BR.head(10)

Problem: I am only getting the last row. Question: How can I fix it to return all the rows?

question from:https://stackoverflow.com/questions/65867400/obtaining-just-the-last-row-when-using-beautiful-soup

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:24:28+0000

First, I'm not sure as to what Python version you are using but how you implement BeautifulSoup is incorrect, at least in my version. BeautifulSoup heavily recommends using a parser here. Your following code here:

 response = requests.get(url)
 page = response.text
 soup = BeautifulSoup(page)
 table = soup.find(id='table')

should be:

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')

What your issue is how you define the url inside the for-loop. I managed to loop through all the elements, but how you define the url is specifically the issue. The way you read the define the url inside the for-loop returns blankspace.

So you say it returns just the last item. When it gets' to the last item, it'll fetch the url in the for-loop. But the url is just blankspace, and the key already exists in movies. Therefore, it'll overwrite the existing data there.

I'm not sure how you wanted the url defined, but this code does as you intend - fetch all the movies, their names, href values, and return the first 10. The only differences should be how you define url and movies[url], but be careful not to trip up on the url again.

Also, the way you redefine url within the for-loop to represent a unique-ID should reflect that, as such - name it unique_id (or, in this example uid). I also included print statements to demonstrate it goes through the entire loop and also gets the first 10 values.

def Get_Top_List_GR(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find(id='table')
    rows = [row for row in table.find_all('tr')]

    movies = {}
    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        # split url string into unique movie serial number
        uid = url_string.split("/")[-2]
        print("{0} - {1} - {2}".format(url, title, uid))
        # set serial number as key to avoid duplication in any other category-        especially title
        movies[uid] = [url_string] + [i.text for i in items]
    movie_page = pd.DataFrame(movies).T  # transpose
    return movie_page

df_test_ = Get_Top_List_GR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')
print(df_test_.head(10))

Categories

python - Obtaining just the last row when using beautiful soup

python - Obtaining just the last row when using beautiful soup

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags