python - How to extract elements belong to the same tag name in web scraping?

Question

Welcome To Ask or Share your Answers For Others

python - How to extract elements belong to the same tag name in web scraping?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to extract elements belong to the same tag name in web scraping?

So basically I was trying to scrape over those rows of streamers on each page with the tag name "tr". And in each row, there's multiple columns that I want to include into my output. I was able to include almost all of those columns, but there's two that have the same tag name frustrated me a lot. (The two columns about followers) I tried the method of indexing them to get only odd or even, but the result is included in the second picture and it did not work out well. The numbers just keeps repeating itself and does not go down the way as it should. So is there some way to get the column of "followers gained" correctly into the output?

It's my first time asking here, so i am not sure if it is enough. I am glad to update more info later if needed.

for i in range(30):      # Number of pages plus one 
    url = "https://twitchtracker.com/channels/viewership?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    headers = {'User-agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content)
    
    channels = soup.find_all('tr')
    for idx, channel in enumerate(channels):
        if idx % 2 == 1:    
            idx += 1
        Name = ", ".join([p.get_text(strip=True) for p in channel.find_all('a', attrs={'style': 'color:inherit'})])
        Avg = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-viewers')])
        Time = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-streamed')])
        All = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-viewersMax')])
        HW = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-watched')])
        FG = ", ".join([soup.find_all('td', class_ = 'color-followers hidden-sm')[idx].get_text(strip=True)])

question from:https://stackoverflow.com/questions/65895834/how-to-extract-elements-belong-to-the-same-tag-name-in-web-scraping

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:16:45+0000

Maybe an alternativ approach ?##

It uses pandas to read the tables, you just have to clean the ads out.

I also used time.sleep() delaying the loops and to be gentle to the server.

Example

import requests, time
import pandas as pd
df_list = []
for i in range(30):      # Number of pages plus one 
    url = "https://twitchtracker.com/channels/viewership?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    headers = {'User-agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=headers)
    df_list.append(pd.read_html(page.text)[0])
    
    time.sleep(1.5)
    
df = pd.concat(df_list).reset_index(drop=True)
df.rename(columns={'Unnamed: 2':'Name'}, inplace=True)
df.drop(df.columns[[0,1]],axis=1,inplace=True)
df[~df.Rank.str.contains(".ads { display:")].to_csv('table.csv', mode='w', index=False)

Categories

python - How to extract elements belong to the same tag name in web scraping?

python - How to extract elements belong to the same tag name in web scraping?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Maybe an alternativ approach ?##

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags