Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
198 views
in Technique[技术] by (71.8m points)

python - How to extract elements belong to the same tag name in web scraping?

the web page being scraped

the wrong output i get

So basically I was trying to scrape over those rows of streamers on each page with the tag name "tr". And in each row, there's multiple columns that I want to include into my output. I was able to include almost all of those columns, but there's two that have the same tag name frustrated me a lot. (The two columns about followers) I tried the method of indexing them to get only odd or even, but the result is included in the second picture and it did not work out well. The numbers just keeps repeating itself and does not go down the way as it should. So is there some way to get the column of "followers gained" correctly into the output?

It's my first time asking here, so i am not sure if it is enough. I am glad to update more info later if needed.

for i in range(30):      # Number of pages plus one 
    url = "https://twitchtracker.com/channels/viewership?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    headers = {'User-agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content)
    
    channels = soup.find_all('tr')
    for idx, channel in enumerate(channels):
        if idx % 2 == 1:    
            idx += 1
        Name = ", ".join([p.get_text(strip=True) for p in channel.find_all('a', attrs={'style': 'color:inherit'})])
        Avg = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-viewers')])
        Time = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-streamed')])
        All = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-viewersMax')])
        HW = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-watched')])
        FG = ", ".join([soup.find_all('td', class_ = 'color-followers hidden-sm')[idx].get_text(strip=True)])
question from:https://stackoverflow.com/questions/65895834/how-to-extract-elements-belong-to-the-same-tag-name-in-web-scraping

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Maybe an alternativ approach ?##

It uses pandas to read the tables, you just have to clean the ads out.

I also used time.sleep() delaying the loops and to be gentle to the server.

Example

import requests, time
import pandas as pd
df_list = []
for i in range(30):      # Number of pages plus one 
    url = "https://twitchtracker.com/channels/viewership?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    headers = {'User-agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=headers)
    df_list.append(pd.read_html(page.text)[0])
    
    time.sleep(1.5)
    
df = pd.concat(df_list).reset_index(drop=True)
df.rename(columns={'Unnamed: 2':'Name'}, inplace=True)
df.drop(df.columns[[0,1]],axis=1,inplace=True)
df[~df.Rank.str.contains(".ads { display:")].to_csv('table.csv', mode='w', index=False)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...