Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
92 views
in Technique[技术] by (71.8m points)

python - Accessing a webpage within a webpage using BeautifulSoup?

I have written a Python script that parses the data of a webpage using beautifulsoup. What i want to do further is to click the NAME of each person on page, access their profile, then click on the website link on that page and scrape the email id ( if available ) from that website. Can anyone help me out with this? I am new to beautifulsoup and python so i am unable to proceed further. Any help is appreciated. Thanks! The kind of link i am working on is: https://www.realtor.com/realestateagents/agentname-john

Here is my code:

from bs4 import BeautifulSoup
import requests
import csv




#####################  Website
#####################           URL

w_url = str('https://www.')+str(input('Please Enter Website URL :'))





####################### Number of
#######################           Pages

pages = int(input(' Please specify number of pages: '))




#######################  Range
#######################         Specified
page_range = list(range(0,pages))




#######################  WebSite
#######################          Name ( in case of multiple websites )
#site_name = int(input('Enter the website name ( IN CAPITALS ) :'))



#######################  Empty
#######################        List
agent_info= []




#######################   Creating
#######################            CSV File
csv_file = open(r'D:Webscraping
eal_estate_agents.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Name and Number'])





####################### FOR
#######################    LOOP
for k in page_range:
    website = requests.get(w_url+'/pg-'+'{}'.format(k)).text
    soup = BeautifulSoup(website,'lxml')


    class1 = 'jsx-1448471805 agent-name text-bold'
    class2 = 'jsx-1448471805 agent-phone hidden-xs hidden-xxs'



    for i in soup.find_all('div',class_=[[class1],[class2]]):

        w = i.text
        agent_info.append(w)





#####################  Reomiving
#####################            Duplicates

updated_info= list(dict.fromkeys(agent_info))





#####################   Writing Data
#####################               to CSV

for t in updated_info:
    print(t)
    csv_writer.writerow([t])
    print('
')




csv_file.close()
question from:https://stackoverflow.com/questions/65941169/accessing-a-webpage-within-a-webpage-using-beautifulsoup

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Would be more efficient (and less lines of code) if you grab the data from the api. It also appears the website emails are within that too, so if needed, no need to go to each of the 30,000+ websites for that email, so you can get it all in a fraction of the time.

The api also has all the data you'd want/need. For example, here's everythin on just 1 agent:

{'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'advertiser_id': 2121274, 'agent_rating': 5, 'background_photo': {'href': 'https://ap.rdcpix.com/1223152681/cc48579b6a0fe6ccbbf44d83e8f82145g-c0o.jpg'}, 'broker': {'fulfillment_id': 3860509, 'designations': [], 'name': 'BRIDGE REALTY, LLC.', 'accent_color': '', 'photo': {'href': ''}, 'video': ''}, 'description': 'As a professional real estate agent licensed in the State of Minnesota, I am committed to providing only the highest standard of care as I assist you in navigating the twists and turns of home ownership. Whether you are buying or selling your home, I will do everything it takes to turn your real estate goals and desires into a reality. If you are looking for a real estate Agent who will put your needs first and go above and beyond to help you reach your goals, I am the agent for you.', 'designations': [], 'first_month': 0, 'first_name': 'John', 'first_year': 2010, 'has_photo': True, 'href': 'http://www.twincityhomes4sale.com', 'id': '56b63efd7e54f7010021459d', 'is_realtor': True, 'languages': [], 'last_name': 'Palomino', 'last_updated': 'Mon, 04 Jan 2021 18:46:12 GMT', 'marketing_area_cities': [{'city_state': 'Columbus_MN', 'name': 'Columbus', 'state_code': 'MN'}, {'city_state': 'Blaine_MN', 'name': 'Blaine', 'state_code': 'MN'}, {'city_state': 'Circle Pines_MN', 'name': 'Circle Pines', 'state_code': 'MN'}, {'city_state': 'Lino Lakes_MN', 'name': 'Lino Lakes', 'state_code': 'MN'}, {'city_state': 'Lexington_MN', 'name': 'Lexington', 'state_code': 'MN'}, {'city_state': 'Forest Lake_MN', 'name': 'Forest Lake', 'state_code': 'MN'}, {'city_state': 'Chisago City_MN', 'name': 'Chisago City', 'state_code': 'MN'}, {'city_state': 'Wyoming_MN', 'name': 'Wyoming', 'state_code': 'MN'}, {'city_state': 'Centerville_MN', 'name': 'Centerville', 'state_code': 'MN'}, {'city_state': 'Hugo_MN', 'name': 'Hugo', 'state_code': 'MN'}, {'city_state': 'Grant_MN', 'name': 'Grant', 'state_code': 'MN'}, {'city_state': 'St. Anthony_MN', 'name': 'St. Anthony', 'state_code': 'MN'}, {'city_state': 'Arden Hills_MN', 'name': 'Arden Hills', 'state_code': 'MN'}, {'city_state': 'New Brighton_MN', 'name': 'New Brighton', 'state_code': 'MN'}, {'city_state': 'Mounds View_MN', 'name': 'Mounds View', 'state_code': 'MN'}, {'city_state': 'White Bear Township_MN', 'name': 'White Bear Township', 'state_code': 'MN'}, {'city_state': 'Vadnais Heights_MN', 'name': 'Vadnais Heights', 'state_code': 'MN'}, {'city_state': 'Shoreview_MN', 'name': 'Shoreview', 'state_code': 'MN'}, {'city_state': 'Little Canada_MN', 'name': 'Little Canada', 'state_code': 'MN'}, {'city_state': 'Columbia Heights_MN', 'name': 'Columbia Heights', 'state_code': 'MN'}, {'city_state': 'Hilltop_MN', 'name': 'Hilltop', 'state_code': 'MN'}, {'city_state': 'Fridley_MN', 'name': 'Fridley', 'state_code': 'MN'}, {'city_state': 'Linwood_MN', 'name': 'Linwood', 'state_code': 'MN'}, {'city_state': 'East Bethel_MN', 'name': 'East Bethel', 'state_code': 'MN'}, {'city_state': 'Spring Lake Park_MN', 'name': 'Spring Lake Park', 'state_code': 'MN'}, {'city_state': 'North St. Paul_MN', 'name': 'North St. Paul', 'state_code': 'MN'}, {'city_state': 'Maplewood_MN', 'name': 'Maplewood', 'state_code': 'MN'}, {'city_state': 'St. Paul_MN', 'name': 'St. Paul', 'state_code': 'MN'}], 'mls': [{'member': {'id': '506004321'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'A', 'primary': True}], 'nar_only': 1, 'nick_name': '', 'nrds_id': '506004321', 'office': {'name': 'Bridge Realty, Llc', 'mls': [{'member': {'id': '10982'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'O', 'primary': True}], 'phones': [{'ext': '', 'number': '(952) 368-0021', 'type': 'Home'}], 'phone_list': {'phone_1': {'type': 'Home', 'number': '(952) 368-0021', 'ext': ''}}, 'photo': {'href': ''}, 'slogan': '', 'website': None, 'video': None, 'fulfillment_id': 3027311, 'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'email': '[email protected]', 'nrds_id': None}, 'party_id': 23115328, 'person_name': 'John Palomino', 'phones': [{'ext': '', 'number': '(763) 458-0788', 'type': 'Mobile'}], 'photo': {'href': 'https://ap.rdcpix.com/900899898/cc48579b6a0fe6ccbbf44d83e8f82145a-c0o.jpg'}, 'recommendations_count': 2, 'review_count': 7, 'role': 'agent', 'served_areas': [{'name': 'Circle Pines', 'state_code': 'MN'}, {'name': 'Forest Lake', 'state_code': 'MN'}, {'name': 'Hugo', 'state_code': 'MN'}, {'name': 'St. Paul', 'state_code': 'MN'}, {'name': 'Minneapolis', 'state_code': 'MN'}, {'name': 'Wyoming', 'state_code': 'MN'}], 'settings': {'share_contacts': False, 'full_access': False, 'recommendations': {'realsatisfied': {'user': 'John-Palomino', 'id': '1073IJk', 'linked': '3d91C', 'updated': '1529551719'}}, 'display_listings': True, 'far_override': True, 'show_stream': True, 'terms_of_use': True, 'has_dotrealtor': False, 'display_sold_listings': True, 'display_price_range': True, 'display_ratings': True, 'loaded_from_sb': True, 'broker_data_feed_opt_out': False, 'unsubscribe': {'autorecs': False, 'recapprove': False, 'account_notify': False}, 'new_feature_popup_closed': {'agent_left_nav_avatar_to_profile': False}}, 'slogan': 'Bridging the gap between buyers & sellers', 'specializations': [{'name': '1st time home buyers'}, {'name': 'Residential Listings'}, {'name': 'Rental/Investment Properties'}, {'name': 'Move Up Buyers'}], 'title': 'Agent', 'types': 'agent', 'user_languages': [], 'web_url': 'https://www.realtor.com/realestateagents/John-Palomino_BLOOMINGTON_MN_2121274_876599394', 'zips': ['55014', '55025', '55038', '55112', '55126', '55421', '55449', '55092', '55434', '55109'], 'email': '[email protected]', 'full_name': 'John Palomino', 'name': 'John Palomino, Agent', 'social_media': {'facebook': {'type': 'facebook', 'href': 'https://www.facebook.com/Johnpalominorealestate'}}, 'for_sale_price': {'count': 1, 'min': 299900, 'max': 299900, 'last_listing_date': '2021-01-29T11:10:24Z'}, 'recently_sold': {'count': 35, 'min': 115000, 'max': 460000, 'last_sold_date': '2020-12-18'}, 'agent_team_details': {'is_team_member': False}}

Code:

import requests
import pandas as pd
import math

# Function to pull the data
def get_agent_info(jsonData, rows):
    agents = jsonData['agents']
    for agent in agents:
        name = agent['person_name']

        if 'email' in agent.keys():
            email = agent['email']
        else:
            email = 'N/A'
        
        if 'href' in agent.keys():
            website = agent['href']
        else:
            website = 'N/A'
            
        try:
            office_data = agent['office']
            office_email = office_data['email']
        except:
            office_email = 'N/A'
        
        row = {'name':name, 'email':email, 'website':website, 'office_email':office_email}
        rows.append(row)
    return rows

rows = []   
url = 'https://www.realtor.com/realestateagents/api/v3/search'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'}
payload = {'nar_only': '1','offset': '','limit': '300','marketing_area_cities':  '_',
           'postal_code': '','is_postal_search': 'true','name': 'john','types': 'agent',
           'sort': 'recent_activity_high','far_opt_out': 'false','client_id': 'FAR2.0',
           'recommendations_count_min': '','agent_rating_min': '','languages': '',
           'agent_type': '','price_min': '','price_max': '','designations': '',
           'photo': 'true'}

# Gets 1st page, finds how many pages yoyu'll need to go through, and parses the data   
jsonData = requests.get(url, headers=headers, params=payload).json()
total_matchs = jsonData['matching_rows']
total_pages = math.ceil(total_matchs/300)
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(1,total_pages))

# Iterate through next pages
for page in range(1,total_pages):
    payload.update({'offset':page*300})
    jsonData = requests.get(url, headers=headers, params=payload).json()
    rows = get_agent_info(jsonData, rows)
    print ('Completed: %s of %s' %(page+1,total_pages))

df = pd.DataFrame(rows)

Output: Just the first 10 rows of 30,600

print(df.head(10).to_string())
                name                       email                                 website                   office_email
0       John Croteau           [email protected]  https://www.facebook.com/JCtherealtor/      [email protected]
1  Stephanie St John       [email protected]   https://stephaniestjohn.shorewest.com     [email protected]
2     Johnine Larsen     [email protected]               http://realestategals.com  [email protected]
3    Leonard Johnson  [email protected]                 http://www.adrhomes.net     [email protected]
4  John C Fitzgerald           [email protected]                 http://www.JCFHomes.com                               
5  John Vrsansky  Jr     [email protected]           http://www.OnTargetRealty.com        [email protected]
6      John Williams    [email protected]        http://www.johnwilliamsidaho.com               [email protected]
7        John Zeiter          [email protected]                                                         [email protected]
8      Mitch Johnson  [email protected]                                            [email protected]
9          John Lowe           [email protected]                http://johnlowegroup.com  [email protected]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...