Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
148 views
in Technique[技术] by (71.8m points)

web scraping - Find or select elements from python to scrape with beautifulsoup

I am not sure how to select below items inside the table class="table-info"

Using python and beautifulsoup, I want to extract the:

  1. phone

  2. email

  3. website

  4. main activity (li element text without the div) "Computer consultancy activities".

     <table class="table-info">
     <tbody>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Business name</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">Company XYZ</div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Register code:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">112233558</div>
             </td>
         </tr>
    
    
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Operating address:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
                         class="link-location">Some location strt. 233</a></div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Legal address</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">
                     <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
                         location
                     </a>
                 </div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">VAT No:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
                         liability</a></div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Age:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">1 year&nbsp;3 months</div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Founded:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">20/09/2019</div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Capital:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">2000 USD</div>
             </td>
         </tr>
         <tr>
             <td colspan="2" class="sep"></td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Phone:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">123456789</div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">E-mail:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text"><a href="mailto:[email protected]">[email protected]</a></div>
             </td>
         </tr>
         <tr>
             <td colspan="2" class="sep"></td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Representatives:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">
                     <div class="box-message">
                         <p class="desc">To access information, please</p>
                         <p>
                             <a href="#" onclick="return loginClicked(this, '#');"
                                 class="btn btn-small btn-purple link-login">Log in</a>
                         </p>
                     </div>
                 </div>
             </td>
         </tr>
         <tr>
             <td colspan="2" class="sep"></td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">
                     Main activity:
                     <span class="tip info" title=""
                         data-original-title="Activities are classified according to EMTAK 2008"></span>
                 </div>
             </td>
             <td class="col-2">
                 <div class="col-2-text" id="activity_top5ffe2eab23d13">
                     <ul>
                         <li>
                             Computer consultancy activities
                             <div class="main_activities_top_link_wrapper">
                                 <a href="https://www.somesite.com/" target="_blank"
                                     onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
                                     class="btn btn-simple btn-open-graph">
                                     <span>Open TOP 20</span> </a>
                             </div>
                         </li>
                     </ul>
    
                 </div>
             </td>
         </tr>
    
    
     </tbody>
    

Note: Above code is one query result / html example, but sometimes query result / company does not have email or website / vice versa. So, its important that code does not run into error if it does not find the html content what its looking for. I find its better to follow the class names or ids rather than counting how deep the table/div nesting goes (xpath).

I have code which is not working great atm:

import csv
import requests
import datetime
import time
 
from requests import get
from bs4 import BeautifulSoup
 
 
with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)
 
    count = 0
     
    for row in reader:
         
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 
        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
         
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")
 
        table_info = soup.select_one('.table-info')
 
        mail = table_info.select_one('.col-2 a[href^=mailto]')
        mail = mail.get('href')
        mail_clean = mail.split(':')[1]
 
        website = soup.find(text='Website:')
        website = table_info.select_one('.col-2 a[target^=_blank]')
        website = website.get('href') 
         
        collected_data = row[1], mail_clean, website, timestamp
 
        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
        with open('extracted.csv', 'w', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerows(data_list)
 
        print(row[1], "|", mail_clean,"|", website,"|", timestamp)
        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1
     
  
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Have you considered using css selectors that count the table's children? If your table will always mirror the example code, it just might be easier to use the nth-child property.

  • Phone: tr:nth-child(10) .col-2-text
  • Email: tr:nth-child(11) a
  • Website: span
  • Main Activity: li

I used Selector Gadget to grab these tags. You might want to run it on your page directly to see if there are any other ones that are easier to implement.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...