Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.8k views
in Technique[技术] by (71.8m points)

python - Selecting text element having specific style color

I have a scrapping task to do in which I have to collect some articles. I know that I only need some paragraphs that are in red (#FF0000). Is there a way to use the Selenium WebDriver to extract only those colored in this colour? Through all the pages that I've to scrape, the only attribute that is always the same is the text color.

For example, in the following URL: https://www.boatos.org/saude/ivermectina-mata-covid-dois-dias-dose-unica.html

I want the webdriver to returns me just the following paragraph that is originally in painted in red:

Vers?o 1: “IVERMECTINA REALMENTE MATA COVID-19 EM 2 DIAS COMPROVA ESTUDO”. Vers?o 2: “Cientistas descobriram que dose única de ivermectina pode remover todo o RNA do novo coronavírus em um período de 48 horas. Mesmo no primeiro dia, a redu??o do material genético do vírus é significativo”.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To print the text Vers?o 1: “IVERMECTINA REALMENTE MATA COVID-19 EM... you can use either of the following Locator Strategies:

  • Using css_selector and text attribute:

    driver.get("https://www.boatos.org/saude/ivermectina-mata-covid-dois-dias-dose-unica.html")
    print(driver.find_element_by_css_selector("span[style] > em").text)
    
  • Using xpath and get_attribute("innerHTML"):

    driver.get("https://www.boatos.org/saude/ivermectina-mata-covid-dois-dias-dose-unica.html")
    print(driver.find_element_by_xpath("//span[@style]/em").get_attribute("innerHTML"))
    

Ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR and get_attribute():

    driver.get("https://www.boatos.org/saude/ivermectina-mata-covid-dois-dias-dose-unica.html")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span[style] > em"))).get_attribute("innerHTML"))
    
  • Using XPATH and text attribute:

    driver.get("https://www.boatos.org/saude/ivermectina-mata-covid-dois-dias-dose-unica.html")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[@style]/em"))).text)
    
  • Console Output:

    Vers?o 1: “IVERMECTINA REALMENTE MATA COVID-19 EM 2 DIAS COMPROVA ESTUDO”. Vers?o 2: “Cientistas descobriram que dose única de ivermectina pode remover todo o RNA do novo coronavírus em um período de 48 horas. Mesmo no primeiro dia, a redu??o do material genético do vírus é significativo”.
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python


References

Link to useful documentation:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...