python - Getting all visible text from a webpage using Selenium

Question

Welcome To Ask or Share your Answers For Others

python - Getting all visible text from a webpage using Selenium

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Getting all visible text from a webpage using Selenium

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.

I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.

After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:

from selenium import webdriver
import codecs

filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')

driver = webdriver.Firefox()

driver.get("http://www.examplepage.com")

allelements = driver.find_elements_by_xpath("//*")

ferdigtxt = []

for i in allelements:

      if i.text in ferdigtxt:
          pass
  else:
         ferdigtxt.append(i.text)
         filen.writelines(i.text)

filen.close()

driver.quit()

The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)

I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.

Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.

Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:59:26+0000

Using lxml, you might try something like this:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean

url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
    browser.get(url) # Load page
    content=browser.page_source
    cleaner=clean.Cleaner()
    content=cleaner.clean_html(content)    
    with open('/tmp/source.html','w') as f:
       f.write(content.encode('utf-8'))
    doc=LH.fromstring(content)
    with open('/tmp/result.txt','w') as f:
        for elt in doc.iterdescendants():
            if elt.tag in ignore_tags: continue
            text=elt.text or ''
            tail=elt.tail or ''
            words=' '.join((text,tail)).strip()
            if words:
                words=words.encode('utf-8')
                f.write(words+'
')

This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).

Categories

python - Getting all visible text from a webpage using Selenium

python - Getting all visible text from a webpage using Selenium

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags