Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
330 views
in Technique[技术] by (71.8m points)

python - Getting all visible text from a webpage using Selenium

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.

I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.

After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:

from selenium import webdriver
import codecs

filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')

driver = webdriver.Firefox()

driver.get("http://www.examplepage.com")

allelements = driver.find_elements_by_xpath("//*")

ferdigtxt = []

for i in allelements:

      if i.text in ferdigtxt:
          pass
  else:
         ferdigtxt.append(i.text)
         filen.writelines(i.text)

filen.close()

driver.quit()

The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)

I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.

Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.

Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using lxml, you might try something like this:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean

url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
    browser.get(url) # Load page
    content=browser.page_source
    cleaner=clean.Cleaner()
    content=cleaner.clean_html(content)    
    with open('/tmp/source.html','w') as f:
       f.write(content.encode('utf-8'))
    doc=LH.fromstring(content)
    with open('/tmp/result.txt','w') as f:
        for elt in doc.iterdescendants():
            if elt.tag in ignore_tags: continue
            text=elt.text or ''
            tail=elt.tail or ''
            words=' '.join((text,tail)).strip()
            if words:
                words=words.encode('utf-8')
                f.write(words+'
') 

This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...