Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
261 views
in Technique[技术] by (71.8m points)

python - Extract text and nodes from <p> using lxml in the same array index

Hello i need get all text and other things inside a pharagrap something like this:

<div>
<p>
Whatever you want type <strong>here is great</strong>
</p>
<p>
Whatever you want type <strong>here is great</strong>
</p>
</div>

I am using this to get all text and strong text from the pharagraps but the problem is that using this way the text and strong text is processed by split, then i get an array like this ['Whatever you want type','here is great'] and i need get the nodes in the same array index, something like this ['Whatever you want type here is great']

content = html.xpath('.//p/text() | .//p/strong/text()')

I found a way to extrac the text inside them:

.text_content(): Returns the text content of the element, including the text content of its children, with no markup.

https://lxml.de/lxmlhtml.html

question from:https://stackoverflow.com/questions/65830421/extract-text-and-nodes-from-p-using-lxml-in-the-same-array-index

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You could use BeautifulSoup for this.

from bs4 import BeautifulSoup

html_string = """<p>
 Whatever you want type <strong>here is great</strong>
</p>
    """

soup = BeautifulSoup(html_string, 'html.parser')
mytext = [soup.find('p').get_text().strip()]
#['Whatever you want type here is great']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...