Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
172 views
in Technique[技术] by (71.8m points)

python - Difference between .string and .text BeautifulSoup

I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here.

Say we have a tags like these that we have parsed with BS:

<td>Some Table Data</td>
<td></td>

The official documented way to extract the data is soup.string. However this extracted a NoneType for the second <td> tag. So I tried soup.text (because why not?) and it extracted an empty string exactly as I wanted.

However I couldn't find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later?

BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

.string on a Tag type object returns a NavigableString type object. On the other hand, .text gets all the child strings and return concatenated using the given separator. Return type of .text is unicode object.

From the documentation, A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree.

From the documentation on .string, we can see that, If the html is like this,

<td>Some Table Data</td>
<td></td>

Then, .string on the second td will return None. But .text will return and empty string which is a unicode type object.

For more convenience,

string

  • Convenience property of a tag to get the single string within this tag.
  • If the tag has a single string child then the return value is that string.
  • If the tag has no children or more than one child then the return value is None
  • If this tag has one child tag then the return value is the 'string' attribute of the child tag, recursively.

And text

  • Get all the child strings and return concatenated using the given separator.

If the html is like this:

<td>some text</td>
<td></td>
<td><p>more text</p></td>
<td>even <p>more text</p></td>

.string on the four td will return,

some text
None
more text
None

.text will give result like this,

some text

more text
even more text

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...