You can use extract()
to remove unwanted tag before you get text.
But it keeps all '
'
and spaces
so you will need some work to remove them.
data = '''<span>
I Like
<span class='unwanted'> to punch </span>
your face
<span>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser')
external_span = soup.find('span')
print("1 HTML:", external_span)
print("1 TEXT:", external_span.text.strip())
unwanted = external_span.find('span')
unwanted.extract()
print("2 HTML:", external_span)
print("2 TEXT:", external_span.text.strip())
Result
1 HTML: <span>
I Like
<span class="unwanted"> to punch </span>
your face
<span></span></span>
1 TEXT: I Like
to punch
your face
2 HTML: <span>
I Like
your face
<span></span></span>
2 TEXT: I Like
your face
You can skip every Tag
object inside external span and keep only NavigableString
objects (it is plain text in HTML).
data = '''<span>
I Like
<span class='unwanted'> to punch </span>
your face
<span>'''
from bs4 import BeautifulSoup as BS
import bs4
soup = BS(data, 'html.parser')
external_span = soup.find('span')
text = []
for x in external_span:
if isinstance(x, bs4.element.NavigableString):
text.append(x.strip())
print(" ".join(text))
Result
I Like your face
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…