python - BeautifulSoup return unexpected extra spaces

Question

Welcome To Ask or Share your Answers For Others

python - BeautifulSoup return unexpected extra spaces

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - BeautifulSoup return unexpected extra spaces

I am trying to grab some text from html documents with BeautifulSoup. In a very relavant case for me, it originates a strange and interesting result: after a certain point, the soup is full of extra spaces within the text (a space separates every letter from the following one). I tried to search the web in order to find a reason for that, but I met only some news about the opposite bug (no spaces at all).

Do you have some suggestion or hint on why it happens, and how to solve this problem?.

This is the very basic code that i created:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova)
print soup

And this is a line taken from the results, the line where this problem start to appear:

value="Giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre"><input onmouseover="Tip('<cen t e r c l a s s = ' t i t l e _ v i d e o ' > < b > G i u s e p p e l a b b a t e o g m ? n o n v o r r e m m o n u o v i u c c e l l i c h i a m a t i l o n t r e <

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:48:39+0000

I believe this is a bug with Lxml's HTML parser. Try:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen ("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova.replace('ISO-8859-1', 'utf-8'))
print soup

Which is a workaround for the problem. I believe the issue was fixed in lxml 3.0 alpha 2 and lxml 2.3.6, so it could be worth checking whether you need to upgrade to a newer version.

If you want more info on the bug it was initially filed here:

https://bugs.launchpad.net/beautifulsoup/+bug/972466

Hope this helps,

Hayden

Categories

python - BeautifulSoup return unexpected extra spaces

python - BeautifulSoup return unexpected extra spaces

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags