I am trying to grab some text from html documents with BeautifulSoup. In a very relavant case for me, it originates a strange and interesting result: after a certain point, the soup is full of extra spaces within the text (a space separates every letter from the following one). I tried to search the web in order to find a reason for that, but I met only some news about the opposite bug (no spaces at all).
Do you have some suggestion or hint on why it happens, and how to solve this problem?.
This is the very basic code that i created:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova)
print soup
And this is a line taken from the results, the line where this problem start to appear:
value="Giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre"><input onmouseover="Tip('<cen t e r c l a s s = ' t i t l e _ v i d e o ' > < b > G i u s e p p e l a b b a t e o g m ? n o n v o r r e m m o n u o v i u c c e l l i c h i a m a t i l o n t r e <
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…