I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?
I tried using nltk
which works fine however, clean_html
and clean_url
will be removed moving forward. Is there a way to use soups get_text
and get the same result?
I tried looking at these other pages:
BeautifulSoup get_text does not strip all tags and JavaScript
Currently i'm using the nltk's deprecated functions.
EDIT
Here's an example:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
I still see the following for CNN:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
How can I remove the js?
Only other options I found are:
https://github.com/aaronsw/html2text
The problem with html2text
is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…