The Story:
When you parse HTML with BeautifulSoup
, class
attribute is considered a multi-valued attribute and is handled in a special manner:
Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes.
Also, a quote from a built-in HTMLTreeBuilder
used by BeautifulSoup
as a base for other tree builder classes, like, for instance, HTMLParserTreeBuilder
:
# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'. When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
The Question:
How can I configure BeautifulSoup
to handle class
as a usual single-valued attribute? In other words, I don't want it to handle class
specially and consider it a regular attribute.
FYI, here is one of the use-cases when it can be helpful:
What I've tried:
I've actually made it work by making a custom tree builder class and removing class
from the list of specially-handled attributes:
from bs4.builder._htmlparser import HTMLParserTreeBuilder
class MyBuilder(HTMLParserTreeBuilder):
def __init__(self):
super(MyBuilder, self).__init__()
# BeautifulSoup, please don't treat "class" specially
self.cdata_list_attributes["*"].remove("class")
soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
What I don't like in this approach is that it is quite "unnatural" and "magical" involving importing "private" internal _htmlparser
. I hope there is a simpler way.
NOTE: I want to save all other HTML parsing related features, meaning I don't want to parse HTML
with "xml"-only features (which could've been another workaround).
See Question&Answers more detail:
os