python - Is it possible for lxml to work in a case-insensitive manner?

Question

Welcome To Ask or Share your Answers For Others

python - Is it possible for lxml to work in a case-insensitive manner?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Is it possible for lxml to work in a case-insensitive manner?

I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library.

I'd like to be able to say doc.cssselect('meta[name=description]') (or some XPath equivalent) but this will not catch <meta name="Description" Content="..."> tags due othe captial D.

I'm currently using this as a workaround, but it's horrible!

for meta in doc.cssselect('meta'):
    name = meta.get('name')
    content = meta.get('content')

    if name and content:
        if name.lower() == 'keywords':
            keywords = content
        if name.lower() == 'description':
            description = content

It seems that the tag name meta is treated case-insensitively, but the attributes are not. It would be even more annoying meta was case-sensitive too!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:53:47+0000

Values of attributes must be case-sensitive.

You can use arbitrary regular expression to select an element:

#!/usr/bin/env python
from lxml import html

doc = html.fromstring('''
    <meta name="Description">
    <meta name="description">
    <META name="description">
    <meta NAME="description">
''')
for meta in doc.xpath('//meta[re:test(@name, "^description$", "i")]',
                      namespaces={"re": "http://exslt.org/regular-expressions"}):
    print html.tostring(meta, pretty_print=True),

Output:

<meta name="Description">
<meta name="description">
<meta name="description">
<meta name="description">

Categories

python - Is it possible for lxml to work in a case-insensitive manner?

python - Is it possible for lxml to work in a case-insensitive manner?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags