Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
316 views
in Technique[技术] by (71.8m points)

python - Parsing nested HTML list with BeautifulSoup

I need to parse a nested HTML list and convert it to a parent-child dict. Given this list:

<ul>
  <li>Operating System
    <ul>
      <li>Linux
        <ul>
          <li>Debian</li>
          <li>Fedora</li>
          <li>Ubuntu</li>
        </ul>
      </li>
      <li>Windows</li>
      <li>OS X</li>
    </ul>
  </li>
  <li>Programming Languages
    <ul>
      <li>Python</li>
      <li>C#</li>
      <li>Ruby</li>
    </ul>
  </li>
</ul>

I want to convert it to a dict like this:

{
    'Operating System': {
        'Linux': {
            'Debian': None,
            'Fedora': None,
            'Ubuntu': None,
        },
        'Windows': None,
        'OS X': None,
    },
    'Programming Languages': {
        'Python': None,
        'C#': None,
        'Ruby': None,
    }
}

My initial attempt is using find_all('li', recursive=False). It returns the top level items (Operating System and Programming Languages) but also the children.

How can I do it with BeautifulSoup?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here's one way:

def dictify(ul):
    result = {}
    for li in ul.find_all("li", recursive=False):
        key = next(li.stripped_strings)
        ul = li.find("ul")
        if ul:
            result[key] = dictify(ul)
        else:
            result[key] = None
    return result

Example use:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <ul>
...   <li>Operating System
...     <ul>
...       <li>Linux
...         <ul>
...           <li>Debian</li>
...           <li>Fedora</li>
...           <li>Ubuntu</li>
...         </ul>
...       </li>
...       <li>Windows</li>
...       <li>OS X</li>
...     </ul>
...   </li>
...   <li>Programming Languages
...     <ul>
...       <li>Python</li>
...       <li>C#</li>
...       <li>Ruby</li>
...     </ul>
...   </li>
... </ul>
... """)
>>> ul = soup.body.ul
>>> from pprint import pprint
>>> pprint(dictify(ul), width=1)
{u'Operating System': {u'Linux': {u'Debian': None,
                                  u'Fedora': None,
                                  u'Ubuntu': None},
                       u'OS X': None,
                       u'Windows': None},
 u'Programming Languages': {u'C#': None,
                            u'Python': None,
                            u'Ruby': None}}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...