Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.5k views
in Technique[技术] by (71.8m points)

Python, How to use lxml XPath?

In python I had:

response = s.get(url, allow_redirects=False, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
reg_cart = soup.find('form', attrs={"name": "regCart"})
registered_courses = [i.a.text for i in reg_cart.find_all('div', attrs={"class": "course-number"})]

Now I want to replace BeautifulSoup with lxml, reading this:

https://timber.io/blog/an-intro-to-web-scraping-with-lxml-and-python/

I tried to implement what they used there and got:

import lxml.html
doc = lxml.html.fromstring(response.content)
registered_courses = doc.xpath('//div[@class="course-number"]/text()')

But for some reason my output is:

['
', '
', '
', '
', '
', '
', '
']

While previously it correctly showed courses numbers.

How can I fix this? plus how can I edit my path to return only those div tags under the form regCart and not in all response?

For example the html code looks something like:

        <form name="regCart" ....>
        </div><div class="entry-spacer"></div><div class="cart-entry">
                <div class="course-number">
                <a href="https://university.com/rishum/course/236756">236756</a>
            </div>
            <div class="course-name">
                ???? ??????? ??????              
            </div>
            <div class="course-points">
                3.0 ??'
            </div>
            <div class="entry-group">
                ????? 13
            </div>

Where I want to return 236756


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The issue is in your relative xpath: //div[@class="course-number"]/text()

<div class="course-number">
  <a href="https://university.com/rishum/course/236756">236756</a>
</div>

This would grab the text field under the corresponding div; however, there is no text under the div. The text field of interest is actually inside the tag, and the correct relative xpath is: //div[@class="course-number"]/a/text()


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...