Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
159 views
in Technique[技术] by (71.8m points)

How do I parse this data structure returned by Nokogiri in Ruby?

So I am cycling through an array element and this is the result returned:

[nil, [#<Nokogiri::XML::Element:0x835386d4 name="a" attributes=[#<Nokogiri::XML::Attr:0x835385f8 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x835381c0 "Web Designer Full time">]>

What I would like to do is access href value, and then the text value. How do I do that?

I tried this:

puts i[:href]

But that generates this error:

TypeError: Symbol as array index

By the way, I am accessing i as an element in the array via each like this:

contents.each do |i|
    puts i.inspect
    puts i[:href]
end

Edit 1:

This is how I am generating the contents array. There is no need to rename it, because it can get confusing :)

contents = {}
first_items.each do |link|
    content_url = link
    content_page = Nokogiri::HTML(open(content_url))
    contents[link[:href]] = content_page.css("p a")
end

puts contents.inspect

This is what gets output:

{nil=>[#<Nokogiri::XML::Element:0x85fee914 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee838 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x85fee400 "Web Designer Full time">]>, #<Nokogiri::XML::Element:0x85fee298 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee1bc name="href" value="http://bham.craigslist.org/web/2959813303.html">] children=[#<Nokogiri::XML::Text:0x85fedd84 "Once in a lifetime opportunity...">]>, #<Nokogiri::XML::Element:0x85fedc1c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fedb40 name="href" value="http://bham.craigslist.org/web/2925485723.html">] children=[#<Nokogiri::XML::Text:0x85fed708 "Website Designer and Blogging Internship!">]>, #<Nokogiri::XML::Element:0x85fed5a0 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fed4c4 name="href" value="http://bham.craigslist.org/web/2918424652.html">] children=[#<Nokogiri::XML::Text:0x85fed08c "Excellent Java Developer Opportunity!">]>, #<Nokogiri::XML::Element:0x85fecf24 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fece48 name="href" value="http://bham.craigslist.org/web/2888669703.html">] children=[#<Nokogiri::XML::Text:0x85feca10 "Freelance Graphic Design">]>, #<Nokogiri::XML::Element:0x85fec8a8 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec7cc name="href" value="http://bham.craigslist.org/web/2900256461.html">] children=[#<Nokogiri::XML::Text:0x85fec394 "GWT/GXT Developer">]>, #<Nokogiri::XML::Element:0x85fec22c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec150 name="href" value="http://bham.craigslist.org/web/2897641463.html">] children=[#<Nokogiri::XML::Text:0x85febd18 "Website hiring!">]>]}

Here is the full value of the output for i:

--------------------
This is the value of i: 
[nil, [#<Nokogiri::XML::Element:0x85fee914 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee838 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x85fee400 "Web Designer Full time">]>, #<Nokogiri::XML::Element:0x85fee298 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee1bc name="href" value="http://bham.craigslist.org/web/2959813303.html">] children=[#<Nokogiri::XML::Text:0x85fedd84 "Once in a lifetime opportunity...">]>, #<Nokogiri::XML::Element:0x85fedc1c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fedb40 name="href" value="http://bham.craigslist.org/web/2925485723.html">] children=[#<Nokogiri::XML::Text:0x85fed708 "Website Designer and Blogging Internship!">]>, #<Nokogiri::XML::Element:0x85fed5a0 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fed4c4 name="href" value="http://bham.craigslist.org/web/2918424652.html">] children=[#<Nokogiri::XML::Text:0x85fed08c "Excellent Java Developer Opportunity!">]>, #<Nokogiri::XML::Element:0x85fecf24 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fece48 name="href" value="http://bham.craigslist.org/web/2888669703.html">] children=[#<Nokogiri::XML::Text:0x85feca10 "Freelance Graphic Design">]>, #<Nokogiri::XML::Element:0x85fec8a8 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec7cc name="href" value="http://bham.craigslist.org/web/2900256461.html">] children=[#<Nokogiri::XML::Text:0x85fec394 "GWT/GXT Developer">]>, #<Nokogiri::XML::Element:0x85fec22c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec150 name="href" value="http://bham.craigslist.org/web/2897641463.html">] children=[#<Nokogiri::XML::Text:0x85febd18 "Website hiring!">]>]]
--------------------
This is the value of i.href: 

Edit 2:

By the way, this is what the actual HTML output looks like...I did this:

builder = Nokogiri::HTML::Builder.new do |doc|
    doc.html {
        doc.body {
            contents.each do |el|
                if !el.nil?
                    puts "-" * 20
                    puts "This is the value of el: "
                puts el.inspect

                    puts "-" * 20
                    puts "This is the value of el.href: "           
                 puts el[:href]
                end

                doc.p {
                    doc.a el, :href => el
                    } 
            end     
            }           
        }
end

puts "*" * 50
puts "This is the HTML generated"

puts builder.to_html

This is how it looks:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p><a href="&lt;a%20href=%22http://bham.craigslist.org/web/2961573018.html%22&gt;Web%20Designer%20Full%20time&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2959813303.html%22&gt;Once%20in%20a%20lifetime%20opportunity...&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2925485723.html%22&gt;Website%20Designer%20and%20Blogging%20Internship!&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2918424652.html%22&gt;Excellent%20Java%20Developer%20Opportunity!&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2888669703.html%22&gt;Freelance%20Graphic%20Design&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2900256461.html%22&gt;GWT/GXT%20Developer&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2897641463.html%22&gt;Website%20hiring!&lt;/a&gt;">&lt;a href="http://bham.craigslist.org/web/2961573018.html"&gt;Web Designer Full time&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2959813303.html"&gt;Once in a lifetime opportunity...&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2925485723.html"&gt;Website Designer and Blogging Internship!&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2918424652.html"&gt;Excellent Java Developer Opportunity!&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2888669703.html"&gt;Freelance Graphic Design&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2900256461.html"&gt;GWT/GXT Developer&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2897641463.html"&gt;Website hiring!&lt;/a&gt;</a></p></body></html>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think it can be a lot simpler. Nokogiri already parses the document and provides convenient ways to access the content. Rather than looping, storing Nokogiri objects, then trying to extract them, why not try a more direct approach?

Try this code:

content_page.search(//a[@href]).map{ |el| [el[:href], el.text] }

This creates the 2d array containing the text and href for each link in the document, which is what you said in a follow-up comment that you're actually working toward.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...