Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
550 views
in Technique[技术] by (71.8m points)

ruby - Nokogiri, open-uri, and Unicode Characters

I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")

At this point, the title looks like this:

Rag303271

Instead of:

Ragù

How can I have nokogiri return the proper character (e.g. ù in this case)?

Here's an example URL:

http://www.epicurious.com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read and pass the resulting string to Nokogiri.

Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo". But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:

# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1]
puts h52.text, h52.text.encoding
#=> Genealog? a de Jesucristo
#=> UTF-8

We can see that this is not the fault of open-uri:

html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
gene = html.read[/GeneS+/]
puts gene, gene.encoding
#=> Genealogía
#=> UTF-8

This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:

# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1].text
puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
#=> Genealogía de Jesucristo
#=> UTF-8
#=> true

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...