Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
355 views
in Technique[技术] by (71.8m points)

python - Convert in utf16

I am crawling several websites and extract the names of the products. In some names there are errors like this:

Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Liku00f6r 0,7l
Haymanu00b4s Sloe Gin
Ron Zacapa Ediciu00f3n Negra
Havana Club Au00f1ejo Especial
Caol Ila 13 Jahre (G&amp;M Discovery)

How can I fix that? I am using xpath and re.search to get the names.

In every Python file, this is the first code: # -*- coding: utf-8 -*-

Edit:

This is the sourcecode, how I get the information.

if '"articleName":' in details:
                            closer_to_product = details.split('"articleName":', 1)[1]
                            closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
                            if debug_product == 1:
                                print('product before try:' + repr(closer_to_product_2))
                            try:
                                found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
                            except AttributeError:
                                found_product = ''
                            if debug_product == 1:
                                print('cleared product: ', '>>>' + repr(found_product) + '<<<')
                            if not found_product:
                                print(product_detail_page, found_product)
                                items['products'] = 'default'
                            else:
                                items['products'] = found_product

Details

product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Where is a problem (Python 3.8.3)?

import html

strings = [
  'Bols Watermelon Liku00f6r 0,7l',
  'Haymanu00b4s Sloe Gin',
  'Ron Zacapa Ediciu00f3n Negra',
  'Havana Club Au00f1ejo Especial',
  'Caol Ila 13 Jahre (G&amp;M Discovery)',
  'Old Pulteney \u00b7 12 Years \u00b7 40% vol',
  'Killepitsch Kr\u00e4uterlik\u00f6r 42% 0,7 L']
  
for str in strings:
  print( html.unescape(str).
                encode('raw_unicode_escape').
                decode('unicode_escape') )
Bols Watermelon Lik?r 0,7l
Hayman′s Sloe Gin
Ron Zacapa Edición Negra
Havana Club A?ejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kr?uterlik?r 42% 0,7 L

Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...