Reading Unicode file data with BOM chars in Python

Question

Welcome To Ask or Share your Answers For Others

Reading Unicode file data with BOM chars in Python

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Reading Unicode file data with BOM chars in Python

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']

infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)

As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>

I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:16:41+0000

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'xefxbbxbfhello'.decode('utf-8')
'ufeffhello'
>>> b'xefxbbxbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

Categories

Reading Unicode file data with BOM chars in Python

Reading Unicode file data with BOM chars in Python

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags