Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
426 views
in Technique[技术] by (71.8m points)

python - How do I properly create custom text codecs?

I'm digging through some old binaries that contain (among other things) text. Their text frequently uses custom character encodings for Reasons, and I want to be able to read and rewrite them.

It seems to me that the appropriate way to do this is to create a custom codec using the standard codecs library. Unfortunately its documentation is both colossal and entirely bereft of examples. Google turns up a few, but only for python2, and I'm using 3.

I'm looking for a minimal example of how to use the codecs library to implement a custom character encoding.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You asked for minimal!

  • Write a encode function and a decode function.
  • Write a "search function" that returns a CodecInfo object constructed from the above encoder and decoder.
  • Use codec.register to register a function that returns the above CodecInfo object.

Here is an example that converts the lowercase letters a-z to 0-25 in order.

import codecs
import string

from typing import Tuple

# prepare map from numbers to letters
_encode_table = {str(number): bytes(letter, 'ascii') for number, letter in enumerate(string.ascii_lowercase)}

# prepare inverse map
_decode_table = {ord(v): k for k, v in _encode_table.items()}


def custom_encode(text: str) -> Tuple[bytes, int]:
    # example encoder that converts ints to letters
    # see https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
    return b''.join(_encode_table[x] for x in text), len(text)


def custom_decode(binary: bytes) -> Tuple[str, int]:
    # example decoder that converts letters to ints
    # see https://docs.python.org/3/library/codecs.html#codecs.Codec.decode
    return ''.join(_decode_table[x] for x in binary), len(binary)


def custom_search_function(encoding_name):
    return codecs.CodecInfo(custom_encode, custom_decode, name='Reasons')


def main():

    # register your custom codec
    # note that CodecInfo.name is used later
    codecs.register(custom_search_function)

    binary = b'abcdefg'
    # decode letters to numbers
    text = codecs.decode(binary, encoding='Reasons')
    print(text)
    # encode numbers to letters
    binary2 = codecs.encode(text, encoding='Reasons')
    print(binary2)
    # encode(decode(...)) should be an identity function
    assert binary == binary2

if __name__ == '__main__':
    main()

Running this prints

$ python codec_example.py
0123456
b'abcdefg'

See https://docs.python.org/3/library/codecs.html#codec-objects for details on the Codec interface. In particular, the decode function

... decodes the object input and returns a tuple (output object, length consumed).

whereas the encode function

... encodes the object input and returns a tuple (output object, length consumed).

Note that you should also worry about handling streams, incremental encoding/decoding, as well as error handling. For a more complete example, refer to the hexlify codec that @krs013 mentioned.


P.S. instead of of codec.decode, you can also use codec.open(..., encoding='Reasons').


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...