Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
231 views
in Technique[技术] by (71.8m points)

How do I extract bytes with offsets from a huge block efficiently in Python?

Let's suppose I have a block of bytes like this:

block = b'0123456789AB'

I want to extract each sequence of 3 bytes from each chunk of 4 bytes and join them together. The result for the block above should be:

b'01245689A'  # 3, 7 and B are missed

I could solve this issue with such script:

block = b'0123456789AB'
result = b''
for i in range(0, len(block), 4):
    result += block[i:i + 3]
print(result)

But as it's known, Python is quite inefficient with such for-loops and bytes concatenations, thus my approach will never end if I apply it for a really huge block of bytes. So is there a faster way to perform?

question from:https://stackoverflow.com/questions/66047126/how-do-i-extract-bytes-with-offsets-from-a-huge-block-efficiently-in-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Make it mutable and delete the the unwanted slice?

>>> tmp = bytearray(block)
>>> del tmp[3::4]
>>> bytes(tmp)
b'01245689A'

If your chunks are large and you want to remove almost all bytes, it might become faster to instead collect what you do want, similar to yours. Although yours potentially takes quadratic time, better use join:

>>> b''.join([block[i : i+3] for i in range(0, len(block), 4)])
b'01245689A'

(Btw according to PEP 8 it should be block[i : i+3], not block[i:i + 3], and for good reason.)

Although that builds a lot of objects, which could be a memory problem. And for your stated case, it's much faster than yours but much slower than my bytearray one.

Benchmark with block = b'0123456789AB' * 100_000 (much smaller than the 1GB you mentioned in the comments below):

    0.00 ms      0.00 ms      0.00 ms  baseline
15267.60 ms  14724.33 ms  14712.70 ms  original
    2.46 ms      2.46 ms      3.45 ms  Kelly_Bundy_bytearray
   83.66 ms     85.27 ms    122.88 ms  Kelly_Bundy_join

Benchmark code:

import timeit

def baseline(block):
    pass

def original(block):
    result = b''
    for i in range(0, len(block), 4):
        result += block[i:i + 3]
    return result

def Kelly_Bundy_bytearray(block):
    tmp = bytearray(block)
    del tmp[3::4]
    return bytes(tmp)

def Kelly_Bundy_join(block):
    return b''.join([block[i : i+3] for i in range(0, len(block), 4)])

funcs = [
    baseline,
    original,
    Kelly_Bundy_bytearray,
    Kelly_Bundy_join,
    ]

block = b'0123456789AB' * 100_000
args = block,
number = 10**0

expect = original(*args)
for func in funcs:
    print(func(*args) == expect, func.__name__)
print()

tss = [[] for _ in funcs]
for _ in range(3):
    for func, ts in zip(funcs, tss):
        t = min(timeit.repeat(lambda: func(*args), number=number)) / number
        ts.append(t)
        print(*('%8.2f ms ' % (1e3 * t) for t in ts), func.__name__)
    print()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...