Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
198 views
in Technique[技术] by (71.8m points)

python - How to read tokens without reading whole line or file

Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else's problem, not mine) is transposing a large matrix with a few very long rows, essentially performing an itertools.izip() on iterators that pick out the elements of a single column. The idea is not not have the entire file in memory during iteration.

The rows are space-delimited ASCII decimal numbers.

The problem would be simple with Java's Scanner class, but I don't see anything in the Python Standard Library that appears to tokenize without having the whole input in a string.

For the record, I know how to write this on my own. I'm just wondering if there's a standard tool that I missed. Something FOSS/libre that can be EasyInstalled is good, too, but I don't see anything on PYPI either.

The full problem was to take the sample input:

"123 3 234234 -35434 112312 54 -439 99 0 42
" +
"13 456 -78 910 333 -44 5555 6 8"

...and produce the output (as a generator, without reading all of very long rows into memory at once:

[123, 13], [3, 456], [234234, -78], ...etc

As I said, it's essentially itertools.izip(iterator1, iterator2), pointing iterator1 at the start of the file, and iterator2 just past the newline to read the second row.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To read tokens from a file one by one; you could use re module to generate tokens from a memory-mapped file:

#!/usr/bin/env python3
import re
import sys
from mmap import ACCESS_READ, mmap    

def generate_tokens(filename, pattern):
    with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
         yield from re.finditer(pattern, mm)

# sum all integers in a file specified at the command-line
print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'd+')))

It works even if the file doesn't fit in memory.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

Just Browsing Browsing

1.4m articles

1.4m replys

5 comments

57.0k users

...