Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
245 views
in Technique[技术] by (71.8m points)

c++ - Using a regex_iterator on an istream

I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream is to:

  1. Read it in the istream a character at a time
  2. Collect the characters
  3. When a delimiter is hit return the collection as a token

Regexes are very good at tokenizing strings with complex delimiters:

string foo{ "A
B
C

" };
vector<string> bar;

// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:

?|
)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });

But I can't use a regex_iterator on a istream :( My solution has been to slurp the istream and then run the regex_iterator over it, but the slurping step seems superfluous.

Is there an unholy combination of istream_iterator and regex_iterator out there somewhere, or if I want it do I have to write it myself?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This question is about code appearance:

  1. Since we know that a regex will work 1 character at a time, this question is asking to use a library to parse the istream 1 character at a time rather than internally reading and parsing the istream 1 character at a time
  2. Since parsing an istream 1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that

C++11's regexes use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex could match using only an input_iterator_tag, but clearly those implemented in C++11 do not.

boost::regex_iterator on the other hand does support the boost::match_partial flag (which is not available in C++11 regex flags.) boost::match_partial allows the user to slurp part of the file and run the regex over that, on a mismatch due to end of input the regex will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A B C ", this can save buffer size.

boost::match_partial has 4 drawbacks:

  1. In the worst case, like "ABC " this saves the user no size and he must slurp the whole istream
  2. If the programmer can guess a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
  3. Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
  4. The inclusion of boost always causes bloat

Circling back to answer the question: A standard library regex_iterator cannot operate on an input_iterator_tag, slurping of the whole istream required. A boost::regex_iterator allows the user to possibly slurp less than the whole istream. Because this is a question about code appearance though, and because boost::regex_iterator's worst case requires slurping of the whole file anyway, it is not a good answer to this question.

For the best code appearance slurping the whole file and running a standard regex_iterator over it is your best bet.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...