Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
169 views
in Technique[技术] by (71.8m points)

c++ - How to make my split work only on one real line and be capable to skip quoted parts of string?

So we have a simple split:

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;

vector<string> split(const string& s, const string& delim, const bool keep_empty = true) {
    vector<string> result;
    if (delim.empty()) {
        result.push_back(s);
        return result;
    }
    string::const_iterator substart = s.begin(), subend;
    while (true) {
        subend = search(substart, s.end(), delim.begin(), delim.end());
        string temp(substart, subend);
        if (keep_empty || !temp.empty()) {
            result.push_back(temp);
        }
        if (subend == s.end()) {
            break;
        }
        substart = subend + delim.size();
    }
    return result;
}

or boost split. And we have simple main like:

int main() {
    const vector<string> words = split("close no "
 matter" how 
 far", " ");
    copy(words.begin(), words.end(), ostream_iterator<string>(cout, "
"));
}

how to make it oputput something like

close 
no
"
 matter"
how
end symbol found.

we want to introduce to split structures that shall be held unsplited and charecters that shall end parsing process. how to do such thing?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Updated By way of 'thank you' for awarding the bonus I went and implemented 4 features that I initially skipped as "You Ain't Gonna Need It".

  1. now supports partially quoted columns

    This is the problem you reported: e.g. with a delimiter , only test,"one,two",three would be valid, not test,one","two","three. Now both are accepted

  2. now supports custom delimiter expressions

    You could only specify single characters as delimiters. Now you can specify any Spirit Qi parser expression as the delimiter rule. E.g

      splitInto(input, output, ' ');             // single space
      splitInto(input, output, +qi.lit(' '));    // one or more spaces
      splitInto(input, output, +qi.lit(" "));  // one or more spaces or tabs
      splitInto(input, output, (qi::double_ >> !'#') // -- any parse expression
    

    Note this changes behaviour for the default overload

    The old version treated repeated spaces as a single delimiter by default. You now have to explicitly specify that (2nd example) if you want it.

  3. now supports quotes ("") inside quoted values (instead of just making them disappear)

    See the code sample. Quite simple of course. Note that the sequence "" outside a quoted construct still represents the empty string (for compatibility with e.g. existing CSV output formats which quote empty strings redundantly)

  4. support boost ranges in addition to containers as input (e.g. char[])

    Well, you ain't gonna need it (but it was rather handy for me in order to just be able to write splitInto("a char array", ...) :)

As I had half expected, you were gonna need partially quoted fields (see your comment1. Well, here you are (the bottleneck was getting it to work consistently across different versions of Boost)).

Introduction

Random notes and observations for the reader:

  • splitInto template function happily supports whatever you throw at it:

    • input from a vector or std::string or std::wstring
    • output to -- some combinations shown in demo --
      • vector<string> (all lines flattened)
      • vector<vector<string>> (tokens per line)
      • list<list<string>> (if you prefer)
      • set<set<string>> (unique linewise tokensets)
      • ... any container you dream up
  • for demo purposes showing off karma output generation (especially taking care of nested container)
    • note: in output being shown as ? for comprehension (safechars)
  • complete with handy plumbing for new Spirit users (legible rule naming, commented DEBUG defines in case you want to play with things)
  • you can specify any Spirit parse expression to match delimiters. This means that by passing +qi::lit(' ') instead of the default (' ') you will skip empty fields (i.e. repeated delimiters)

Versions required/tested

This was compiled using

  • gcc 4.4.5,
  • gcc 4.5.1 and
  • gcc 4.6.1.

It works (tested) against

  • boost 1.42.0 (possibly earlier versions too) all the way through
  • boost 1.47.0.

Note: The flattening of output containers only seems to work for Spirit V2.5 (boost 1.47.0).
(this might be something simple as needing an extra include for older versions?)

The Code!

//#define BOOST_SPIRIT_DEBUG
#define BOOST_SPIRIT_DEBUG_PRINT_SOME 80

// YAGNI #4 - support boost ranges in addition to containers as input (e.g. char[])
#define SUPPORT_BOOST_RANGE // our own define for splitInto
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp> // for pre 1.47.0 boost only
#include <boost/spirit/version.hpp>
#include <sstream>

namespace /*anon*/
{
    namespace phx=boost::phoenix;
    namespace qi =boost::spirit::qi;
    namespace karma=boost::spirit::karma;

    template <typename Iterator, typename Output> 
        struct my_grammar : qi::grammar<Iterator, Output()>
    {
        typedef qi::rule<Iterator> delim_t;

        //my_grammar(delim_t const& _delim) : delim(_delim),
        my_grammar(delim_t _delim) : delim(_delim),
            my_grammar::base_type(rule, "quoted_delimited")
        {
            using namespace qi;

            noquote = char_ - '"';
            plain   = +((!delim) >> (noquote - eol));
            quoted  = lit('"') > *(noquote | '"' >> char_('"')) > '"';

#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
            mixed   = *(quoted|plain);
#else
            // manual folding
            mixed   = *( (quoted|plain) [_a << _1]) [_val=_a.str()];
#endif

            // you gotta love simple truths:
            rule    = mixed % delim % eol;

            BOOST_SPIRIT_DEBUG_NODE(rule);
            BOOST_SPIRIT_DEBUG_NODE(plain);
            BOOST_SPIRIT_DEBUG_NODE(quoted);
            BOOST_SPIRIT_DEBUG_NODE(noquote);
            BOOST_SPIRIT_DEBUG_NODE(delim);
        }

      private:
        qi::rule<Iterator>                  delim;
        qi::rule<Iterator, char()>          noquote;
#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
        qi::rule<Iterator, std::string()>   plain, quoted, mixed;
#else
        qi::rule<Iterator, std::string()>   plain, quoted;
        qi::rule<Iterator, std::string(), qi::locals<std::ostringstream> > mixed;
#endif
        qi::rule<Iterator, Output()> rule;
    };
}

template <typename Input, typename Container, typename Delim>
    bool splitInto(const Input& input, Container& result, Delim delim)
{
#ifdef SUPPORT_BOOST_RANGE
    typedef typename boost::range_const_iterator<Input>::type It;
    It first(boost::begin(input)), last(boost::end(input));
#else
    typedef typename Input::const_iterator It;
    It first(input.begin()), last(input.end());
#endif

    try
    {
        my_grammar<It, Container> parser(delim);

        bool r = qi::parse(first, last, parser, result);

        r = r && (first == last);

        if (!r)
            std::cerr << "parsing failed at: "" << std::string(first, last) << ""
";
        return r;
    }
    catch (const qi::expectation_failure<It>& e)
    {
        std::cerr << "FIXME: expected " << e.what_ << ", got '";
        std::cerr << std::string(e.first, e.last) << "'" << std::endl;
        return false;
    }
}

template <typename Input, typename Container>
    bool splitInto(const Input& input, Container& result)
{
    return splitInto(input, result, ' '); // default space delimited
}


/********************************************************************
 * replaces '
' character by '?' so that the demo output is more   *
 * comprehensible (see when a 
 was parsed and when one was output *
 * deliberately)                                                    *
 ********************************************************************/
void safechars(char& ch)
{
    switch (ch) { case '
': case '
': ch = '?'; break; }
}

int main()
{
    using namespace karma; // demo output generators only :)
    std::string input;

#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
    // sample invocation: simple vector of elements in order - flattened across lines
    std::vector<std::string> flattened;

    input = "actually on
two lines";
    if (splitInto(input, flattened))
        std::cout << format(*char_[safechars] % '|', flattened) << std::endl;
#endif
    std::list<std::set<std::string> > linewise, custom;

    // YAGNI #1 - now supports partially quoted columns
    input = "partially q"oute"d columns";
    if (splitInto(input, linewise))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '
', linewise) << std::endl;

    // YAGNI #2 - now supports custom delimiter expressions
    input="custom delimiters: 1997-03-14 10:13am"; 
    if (splitInto(input, custom, +qi::char_("- 0-9:"))
     && splitInto(input, custom, +(qi::char_ - qi::char_("0-9"))))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '
', custom) << std::endl;

    // YAGNI #3 - now supports quotes ("") inside quoted values (instead of just making them disappear)
    input = "would like ne""sted "quotes like ""
"" that"";
    custom.clear();
    if (splitInto(input, custom, qi::char_("() ")))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '
', custom) << std::endl;

    return 0;
}

The Output

Output from the sample as shown:

actually|on|two|lines
set['columns', 'partially', 'qouted']
set['am', 'custom', 'delimiters']
set['', '03', '10', '13', '14', '1997']
set['like', 'nested', 'quotes like "?" that', 'would']

Update Output for your previously failing test case:

--server=127.0.0.1:4774/|--username=robota|--userdescr=robot A ? I am cool robot ||--robot|>|echo.txt

1 I must admit I had a good laugh when reading that 'it crashed' [sic]. That sounds a lot like my end-users. Just to be precise: a crash is an unrecoverable application failure. What you ran into was a handled error, and was nothing more than 'unexpected behavior' from your point of view. Anyways, that's fixed now :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...