Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
140 views
in Technique[技术] by (71.8m points)

c++ - Given a regular expression, how would I generate all strings that match it?

I'm using a simple language of only (), |, spaces, and alpha characters.
Given a regular expression like the following:

(hello|goodbye) (world(s|)|)

How would I go about generating the following data?

hello worlds
hello world
hello 
goodbye worlds
goodbye world
goodbye

I'm not quite sure if I need to build a tree first, or if it can be done recursively. I'm stuck on what data structures to utilize, and how to generate the strings as I go. Will I have to keep a bunch of markers, and index back into partially built strings to concatenate more data on? I don't know how best to approach this problem. Would I need to read the whole expression first, and re-order it a certain way?

The function signature is going to look the following way:

std::vector<std::string> Generate(std::string const&){
   //...
}

What do you suggest I do?

EDIT:
Let me clarify that the results should always be finite here. In my particular example, there are only 6 strings that would ever be true for the expression. I'm not sure if my terminology is correct here, but what I'm looking for, is a perfect match of the expression- not any string that contains a substring which matches.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Somewhat following Kieveli's advice, I have come up with a working solution. Although not previously mentioned, it was important for me to also get a count of how many results could potentially be generated. I was using a python script called "exrex" which I had found on github. Embarrassingly, I did not realize that it had the capability to also count. Nonetheless, I implemented it the best I could in C++ using my simplified regular expression language. If interested in my solution, please read on.

From an object oriented stand point, I wrote a scanner to take the regular expression(string), and convert it into a list of tokens(vector of strings). The list of tokens was then sent to a parser which generated an n-ary tree. All of this was packed inside an "expression generator" class that could take an expression and hold the parse tree, as well as the generated count.
object overview
The scanner was important because it tokenized the empty string case which you can see in my question appearing as "|)". Scanning also created a pattern of [word] [operation] [word] [operation] ... [word].
For example, scanning: "(hello|goodbye) (world(s|)|)"
will create: [][(][hello][|][goodbye][)][ ][(][world][(][s][|][][)][][|][][)][]

The parse tree was a vector of nodes. Nodes contain a vector of vector of nodes. parse structure
The orange cells represent the "or"s, and the other boxes that draw the connections, represent the "and"s. Below is my code.

Node header

#pragma once
#include <string>
#include <vector>

class Function_Expression_Node{

public:
    Function_Expression_Node(std::string const& value_in = "", bool const& more_in = false);

    std::string value;
    bool more;
    std::vector<std::vector<Function_Expression_Node>> children;

};

Node source

#include "function_expression_node.hpp"

    Function_Expression_Node::Function_Expression_Node(std::string const& value_in, bool const& more_in)
    : value(value_in)
    , more(more_in)
    {}

Scanner header

#pragma once
#include <vector>
#include <string>

class Function_Expression_Scanner{

    public: Function_Expression_Scanner() = delete;
    public: static std::vector<std::string> Scan(std::string const& expression);

};

Scanner source

#include "function_expression_scanner.hpp"

std::vector<std::string> Function_Expression_Scanner::Scan(std::string const& expression){

    std::vector<std::string> tokens;
    std::string temp;

    for (auto const& it: expression){

        if (it == '('){
            tokens.push_back(temp);
            tokens.push_back("(");
            temp.clear();
        }

        else if (it == '|'){
            tokens.push_back(temp);
            tokens.push_back("|");
            temp.clear();
        }

        else if (it == ')'){
            tokens.push_back(temp);
            tokens.push_back(")");
            temp.clear();
        }

        else if (isalpha(it) || it == ' '){
            temp+=it;
        }

    }

    tokens.push_back(temp);

    return tokens;
    }

Parser header

#pragma once
#include <string>
#include <vector>
#include "function_expression_node.hpp"

class Function_Expression_Parser{

    Function_Expression_Parser() = delete;

//get parse tree
public: static std::vector<std::vector<Function_Expression_Node>> Parse(std::vector<std::string> const& tokens, unsigned int & amount);
    private: static std::vector<std::vector<Function_Expression_Node>> Build_Parse_Tree(std::vector<std::string>::const_iterator & it, std::vector<std::string>::const_iterator const& end, unsigned int & amount);
        private: static Function_Expression_Node Recursive_Build(std::vector<std::string>::const_iterator & it, int & total); //<- recursive

    //utility
    private: static bool Is_Word(std::string const& it);
};

Parser source

#include "function_expression_parser.hpp"

bool Function_Expression_Parser::Is_Word(std::string const& it){
        return (it != "(" && it != "|" && it != ")");
    }
Function_Expression_Node Function_Expression_Parser::Recursive_Build(std::vector<std::string>::const_iterator & it, int & total){

    Function_Expression_Node sub_root("",true); //<- contains the full root
    std::vector<Function_Expression_Node> root;

    const auto begin = it;

    //calculate the amount
    std::vector<std::vector<int>> multiplies;
    std::vector<int> adds;
    int sub_amount = 1;

    while(*it != ")"){

        //when we see a "WORD", add it.
        if(Is_Word(*it)){
            root.push_back(Function_Expression_Node(*it));
        }

        //when we see a "(", build the subtree,
        else if (*it == "("){
            ++it;
            root.push_back(Recursive_Build(it,sub_amount));

            //adds.push_back(sub_amount);
            //sub_amount = 1;
        }

        //else we see an "OR" and we do the split
        else{
            sub_root.children.push_back(root);
            root.clear();

            //store the sub amount
            adds.push_back(sub_amount);
            sub_amount = 1;
        }

        ++it;
    }

    //add the last bit, if there is any
    if (!root.empty()){
        sub_root.children.push_back(root);

        //store the sub amount
        adds.push_back(sub_amount);
    }
    if (!adds.empty()){
        multiplies.push_back(adds);
    }


    //calculate sub total
    int or_count = 0;
    for (auto const& it: multiplies){
        for (auto const& it2: it){
            or_count+=it2;
        }

        if (or_count > 0){
            total*=or_count;
        }
        or_count = 0;
    }

    /*
    std::cout << "---SUB FUNCTION---
";
    for (auto it: multiplies){for (auto it2: it){std::cout << "{" << it2 << "} ";}std::cout << "
";}std::cout << "--
";
    std::cout << total << std::endl << '
';
    */

    return sub_root;
}
std::vector<std::vector<Function_Expression_Node>> Function_Expression_Parser::Build_Parse_Tree(std::vector<std::string>::const_iterator & it, std::vector<std::string>::const_iterator const& end, unsigned int & amount){

    std::vector<std::vector<Function_Expression_Node>> full_root;
    std::vector<Function_Expression_Node> root;

    const auto begin = it;

    //calculate the amount
    std::vector<int> adds;
    int sub_amount = 1;
    int total = 0;

    while (it != end){

        //when we see a "WORD", add it.
        if(Is_Word(*it)){
            root.push_back(Function_Expression_Node(*it));
        }

        //when we see a "(", build the subtree,
        else if (*it == "("){
            ++it;
            root.push_back(Recursive_Build(it,sub_amount));

        }

        //else we see an "OR" and we do the split
        else{
            full_root.push_back(root);
            root.clear();

            //store the sub amount
            adds.push_back(sub_amount);
            sub_amount = 1;
        }

        ++it;
    }

    //add the last bit, if there is any
    if (!root.empty()){
        full_root.push_back(root);

        //store the sub amount
        adds.push_back(sub_amount);
        sub_amount = 1;
    }

    //calculate sub total
    for (auto const& it: adds){ total+=it; }

    /*
    std::cout << "---ROOT FUNCTION---
";
    for (auto it: adds){std::cout << "[" << it << "] ";}std::cout << '
';
    std::cout << total << std::endl << '
';
    */
    amount = total;

    return full_root;
}
std::vector<std::vector<Function_Expression_Node>> Function_Expression_Parser::Parse(std::vector<std::string> const& tokens, unsigned int & amount){

    auto it = tokens.cbegin();
    auto end = tokens.cend();
    auto parse_tree = Build_Parse_Tree(it,end,amount);
    return parse_tree;
}

Generator header

#pragma once
#include "function_expression_node.hpp"

class Function_Expression_Generator{

    //constructors
    public: Function_Expression_Generator(std::string const& expression);
    public: Function_Expression_Generator();

    //transformer
    void Set_New_Expression(std::string const& expression);

    //observers
    public: unsigned int Get_Count();
    //public: unsigned int Get_One_Word_Name_Count();
    public: std::vector<std::string> Get_Generations();
        private: std::vector<std::string> Generate(std::vector<std::vector<Function_Expression_Node>> const& parse_tree);
            private: std::vector<std::string> Sub_Generate(std::vector<Function_Expression_Node> const& nodes);

private:
    std::vector<std::vector<Function_Expression_Node>> m_parse_tree;
    unsigned int amount;

};

Generator source

#include "function_expression_generator.hpp"

#include "function_expression_scanner.hpp"
#include "function_expression_parser.hpp"

//constructors
Function_Expression_Generator::Function_Expression_Generator(std::string const& expression){
    auto tokens = Function_Expression_Scanner::Scan(expression);
    m_parse_tree = Function_Expression_Parser::Parse(tokens,amount);
}
Function_Expression_Generator::Function_Expression_Generator(){}

//transformer
void Function_Expression_Generator::Set_New_Expression(std::string const& expression){
    auto tokens = Function_Expression_Scanner::Scan(expression);
    m_parse_tree = Function_Expression_Parser::Parse(tokens,amount);
}

//observers
unsigned int Function_Expression_Generator::Get_Count(){
    return amount;
}
std::vector<std::string> Function_Expression_Generator::Get_Generations(){
    return Generate(m_parse_tree);
}
std::vector<std::string> Function_Expression_Generator::Generate(std::vector<std::vector<Function_Expression_Node>> const& parse_tree){
    std::vector<std::string> results;
    std::vector<std::string> more;

    for (auto it: parse_tree){
        more = Sub_Generate(it);
        results.insert(results.end(), more.begin(), more.end());
    }

    return results;
}
std::vector<std::string> Function_Expression_Generator::Sub_Generate(std::vector<Function_Expression_Node> const& nodes){
    std::vector<std::string> results;
    std::vector<std::string> more;
    std::vector<std::string> new_results;

    results.push_back("");
    for (auto it: nodes){
        if (!it.more){
            for (auto & result: results){
                result+=it.value;
            }
        }
        else{
            more = Generate(it.children);
            for (auto m: more){
                for (auto r: results){
                    new_results.push_back(r+m);
                }
            }
            m

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...