There is both a simple solution and a more advanced solution (added after edit) to handle more complex functions.
To achieve the example you posted, I suggest doing this in two steps, the first step is to extract the parameters (regexes are explained at the end):
[^()]+((.*))$
Now, to parse the parameters.
Simple solution
Extract the parameters using:
([^,]+(.+?))|([^,]+)
Here are some C# code examples (all asserts pass):
string extractFuncRegex = @"[^()]+((.*))$";
string extractArgsRegex = @"([^,]+(.+?))|([^,]+)";
//Your test string
string test = @"func1(2 * 7, func2(3, 5))";
var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );
Explanation of regexes. The arguments extraction as a single string:
[^()]+((.*))$
where:
[^()]+
chars that are not an opening, closing bracket.
((.*))
everything inside the brackets
The args extraction:
([^,]+(.+?))|([^,]+)
where:
([^,]+(.+?))
character that are not commas followed by characters in brackets. This picks up the func arguments. Note the +? so that the match is lazy and stops at the first ) it meets.
|([^,]+)
If the previous does not match then match consecutive chars that are not commas. These matches go into groups.
More advanced solution
Now, there are some obvious limitations with that approach, for example it matches the first closing bracket, so it doesn't handle nested functions very well. For a more comprehensive solution (if you require it), we need to use balancing group definitions(as I mentioned before this edit). For our purposes, balancing group definitions allow us to keep track of the instances of the open brackets and subtract the closing bracket instances. In essence opening and closing brackets will cancel each other out in the balancing part of the search until the final closing bracket is found. That is, the match will continue until the brackets balance and the final closing bracket is found.
So, the regex to extract the parms is now (func extraction can stay the same):
(?:[^,()]+((?:((?>[^()]+|((?<open>)|)(?<-open>))*)))*)+
Here are some test cases to show it in action:
string extractFuncRegex = @"[^()]+((.*))$";
string extractArgsRegex = @"(?:[^,()]+((?:((?>[^()]+|((?<open>)|)(?<-open>))*)))*)+";
//Your test string
string test = @"func1(2 * 7, func2(3, 5))";
var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );
//A more advanced test string
test = @"someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)";
match = Regex.Match( test, extractFuncRegex );
innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2" );
matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "a" );
Assert.AreEqual( matches[1].Value.Trim(), "b" );
Assert.AreEqual( matches[2].Value.Trim(), "func1(a,b+c)" );
Assert.AreEqual( matches[3].Value.Trim(), "func2(a*b,func3(a+b,c))" );
Assert.AreEqual( matches[4].Value.Trim(), "func4(e)+func5(f)" );
Assert.AreEqual( matches[5].Value.Trim(), "func6(func7(g,h)+func8(i,(a)=>a+2))" );
Assert.AreEqual( matches[6].Value.Trim(), "g+2" );
Note especially that the method is now quite advanced:
someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)
So, looking at the regex again:
(?:[^,()]+((?:((?>[^()]+|((?<open>)|)(?<-open>))*)))*)+
In summary, it starts out with characters that are not commas or brackets. Then if there are brackets in the argument, it matches and subtracts the brackets until they balance. It then tries to repeat that match in case there are other functions in the argument. It then goes onto the next argument (after the comma). In detail:
[^,()]+
matches anything that is not ',()'
?:
means non-capturing group, i.e. do not store matches within brackets in a group.
(
means start at an open bracket.
?>
means atomic grouping - essentially, this means it does not remember backtracking positions. This also helps to improve performance because there are less stepbacks to try different combinations.
[^()]+|
means anything but an opening or closing bracket. This is followed by | (or)
((?<open>)|
This is the good stuff and says match '(' or
(?<-open>)
This is the better stuff that says match a ')' and balance out the '('. This means that this part of the match (everything after the first bracket) will continue until all the internal brackets match. Without the balancing expressions, the match would finish on the first closing bracket. The crux is that the engine does not match this ')' against the final ')', instead it is subtracted from the matching '('. When there are no further outstanding '(', the -open fails so the final ')' can be matched.
- The rest of the regex contains the closing parenthesis for the group and the repetitions (, and +) which are respectively: repeat the inner bracket match 0 or more times, repeat the full bracket search 0 or more times (0 allows arguments without brackets) and repeat the full match 1 or more times (allows foo(1)+foo(2))
One final embellishment:
If you add (?(open)(?!))
to the regex:
(?:[^,()]+((?:((?>[^()]+|((?<open>)|)(?<-open>))*(?(open)(?!)))))*)+
The (?!) will always fail if open has captured something (that hasn't been subtracted), i.e. it will always fail if there is an opening bracket without a closing bracket. This is a useful way to test whether the balancing has failed.
Some notes:
- will not match when the last character is a ')' because it is not a word character and tests for word character boundaries so your regex would not match.
- While regex is powerful, unless you are a guru among gurus it is best to keep the expressions simple because otherwise they are hard to maintain and hard for other people to understand. That is why it is sometimes best to break up the problem into subproblems and simpler expressions and let the language do some of the non search/match operations that it is good at. So, you may want to mix simple regexes with more complex code or visa versa, depending on where you are comfortable.
- This will match some very complex functions, but it is not a lexical analyzer for functions.
- If you can have strings in the arguments and the strings themselves can contains brackets, e.g. "go(..." then you will need to modify the regex to take strings out of the comparison. Same with comments.
- Some links for balancing group definitions: here, here, here and here.
Hope that helps.