I have N strings.
Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.
I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which
1) minimizes K
2) minimizes L
3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /abd+/ or /w+.+/, but the first regex is more "specific".
All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.
A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.
Any ideas (or pointers to existing work) are greatly appreciated.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…