I'm working on analyzing a large public dataset with lots of verbose human-readable strings that were clearly generated by some regular (in the formal language theory sense) grammar.
It's not too hard to look at sets of these strings one by one to see the patterns; unfortunately, there's about 24,000 of these unique strings broken up into 33 categories and 1714 subcategories, so it's somewhat painful to do this manually.
Basically, I'm looking for an existing algorithm (preferably with an existing reference implementation) to take an arbitrary list of strings and try to infer some minimal (for some reasonable definition of minimal) spanning set of regular expressions that can be used to generate them (i.e. infer a regular grammar from a finite set of strings from the language generated by that grammar).
I've considered doing repeated greedy longest common substring elimination, but that only goes so far because it won't collapse anything but exact matches, so won't detect, say, a common pattern of varying numerical strings at a particular position in the grammar.
Brute forcing anything that doesn't fall out of common substring elimination is possible, but probably computationally unfeasible. (Furthermore, I've thought about it and there might be a "phase ordering" and/or "local minimum" issue with substring elimination, since you might make a greedy substring match that ends up forcing the final grammar to be less compressed/minimal even though it appears to be the best reduction initially).
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…