In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):
[First, look] for text enclosed between two identical quotes
(the probable quotechar) which are preceded and followed
by the same character (the probable delimiter).
For example:
,'some text',
The quote with the most wins, same with the delimiter.
If there is no quotechar the delimiter can't be determined
this way.
In that case, try the following:
The delimiter should occur the same number of times on
each row. However, due to malformed data, it may not. We don't want
an all or nothing approach, so we allow for small variations in this
number.
- build a table of the frequency of
each character on every line.
- build a table of freqencies of this
frequency (meta-frequency?), e.g.
'x occurred 5 times in 10 rows, 6
times in 1000 rows, 7 times in 2
rows'
- use the mode of the meta-frequency
to determine the expected
frequency for that character
- find out how often the character
actually meets that goal
- the character that best meets its
goal is the delimiter
For performance reasons, the data is evaluated in chunks, so it can
try and evaluate the smallest portion of the data possible, evaluating
additional chunks as necessary.
I'm not going to quote the source code here - it's in the Lib directory of every Python installation.
Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…