Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
190 views
in Technique[技术] by (71.8m points)

c# - How should I detect which delimiter is used in a text file?

I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use?

One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done.

Edit: Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means that I won't know how many fields I have until I can parse it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):


[First, look] for text enclosed between two identical quotes (the probable quotechar) which are preceded and followed by the same character (the probable delimiter). For example:

         ,'some text',

The quote with the most wins, same with the delimiter. If there is no quotechar the delimiter can't be determined this way.

In that case, try the following:

The delimiter should occur the same number of times on each row. However, due to malformed data, it may not. We don't want an all or nothing approach, so we allow for small variations in this number.

  1. build a table of the frequency of each character on every line.
  2. build a table of freqencies of this frequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows, 7 times in 2 rows'
  3. use the mode of the meta-frequency to determine the expected frequency for that character
  4. find out how often the character actually meets that goal
  5. the character that best meets its goal is the delimiter

For performance reasons, the data is evaluated in chunks, so it can try and evaluate the smallest portion of the data possible, evaluating additional chunks as necessary.


I'm not going to quote the source code here - it's in the Lib directory of every Python installation.

Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...