Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
66 views
in Technique[技术] by (71.8m points)

python - Considering name with space (country name) as multiple columns when reading txt file

I am trying to load the text file into Python, but due to the spaces in the string with multiple words, it is considering each word as a separate column. What's wrong, and how can I fix this?

Data:

Name 2000–12 2012–13 2013–14 2012 2012 2012 2012 2012 2012 2012
Costa Rica 4.7 3.4 4.3 15.9 15.1 –5.3 –3.5 .. 4.5 49.4
C?te d’Ivoire 1.2 8.7 8.2 .. .. 2.0 –3.1 .. 1.3 39.0
Croatia 2.1 .. .. 18.9 9.3 –0.3 –4.7 .. 3.4 80.7
Cuba 5.8 .. .. .. .. .. .. .. .. ..
Cura?ao .. .. .. .. .. .. .. .. .. ..
Cyprusb 2.6c .. .. 8.8c 3.9c –6.9 –6.3 113.3 2.4 ..
Czech Republic 3.3 .. .. 21.0 5.1 –2.4 –4.4 38.3 3.3 77.3
Denmark 0.6 .. .. 23.6 15.7 5.9 –2.0 50.6 2.4 74.6
Djibouti 3.5 .. .. .. .. .. .. .. 3.7 ..
Dominica 3.2 1.1 1.7 10.8 .. –11.5 –11.9 .. 1.4 97.4
Dominican Republic 5.6 2.5 3.9 9.2 .. –6.8 –2.9 .. 3.7 34.3
Ecuador 4.4 4.0 4.1 26.9 6.1 –0.2 .. .. 5.1 31.6
Egypt, Arab Rep. 4.9 1.8 2.3 13.0 0.0 –2.7 –10.6 .. 7.1 74.1

What I have tried so far:

q3=pd.read_csv("E:DRISTIAQuestion2.txt",skiprows=2,encoding='unicode_escape',header=0,engine='python',skipinitialspace=True,delim_whitespace=True)
print(q3)

               Name 2000–12 2012–13 2013–14  2012 2012.1 2012.2 2012.3 2012.4  
Costa          Rica     4.7     3.4     4.3  15.9   15.1   –5.3   –3.5     ..   
C?te       d’Ivoire     1.2     8.7     8.2    ..     ..    2.0   –3.1     ..   
Croatia         2.1      ..      ..    18.9   9.3   –0.3   –4.7     ..    3.4   
Cuba            5.8      ..      ..      ..    ..     ..     ..     ..     ..   
Cura?ao          ..      ..      ..      ..    ..     ..     ..     ..     ..   
Cyprusb        2.6c      ..      ..    8.8c  3.9c   –6.9   –6.3  113.3    2.4   
Czech      Republic     3.3      ..      ..  21.0    5.1   –2.4   –4.4   38.3   
Denmark         0.6      ..      ..    23.6  15.7    5.9   –2.0   50.6    2.4   
Djibouti        3.5      ..      ..      ..    ..     ..     ..     ..    3.7   
Dominica        3.2     1.1     1.7    10.8    ..  –11.5  –11.9     ..    1.4   
Dominican  Republic     5.6     2.5     3.9   9.2     ..   –6.8   –2.9     ..   

Notice how e.g. the second word in "Costa Rica" and "C?te d'Ivoire" etc are pushed into the first data column. I would like for those to stay in the Name column.

Required output:

Name       2000–12 2012–13 2013–14 2012 2012 2012 2012 2012 2012 2012
Costa Rica 4.7     3.4     4.3     15.9 15.1 –5.3 –3.5 ..   4.5  49.4

i.e. all the strings of Country name should fall under Name column. All the other columns are regular and do not contain any whitespace.

question from:https://stackoverflow.com/questions/65886711/considering-name-with-space-country-name-as-multiple-columns-when-reading-txt

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The Pandas read_csv function accepts a custom delimiter sep which can be a regular expression. Now the task is to articulate a regex which only matches the last n-1 spaces in a line, where n is the number of columns in the file.

q3=pd.read_csv(
    "Question2.txt", engine='python', skiprows=2, encoding='unicode_escape',
    sep=r's+(?!S+(?:s+S+){10})')
print(q3)

The regular expression matches whitespace (s+) but only if it is not followed (?!...) by ten or more whitespace-separated columns.

The sample data you provided doesn't seem to exactly match what your code expects, but with a couple of empty lines added at the top, I get

                  Name 2000a€“12 2012a€“13  ... 2012.4 2012.5 2012.6
0           Costa Rica       4.7       3.4  ...     ..    4.5   49.4
1     C?′te da€?Ivoire       1.2       8.7  ...     ..    1.3   39.0
2              Croatia       2.1        ..  ...     ..    3.4   80.7
3                 Cuba       5.8        ..  ...     ..     ..     ..
4             Cura?§ao        ..        ..  ...     ..     ..     ..
5              Cyprusb      2.6c        ..  ...  113.3    2.4     ..
6       Czech Republic       3.3        ..  ...   38.3    3.3   77.3
7              Denmark       0.6        ..  ...   50.6    2.4   74.6
8             Djibouti       3.5        ..  ...     ..    3.7     ..
9             Dominica       3.2       1.1  ...     ..    1.4   97.4
10  Dominican Republic       5.6       2.5  ...     ..    3.7   34.3
11             Ecuador       4.4       4.0  ...     ..    5.1   31.6
12    Egypt, Arab Rep.       4.9       1.8  ...     ..    7.1   74.1

[13 rows x 11 columns]

(Notice the Unicode mojibake, because of the slightly weird encoding keyword argument.)

Probably as your very first task save the result in a less inane format, probably as proper CSV (comma-delimited, with quoting around any field which contains literal commas; but Pandas to_csv() takes care of all of this for you).

As an aside, hardcoding the file name with a full path is probably going to make your script less useful. Maybe take out the path, like I have done above, and run the script in the directory where you have the input file, or (less usefully, but sometimes more practically) put in a relative path to a subdirectory, and run it from the corresponding parent directory.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...