Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
228 views
in Technique[技术] by (71.8m points)

Parse a csv using awk and ignoring commas inside a field

I have a csv file where each row defines a room in a given building. Along with room, each row has a floor field. What I want to extract is all floors in all buildings.

My file looks like this...

"u_floor","u_room","name"
0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL, JOHN W    "
0,3,"BRICKER HALL, JOHN W    "
0,5,"BRICKER HALL, JOHN W    "
0,6,"BRICKER HALL, JOHN W    "
0,7,"BRICKER HALL, JOHN W    "
0,8,"BRICKER HALL, JOHN W    "
0,9,"BRICKER HALL, JOHN W    "
0,19,"BRICKER HALL, JOHN W    "
0,20,"BRICKER HALL, JOHN W    "
0,21,"BRICKER HALL, JOHN W    "
0,25,"BRICKER HALL, JOHN W    "
0,27,"BRICKER HALL, JOHN W    "
0,29,"BRICKER HALL, JOHN W    "
0,35,"BRICKER HALL, JOHN W    "
0,45,"BRICKER HALL, JOHN W    "
0,59,"BRICKER HALL, JOHN W    "
0,60,"BRICKER HALL, JOHN W    "
0,61,"BRICKER HALL, JOHN W    "
0,63,"BRICKER HALL, JOHN W    "
0,"0006M","BRICKER HALL, JOHN W    "
0,"0008A","BRICKER HALL, JOHN W    "
0,"0008B","BRICKER HALL, JOHN W    "
0,"0008C","BRICKER HALL, JOHN W    "
0,"0008D","BRICKER HALL, JOHN W    "
0,"0008E","BRICKER HALL, JOHN W    "
0,"0008F","BRICKER HALL, JOHN W    "
0,"0008G","BRICKER HALL, JOHN W    "
0,"0008H","BRICKER HALL, JOHN W    "

What I want is all floors in all buildings.

I am using cat, awk, sort and uniq to obtain this list although I am having a problem with the "," in the building name field such as "BRICKER HALL, JOHN W" and it is throwing off my entire csv generation.

cat Buildings.csv | awk -F, '{print $1","$2}' | sort | uniq > Floors.csv 

How can I get awk to use the comma but ignore a comma in between "" of a field? Alternatively, does someone have a better solution?

Based on the answer provided suggesting a awk csv parser I was able to get the solution:

cat Buildings.csv | awk -f csv.awk | awk -F" -> 2|"  '{print $2}' | awk -F"|" '{print $2","$3}' | sort | uniq > floors.csv 

There we want to use the csv awk program and then from there I want to use a " -> 2|" which is formatting based on the csv awk program. The print $2 there prints only the csv parsed contents, this is because the program prints the original line followed by " -> #" where # is the count parsed from csv. (Ie. the columns.) From there I can split this awk csv result on the "|" whcih is what it replaces the comma's with. Then the sort, uniq and pipe out to a file and done!

Thanks for the help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
gawk -vFPAT='[^,]*|"[^"]*"' '{print $1 "," $3}' | sort | uniq

This is an awesome GNU Awk 4 extension, where you define a field pattern instead of a field-separator pattern. Does wonders for CSV. (docs)

ETA (thanks mitchus): To remove the surrounding quotes, gsub("^"|"$","",$3); if there's more fields than just $3 to process that way, just loop through them.
Note this simple approach is not tolerant of malformed input, nor of some possible special characters between quotes – covering all of those would go beyond the scope of a neat one-liner.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...