Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
847 views
in Technique[技术] by (71.8m points)

awk : awk script to group by column with condition

I have tab delimited file like following and I am trying to write a awk script

aaa_log-000592                    2     p      STARTED   7027691  21.7   a1
aaa_log-000592                    28    r      STARTED   7027815  21.7   a2
aaa_log-000592                    33    p      STARTED   7032607  21.7   a3
aaa_log-000592                    33    r      STARTED   7032607  21.7   a4
aaa_log-000592                    43    p      STARTED   7025709  21.7   a5
aaa_log-000592                    43    r      STARTED   7025709  21.7   a6
aaa_log-000595                    2     r      STARTED   7027691  21.7   a7
aaa_log-000598                    28    p      STARTED   7027815  21.7   a8
aaa_log-000599                    13    p      STARTED   7033090  21.7   a9

I am trying to count for 3rd column (p or r) and group by column 1

Output would be like

Col1                   Count-P  Count-R
aaa_log-000592            3     3                                      
aaa_log-000595            0     1       
aaa_log-000598            1     0        
aaa_log-000599            1     0 

I can't find an example that would have IF condition with group by in awk.

question from:https://stackoverflow.com/questions/65851124/awk-awk-script-to-group-by-column-with-condition

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

awk(more specifically, the GNU variant, gawk) has multi-dimensional arrays that can be indexed using input values (including character strings like in your example). As such, you can count the values in the way you want by doing

{ 
    values[$3] = 1    # this line records the values in column three
    counts[$1][$3]++  # and this lines counts their frequency
}

The first line isn't strictly required, but it simplifies generating the output.

The only remaining part is to have an END clause that outputs the tabulated results.

END {
    # Print column headings
    printf "Col1              "
    for (v in values) {
        printf "  Count-%s", v
    }
    printf "
"
      
    # Print tabulated results
    for (i in counts) {
        printf "%-20s", i
        for (v in values) {
            printf "    %d", counts[i][v]
        }
        printf "
"
    }
}

Generating the values array handles the case when the values of column three may not be known (e.g., like when there's an error in your input).

If you're using a different awk implementation (like what you might find in macOS, for example), array indexing may be different (e.g., they are single-dimensional arrays, but indexed by a comma-separate list of indices). This may add some additional complexity, but the idea is the same.

{
  files[$1] = 1
  values[$3] = 1
  counts[$1,$3]++
}

END {
    # Print column headings
    printf "Col1              "
    for (v in values) {
        printf "  Count-%s", v
    }
    printf "
"

    # Print tabulated results
    for (f in files) {
        printf "%-20s", f
        for (v in values) {
            printf "    %d", counts[f,v]
        }
        printf "
"
    }
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...