Which AWK program can do this manipulation?

Question

Welcome To Ask or Share your Answers For Others

Which AWK program can do this manipulation?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Which AWK program can do this manipulation?

Given a file containing a structure arranged like the following (with fields separated by SP or HT)

4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w

Which AWK program do I need to get the following output?

Thanks in advance for any and all help.

Postscript

Please bear in mind,

My input file is much larger than the one depicted in this question.
My computer science skills are seriously limited.
This task has been imposed on me.

question from:https://stackoverflow.com/questions/65598986/which-awk-program-can-do-this-manipulation

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T18:50:44+0000

awk -v n=4 '
    function join(start, end,    result, i) {
        for (i=start; i<=end; i++)
            result = result $i (i==end ? ORS : FS)
        return result
    }
    {
        c=0
        for (i=1; i<NF; i+=n) {
            c++
            col[c] = col[c] join(i, i+n-1)
        }
    }
    END {
        for (i=1; i<=c; i++)
            printf "%s", col[i]  # the value already ends with newline
    }
' file

The awk info page has a short primer on awk, so read that too.

Benchmarking

create an input file with 1,000,000 columns and 8 rows (as specified by OP)

#!perl
my $cols = 2**20; # 1,048,576
my $rows = 8;
my @alphabet=( 'a'..'z', 0..9 );
my $size = scalar @alphabet;

for ($r=1; $r <= $rows; $r++) {
    for ($c = 1; $c <= $cols; $c++) {
        my $idx = int rand $size;
        printf "%s ", $alphabet[$idx];
    }
    printf "
";
}

$ perl createfile.pl > input.file
$ wc input.file
       8  8388608 16777224 input.file

time various implementations: I use the fish shell, so the timing output is different from bash's

my awk

$ time awk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in    3.62 secs   fish           external
   usr time    3.49 secs    0.24 millis    3.49 secs
   sys time    0.11 secs    1.96 millis    0.11 secs

$ wc output.file
 2097152  8388608 16777216 output.file

Timur's perl:

$ time perl -lan columnize.pl input.file > output.file

________________________________________________________
Executed in    3.25 secs   fish           external
   usr time    2.97 secs    0.16 millis    2.97 secs
   sys time    0.27 secs    2.87 millis    0.27 secs

Ravinder's awk

$ time awk -f columnize.ravinder input.file > output.file

________________________________________________________
Executed in    4.01 secs   fish           external
   usr time    3.84 secs    0.18 millis    3.84 secs
   sys time    0.15 secs    3.75 millis    0.14 secs

kvantour's awk, first version

$ time awk -f columnize.kvantour -v n=4 input.file > output.file

________________________________________________________
Executed in    3.84 secs   fish           external
   usr time    3.71 secs  166.00 micros    3.71 secs
   sys time    0.11 secs  1326.00 micros    0.11 secs

kvantour's second awk version: Crtl-C interrupted after a few minutes

$ time awk -f columnize.kvantour2 -v n=4 input.file > output.file
^C
________________________________________________________
Executed in  260.80 secs   fish           external
   usr time  257.39 secs    0.13 millis  257.39 secs
   sys time    1.68 secs    2.72 millis    1.67 secs

$ wc output.file
 9728 38912 77824 output.file

The $0=a[j] line is pretty expensive, as it has to parse the string into fields each time.

dawg's python

$ timeout 60s fish -c 'time python3 columnize.py input.file 4 > output.file'
[... 60 seconds later ...]
$ wc output.file
 2049  8196 16392 output.file

another interesting data point: using different awk implementations. I'm on a Mac with GNU awk and mawk installed via homebrew

with many columns, few rows

$ time gawk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in    3.78 secs   fish           external
   usr time    3.62 secs  174.00 micros    3.62 secs
   sys time    0.13 secs  1259.00 micros    0.13 secs

$ time /usr/bin/awk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in   17.73 secs   fish           external
   usr time   14.95 secs    0.20 millis   14.95 secs
   sys time    2.72 secs    3.45 millis    2.71 secs

$ time mawk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in    2.01 secs   fish           external
   usr time  1892.31 millis    0.11 millis  1892.21 millis
   sys time   95.14 millis    2.17 millis   92.97 millis

with many rows, few columns, this test took over half an hour on a MacBook Pro, 6 core Intel cpu, 16GB ram

$ time mawk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in   32.30 mins   fish           external
   usr time   23.58 mins    0.15 millis   23.58 mins
   sys time    8.63 mins    2.52 millis    8.63 mins

Categories

Which AWK program can do this manipulation?

Which AWK program can do this manipulation?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Benchmarking

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags