awk -v n=4 '
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
' file
The awk info page has a short primer on awk, so read that too.
Benchmarking
create an input file with 1,000,000 columns and 8 rows (as specified by OP)
#!perl
my $cols = 2**20; # 1,048,576
my $rows = 8;
my @alphabet=( 'a'..'z', 0..9 );
my $size = scalar @alphabet;
for ($r=1; $r <= $rows; $r++) {
for ($c = 1; $c <= $cols; $c++) {
my $idx = int rand $size;
printf "%s ", $alphabet[$idx];
}
printf "
";
}
$ perl createfile.pl > input.file
$ wc input.file
8 8388608 16777224 input.file
time various implementations: I use the fish shell, so the timing output is different from bash's
my awk
$ time awk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 3.62 secs fish external
usr time 3.49 secs 0.24 millis 3.49 secs
sys time 0.11 secs 1.96 millis 0.11 secs
$ wc output.file
2097152 8388608 16777216 output.file
Timur's perl:
$ time perl -lan columnize.pl input.file > output.file
________________________________________________________
Executed in 3.25 secs fish external
usr time 2.97 secs 0.16 millis 2.97 secs
sys time 0.27 secs 2.87 millis 0.27 secs
Ravinder's awk
$ time awk -f columnize.ravinder input.file > output.file
________________________________________________________
Executed in 4.01 secs fish external
usr time 3.84 secs 0.18 millis 3.84 secs
sys time 0.15 secs 3.75 millis 0.14 secs
kvantour's awk, first version
$ time awk -f columnize.kvantour -v n=4 input.file > output.file
________________________________________________________
Executed in 3.84 secs fish external
usr time 3.71 secs 166.00 micros 3.71 secs
sys time 0.11 secs 1326.00 micros 0.11 secs
kvantour's second awk version: Crtl-C interrupted after a few minutes
$ time awk -f columnize.kvantour2 -v n=4 input.file > output.file
^C
________________________________________________________
Executed in 260.80 secs fish external
usr time 257.39 secs 0.13 millis 257.39 secs
sys time 1.68 secs 2.72 millis 1.67 secs
$ wc output.file
9728 38912 77824 output.file
The $0=a[j]
line is pretty expensive, as it has to parse the string into fields each time.
dawg's python
$ timeout 60s fish -c 'time python3 columnize.py input.file 4 > output.file'
[... 60 seconds later ...]
$ wc output.file
2049 8196 16392 output.file
another interesting data point: using different awk implementations. I'm on a Mac with GNU awk and mawk installed via homebrew
with many columns, few rows
$ time gawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 3.78 secs fish external
usr time 3.62 secs 174.00 micros 3.62 secs
sys time 0.13 secs 1259.00 micros 0.13 secs
$ time /usr/bin/awk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 17.73 secs fish external
usr time 14.95 secs 0.20 millis 14.95 secs
sys time 2.72 secs 3.45 millis 2.71 secs
$ time mawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 2.01 secs fish external
usr time 1892.31 millis 0.11 millis 1892.21 millis
sys time 95.14 millis 2.17 millis 92.97 millis
with many rows, few columns, this test took over half an hour on a MacBook Pro, 6 core Intel cpu, 16GB ram
$ time mawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 32.30 mins fish external
usr time 23.58 mins 0.15 millis 23.58 mins
sys time 8.63 mins 2.52 millis 8.63 mins
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…