Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
964 views
in Technique[技术] by (71.8m points)

shell - What is the simplest method to join columns from variable number of files using bash script?

I have input files in one directory. All the input files have the same format and I'd like to join certain columns from these input files into one output file.

For example:

in File1

Adam    0.5 a1
Bills   0.7 b1
Carol   0.8 c1
Dean    0.4 d1

in File2

Adam    0.4 a2
Carol   0.8 c2
Evan    0.9 e2

in File3

Bills   0.6 b3
Carol   0.7 c3
Evan    0.1 e3

I'd like to join the third column from all input files by using the first column as a key. So the output may look like

Adam    a1  a2  NA
Bills   b1  NA  b3
Carol   c1  c2  c3
Dean    d1  NA  NA
Evan    NA  e2  e3

Because the number of input files are varied, the number of columns in output are also varied. The number of input files are at least 200 and can be maximum at 10,000.

I couldn't find a simple way to use 'for', 'awk', 'join', 'cut' to solve this problem. And yes, I can write a Python or Perl script to solve this problem but I wonder if this can be done using bash script alone?

ps. I tried to search for a solution before asking this question but couldn't find it. If this kind of question is already asked, please point me to the answer.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can do this by combining two joins.

$ join -o '0,1.3,2.3' -a1 -a2 -e 'NA' file1 file2
Adam a1 a2
Bills b1 NA
Carol c1 c2
Dean d1 NA
Evan NA e2

First join the first two files together, using -a1 -a2 to make sure lines that are only present in one file are still printed. -o '0,1.3,2.3' controls which fields are output and -e 'NA' replaces missing fields with NA.

$ join -o '0,1.3,2.3' -a1 -a2 -e 'NA' file1 file2 | join -o '0,1.2,1.3,2.3' -a1 -a2 -e 'NA' - file3
Adam a1 a2 NA
Bills b1 NA b3
Carol c1 c2 c3
Dean d1 NA NA
Evan NA e2 e3

Then pipe that join to another one which joins the third file. The trick here is passing in - as the first file name, which tells join to use stdin as the first file.


For an arbitrary number of files, here's a script which applies this idea recursively.

#!/bin/bash

join_all() {
    local file=$1
    shift

    awk '{print $1, $3}' "$file" | {
        if (($# > 0)); then
            join2 - <(join_all "$@") $(($# + 1))
        else
            cat
        fi
    }
}

join2() {
    local file1=$1
    local file2=$2
    local count=$3

    local fields=$(eval echo 2.{2..$count})
    join -a1 -a2 -e 'NA' -o "0 1.2 $fields" "$file1" "$file2"
}

join_all "$@"

Example usage:

$ ./joinall file1
Adam a1
Bills b1
Carol c1
Dean d1

$ ./joinall file1 file2
Adam a1 a2
Bills b1 NA
Carol c1 c2
Dean d1 NA
Evan NA e2

$ ./joinall file1 file2 file3
Adam a1 a2 NA
Bills b1 NA b3
Carol c1 c2 c3
Dean d1 NA NA
Evan NA e2 e3

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...