Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
234 views
in Technique[技术] by (71.8m points)

awk - grep does not work in removing patterns from a file from a CSV

I have a file which needs too be cleaned of some URLs. The URLs are in a file say fileA and the CSV fileB(these are huge files of size 6-10 GB). I have tried the following grep command, but it does not work on newer fileB's.

grep -vwF -f patterns.txt fileB.csv > result.csv

The structure of file A is a single list of URLs like so:

URLs (header, single column)
bwin.hu
paradisepoker.li

and fileB:

type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com 
2|||www.bwin.hu|||1524024324|||bwin.hu

The delimiter for fileB is |||

I am open to all solutions including awk. Thanks.

Edit: expected output is the CSV file retaining all rows not matching the domain patterns in fileA

type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com 
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Could you please try following.

awk 'FNR==NR{a[$0];next} !($NF in a)' Input_filea FS="\|\|\|" Input_fileb

OR

awk 'FNR==NR{a[$0];next} !($NF in a)' filea FS='|||' fileb

Output will be as follows.

type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com 

Explanation: Adding explanation for above code now.

awk '                                          ##Starting awk program here.
FNR==NR{                                       ##Checking condition FNR==NR which will be TRUE when first Input_file named filea is being read.
  a[$0]                                        ##Creating an array named a whose index is $0(current line).
  next                                         ##next keyword will skip all further statements.
}                                              ##Closing block for condition FNR==NR here.
!($NF in a)                                    ##Checking condition if last field of current line is NOT present in array a for Input_fileb only.
                                               ##if condition is TRUE then no action is mentioned so by default print of current line will happen.
' filea FS="\|\|\|" fileb                   ##Mentioning Input_file names and for fileb mentioning FS should be ||| escaped it here so that awk will consider it as a literal character.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...