I have a 4GB text file with highly variable length lines, this is only a sample file, production files will be much larger. I need to read the file and apply a multi line regex.
What is the best way to read such a large file for the multi line regex?
If I read it line by line, I don't think my multi line regex will work correctly. When I use the read function in 3 argument form my regex results vary as I change the size of length I specify in the the read statement. I believe that the file's size makes it too large to be read into an array or into memory.
Here is my code
package main;
use strict;
use warnings;
our $VERSION = 1.01;
my $buffer;
my $INFILE;
my $OUTFILE;
open $INFILE, '<', ... or die "Bad Input File: $!";
open $OUTFILE, '>',... or die "Bad Output File: $!";
while ( read $INFILE, $buffer, 512 ) {
if ($buffer =~ /(?m)(^[^
]*R+){1}^(B|BREAK|C|CLOSE|D|DO(?! NOT)|E|ELSE|F|FOR|G|GOTO|H|HALT|HANG|I|IF|J|JOB|K|KILL|L|LOCK|M|MERGE|N|O|OPEN|Q|QUIT|R|READ|S|SET|TC|TRE|TRO|TS|U|USE|V|VIEW|W|WRITE|X|XECUTE)( |:).*[^
]/) {
print $OUTFILE $&;
print $OUTFILE "
";
}
}
close( $INFILE );
close( $OUTFILE );
1;
Here is some sample data:
^%Z("EUD")
S %L=%LO,%N="E1"
^%Z("RT")
This is data that I don't want the regex to find
^%Z("EXY")
X ^%Z("EW2"),^%Z("ELONG"):$L(%L)>245 S %N="E1" Q:$L(%L)>255 X ^%ZOSF("EON") S DX=0,DY=%EY,X=%RM+1 X ^%ZOSF("RM"),XY K %EX,%EY,%E1,%E2,DX,DY,%N Q
^%Z("F12")
S %A=$P(^DIC(9.8,0),"^",3)+1,%C=$P(^(0),"^",4)+1 X "F %=0:0 Q:'$D(^DIC(9.8,%A,0)) S %A=%A+1" S $P(^DIC(9.8,0),"^",3,4)=%A_"^"_%C,^DIC(9.8,%A,0)=%X_"^R",^DIC(9.8,"B",%X,%A)=""
^%Z("F2")
S %=$H>21549+$H-.1,%Y=%365.25+141,%=%#365.251,%D=%+306#(%Y#4=0+365)#153#61#31+1,%M=%-%D29+1,%DT=%Y_"00"+%M_"00"+%D,%D=%M_"/"_%D_"/"_$E(%Y,2,3)
The lines above are paired, syntactically (line 1 and 2 go together, 3 and 4, etc). I need to find specific pairs, in the above data that's all of the pairs except for:
^%Z("RT")
This is data that I don't want the regex to find
See Question&Answers more detail:
os