Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
321 views
in Technique[技术] by (71.8m points)

Perl read a large file for use with multi line regex

I have a 4GB text file with highly variable length lines, this is only a sample file, production files will be much larger. I need to read the file and apply a multi line regex.

What is the best way to read such a large file for the multi line regex?

If I read it line by line, I don't think my multi line regex will work correctly. When I use the read function in 3 argument form my regex results vary as I change the size of length I specify in the the read statement. I believe that the file's size makes it too large to be read into an array or into memory.

Here is my code

package main;
use strict;
use warnings;

our $VERSION = 1.01;
my $buffer;
my $INFILE;
my $OUTFILE;

open $INFILE, '<', ... or die "Bad Input File: $!";
open $OUTFILE, '>',... or die "Bad Output File: $!";

while ( read $INFILE, $buffer, 512  ) {
    if ($buffer =~ /(?m)(^[^
]*R+){1}^(B|BREAK|C|CLOSE|D|DO(?! NOT)|E|ELSE|F|FOR|G|GOTO|H|HALT|HANG|I|IF|J|JOB|K|KILL|L|LOCK|M|MERGE|N|O|OPEN|Q|QUIT|R|READ|S|SET|TC|TRE|TRO|TS|U|USE|V|VIEW|W|WRITE|X|XECUTE)( |:).*[^
]/) {
        print $OUTFILE $&;
        print $OUTFILE "
";
    }
}

close( $INFILE ); 
close( $OUTFILE );
1;

Here is some sample data:

^%Z("EUD")
S %L=%LO,%N="E1"
^%Z("RT")
This is data that I don't want the regex to find
^%Z("EXY")
X ^%Z("EW2"),^%Z("ELONG"):$L(%L)>245 S %N="E1" Q:$L(%L)>255  X ^%ZOSF("EON") S DX=0,DY=%EY,X=%RM+1 X ^%ZOSF("RM"),XY K %EX,%EY,%E1,%E2,DX,DY,%N Q
^%Z("F12")
S %A=$P(^DIC(9.8,0),"^",3)+1,%C=$P(^(0),"^",4)+1 X "F %=0:0 Q:'$D(^DIC(9.8,%A,0))  S %A=%A+1" S $P(^DIC(9.8,0),"^",3,4)=%A_"^"_%C,^DIC(9.8,%A,0)=%X_"^R",^DIC(9.8,"B",%X,%A)=""
^%Z("F2")
S %=$H>21549+$H-.1,%Y=%365.25+141,%=%#365.251,%D=%+306#(%Y#4=0+365)#153#61#31+1,%M=%-%D29+1,%DT=%Y_"00"+%M_"00"+%D,%D=%M_"/"_%D_"/"_$E(%Y,2,3)

The lines above are paired, syntactically (line 1 and 2 go together, 3 and 4, etc). I need to find specific pairs, in the above data that's all of the pairs except for:

^%Z("RT")
This is data that I don't want the regex to find
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The question is apparently about parsing a DSL, and it seems that in general regex isn't the right tool for that. A quick search did not yield an easy list of accepted approaches, except for pages of CPAN modules and posts like this article. Finding out the best approach is indeed the first step.

However, below is an answer to the question as stated in the title and in the clear description: how to parse a very large file where units to be processed spread over an unknown number of lines.


Keep assembling a 'buffer' and checking it. Once you find a match, process and clear it.

For instance, appeand a line to a variable and check (try to match if you use regex). Keep going and once it does match process and clear the variable.

my $unit;
while (<$fh>) {
    # chomp;            # if suitable, and then add a space
    # $unit .= ' '.$_;  # as a separator that newline was
    $unit .= $_;

    if ( test_unit($unit) ) {
         # process ...
         $unit = undef;
    }
}

The test_unit() sub is a placeholder for code that would decide whether the assembled unit should be processed. If that is regex it can be defined before the loop, my $re = qr/.../; (see qr in perlop), and then test in the loop with if ($unit =~ $re)

A note in the question states that lines to be processed come in pairs, but it is clarificated in a comment that subsequent lines don't always pair up. Thus we can't process pairs of lines.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...