• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

perl学习笔记之:模式匹配,模块,文档

原作者: [db:作者] 来自: [db:来源] 收藏 邀请

Perl语言的最大特点,也是Perl作为CGI首选语言的最大特点,是它的模式匹配操作符。Perl语言的强大的文本处理能力正是通过其内嵌的对模式匹配的支持体现的。模式通过创建正则表达式实现。Perl的正则表达式与模式匹配的特点一是内嵌于语言之中,而不是通过库或函数来实现,因此使用更简便;二是比一般的正则表达式与模式匹配功能强大。

模式匹配操作符简介

操作符

意义

实例

=~

匹配(包含)

 

!~

不匹配(不包含)

 

m//

匹配

$haystack =~ m/needle/

$haystack =~ /needle/

s///

替换

$italiano =~ s/butter/olive oil/

tr///(y///)

转换

$rotate13 =~ tr/a-zA-Z/n-za-mN-ZA-M/

qr//

正则表达式

 

使用说明

l         注意区别记忆Perl的绑定操作符(=~)与AWK的相应操作符(AWK的绑定匹配操作符是 ~),Perl与AWK的否定匹配操作符相同(都是!~)

l         没有绑定操作符时,默认是对$_进行绑定:

/new life/ and /new civilizations/ (对$_进行两次查找)

s/suger/aspartame/  (对$_进行替换)

tr/ATCG/TAGC/    (对$_进行转换)

l         m//操作符前面的m可以省略,但是不省略可读性更好,建议不省略。

l         如果有绑定操作符=~,m//都省略也表示匹配:

print “matches” if $somestring =~ $somepattern; 等价于

print “matches” if $somestring =~ m/$somepattern/;

l         m//, s///, tr///, qr//操作符是引用操作符,你可以选择自己的分割符(与q//, qq//, qw//一样):

$path =~ s#/tmp#/var/tmp/scratch#

if  ($dir =~ m[/bin]) {

    print “No binary directories please. /n”;

}

l         一个括号可与其它括号配合使用,可以用空格分开:

s(egg)<larva>

s(larva){pupa};

s[pupa]/imago/;

s (egg) <larva>;    

l         如果一个模式成功匹配上,$`, $&, $’将被设置,分别表示匹配左边、匹配、匹配右边的字符串:

“hot cross buns” =~ /cross/;

print “Matched: <$`> $& <$’>/n”;   # Matched <hot > cross < buns>

l         模式模式后设置的特殊变量如下:

变量

含义

$`

匹配文本之前的文本

$&

匹配文本

$’

匹配文本之后的文本

$1、$2、$3

对应第1、2、3组捕获括号匹配的文本

$+

编号最大的括号匹配的文本

$^N

最后结束的括号匹配的文本

@-

目标文本中各匹配开始位置的偏移值数组

@+

目标文本中各匹配结束位置的偏移值数组

$^R

最后执行的嵌入代码的结果,如果嵌入代码结构作为条件语句的if部分,则不设定$^R

 

m//, s///和qr//都接受以下修饰符:

修饰符

    

/i

进行忽略字母大小的匹配

/s

单行模式(让.号匹配换行符并且忽略过时的$*变量,点号通配模式)

/m

多行模式(让^和$匹配内含的换行符(/n)的之后与之前。如果目标字符串中没有“/n”字符或者模式中没有 ^ 或 $,则设定此修饰符没有任何效果)。

(增强的行锚点模式)

/x

宽松排列和注释模式(忽略空白符(除转义空白符之外)并且允许模式中的注释)

/o

仅编译一次模式,防止运行时重编译

例如:

m//w+:(/s+/w+)/s*/d+/;          # A word, colon, space, word, space, digits

 

m//w+: (/s+  /w+) /s* /d+/x;     # A word, colon, space, word, space, digits

 

m{

/w+;                     # Match a word and a column

(                        # (begin group)

    /s+                  # Match one or more spaces.

    /w+                 # Match another word

)                        # (end group)

/s*                      # Match zero or more spaces

/d+                      # Match some digits

}x;

 

$/ = "";     # "paragrep" mode

 

while (<>) {

    while ( m{

               /b           # start at a word boundary

               (/w/S+)      # find a wordish chunk

               (

                   /s+      # separated by some whitespace

                   /1       # and that chunk again

               ) +          # repeat ad lib

               /b           # until another word word boundary

          }xig

 )  {

           print "dup word '$1' at paragraph $. /n";

    }

 }

模式匹配操作符详解

7.3.1 m//操作符(匹配)

EXPR =~ m/PATTERN/cgimosx

EXPR =~ /PATTERN/cgimosx

EXPR =~ ?PATTERN?cgimosx

m/PATTERN/cgimosx

/PATTERN/cgimosx

?PATTERN?cgimosx

说明:

l         如果PATTERN是空字符串,最后成功执行的正则表达式将被代替使用。

l         m//特殊修饰符如下:

修饰符

    

/g

查找所有的匹配

/cg

/g 匹配失败后允许继续搜索

l         在LIST上下文中m//g返回所有匹配

if (@perls = $paragraph =~ /perl/gi) {

    printf “Perl mentioned %d times./n”, scalar @perls;

}

l         ??分隔符表示一次性匹配, ‘’分隔符压制变量替换和/U等六个转换

open DICT, "/usr/share/dict/words"  or die "Cannot open words: $!/n";

while (<DICT>) {

    $first = $1 if ?(^love.*)?;

    $last  = $1 if /(^love.*)/;

}

print $first, "/n";

print $last, "/n";

7.3.2 s///操作符(替换)

LVALUE =~ s/PATTERN/REPLACEMENT/egimosx

s/PATTERN/REPLACEMENT/egimosx

说明:

l         该操作符在字符串中查找PATTERN, 如果查找到,用REPLACEMENT代替匹配的子串。返回值是成功替换的次数(加/g修饰符可能大于1)。若失败,返回””(0)。

 if ($lotr =~ s/Bilbo/Frodo/) { print “Successfully wrote sequel. “ }

 $change_count = $lotr =~ s/Bilbo/Frodo/g;

l         替换部分作为双引字符串,可以使用动态生成的模式变量($`,$&, $’, $1, $2等):

s/revision/version/release//u$&/g;

s/version ([0-9.]+)/the $Names{$1} release/g;

l         如果PATTERN是空字符串,最后成功执行的正则表达式将被代替使用。PATTERN和REPLACEMENT都需进行变量替换,但是PATTERN在s///作为一个整体处理的时候替换,而REPLACEMENT在每次模式匹配到时替换。

l         s///特殊修饰符如下:

修饰符

    

/g

替换所有的匹配

/e

将右边部分作为一个Perl表达式(代码)而不是字符串

/e修饰符的实例:

s/[0-9]+/sprintf(“%#x”, $1)/ge

s{

   version

   /s+

(

  [0-9.]+

)

}{

   $Names{$1}

      ? “the $Names{$1} release”

      : $&

}xge;

l         不替换原字符串的方式:

$lotr = $hobbit;

$lotr =~ s/Bilbo/Frodo/g;

($lotr = $hobbit) =~ s/Bilbo/Frodo/g;

l         替换数组中的每一元素:

for (@chapters) { s/Bilbo/Frodo/g }

s/Bilbo/Frodo/g for @chapters;

l         对某一字符串进行多次替换:

for ($string) {

s/^/s+//;

s//s+$//;

s//s+/ /g

}

for ($newshow = $oldshow) {

    s/Fred/Homer/g;

    s/Wilma/Marge/g;

    s/Pebbles/Lisa/g;

    s/Dino/Bart/g;

}

l         当一次全局替换不够的时的替换:

 # put comma in the right places in an integer

 1 while s/(/d)(/d/d/d)(?!/d)/$1,$2/;

 # expand tabs to 8-column spacing

 1 while s//t+/’ ‘ x (length($&)*8 – length($`)%8)/e;

 # remove (nested (even deeply nested (like this))) remarks

 1 while s//([^()]*/)//g;

 # remove duplicate words (and triplicate ( and quadruplicate…))

 1 while s//b(/w+) /1/b/$1/gi;

7.3.3 tr///操作符(字译)

LVALUE =~ tr/SEARCHLIST/REPLACELIST/cds

tr/SEARCHLIST/REPLACELIST/cds

使用说明:

l         tr///的修饰符如下:

修饰符

    

/c

补替换 (Complement SEARCHLIST)

/d

删除找到未替换的字符串(在SEARCHLIST中存在在REPLACELIST中不存在的字符)

/s

将重复替换的字符变成一个

l         如果使用了/d修饰符,REPLACEMENTLIST总是解释为明白写出的字符串,否则,如果REPLACEMENTLIST比SEARCHLIST短,最后的字符将被复制直到足够长,如果REPLACEMENTLIST为空,等价于SEARCHLIST,这种用法在想对字符进行统计而不改变时有用,在用/s修饰符压扁字符时有用。

tr/aeiou/!/;        # change any vowel into !

tr{////r/n/b/f. }{_};  # change strange chars into an underscore

tr/A-Z/a-z/ for @ARGV;   # canonicalize to lowercase ASCII

$count = ($para =~ tr//n//);

$count = tr/0-9//;

$word =~ tr/a-zA-Z//s;     # bookkeeper -> bokeper

tr/@$%*//d;             # delete any of those

tr#A-Za-z0-9+/##cd;      # remove non-base64 chars

# change en passant

($HOST = $host) =~ tr/a-z/A-Z/;

$pathname =~ tr/a-zA-Z/_/cs;  # change non-(ASCII) alphas to single underbar

元字符

Perl元字符有:

/ | ( ) [ { ^ $ * + ?

正则表达式元字符的意义如下:

Symbol

Atomic

Meaning

/...

Varies

转义

...|...

No

选择

(...)

Yes

集群(作为一个单位)

[...]

Yes

字符集合

^

No

字符串开始

.

Yes

匹配一个字符(一般除换行符外)

$

No

字符串结尾(或者换行符之前)

 

* + ?是数量元字符,Perl数量相关元字符意义如下:

Quantifier

Atomic

Meaning

*

No

匹配0或多次(最大匹配),相当于{0,}

+

No

匹配1或多次(最大匹配),相当于{1,}

?

No

匹配1或0次(最大匹配),相当于{0,1}

{COUNT}

No

匹配精确COUNT

{MIN,}

No

匹配最少MIN (最大匹配)

{MIN,MAX}

No

匹配最小MIN最大MAX(最大匹配)

*?

No

匹配0或多次(最小匹配)

+?

No

匹配1或多次(最小匹配)

??

No

匹配1或0次(最小匹配)

{MIN,}?

No

匹配最少MIN (最小匹配)

{MIN,MAX}?

No

匹配最小MIN最大MAX(最小匹配)

扩展正则表达式序列如下:

Extension

Atomic

Meaning

(?#...)

No

Comment, discard.

(?:...)

Yes

Cluster-only parentheses, no capturing.

(?imsx-imsx)

No

Enable/disable pattern modifiers.

(?imsx-imsx:...)

Yes

Cluster-only parentheses plus modifiers.

(?=...)

No

True if lookahead assertion succeeds.

(?!...)

No

True if lookahead assertion fails.

(?<=...)

No

True if lookbehind assertion succeeds.

(?<!...)

No

True if lookbehind assertion fails.

(?>...)

Yes

Match nonbacktracking subpattern.

(?{...})

No

Execute embedded Perl code.

(??{...})

Yes

Match regex from embedded Perl code.

(?(...)...|...)

Yes

Match with if-then-else pattern.

(?(...)...)

Yes

Match with if-then pattern.

说明:以上定义了向前查找(?=PATTERN),负向前查找(?!PATTERN),向后查找(?<=PATTERN),负向后查找(?<!PATTERN),条件查找等较为高级的正则表达式匹配功能,需要使用时请查阅相关资料。

字母顺序元字符意义:

Symbol

Atomic

Meaning

/0

Yes

Match the null character (ASCII NUL).

/NNN

Yes

Match the character given in octal, up to /377.

/n

Yes

Match nth previously captured string (decimal).

/a

Yes

Match the alarm character (BEL).

/A

No

True at the beginning of a string.

/b

Yes

Match the backspace character (BS).

/b

No

True at word boundary.

/B

No

True when not at word boundary.

/cX

Yes

Match the control character Control-X (/cZ, /c[, etc.).

/C

Yes

Match one byte (C char) even in utf8 (dangerous).

/d

Yes

Match any digit character.

/D

Yes

Match any nondigit character.

/e

Yes

Match the escape character (ASCII ESC, not backslash).

/E

--

End case (/L, /U) or metaquote (/Q) translation.

/f

Yes

Match the form feed character (FF).

/G

No

True at end-of-match position of prior m//g.

/l

--

Lowercase the next character only.

/L

--

Lowercase till /E.

/n

Yes

Match the newline character (usually NL, but CR on Macs).

/N{NAME}

Yes

Match the named char (/N{greek:Sigma}).

/p{PROP}

Yes

Match any character with the named property.

/P{PROP}

Yes

Match any character without the named property.

/Q

--

Quote (de-meta) metacharacters till /E.

/r

Yes

Match the return character (usually CR, but NL on Macs).

/s

Yes

Match any whitespace character.

/S

Yes

Match any nonwhitespace character.

/t

Yes

Match the tab character (HT).

/u

--

Titlecase next character only.

/U

--

Uppercase (not titlecase) till /E.

/w

Yes

Match any "word" character (alphanumerics plus "_").

/W

Yes

Match any nonword character.

/x{abcd}

Yes

Match the character given in hexadecimal.

/X

Yes

Match Unicode "combining character sequence" string.

/z

No

True at end of string only.

/Z

No

True at end of string or before optional newline.

(以上均直接Copy自《Programming Perl》,下面未翻译者同)

其中应注意以下经典的字符集合:

Symbol

Meaning

As Bytes

As utf8

/d

Digit

[0-9]

/p{IsDigit}

/D

Nondigit

[^0-9]

/P{IsDigit}

/s

Whitespace

[ /t/n/r/f]

/p{IsSpace}

/S

Nonwhitespace

[^ /t/n/r/f]

/P{IsSpace}

/w

Word character

[a-zA-Z0-9_]

/p{IsWord}

/W

Non-(word character)

[^a-zA-Z0-9_]

/P{IsWord}

POSIX风格的字符类如下:

Class

Meaning

alnum

Any alphanumeric, that is, an alpha or a digit.

alpha

Any letter. (That's a lot more letters than you think, unless you're thinking Unicode, in which case it's still a lot.)

ascii

Any character with an ordinal value between 0 and 127.

cntrl

Any control character. Usually characters that don't produce output as such, but instead control the terminal somehow; for example, newline, form feed, and backspace are all control characters. Characters with an ord value less than 32 are most often classified as control characters.

digit

A character representing a decimal digit, such as 0 to 9. (Includes other characters under Unicode.) Equivalent to /d.

graph

Any alphanumeric or punctuation character.

lower

A lowercase letter.

print

Any alphanumeric or punctuation character or space.

punct

Any punctuation character.

space

Any space character. Includes tab, newline, form feed, and carriage return (and a lot more under Unicode.) Equivalent to /s.

upper

Any uppercase (or titlecase) letter.

word

Any identifier character, either an alnum or underline.

xdigit

Any hexadecimal digit. Though this may seem silly ([0-9a-fA-F] works just fine), it is included for completeness.

注意:POSIX风格字符类的使用方法,

42 =~ /^[[:digit:]]+$/  (正确)

42 =~ /^[:digit:]$/   (错误)

这里使用的模式以[[开头,以]]结束,这是使用POSIX字符类的正确使用方法。我们使用的字符类是[:digit:]。外层的[]用来定义一个字符集合,内层的[]字符是POSIX字符类的组成部分。

常见问题的正则解决方案

IP地址

(((/d{1,2})|(1/d{2})|(2[0-4]/d)|(25[0-5]))/.){3}((/d{1,2})|(1/d{2})|(2[0-4]/d)|(25[0-5]))

邮件地址

(/w+/.)*/w+@(/w+/.)+[A-Za-z]+

(以上邮件地址正则表达式并非严格的,但是可以匹配绝大多数普通的邮件地址。

HTTP URL:

{http://([^/:]+)(:(/d+))?(/.*)?$}i

https?://(/w*:/w*@)?[-/w.]+(:/d+)?(/([/w/_.]*(/?/S+)?)?)?

C语言注释

/

Perl中,类、包、模块是相关的,一个模块只是以同样文件名(带.pm后缀)的一个包;一个类就是一个包;一个对象是一个引用;一个方法就是一个子程序。这里只说明其最简单的使用方法。

模块使用

以下是一个模块(Bestiary.pm)的编写方式,可以作为写一般模块的参考。

package      Bestiary;
require      Exporter;
 
our @ISA       = qw(Exporter);
our @EXPORT    = qw(camel);    # Symbols to be exported by default
our @EXPORT_OK = qw($weight);  # Symbols to be exported on request
our $VERSION   = 1.00;         # Version number
 
### Include your variables and functions here
 
sub camel { print "One-hump dromedary" }
 
$weight = 1024;
 
1;
(引自《Programming Perl》)

对象使用

以下例子用来构建一个Ipregion对象,可以使用该对象的get_area_isp_id方法查找一个IP的地区与运营商。本例可以作为写一般对象的参考。

package Ipregion;

use strict;

 

my ($DEFAULT_AREA_ID, $DEFAULT_ISP_ID) = (999999, 9);

my ($START_IP, $END_IP, $AREA_ID, $ISP_ID) = (0 .. 3);   

sub new {

     my $invocant = shift;

     my $ip_region_file = shift;

     my $class = ref($invocant) || $invocant;

     my $self = [ ];                     # $self is an reference of array of arrays

 

     # Read into ip region data from file

     open my $fh_ip_region, '<', $ip_region_file   

or  die "Cannot open $ip_region_file to load ip region data $!";

 

     my $i = 0;

     while (<$fh_ip_region>) {

            chomp;

            my ($start_ip, $end_ip, $area_id, $isp_id) = split;

            $self->[$i++] = [ $start_ip, $end_ip, $area_id, $isp_id ];

     }

 

     bless($self, $class);

     return $self;

}

 

sub get_area_isp_id {

    my $self        = shift;

    my $ip          = shift;

    my $area_id = $DEFAULT_AREA_ID;

    my $isp_id     = $DEFAULT_ISP_ID;

 

    # Check if a ip address is in the table using binary search method.

    my $left  = 0;                 

    my $right       = @$self - 1;                   # Get max array index  

    my $middle;                  

 

    while ($left <= $right) {

        $middle = int( ($left + $right) / 2 );

        if ( ($self->[$middle][$START_IP] <= $ip) && ($ip <= $self->[$middle][$END_IP]) ) {

            $area_id = $self->[$middle][$AREA_ID];

            $isp_id  = $self->[$middle][$ISP_ID];

            last;

        }

        elsif ($ip < $self->[$middle][$START_IP]) {

            $right = $middle - 1;

        }

        else {

            $left = $middle + 1;

        }

    }

 

    return ($area_id, $isp_id);

}

 

该对象的使用方法是:

use Ipregion;

my $ip_region = Ipregion->new("new_ip_region.dat");

my @search_result = $ip_region->get_area_isp_id(974173694);

 

Perl特殊变量

变量符号(名)

  

$a

sort函数使用存储第一个将比较的值

$b

sort函数使用存储第二个将比较的值

$_ ($ARG)

默认的输入或模式搜索空间

@_ (@ARG)

子程序中默认存储传入参数

ARGV

The special filehandle that iterates over command-line filenames in @ARGV

$ARGV

Contains the name of the current file when reading from ARGV filehandle

@ARGV

The array containing the command-line arguments intended for script

$^T ($BASETIME)

The time at which the script began running, in seconds since the epoch

$? ($CHILD_ERROR)

The status returned by the last pipe close, backtick(``)command, or wait, waitpid, or system functions.

DATA

This special filehandle refers to anything following the __END__ or the __DATA__ token in the current file

$)  ($EGID,

$EFFECTIVE_GROUP_ID)

The effective GID of this process

$> ($EUID,

$EFFECTIVE_USER_ID)

The effective UID of this process as returned by the geteuid(2) syscall

%ENV

The hash containing your current environment variables

$@ ($EVAL_ERROR)

The currently raised exception or the Perl syntax error message from the last eval operation

@EXPORT

Exporter模块import方法使用

@EXPORT_OK

Exporter模块import方法使用

%EXPORT_TAGS

Exporter模块import方法使用

%INC

The hash containing entries for the filename of each Perl file loaded via do FILE, require or use

@INC

The array containing the list of directories where Perl module may be found by do FILE, require or use

$. ($NR,

 $INPUT_LINE_NUMBER)

The current record number (usually line numberZ) for the last filehandle you read from.

$/ ($RS,

$INPUT_RECORD_SEPARATOR)

The input record separator, newline by default, which is consulted by the readline function, the <FH> operator, and the chomp function.

$/=””将使得记录分割符为空白行,不同于”/n/n”

undef $/; 文件剩余所有行将全部一次读入

$/=/$number将一次读入$number字节

@ISA

This array contains names of other packages to look through when a method call cannot be found in the current package

@+ @- $` $’ $& $1 $2 $3

匹配相关变量

$^ $~ $|

Filehandle相关

$” ($LIST_SEPARATOR)

When an array or slice is interpolated into a double-quoted string, this variable specifies the string to put between individual elements. Default is space.

$^O ($OSNAME)

存储平台名

$! ($ERRNO, $OS_ERROR)

数值上下文:最近一次调用的返回值

字符串上下文:响应系统错误信息

$, ($OFS,

$OUTPUT_FIELD_SEPARATOR)

print的字段分割符(默认为空)

$/($ORS,

$OUTPUT_RECORD_SEPARATOR)

print的记录分割符(默认为空,设为”/n”是很好的选择)

$$ ($PID)

The process number

$0 ($PROGRAM_NAME)

程序名

$( ($GID, $PEAL_GROUP_ID)

进程的真正GID

$<


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
CentOS7安装Perl环境发布时间:2022-07-22
下一篇:
(转载)CSV文件处理PERL发布时间:2022-07-22
热门推荐
热门话题
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap