perl学习笔记之：模式匹配，模块，文档

OGeek|极客世界-中国程序员成长平台 › 门户 › 编程› Perl›Perl教程

原作者: [db:作者] 来自: [db:来源] 收藏邀请

Perl语言的最大特点，也是Perl作为CGI首选语言的最大特点，是它的模式匹配操作符。Perl语言的强大的文本处理能力正是通过其内嵌的对模式匹配的支持体现的。模式通过创建正则表达式实现。Perl的正则表达式与模式匹配的特点一是内嵌于语言之中，而不是通过库或函数来实现，因此使用更简便；二是比一般的正则表达式与模式匹配功能强大。

模式匹配操作符简介

操作符	意义	实例
=~	匹配(包含)
!~	不匹配（不包含）
m//	匹配	$haystack =~ m/needle/ $haystack =~ /needle/
s///	替换	$italiano =~ s/butter/olive oil/
tr///(y///)	转换	$rotate13 =~ tr/a-zA-Z/n-za-mN-ZA-M/
qr//	正则表达式

使用说明：

l 注意区别记忆Perl的绑定操作符(=~)与AWK的相应操作符（AWK的绑定匹配操作符是 ~），Perl与AWK的否定匹配操作符相同(都是!~)

l 没有绑定操作符时，默认是对$_进行绑定：

/new life/ and /new civilizations/ （对$_进行两次查找）

s/suger/aspartame/ (对$_进行替换)

tr/ATCG/TAGC/ (对$_进行转换)

l m//操作符前面的m可以省略，但是不省略可读性更好，建议不省略。

l 如果有绑定操作符=~，m//都省略也表示匹配：

print “matches” if $somestring =~ $somepattern; 等价于

print “matches” if $somestring =~ m/$somepattern/;

l m//, s///, tr///, qr//操作符是引用操作符，你可以选择自己的分割符（与q//, qq//, qw//一样）：

$path =~ s#/tmp#/var/tmp/scratch#

if ($dir =~ m[/bin]) {

print “No binary directories please. /n”;

}

l 一个括号可与其它括号配合使用，可以用空格分开：

s(egg)<larva>

s(larva){pupa};

s[pupa]/imago/;

s (egg) <larva>;

l 如果一个模式成功匹配上，$`, $&, $’将被设置，分别表示匹配左边、匹配、匹配右边的字符串：

“hot cross buns” =~ /cross/;

print “Matched: <$`> $& <$’>/n”; # Matched <hot > cross < buns>

l 模式模式后设置的特殊变量如下：

变量	含义
$`	匹配文本之前的文本
$&	匹配文本
$’	匹配文本之后的文本
$1、$2、$3	对应第1、2、3组捕获括号匹配的文本
$+	编号最大的括号匹配的文本
$^N	最后结束的括号匹配的文本
@-	目标文本中各匹配开始位置的偏移值数组
@+	目标文本中各匹配结束位置的偏移值数组
$^R	最后执行的嵌入代码的结果，如果嵌入代码结构作为条件语句的if部分，则不设定$^R

m//, s///和qr//都接受以下修饰符：

修饰符	意义
/i	进行忽略字母大小的匹配
/s	单行模式(让.号匹配换行符并且忽略过时的$*变量，点号通配模式)
/m	多行模式（让^和$匹配内含的换行符(/n)的之后与之前。如果目标字符串中没有“/n”字符或者模式中没有 ^ 或 $，则设定此修饰符没有任何效果）。（增强的行锚点模式）
/x	宽松排列和注释模式（忽略空白符（除转义空白符之外）并且允许模式中的注释）
/o	仅编译一次模式，防止运行时重编译

例如：

m//w+:(/s+/w+)/s*/d+/; # A word, colon, space, word, space, digits

m//w+: (/s+ /w+) /s* /d+/x; # A word, colon, space, word, space, digits

/w+; # Match a word and a column

( # (begin group)

/s+ # Match one or more spaces.

/w+ # Match another word

) # (end group)

/s* # Match zero or more spaces

/d+ # Match some digits

}x;

$/ = ""; # "paragrep" mode

while (<>) {

while ( m{

/b # start at a word boundary

(/w/S+) # find a wordish chunk

(

/s+ # separated by some whitespace

/1 # and that chunk again

) + # repeat ad lib

/b # until another word word boundary

}xig

) {

print "dup word '$1' at paragraph $. /n";

}

模式匹配操作符详解

7.3.1 m//操作符(匹配)

EXPR =~ m/PATTERN/cgimosx

EXPR =~ /PATTERN/cgimosx

EXPR =~ ?PATTERN?cgimosx

m/PATTERN/cgimosx

/PATTERN/cgimosx

?PATTERN?cgimosx

说明：

l 如果PATTERN是空字符串，最后成功执行的正则表达式将被代替使用。

l m//特殊修饰符如下：

修饰符	意义
/g	查找所有的匹配
/cg	在/g 匹配失败后允许继续搜索

l 在LIST上下文中m//g返回所有匹配

if （@perls = $paragraph =~ /perl/gi) {

printf “Perl mentioned %d times./n”, scalar @perls;

}

l ??分隔符表示一次性匹配， ‘’分隔符压制变量替换和/U等六个转换

open DICT, "/usr/share/dict/words" or die "Cannot open words: $!/n";

while (<DICT>) {

$first = $1 if ?(^love.*)?;

$last = $1 if /(^love.*)/;

}

print $first, "/n";

print $last, "/n";

7.3.2 s///操作符(替换)

LVALUE =~ s/PATTERN/REPLACEMENT/egimosx

s/PATTERN/REPLACEMENT/egimosx

说明：

l 该操作符在字符串中查找PATTERN, 如果查找到，用REPLACEMENT代替匹配的子串。返回值是成功替换的次数（加/g修饰符可能大于1）。若失败，返回””(0)。

if ($lotr =~ s/Bilbo/Frodo/) { print “Successfully wrote sequel. “ }

$change_count = $lotr =~ s/Bilbo/Frodo/g;

l 替换部分作为双引字符串，可以使用动态生成的模式变量（$`，$&, $’, $1, $2等）：

s/revision/version/release//u$&/g;

s/version ([0-9.]+)/the $Names{$1} release/g;

l 如果PATTERN是空字符串，最后成功执行的正则表达式将被代替使用。PATTERN和REPLACEMENT都需进行变量替换，但是PATTERN在s///作为一个整体处理的时候替换，而REPLACEMENT在每次模式匹配到时替换。

l s///特殊修饰符如下：

修饰符	意义
/g	替换所有的匹配
/e	将右边部分作为一个Perl表达式（代码）而不是字符串

/e修饰符的实例：

s/[0-9]+/sprintf(“%#x”, $1)/ge

version

/s+

(

[0-9.]+

)

}{

$Names{$1}

? “the $Names{$1} release”

: $&

}xge;

l 不替换原字符串的方式：

$lotr = $hobbit;

$lotr =~ s/Bilbo/Frodo/g;

($lotr = $hobbit) =~ s/Bilbo/Frodo/g;

l 替换数组中的每一元素：

for (@chapters) { s/Bilbo/Frodo/g }

s/Bilbo/Frodo/g for @chapters;

l 对某一字符串进行多次替换：

for ($string) {

s/^/s+//;

s//s+$//;

s//s+/ /g

}

for ($newshow = $oldshow) {

s/Fred/Homer/g;

s/Wilma/Marge/g;

s/Pebbles/Lisa/g;

s/Dino/Bart/g;

}

l 当一次全局替换不够的时的替换：

# put comma in the right places in an integer

1 while s/(/d)(/d/d/d)(?!/d)/$1,$2/;

# expand tabs to 8-column spacing

1 while s//t+/’ ‘ x (length($&)*8 – length($`)%8)/e;

# remove (nested (even deeply nested (like this))) remarks

1 while s//([^()]*/)//g;

# remove duplicate words (and triplicate ( and quadruplicate…))

1 while s//b(/w+) /1/b/$1/gi;

7.3.3 tr///操作符(字译)

LVALUE =~ tr/SEARCHLIST/REPLACELIST/cds

tr/SEARCHLIST/REPLACELIST/cds

使用说明：

l tr///的修饰符如下：

修饰符	意义
/c	补替换（Complement SEARCHLIST）
/d	删除找到未替换的字符串（在SEARCHLIST中存在在REPLACELIST中不存在的字符）
/s	将重复替换的字符变成一个

l 如果使用了/d修饰符，REPLACEMENTLIST总是解释为明白写出的字符串，否则，如果REPLACEMENTLIST比SEARCHLIST短，最后的字符将被复制直到足够长，如果REPLACEMENTLIST为空，等价于SEARCHLIST，这种用法在想对字符进行统计而不改变时有用，在用/s修饰符压扁字符时有用。

tr/aeiou/!/; # change any vowel into !

tr{////r/n/b/f. }{_}; # change strange chars into an underscore

tr/A-Z/a-z/ for @ARGV; # canonicalize to lowercase ASCII

$count = ($para =~ tr//n//);

$count = tr/0-9//;

$word =~ tr/a-zA-Z//s; # bookkeeper -> bokeper

tr/@$%*//d; # delete any of those

tr#A-Za-z0-9+/##cd; # remove non-base64 chars

# change en passant

($HOST = $host) =~ tr/a-z/A-Z/;

$pathname =~ tr/a-zA-Z/_/cs; # change non-(ASCII) alphas to single underbar

元字符

Perl元字符有：

/ | ( ) [ { ^ $ * + ?

正则表达式元字符的意义如下：

Symbol	Atomic	Meaning
/...	Varies	转义
...\|...	No	选择
(...)	Yes	集群（作为一个单位）
[...]	Yes	字符集合
^	No	字符串开始
.	Yes	匹配一个字符（一般除换行符外）
$	No	字符串结尾(或者换行符之前)

* + ？是数量元字符，Perl数量相关元字符意义如下：

Quantifier	Atomic	Meaning
*	No	匹配0或多次(最大匹配)，相当于{0,}
+	No	匹配1或多次(最大匹配)，相当于{1,}
?	No	匹配1或0次(最大匹配)，相当于{0,1}
{COUNT}	No	匹配精确COUNT次
{MIN,}	No	匹配最少MIN次 (最大匹配)
{MIN,MAX}	No	匹配最小MIN最大MAX次(最大匹配)
*?	No	匹配0或多次(最小匹配)
+?	No	匹配1或多次(最小匹配)
??	No	匹配1或0次(最小匹配)
{MIN,}?	No	匹配最少MIN次 (最小匹配)
{MIN,MAX}?	No	匹配最小MIN最大MAX次(最小匹配)

扩展正则表达式序列如下：

Extension	Atomic	Meaning
(?#...)	No	Comment, discard.
(?:...)	Yes	Cluster-only parentheses, no capturing.
(?imsx-imsx)	No	Enable/disable pattern modifiers.
(?imsx-imsx:...)	Yes	Cluster-only parentheses plus modifiers.
(?=...)	No	True if lookahead assertion succeeds.
(?!...)	No	True if lookahead assertion fails.
(?<=...)	No	True if lookbehind assertion succeeds.
(?<!...)	No	True if lookbehind assertion fails.
(?>...)	Yes	Match nonbacktracking subpattern.
(?{...})	No	Execute embedded Perl code.
(??{...})	Yes	Match regex from embedded Perl code.
(?(...)...\|...)	Yes	Match with if-then-else pattern.
(?(...)...)	Yes	Match with if-then pattern.

说明：以上定义了向前查找(?=PATTERN)，负向前查找(?!PATTERN)，向后查找(?<=PATTERN)，负向后查找(?<!PATTERN)，条件查找等较为高级的正则表达式匹配功能，需要使用时请查阅相关资料。

字母顺序元字符意义：

Symbol	Atomic	Meaning
/0	Yes	Match the null character (ASCII NUL).
/NNN	Yes	Match the character given in octal, up to /377.
/n	Yes	Match nth previously captured string (decimal).
/a	Yes	Match the alarm character (BEL).
/A	No	True at the beginning of a string.
/b	Yes	Match the backspace character (BS).
/b	No	True at word boundary.
/B	No	True when not at word boundary.
/cX	Yes	Match the control character Control-X (/cZ, /c[, etc.).
/C	Yes	Match one byte (C char) even in utf8 (dangerous).
/d	Yes	Match any digit character.
/D	Yes	Match any nondigit character.
/e	Yes	Match the escape character (ASCII ESC, not backslash).
/E	--	End case (/L, /U) or metaquote (/Q) translation.
/f	Yes	Match the form feed character (FF).
/G	No	True at end-of-match position of prior m//g.
/l	--	Lowercase the next character only.
/L	--	Lowercase till /E.
/n	Yes	Match the newline character (usually NL, but CR on Macs).
/N{NAME}	Yes	Match the named char (/N{greek:Sigma}).
/p{PROP}	Yes	Match any character with the named property.
/P{PROP}	Yes	Match any character without the named property.
/Q	--	Quote (de-meta) metacharacters till /E.
/r	Yes	Match the return character (usually CR, but NL on Macs).
/s	Yes	Match any whitespace character.
/S	Yes	Match any nonwhitespace character.
/t	Yes	Match the tab character (HT).
/u	--	Titlecase next character only.
/U	--	Uppercase (not titlecase) till /E.
/w	Yes	Match any "word" character (alphanumerics plus "_").
/W	Yes	Match any nonword character.
/x{abcd}	Yes	Match the character given in hexadecimal.
/X	Yes	Match Unicode "combining character sequence" string.
/z	No	True at end of string only.
/Z	No	True at end of string or before optional newline.

（以上均直接Copy自《Programming Perl》，下面未翻译者同）

其中应注意以下经典的字符集合：

Symbol	Meaning	As Bytes	As utf8
/d	Digit	[0-9]	/p{IsDigit}
/D	Nondigit	[^0-9]	/P{IsDigit}
/s	Whitespace	[ /t/n/r/f]	/p{IsSpace}
/S	Nonwhitespace	[^ /t/n/r/f]	/P{IsSpace}
/w	Word character	[a-zA-Z0-9_]	/p{IsWord}
/W	Non-(word character)	[^a-zA-Z0-9_]	/P{IsWord}

POSIX风格的字符类如下：

Class	Meaning
alnum	Any alphanumeric, that is, an alpha or a digit.
alpha	Any letter. (That's a lot more letters than you think, unless you're thinking Unicode, in which case it's still a lot.)
ascii	Any character with an ordinal value between 0 and 127.
cntrl	Any control character. Usually characters that don't produce output as such, but instead control the terminal somehow; for example, newline, form feed, and backspace are all control characters. Characters with an ord value less than 32 are most often classified as control characters.
digit	A character representing a decimal digit, such as 0 to 9. (Includes other characters under Unicode.) Equivalent to /d.
graph	Any alphanumeric or punctuation character.
lower	A lowercase letter.
print	Any alphanumeric or punctuation character or space.
punct	Any punctuation character.
space	Any space character. Includes tab, newline, form feed, and carriage return (and a lot more under Unicode.) Equivalent to /s.
upper	Any uppercase (or titlecase) letter.
word	Any identifier character, either an alnum or underline.
xdigit	Any hexadecimal digit. Though this may seem silly ([0-9a-fA-F] works just fine), it is included for completeness.

注意：POSIX风格字符类的使用方法,

42 =~ /^[[:digit:]]+$/ (正确)

42 =~ /^[:digit:]$/ （错误）

这里使用的模式以[[开头，以]]结束，这是使用POSIX字符类的正确使用方法。我们使用的字符类是[:digit:]。外层的[]用来定义一个字符集合，内层的[]字符是POSIX字符类的组成部分。

常见问题的正则解决方案

IP地址：

(((/d{1,2}）|(1/d{2})|(2[0-4]/d)|(25[0-5]))/.){3}((/d{1,2}）|(1/d{2})|(2[0-4]/d)|(25[0-5]))

邮件地址：

(/w+/.)*/w+@(/w+/.)+[A-Za-z]+

(以上邮件地址正则表达式并非严格的，但是可以匹配绝大多数普通的邮件地址。

HTTP URL:

{http://([^/:]+)(:(/d+))?(/.*)?$}i

https?://(/w*:/w*@)?[-/w.]+(:/d+)?(/([/w/_.]*(/?/S+)?)?)?

C语言注释：

在Perl中，类、包、模块是相关的，一个模块只是以同样文件名（带.pm后缀）的一个包；一个类就是一个包；一个对象是一个引用；一个方法就是一个子程序。这里只说明其最简单的使用方法。

模块使用

以下是一个模块(Bestiary.pm)的编写方式，可以作为写一般模块的参考。

package      Bestiary;

require      Exporter;

our @ISA       = qw(Exporter);

our @EXPORT    = qw(camel);    # Symbols to be exported by default

our @EXPORT_OK = qw($weight);  # Symbols to be exported on request

our $VERSION   = 1.00;         # Version number

### Include your variables and functions here

sub camel { print "One-hump dromedary" }

$weight = 1024;

1;

（引自《Programming Perl》）

对象使用

以下例子用来构建一个Ipregion对象，可以使用该对象的get_area_isp_id方法查找一个IP的地区与运营商。本例可以作为写一般对象的参考。

package Ipregion;

use strict;

my ($DEFAULT_AREA_ID, $DEFAULT_ISP_ID) = (999999, 9);

my ($START_IP, $END_IP, $AREA_ID, $ISP_ID) = (0 .. 3);

sub new {

my $invocant = shift;

my $ip_region_file = shift;

my $class = ref($invocant) || $invocant;

my $self = [ ]; # $self is an reference of array of arrays

# Read into ip region data from file

open my $fh_ip_region, '<', $ip_region_file

or die "Cannot open $ip_region_file to load ip region data $!";

my $i = 0;

while (<$fh_ip_region>) {

chomp;

my ($start_ip, $end_ip, $area_id, $isp_id) = split;

$self->[$i++] = [ $start_ip, $end_ip, $area_id, $isp_id ];

}

bless($self, $class);

return $self;

}

sub get_area_isp_id {

my $self = shift;

my $ip = shift;

my $area_id = $DEFAULT_AREA_ID;

my $isp_id = $DEFAULT_ISP_ID;

# Check if a ip address is in the table using binary search method.

my $left = 0;

my $right = @$self - 1; # Get max array index

my $middle;

while ($left <= $right) {

$middle = int( ($left + $right) / 2 );

if ( ($self->[$middle][$START_IP] <= $ip) && ($ip <= $self->[$middle][$END_IP]) ) {

$area_id = $self->[$middle][$AREA_ID];

$isp_id = $self->[$middle][$ISP_ID];

last;

}

elsif ($ip < $self->[$middle][$START_IP]) {

$right = $middle - 1;

}

else {

$left = $middle + 1;

}

return ($area_id, $isp_id);

}

该对象的使用方法是：

use Ipregion;

my $ip_region = Ipregion->new("new_ip_region.dat");

my @search_result = $ip_region->get_area_isp_id(974173694);

．Perl特殊变量

变量符号（名）	意义
$a	sort函数使用存储第一个将比较的值
$b	sort函数使用存储第二个将比较的值
$_ ($ARG)	默认的输入或模式搜索空间
@_ (@ARG)	子程序中默认存储传入参数
ARGV	The special filehandle that iterates over command-line filenames in @ARGV
$ARGV	Contains the name of the current file when reading from ARGV filehandle
@ARGV	The array containing the command-line arguments intended for script
$^T ($BASETIME)	The time at which the script began running, in seconds since the epoch
$? ($CHILD_ERROR)	The status returned by the last pipe close, backtick(``)command, or wait, waitpid, or system functions.
DATA	This special filehandle refers to anything following the __END__ or the __DATA__ token in the current file
$) ($EGID, $EFFECTIVE_GROUP_ID)	The effective GID of this process
$> ($EUID, $EFFECTIVE_USER_ID)	The effective UID of this process as returned by the geteuid(2) syscall
%ENV	The hash containing your current environment variables
$@ ($EVAL_ERROR)	The currently raised exception or the Perl syntax error message from the last eval operation
@EXPORT	Exporter模块import方法使用
@EXPORT_OK	Exporter模块import方法使用
%EXPORT_TAGS	Exporter模块import方法使用
%INC	The hash containing entries for the filename of each Perl file loaded via do FILE, require or use
@INC	The array containing the list of directories where Perl module may be found by do FILE, require or use
$. ($NR, $INPUT_LINE_NUMBER)	The current record number (usually line numberZ) for the last filehandle you read from.
$/ ($RS, $INPUT_RECORD_SEPARATOR)	The input record separator, newline by default, which is consulted by the readline function, the <FH> operator, and the chomp function. $/=””将使得记录分割符为空白行，不同于”/n/n” undef $/; 文件剩余所有行将全部一次读入 $/=/$number将一次读入$number字节
@ISA	This array contains names of other packages to look through when a method call cannot be found in the current package
@+ @- $` $’ $& $1 $2 $3	匹配相关变量
$^ $~ $\|	Filehandle相关
$” ($LIST_SEPARATOR)	When an array or slice is interpolated into a double-quoted string, this variable specifies the string to put between individual elements. Default is space.
$^O ($OSNAME)	存储平台名
$! ($ERRNO, $OS_ERROR)	数值上下文：最近一次调用的返回值字符串上下文：响应系统错误信息
$, ($OFS, $OUTPUT_FIELD_SEPARATOR)	print的字段分割符(默认为空)
$/($ORS, $OUTPUT_RECORD_SEPARATOR)	print的记录分割符(默认为空，设为”/n”是很好的选择)
$$ ($PID)	The process number
$0 ($PROGRAM_NAME)	程序名
$( ($GID, $PEAL_GROUP_ID)	进程的真正GID
$<