Perl in 20 pages
A guide to Perl 5 for C/C++, awk, and shell programmers
Russell Quong
Feb 19 2001 - Document version 2001a
Keywords: Perl documentation, Perl tutorial, Perl beginners, Guide to
Perl. (For internet search engines.)
Table of Contents
1. Introduction
* Perl Versions
2. Obtaining Perl binaries, documentation
3. Basics
4. Command line usage: substituting text
5. A simple one-shot script
6. A prototype Perl script
7. Control constructs
8. Variables
* Scalar types
* String or number
* Null string/zero versus no value
* Operators
* Lists/arrays
* Hashes
* Variables declaration
* Barewords
9. Context: scalar, list, hash or reference
* Forcing scalar or list context
10. Functions
* Calling functions
* Defining functions
* Returning values
* Optional parameters
11. Regular Expressions
* Symbols, syntax
* Searching and substituting
12. Built-in Perl functions
* File tests
13. Command line arguments
14. File I/O
15. Running external commands
16. References
* Passing references to functions
17. Quoting
* Here strings
18. Packages, Modules, Records and Objects in Perl
19. Revision History
20. Feedback, motivation and afterthoughts
Introduction
Perl is an interpreted scripting language with high-level support for
text processing, file/directory management, and networking. Perl
originated on Unix but as of 1997 has been ported to numerous platforms
including the Win32 API (on which Win95/NT are based). It is the
defacto language for CGI scripts. If I had to learn just one scripting
language, it would be Perl.
This document is not meant to be a thorough reference manual; instead,
see the concisely-written manual pages ("man pages") or buy the Perl
book (Programming Perl 2nd Edition, by Wall, Christianson and Schwartz,
ISBN 1-56592-149-6 [Note: Like the K&R book on C, this definitive
reference on a popular language is dense and insightful, but not for
all tastes]. This document attempts to help an experienced programmer
unfamiliar with Perl up to speed as quickly as possible on the most
commonly used features of Perl. For the experience Perl programmer
looking for a reference, I recommend Perl in a Nutshell, by Ellen
Siever, Stephen Spainhour and Nathan Patwardhan, ISBN 1-56592-286-7.
I am willing to sacrifice 100% correctness if there is a much simpler
view that is correct 99% of the time. There are several reasons for
taking this approach (I need to finish this paragraph).
My Perl programming philosophy emphasizes reuse and clarity over
brevity. We happily acknowledge that much of the Perl code presented
could easily be written in half the number of lines of code and with
greater efficiency.
1. I name variables and avoid using the implicit $_ or @_ variables whenever possible.
2. I use subroutines to hold all code.
3. I use local variables and avoid globals whenever possible.
The latest version of this document can be found at
http://www.best.com/~quong/perlin20/ . Additionally there are gzip'ped
2-up (US letter) Postscript , and 2-up (US letter) PDF versions. Have
at them.
License/use: You are free to reproduce/redistribute this document in
its entirety in any form for any use so long as (i) this license (what
you are reading right now) is maintained, and (ii) you make no claims
about the authorship. I, Russell Quong, have copyrighted this document.
I would appreciate notification of any large scale reproduction and/or
feedback.
As of Jun 1999, this document is fairly complete; continued work will
be infrequent with updates every 4-10 months. Thanks to all who have
pointed out errors.
Perl Versions
This document covers Perl version 5. If you have an older version,
upgrade immediately. Run perl -v to see the version. As of 6/2000, Perl
5.6 is the latest Unix and Win32 version and is available at
http://www.perl.com . (Version 5.005 was out by 9/98 and version 5.004
was available by 2/98.) I used 5.003 when initially writing this
document in 4/98.
Before version 5, Perl was a cryptic language in large part to its use
of variables. In Version 4 most built-in variables were named via
single punctuation symbols, such as $], $_ and, even worse, most
statements operated on an implicit variable, named _ (yes, the variable
named underscore) to increase brevity. In Perl 5, released sometime in
late 1995 (?), most of built-in variables now have descriptive english
names and all statements can be rewritten to show explicitly the
variables being used.
Obtaining Perl binaries, documentation
Check http://www.perl.com and/or CPAN (the Comprehensive Perl Archive)
for any Perl related binaries, material, documentation, source or
modules. If anything, there is too much information at CPAN. CPAN is
mirrored at many (over 40) different sites . Pick one near you.
Basics
Perl is a polymorphic, interpreted language with built in support for
textual processing, regular expressions, file/directory manipulation,
command execution, networking, associative arrays, lists, and dbm
access. We next present three increasingly complicated examples using
perl
Command line usage: substituting text
In some cases, a script is not needed. For example, I often want to
replace all occurrences of a regex (regular expression) FROMX to a new
value TOX in one more files FILESX.
Here's the command:
## replace FROM with TOX in all files FILESX, renaming originals with .bak
% perl -p -i.bak -e "s/FROM/TOX/;" FILESX
## replace FROM with TOX in all files FILESX, overwriting originals
% perl -p -i -e "s/FROM/TOX/;" FILESX
## Same as above, assumes FROM or TOX contain a '/' but not a '@'
% perl -p -i -e "s@FROM@TOX@;" FILESX
A simple one-shot script
Sometimes you need a simple throw-away script to do a task once or
twice, in which case the full-blown script in the next section is just
too much. The following script oneShot.pl reads all files specified as
command line arguments and prints out each line preceded by the file
name and the line number. You may need to make the script file
executable (via the Unix command chmod 755 oneShot.pl) first.
To run the script type
% oneShot.pl input-file(s)
or
% perl -w oneShot.pl input-file(s)
1 #! /usr/bin/perl -w
2 use English;
3
4 sub main () {
5 my($filename, $line, $lineno) = ("f-not-set", undef, 0); # local vars
6 ## <> returns one-by-one every line of all files in @ARGV
7 while ( defined($line=<>) ) {
8 if ($ARGV ne $filename) { # detect when we switch files
9 $lineno = 0; # reset the line number
10 $filename =
$ARGV; # $ARGV = current file name
11 }
12 $lineno ++; # increment the line number
13 chomp($line); # strip off newline from the line
14 print "file=$filename, $lineno: line=($line)\n";
15 }
16 }
17
18 main();
19 0;
A prototype Perl script
We present a non-trivial prototype Perl script that illustrates many common Perl script operations, including
* command line flag handling
* variables, defining/calling functions, parameter syntax
* read multiple files
* write the results to a file
* text searching and matching using regular expressions,
* sorting an array of strings alphabetically
If this script is too much for your needs, use the preceding prototype
script for simpler one-shot tasks in the next section. Remember, it is
much easier to remove parts from a big script than to add to a small
script. (Retrospective: even after writing this prototype script, I
resisted using it because it seemed too long, but in most cases I ended
up cutting/pasting from it to my new script; since then, I just start
with this script and wittle away.)
By breaking each of the majors steps into a separate function, you can
modify this prototype script for your needs with minimial changes.
Although this script is long, it should be fairly easy to read.
This example script proto-getH1.pl extracts and then sorts
(alphabetizes) all the high-level headings from one or more HTML files,
by looking for lines that contain
<Hn> ... </Hn>
This script proto-getH1.pl is run via:
% perl -w proto-getH1.pl [-o outputfile] input-file(s)
or
% proto-getH1.pl [-o outputfile] input-file(s)
All HTML headers are sent to the output file, which is stdout by default, or the file specified after the -o command line flag.
1 #! /usr/bin/perl -w
2
3 # Example perl file - extract H1,H2 or H3 headers from HTML files
4 # Run via:
5 # perl this-perl-script.pl [-o outputfile] input-file(s)
6 # E.g.
7 # perl proto-getH1.pl -o headers *.html
8 # perl proto-getH1.pl -o output.txt homepage.htm
9 #
10 # Russell Quong 2/19/98
11
12 require 5.003; # need this version of Perl or newer
13 use English; # use English names, not cryptic ones
14 use FileHandle;
# use FileHandles instead of open(),close()
15 use
Carp;
# get standard error / warning messages
16 use strict; # force disciplined use of variables
17
18 ## define some variables.
19 my($author) = "Russell W. Quong";
20 my($version) = "Version 1.0";
21 my($reldate) = "Jan 1998";
22
23 my($lineno) =
0;
# variable, current line number
24 my($OUT) =
\*STDOUT;
# default output file stream, stdout
25 my(@headerArr) = (); # array of HTML headers
26
27 # print out a non-crucial for-your-information messages.
28 # By making fyi() a function, we enable/disable debugging messages easily.
29 sub fyi ($) {
30 my($str) = @_;
31 print "$str\n";
32 }
33
34 sub main () {
35 fyi("perl script = $PROGRAM_NAME, $version, $author, $reldate.");
36 handle_flags();
37 # handle remaining command line args, namely the input files
38 if (@ARGV == 0)
{ # @ARGV
used in scalar context = number of args
39 handle_file('-');
40 } else {
41 my($i);
42 foreach $i (@ARGV) {
43 handle_file($i);
44 }
45 }
46
postProcess();
# additional processing after reading input
47 }
48
49 # handle all the arguments, in the @ARGV array.
50 # we assume flags begin with a '-' (dash or minus sign).
51 #
52 sub handle_flags () {
53 my($a, $oname) = (undef, undef);
54 foreach $a (@ARGV) {
55 if ($a =~ /^-o/) {
56
shift
@ARGV;
# discard ARGV[0] = the -o flag
57
$oname =
$ARGV[0]; # get
arg after -o
58
shift
@ARGV;
# discard ARGV[0] = output file name
59 $OUT = new FileHandle "> $oname";
60 if (! defined($OUT) ) {
61
croak "Unable to open output file: $oname. Bye-bye.";
62 exit(1);
63 }
64 } else {
65
last;
# break out of this loop
66 }
67 }
68 }
69
70 # handle_file (FILENAME);
71 # open a file handle or input stream for the file named FILENAME.
72 # if FILENAME == '-' use stdin instead.
73 sub handle_file ($) {
74 my($infile) = @_;
75 fyi(" handle_file($infile)");
76 if ($infile eq "-") {
77
read_file(\*STDIN, "[stdin]"); # \*STDIN=input stream for STDIN.
78 } else {
79 my($IN) = new FileHandle "$infile";
80 if (! defined($IN)) {
81
fyi("Can't open spec file $infile: $!\n");
82 return;
83 }
84
read_file($IN, "$infile"); # $IN = file
handle for $infile
85
$IN->close();
# done, close the file.
86 }
87 }
88
89 # read_file (INPUT_STREAM, filename);
90 #
91 sub read_file ($$) {
92 my($IN, $filename) = @_;
93 my($line, $from) = ("", "");
94 $lineno =
0;
# reset line number for this file
95 while ( defined($line = <$IN>) ) {
96 $lineno++;
97
chomp($line);
# strip off trailing '\n' (newline)
98 do_line($line, $lineno, $filename);
99 }
100 }
101
102 # do_line(line of text data, line number, filename);
103 # process a line of text.
104 sub do_line ($$$) {
105 my($line, $lineno, $filename) = @_;
106 my($heading, $htype) = undef;
107 # search for a <Hx> .... </Hx> line, save the .... in $header.
108 # where Hx = H1, H2 or H3.
109 if ( $line =~ m:(<H[123]>)(.*)</H[123]>:i ) {
110 $htype =
$1; #
either H1, H2, or H3
111 $heading =
$2; # text
matched in the parethesis in the regex
112 fyi("FYI:
$filename, $lineno: Found
($heading)");
113 print $OUT "$filename, $lineno: $heading\n";
114
115 #
we'll also save the all the headers in an array, headerArr
116 push(@headerArr, "$heading ($filename, $lineno)");
117 }
118 }
119
120 # print out headers sorted alphabetically
121 #
122 sub postProcess() {
123 my(@sorted) = sort { $a cmp $b } @headerArr; # example using sort
124 print $OUT "\n--- SORTED HEADERS ---\n";
125 my($h);
126 foreach $h (@sorted) {
127 print $OUT "$h\n";
128 }
129 my $now = localtime();
130 print $OUT "\nGenerated $now.\n"
131
132 }
133 # start executing at main()
134 #
135 main();
136
0;
# return 0 (no error from this script)
Control constructs
Perl has the similar syntax as C/C++/Java for control constructs such
as if, while, for statements. The following table compares the control
constructs between C and Perl. In Perl, the values 0, "0", and "" (the
empty string) are false; any other value is true when evaluating a
condition in an if/for/while statement.
C Perl (braces required)
same if () { ... } if () { ... }
diff } else if () { ... } } elsif () { ... }
same while () { ... } while () { ... }
diff do while (); do while (); (See below)
same for (aaa;bbb;ccc) { ... } for (aaa;bbb;ccc) { ... }
diff N/A foreach $var (@array) { ... }
diff break last
diff continue next
similar 0 is FALSE 0, "0", and "" is FALSE
similar != 0 is TRUE anything not false is TRUE
Note in Perl, the curly braces around a block are required, even if the
block contains a single statement. Also you must use elsif in Perl,
rather than else if as shown below.
if ( conditionAAA ) {
...
} elsif ( conditionBBB ) {
...
} else {
...
}
Finally, although the do { body } while (...) is legal Perl, it is not
an actual loop construct in Perl. Instead, it is the do statement with
a while modifier. In particular, last and next will not work inside the
body.
Variables
There are four types of data in Perl, scalars, arrays, hashes and
references. Scalars and arrays are ubiquitious (used everywhere).
Hashes are common in large programs and not unusual in smaller
programs. References are scalars that point to other data, namely a
reference is a pointer. Referencs are an advanced topic and can be
ignored initially; there is a sparse coverage of references later in
this document. In the following listing, the initial symbol is the
context specifier for that type.
1. ($) A scalar is a single string or numeric value. More advanced scalar types include references, and typeglobs.
2. (@) A list or array is a one-dimensional vector of zero
or more scalars. Arrays/lists are indexed as arrays via [ ]; the
starting index is 0, like C/C++. The Perl reference documentation
intermixes the terms list and array freely; so shall we.
3. (%) A hash is a list of (key, value) pairs, in which
you can search for a particular key efficiently. In practice, a hash is
implemented via in a hash table, hence the name.
4. (\) A reference refers to another value, much like a pointer in C/C++ refers to some other value.
A scalar holds a single value; an array or list holds zero or more
values. The scalar types in Perl are string, number, and
reference[Note: There is also a symbol table entry scalar type, poorly
named a typeglob in Perl, but you are not likely to use it initially].
Like awk, a scalar data value in Perl contains either a string or a
(floating point) number. For reference we create scalars of all four
types.
$numx = 3.14159; # numeric
$strx = "The constant pi"; # string
$refx = \$numx; # reference
$tglobx =
*numx;
# typeglob (different from file name globbing)
A numeric value is a real or floating point value and can use any of the standard C specifications, e.g. (1.2, 12+e-1).
A string value is enclosed in matching single or double quotes. Within
double quotes, variable references (but not expressions involving
operators) are evaluated, like shells (csh,sh); within single quotes
nothing is evaluated. Double quotes are especially convenient when
printing out values.
$i = 123; print('i =
$i\n');
# print: i = $i\n
print("i =
$i\n");
# print: i = 123
print("i =
$i+4\n");
# print: i = 123+4
print("i = " . ($i+4) . "\n"); # print: i = 127
print("i = " . $i+4 .
"\n");
# print: 4 (may get warnings)
print((("i = " . $i) + 4) . "\n"); # print: 4 (same as previous)
Perl automatically converts from string to number or vice versa as
needed, based on the operation being done. Below, + is arithmetic plus
and . is string concatenation.
$pi = "3.14";
$two_pi = 2 * $pi; # $two_pi = 6.28
$pi_pi = $pi . $pi; # $pi_pi = "3.143.14"
The following table shows that a non-numeric string value is viewed as
0 (zero), and a numeric value viewed as a string is the ASCII
representation of the number.
Type of $x (Value of) $x $x+1 $x . "::" if ($x) {
string "abc" 1 abc:: true
number 3 4 3:: true
string "45.0" 46 45.0:: true
number 0 1 0:: false
string "" 1 :: false
undefined "" 1 :: false
Because strings are converted to numbers on demand and vice versa,
there is no practical difference between a number and its string
equivalent. Thus, in the following statements i and j are assigned the
same value.
$i = 3; # same as $i = "3"
$j = "3"; # same as $j = 3
$k = $i + $j; # $k = 6
$s = $i . $j; # $s = "33"
$f = "3.0" # not the same as "3" as $f . 1 would give "3.01"
- Null string/zero versus no value
A scalar variable that has a valid string or numeric value, such as 4.3
or "hello" or even "" (the empty string), is defined. In contrast, if a
variable without a valid value is undefined. The builtin value undef
represents this undefined value, much like NULL in C/C++, null in Java
or nil in Lisp/Ada are undefined values. An array is defined if has
previously held data. The empty array () is undefined; all other array
values are considered defined. Use the defined() function to test if a
variable is defined.
my $emptystr = "";
my(@nonemptylist) = ( undef );
if ( defined($emptystr) && defined(@nonemptylist) ) {
print "will see this\n";
}
my $invalid;
my(@empylist) = ();
if ( defined($invalid) || defined(@emptylist)) {
print "will NOT see this\n";
}
@emptylist = (1, 2);
@emptylist = ();
if ( defined(@emptylist)) {
print "emptylist is empty but is defined now\n";
}
If you read or access an undefined variable var as a string or number,
you get the undefined value, which is then converted to "" or 0. Thus
an undefined variable is considered false.
An entry for a key KKK in a hash can contain the undefined value. This
situation is different than the key KKK not existing in the hash. Use
the perl functions exists and defined to distinguish the difference.
sub hashdefined () {
my(%hhh);
$hhh{"red"} = undef;
if (! exists $hhh{"nowhere"} ) {
print "key nowhere is not in hash hhh.\n"; # YES
}
if (! exists $hhh{"red"} ) {
print "key red is not in hash
hhh.\n";
# NOPE
}
if (exists $hhh{"nowhere"} && ! defined($hhh{"nowhere"}) ) {
print "key nowhere exists but has the undefined value.\n"; # NOPE
}
if (exists $hhh{"red"} && ! defined($hhh{"red"}) ) {
print "key red exists but has the undefined value.\n"; # YES
}
}
Most Perl operators, such as + or < or . work either on numbers or on strings but not both.
Description string op numeric op
equality eq ==
inequality ne !=
ternary compare cmp <=>
concatenation . (a dot) N/A
arithmetic N/A +, -, *, /
relational lt, le, gt, ge <, <=, >, >=
ANSI C ops
ASCII strings are ordered character by character based on the
underlying ASCII value. For purely alphabetic strings, this results in
normal alphabetization, as A < B < ... < Z < a < b <
... < z. In general, strings are ordered using the local collating
property. The tri-valued compare operations xx cmp yy or xx <=>
yy, returns -1, 0, or 1 if xx is less than, equal or greater than yy
for strings and numbers respectively; these operators are commonly used
as sort comparison functions.
A list/array is a one-dimensional vector that holds zero or more
values. To Perl, lists and arrays are identical, and we shall use the
terms interchangably, using the poor justification the existing
documentation does so, too. In Perl, a list/array value is denoted by
scalars enclosed in parethesis. Arrays can be indexed; like C/C++/Java,
the first element has index 0.
@fib = (0, 1, 1, 2, 3, 5);
@mixed = ("quiet", +4, 3.14, "hot dog");
@empty = ();
@emptyAlso = ( (), (), () );
$five = pop @fib; # get $five
$three = $fib[4];
The length or size of an array is can be obtained in two different ways.
$len =
@array ## need
SCALAR CONTEXT. Number of items in the array.
$last_index = $#array ## index of last element in the array.
Finally, here are three ways to iterate through an array, @arr. In this
example, we simply print out each element. For accessing each element,
I prefer foreach; if the index is needed too, I use the second method.
my $item;
foreach $item (@arr) { ## cleanest, but no index
print $item;
}
my $i;
for ($i=0; $i<@arr; $i++) { ## just like C
print $arr[$i];
}
for (my $i=0; $i<@arr; $i++) { ## In v 5.004, 'my' inside for
print $arr[$i];
}
my $j;
for ($j=0; $j<=$#arr; $j++) { ## I don't use this much
print $arr[$j];
}
The next block shows some common array operations. Push and pop
add/remove elements at the right-end of the array. We show how to
construct the list ("one1", "two2", "three3", "four4") in the following
steps.
@list = ("one1");
push(@list, "two2");
$list[2] = "three3";
$nelements =
@list;
# get three, as there are three elements
$list[$nelements] = "four" . "4";
Perl automatically and dynamically enlarges an array so you do not have
predeclare the size of an array. However, if you know you will need a
very large array, largeArr, you can pre-allocate space by assigning to
$#largeArr. Pre-allocating is slightly more efficient, but potentially
wastes a lot of space, and should only be done for arrays bigger than
16K elements.
$#largeArr = 987654; ## preallocate 987K worth of space.
A hash variable stores a array of (key, value) pairs, collectively
known as a map. Typically, the key and value are different but related
values, such as a person's name and phone number. A hash is implemented
in Perl so that you can quickly look up the value given the key, when
there are many (key, value) pairs. From a algorithms/data structures
standpoint, a Perl hash implements a dictionary, mostly likely using a
hash table.
For example, given the name of a state, such as california, I want the
Postal abbreviation, CA. We define, initialize, and modify a hash,
%abbrevTable as follows.
my(%abbrevTable) = ( # this is the initialization syntax.
"california" => "CA", # key = california, value = CA
"oregon" => "OR",
);
sub printAbbrev($) {
my($state) = @_;
if (exists $abbrevTable{$state}) {
print "Abbreviation for $state = $abbrevTable{$state} \n";
} else {
print "No known abbreviation for $state\n";
}
}
sub hashdemo () {
printAbbrev("arizona"); # no such key
$abbrevTable{"arizona"} = "AZ"; # add a new (key, value) pair
printAbbrev("arizona");
# this will succeed
}
Calling the function hashdemo() gives
No known abbreviation for arizona
Abbreviation for arizona = AZ
Note that we use the exists $hash{$key} syntax to test if a key exists
in the hash table. Also a hash is assymetric in that we can lookup up
entries based on the key, not the value.
If treated as an normal array/list, a hash will appear as
(keyA, valueA, keyB, valueB, keyC, valueC, ... ).
The order of the keys will appear random[Note: The key order is based
on the underlying hash function being used, we are simply listing the
hash table buckets.].
Declare local variables using the my(var-name[s]) = initial-vals, which
evaluates initial-vals in list context, or my scalar-var = initial-val,
which evaluates initial-val in scalar context . A local variable only
exists in and hence can only be used in the function (or block) where
it was declared.
sub some_function () {
my(@copyOfARGV) = @ARGV; # array local variable
my($i, $mesg) = (0, "hi"); # local variables for some_function
foreach $i (@ARGV) {
my $arg = $ARGV[$i]; # $arg only exists in the for loop
}
print
$arg;
# Arghh. ERROR, $arg does not exist here.
}
In older Perl code, you may see the local keyword instead of my. If in
doubt, use my instead of local[Note: There are advanced situations,
beyond the scope of this document, where local must be used.]. A local
variable is dynamically-scoped[Note: With dynamic scoping, we use the
variable in the closest function-call stack frame, which means that the
same line of code might use different non-local variables as it depends
on the function call nesting.]; a my variable is statically-scope,
which is faster and almost certainly what you want. For example,
C/C++/Java use static scoping.
A bareword is a unquoted literal not used as a variable or function
name. Barewords are used mainly for labels and for filehandles [Note:
and for package names, but this is an advanced topic]. The following
code snippet shows three bare words, A_FILE_HANDLE, bare and bareword.
filehandles are uppercase to avoid naming conflicts, and to follow the
normal Perl naming convention. (If you use the FileHandle package, you
don't need to make your own file handles.)
open(A_FILE_HANDLE, "./perlscript.pl");
bare: while ($line = <A_FILE_HANDLE>) {
bareword: while ($line[$i] ne "") {
if ($line[$i] =~ /\s*#/) {
next bare;
}
}
}
A bareword not used as a filehandle or label, and which is not a known function, is viewed as string constant.
$str =
hi; #
AVOID. Use of bareword hi, same as "hi".
$str = "hi"; # same, but much easier to read.
We advise against use barewords as strings, since it impedes clarity,
as function calls are typically barewords. Instead, put your strings in
double quotes, which is standard across most languages.
Context: scalar, list, hash or reference
A context specifier, which is one of the characters $, @, % must be
used before all variable references. The context indicates the kind
value that will be used or assigned. The context is not part of a
variable name. Consider the following assignment statements.
$eight =
8;
# numeric scalar
@nulllist =
();
# null or empty list.
$four = $eight / 2; #
@cubes = (1, 8, 27, 64); # assign an entire array/list.
$eight =
$cubes[1];
# huh? cubes is an array, why not @cubes[1].
The $ specifier in the statement ... = $varX ... means that we expect
to read a scalar value from a variable named varX. Thus, Perl uses the
scalar variable named varX. Similarly, the @ specifier in ... = @varX
means that we expect to read an array/list value from a variable varX;
Perl uses the array/list variable varX.
While it might seem that the $ and the @ are part of the variable names
in $varX and @varX, this view is wong. In reality, there are two
different variables, each named varX; one is a scalar, the other an
array. In an expression like varX[...], because array subscripting is
used, Perl selects the array variable. The last statement in the
preceding example $eight = $cubes[1]; illustrates the preceding rule as
we precede the array variable cubes by a $.
An expression like @aaa = @bbb[$ccc] means that we expect the element
bbb[$ccc] to produce an list/array value, which is probably wrong
thinking. Since Perl arrays elements must be scalars, @bbb[$ccc]
results in a one-element array containing $bbb[$ccc], namely (
$bbb[$ccc] ). [Note: If $bbb[$ccc] is undefined, we get the array (
undef ) ]
In an expression like ... = $varX[kk], we first interpret the array
brackets, which means varX must be an array. We get the kkth element.
Finally the leading $ specifier indicates we expect this element to be
a scalar value.
What happens if the LHS and RHS contexts do not match in an assignment
statement? Perl uses the following rules which are often convenient but
sometimes unexpected.
Value assigned to LHS in LL = RR
LHS Original RHS Value
Value Scalar $RR List @RR Hash %RR
"hi" (1, 4, 9) ("one" -> 1, "two" -> 2)
scalar, $LL "hi" 3 [arr length] 1/8 [used/alloc buckets]
list, @LL ("hi") (1, 4, 9) ("one", 1, "two", 2)
hash, %LL [empty hash] (1, 4) ("one" -> 1, "two" -> 2)
Variables of different types (scalar, list, hash) can have the same
name, because each type has its own namespace. Thus, the following code
refers to three different variables, so that no data values are
overwritten.
$xyz = "my
foot";
# scalar mode variable
@xyz = ("tulip", "rose", "mum is the word"); # list mode variable
$xyz{$xyz} =
$xyz[1];
# $xyz{"my foot"} = "rose";
Even the Perl book is misleading as it states that "all variable names
start with a $, or %,'' (page 37) which would imply that $cubes[1] is
using the $cubes variable, which is incorrect. (It is accurate to say
that all variable uses begin with a $, @ or a %.)
The condition of an if-statement or while-loop is evaluated in scalar
context. Thus it is acceptable and indeed common Perl programming
practice to say
if ( @array > 4 )
{ ##
@array ==> number of items in it.
...
}
Many functions and operators behave differently depending on the
context. For example, using my($var) = RHS; produces a list context on
the LHS and RHS, because the parenthesis denote a list, so RHS will be
evaluated in list context. Instead do my $var = RHS;.
Thus, to get a string of the current time there are several correct ways. We show some commonly encountered cases.
my($now1) = scalar(localtime()); # CORRECT, force scalar context
my $now2 =
localtime();
# CORRECT, no parens, scalar context
my($now3);
$now3 =
localtime();
# CORRECT,
my($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(); # OK
my($nowWRONG) = localtime(); # WRONG, list context, get $sec
- Forcing scalar or list context
Use the scalar(...) function to force scalar context. Use (...) to force array/list context.
$scalarVar = scalar(@arrayVar); # force scalar context.
my($line) = scalar( < file > ); # just read one line
Functions
Perl functions take a single list/array as a parameter, which naturally
handles the case of passing several scalars. Parameters are separated
by commas, because they are separate elements of the parameter
list/array.
$two = sqrt 4.00; # square root of 4
open FILEHANDLE, "input.txt"; # open the file input.txt for reading
$i = index "abcdefg", "cde"; # index of substring cde in abcdefg
print "i = $i \bsl n"; # print value of i
if (defined $somevar) { ... } # test if $somevar has been used
You may optionally put parenthesis around the arguments, resulting in
the standard call-syntax of most langauges as shown below. I personally
prefer using parenthesis. However, I prefer no parenthesis if the
function call is the entire conditional of an if or while statement.
$two = sqrt(4.00); # square root of 4
open (FILEHANDLE, "input.txt"); # open the file input.txt for reading
$i = index("abcdefg", "cde"); # index of substring cde in abcdefg
print ("i = $i \bsl n"); # print value of i.
if (defined($somevar)) { ... } # test if $somevar has been used (ugly)
A few functions, such as print, grep, map, and sort have secondary
syntaxes that require spaces after the first parameter. If you use
parenthesis around the arguments, you must still use a space.
print STDERR "i = $i \bsl n"; # print value of i to STDERR
print(STDERR "i = $i \bsl n"); # print value of i to STDERR
print(STDERR, "i = $i \bsl n"); # (ACK) print 'STDERR' followed by i
Beware that the first set of outermost parenthesis fully delimit the
parameters, so that subsequent values are not parameters. Whitespace
does not affect things.
$ten = sqrt
(1+3)*5;
# Ack. same as $ten = (sqrt(4)) * 5;
$ten = 5 * sqrt (1+3); # Arithmetically the same as preceding.
$n = sqrt ((1+3)*5); # Good. $n = sqrt (20);
A function definition looks as follows. All the parameters to the
function are passed in the @_ list/array. This is one time where use of
this cryptic variable cannot be avoided. I always immediately rename
the parameters as shown in the prototype code.
sub do_line ($$$) {
my($line, $lineno, $filename) = @_;
...
}
As of Perl 5.002, you can pre-declare the number and types of the
function parameters (see Section Prototypes in perlsub) using a
function prototype, so that the parameters can be interpreted in a user
specified manner. In the function declaration sub do_line ($$$) {, each
of the $ signifies a single scalar parameter. A @ in the parameter list
signifies a list; nothing can follow it as the list parameter gobbles
up all remaining parameters. Warning: the function-prototype for a
function fn must be seen before calling fn for Perl to do parameter
checking.
A Perl function can return any type of value including a scalar, an
array, or nothing (void). Unfortunately, the return type of a function
cannot be specified in the function prototype. If a function returns
one type, say an array, and you expect a scalar, Perl will silently do
a conversion.
You can write functions that return different types based on expected
return type (known as the calling context) by using the wantarray
function. For example,
sub scalarOrList () {
return wantarray ? ("red", "green", "blue") : 88;
}
...
$i = scalarOrList(); # scalar context, get 88
@color = scalarOrList(); # list context, get ("red", "green", "blue")
If a function takes optional trailing parameters, they are declared and fetched as follows.
# called as:
# dieMessage("Whoops, that hurt."); # one parameter
# dieMessage("Whoops, that hurt.", 0); # two parameters
#
sub dieMessage ($;$) {
my($message) = shift @_;
my($shouldDie) = (@_ > 0) ? shift @_ : 1; ## 1 = default value if no param
}
Regular Expressions
In regular expressions, Perl understands the following convenient
character set symbols which match a single character. Thus, to handle
arbitrary blank space you must use \s+. You may use these symbols in a
character set. For example, when looking for a hex integer you might
look for [a-fA-F\d]. Also, the term regex is short for regular
expressions.
Symbol Equiv Description
\w [a-zA-Z0-9_] A "word" character (alphanumeric plus "_")
\W [^a-zA-Z0-9_] Match a non-word character
\s [ \t\n\f\r] Match a whitespace character
\S [^\s] Match a non-whitespace character
\d [0-9] Match a digit character
\D [^0-9] Match a non-digit character
Perl has the standard regex quantifiers or closures, where r is any regular expressions.
r* Zero or more occurences of r (greedy match).
r+ One or more occurences of r (greedy match).
r? Zero or one occurence of r (greedy match).
r*? Zero or more occurences of r (match minimal).
r+? One or more occurences of r (match minimal).
r?? Zero or one occurence of r (match minimal).
Let q be a regex with a quantifier. If there are many ways for q to
match some text, a greedy quantifier will match (or "eats up") as much
text as possible; a minimal matcher does the opposite. If a regex
contains more than one quantifier, the quantifiers are "fed" left to
right.
- Searching and substituting
The two main regex operations are searching/finding and substituting.
In searching, we test if a string contains a regular expression[Note:
"Regex searching'' is often incorrectly called "regex matching''.]. In
substituting, we replace part of the original string with a new string;
the new string is often based on the original. Both of these operations
use the regular expression operator
=~
, which consists of two characters. This operator is not related to
either equals = or ~[Note: (1) The choice of symbols was quite
confusing to me initially. (2) The =~ is officially called the "binding
operator", as there are other non-regex operations that use it.]
Searching: For example, to determine if the string $line contains a
recent year such as 1998 or 1983, we use the search operator =~ /.../.
Here the slashes '/' delimit or mark the beginning and the end of the
regular expression.
if ($line =~ /19[89]\d/) {
# we found a year in $line
}
In general, to determine if string $var contains the regular expression
re use any of the following forms. If the regular expression contains a
slash '/' itself, then you must use mXreX form, where each X is the
same single character not appearing in re.
In mX...X, the m stands for "match".
if ($var =~ /re/) { ... }
if ($var =~ m:re:) { ... } # can replace ':' with any other character
while ($var =~ m/re/) { ... } # can replace '/' with any other character
To access the substring in $var matched by part of the regular
expression re, put the part of re in parenthesis. The matched text is
accessible via the variables $1, $2, ..., $k, where $k matches the k-th
parenthesized part of the regular expression. For example to break up
an e-mail address user@machine in $line we could do
if ($line =~ /(\S+)@(\S+)/) { # \S = any non-space character
my($user, $machine) = ($1, $2);
...
}
The submatch variables $1, $2, ... $k are updated after each successful
regex operation, which wipes out the previous values. I store these
submatch values into other well-named variable immediately after the
regex operation, if I want them.
Use \k, not $k, in the regular expression itself to refer to a
previously matched substring. For example, to search for identical
begining and ending HTML tags <xyz> ... </xyz> on a single
line $line use
if ($line =~ m|<(.*)>(.*)</\1>|) { # search for: <xyz>stuff</xyz>
my($stuff) = $2;
...
}
Substitution: To replace or substitute text in $var from the regular expression old to new use the following form.
$var =~
s/old/new/;
# replace old with new
if ($var =~ s:old:new:) { ... } # replace ':' with any other character
To use part of the actual text matched by the old regex, the new regex
can use the $k variables. Taking our previous example involving years,
to replace the year 19xy with xy, use
$line =~ s/19(\d\d)/$1/;
Modifiers: When searching or substituing, there are several optional
modifiers you can use to alter the regular expression. For example, in
if ($var =~ / <title> /i), the i at the end specifies a
case-insensitive search. We use m// and s/// to represent searching and
substituing.
Option Where What
i m//, s/// case insensitive (upper=lower case) pattern
m m//, s/// $var as multiple lines
g s/// replace all orig with new. I.e. apply repeatedly.
g m/// (Adv) search for all
occurences. On next evaluation, continue where previous search left off.
s m//, s/// (Adv) treat $var as a single line, even if imbedded '\n' chars
x m//, s/// (Adv) allow extended
regex syntax. Ignore spaces in the regex (for readability)
The regex operations return different results depending on the context. For clarity, I recommend using the scalar context
context return value
scalar true, if there was a match (or substitution)
list/array list of sub-matches ($1, $2, ...) found in the match
Built-in Perl functions
Perl has many built-in functions.
There are numerous ways to access documentation about Perl functions.
* On a Unix system with Perl installed, run %man perfunc.
* On a Win 95 PC with standard Perl installed in perldir on, look at perldir/lib/Pod/perlfunc.html.
Here are some of the more common functions I've used. If the function
has additional options for a function, the description starts with a
(+).
@arr=split(/[ t:]+/, $line); (+) Split $line into
words. Words are seprated by spaces or colons (but not tabs). Store
words in @arr, spaces and colons are discarded.
@arr = stat(filename); Returns a 13 element list
($dev, $ino, $mode (permissions on this file), $nlink, $uid, $gid,
$rdev, $size (in bytes), $atime, $mtime (last modification time),
$ctime, $blksize, $blocks) containing information about a file.
$str = join("::", @arr); Concatenate all elements of
@arr into a single scalar string; separate all the elements by a double
colon. Useful when printing out an array.
Perl has several functions which test properties about files. These
functions have the name -X, for some character X. (Yes, the function
name starts with a dash.) These names mimic the Unix csh and the Unix
sh test operations. These functions take a filename or a file handle,
as in -X filename.
For example, if you want to run a command /bin/ccc on the data file
../input/ddd, you might want to check if ccc is executable and ddd is
readable first.
if ( (-x "/bin/ccc") && (-r "../input/ddd") ) {
my(@cccout) = `/bin/ccc ../input/ddd`; # run the command.
} else {
... complain ...
}
I give the descriptions directly from the perlfunc manual page, listed from most common to least common, based on my own usage.
-f File is a plain file.
-e File exists.
-d File is a directory.
-l File is a symbolic link.
-r File is readable by effective uid/gid.
-x File is executable by effective uid/gid.
-w File is writable by effective uid/gid.
-z File has zero size.
-s File has non-zero size (returns size).
-o File is owned by effective uid.
-R File is readable by real uid/gid.
-W File is writable by real uid/gid.
-X File is executable by real uid/gid.
-O File is owned by real uid.
-p File is a named pipe (FIFO).
-S File is a socket.
-b File is a block special file.
-c File is a character special file.
-t Filehandle is opened to a tty.
-u File has setuid bit set.
-g File has setgid bit set.
-k File has sticky bit set.
-T File is a text file.
-B File is a binary file (opposite of -T).
-M Age of file in days when script started.
-A Same for access time.
-C Same for inode change time.
Command line arguments
When you run a Perl script, perl puts the command line arguments in the global array @ARGV. For example, if you run the command
% perl somescript.pl -o abc -t one.html two.html
will result in
$ARGV[0] -o
$ARGV[1] abc
$ARGV[2] -t
$ARGV[3] one.html
$ARGV[4] two.html
The prototype code at the begining of this document shows one way to process @ARGV.
File I/O
See the prototype example for reading/writing from/to a file.
Given a file handle FH from either open() or a new FileHandle, the
operation <FH> reads the next line in scalar context or the
entire file in list context.
while ( $line = <FILE_DATA> ) { # read a line at a time.
if ( $line =~ /keyboard/ ) {
print $line;
}
}
my(@whole_file) = <FILE_DATA>; # be careful, file could be BIG.
my($numlines) = scalar(@whole_file); #
If you only want to read from stdin, use an use
while ($line = <STDIN>) { # read a line at a time
...
}
But how can we read from a file sometime and from STDIN at other times
in the same Perl script? The routines handle_file() and read_file() in
the prototype code show how read from any input stream such as a file,
stdin (which itself could be a file, the keyboard or a network
connection), a network connection, the keyboard, and so on.[Note: An
input stream is any source of input data and is a generalization of an
input file. In C an input stream is a file descriptor or a FILE*
pointer (from stdio.h), such as stdin. In C++ an input stream is an
istream, such as cin.] The function handle_file() is a "driver" for
read_file() that passes as a parameter either STDIN or a FileHandle
input stream to read_file().
In read_file(istream, fname) the first parameter, istream, is the input
stream, from whic we read input data. The second parameter fname is the
file name, which is used for say, reporting errors. To pass STDIN as a
parameter to read_file(), we use \*STDIN[Note: This is a very advanced
topic as we are passing a reference to the typeglob for STDIN.] Sadly
explaining \*STDIN is beyond the scope of this document.
Running external commands
(This may or may not work on Win32) You can run an external command,
such as ls -l by placing it in back quotes (also known as back ticks or
grave accents, `ls -l`. The returned value is the output the command
sends to stdout. In scalar context, you get one big string, with a \n
character separating lines; in array context, each output line is a
separate array item.
Thus, see the contents of a tar file, xyz.tar in Perl, you could do
my(@tarlist) = `tar tfv xyz.tar`;
Commands are run in current working directory, which is initially the
directory where you started the Perl script. You can change the current
working directory to DDD by calling the built-in Perl function chdir
DDD.
References
A reference in Perl is equivalent to a pointer in C. Any Perl scalar
value/variable can be a reference. The address-of operator in Perl is
the \ (backslash); the dereference operator is sadly and confusingly
the $ (dollar sign).
Thus the following lines are equivalent in Perl and C; in both cases we
change the value of str from "hi" to "bye" via ptr and we add 5 to the
value of num via a pointer. In Perl, we can use the same reference
variable ptr becuse references are not typed; in C we must use
different pointers sptr and iptr.
Perl C/C++
$str = "hi"; char* str = "hi";
$ptr = \$str; char** sptr = &str;
$$ptr = "bye"; *sptr = "bye";
$num = 4; int num = 4;
$ptr = \$num; int* iptr = #
$$ptr += 5; (*iptr) += 5;
In the last line, the double dollar sign $$ptr is pretty ugly; as a
notational convenience, for a reference to an array or hash, the
postfix -> operator can be used. Thus, dereference the array
reference arrRef, we can use either
$arrRef->[...]
or
$$arrRef[...].
An analoguous notation is used for hashes passed by reference. The
following table shows how to use an array/hash versus a reference to
it. There should be no surprises to an experienced C programmers.
Approach Var whole array k-th item address-of array
Normal @arr @arr $arr[k] \@arr
Reference $aref = \arr @$aref
$aref->[k] or $$aref[k] $aref
Approach Var whole hash key lookup address-of hash
Normal %hash %hash $hash{k} \%hash
Reference $href = \hash %$href
$href->{key} or $$href{key}
$href
- Passing references to functions
I typically pass arrays and hashes as references like C/C++, because
this method is fast (as we only pass a scalar) and it allows the array
to be modified. The basic scheme is declare the formal parameters as
scalars; the actual parameters passed are "the-address-of" of the array
or hash.
# call via:
# toBeCalled (array-reference, hash-reference);
#
sub toBeCalled ($$) { # declare params to be scalars
my($ref2arr, $ref2hash) = @_;
...
$ref2arr->[idx] = ...
...
$ref2hash->{key} = ...
...
foreach item in ( @$ref2arr ) {
...
}
}
sub caller () {
my(@arr) = ( ... );
my(%hash) = ();
...
toBeCalled(\@arr, \%hash);
}
Here's an example of a function clearEntry which clears the specified
index idx of an array of strings arr and increments index. Because both
variables are modified, they are both passed as references.
sub clearEntry ($$) {
my($idx, $arr) = @_;
$arr->[$$idx] = "";
$$idx ++;
}
sub callClear () {
my(@stuff) = ("aa", "bb", "cc", "dd");
my($indexer) = 1;
print "BEFORE indexer = $indexer " . join(":", @stuff) . "\n";
clearEntry(\$indexer, \@stuff);
print "AFTER indexer = $indexer " . join(":", @stuff) . "\n";
}
Calling callClear() gives
BEFORE indexer = 1 aa:bb:cc:dd
AFTER indexer = 2 aa::cc:dd
Quoting
There are a variety of other quoting mechanisms as summarized in the
table below, which borrows directly from the Section Quote and
Quotelike Operators in perlop. Interpolates means that variables are
evaluated, which in turn means that all variable references starting
with $, @, or % are fully evaluated.
@squares = (0, 1, 4, 9, 16, 25);
$i = 2;
print("i = $i, 3+i = (3+$i)\n"); # print: i = 2, 3+i=(3+2)
print("squares[i+3] = $squares[$i+3]\n"); # print: squares[i+3] = 23
In the first print() statement, the arithmetic expression (3+i) is not
evaluated, because it is not a variable; however, the reference to
$squares[$i+3] is fully evaluated.
Customary Generic Meaning Interpolates
'xxx' q:xxx: Literal no
"xxx" qq:xxx: Literal yes
`xxx` qx:xxx: Command yes
none qw:xxx: Word list no
/xxx/ m:xxx: Pattern match yes
none s:xxx:yyy: Substitution yes
none tr:xxx:yyy: Translation no
The generic quoting mechanism allows you to delimit a string with
arbitrary characters, which is especially convenient when the string
contains single and/or double quotes.
$where = "a hot dog stand";
$proverb = 'Don\'t buy sushi from a hot dog stand.';
$proverb = q/Don't buy sushi from a hot dog stand./;
$proverb = q(Don't buy sushi from a hot dog stand.);
$proverb = "Don't buy sushi from $where.";
$proverb = qq/Don't buy sushi from $where./;
$proverb = qq(Don't buy sushi from $where.);
You can specify multi-line, verbatim strings, called "here documents",
using the << syntax. This syntax originated in the Bourne shell.
The following three snippets produce the same output.
sub here_one () {
my $weather = "sunny";
print $OUT <<"EOStr";
Oh great. It
is $weather today.
EOStr
}
sub here_two () {
my $weather = "sunny";
my $heredoc =<<"EOStr";
Oh great. It
is $weather today.
EOStr
print $OUT $heredoc;
}
sub no_here () {
my $weather = "sunny";
print $OUT "Oh great. It\n";
print $OUT " is $weather today.\n";
}
In the preceding examples, I use EOStr as a delimiter; as a rule of
thumb, the delimiter can be any string that does not appear in the here
document. Beware, the syntax is intolerant of extra spaces surrounding
the delimiter. In particular, at the start of the here document (i) do
not put a space after the <<, and (ii) remember to add a ;
(semicolon), and at the end (ii) the delimiter must be on a line by
itself without spaces.
Packages, Modules, Records and Objects in Perl
I have no plans to cover these topics in this introductory document.
Perhaps in a not-in-the-near future "Reusable Perl code in 10 pages"
document.
Revision History
Revision When Description
2000c 9 Jun 2000 Fixed errors (thanks AA and GGS). Added here strings.
2000b 19 Apr 2000 Very minor rewrites.
1999c ??? 1999 Added table of contents (by fixing ltoh ).
Feedback, motivation and afterthoughts
I wrote this document because I wish some one had done so when I was
learning Perl. I welcome any constructive feedback on this document.
This document © Russell W Quong, 1998,1999,2000. You may freely copy
and distribute this document so long as the copyright is left intact.
You may freely copy and post unaltered versions of this document in
HTML and Postscript formats on a web site or ftp site. Lastly, if you
do something injurious or stupid because of this document, I don't want
to know about it. Unless it's amusing.
|