CIS 2.55 - Notes 0006

Regular Expressions are probably the most important and most often used aspect of Perl. They are what make Perl a Practical Extraction and Reporting Language.

Regular expressions are sort of like patterns that you specify, and can then look for these patterns in other strings. You can also use them to match data, and do substitutions (substitute this pattern for such string, etc.).

They are really the key to the usefulness of Perl.

What are Regular Expressions? (really short review of theory stuff)

Strictly speaking, regular expressions are a way to specify regular languages. Regular languages are sets of strings that can be recognized by a fairly simple device called the finite automaton (a machine that only needs a finite amount of memory to recognize a string). So, you can specify regular languages by specifying regular expressions, or defining such a simple machine (both ways are equivalent).

Heh? What is all this?

Compilers and just about all text processors use modules called scanners and parsers. Scanners are what break up the input into tokens, and parsers are things that build parse trees. Why do we need to know this? Well, scanners accomplish their goal of breaking input into tokens through the use of finite automatons, which were specified by a programmer using a set of regular expressions.

For a parser to recognize an if statement it must first be scanned by a scanner, which must first match a regular expression (maybe something like "if") to the input text (containing if), and proclaim that an if statement has been read in. Sounds simple, doesn't it?

Regular Expressions do have their limits though. They cannot go beyond specifying regular languages. For example, languages like C, Perl, Java, and natural languages are not regular languages (they're quite a bit more complex). So, while we can use regular expressions to extract and recognize simple language constructs, we cannot use them to build a whole compiler.

Knowing the limits of regular expressions is key to understanding how and when to use them best.

Anyway, let's move on to a practical example.

Recognition

The primary use of regular expressions is to recognize patterns in strings (scalars). So, for example, if we wanted to recognize if there is a "bob" inside some scalar named $foo, we might write code like this:

$foo = "this is bob's scalar";

if($foo =~ m/bob/){
   print "foo has bob\n";
}

This looks for bob in $foo. Notice that the sentence does have "bob"... ("this is bob's scalar"). Regular expressions can be used to match things only at the beginning or only at the end of the string; by default they'll look through the entire string for the pattern.

And just as with everything else in Perl, you can rewrite the above in several different ways, for one, the m is not necessary. So you can turn the code into:

if($foo =~ /bob/){
   print "foo has bob\n";
}

The $foo itself is not necessary if we save the string to $_ scalar:

$_ = "this is bob's scalar";

if(/bob/){
   print "foo has bob\n";
}

This is the code that you'll see often in Perl (everything is done using the default $_ scalar).

Now, in the previous examples, we've used =~ operator, which seems obvious enough (does it match?). There is also the !~ operator, which does the opposite: succeed if no match. You can try that out on your own.

There are also other minor details like the fact that you don't have to use /pattern/, but can actually use things like m|pattern| (or for that matter, any character after m as long as it matches on the other side; this is useful for matching file paths).

So what exactly are regular expression?

Regular Expressions are pattern. How do we specify them you ask? Well, patterns follow a rule. Whatever the character is in the pattern, that's the character matched, except for a few special characters.

What that means, is that if you wanted to find a string "hi, what a wonderful day" in a scalar $foo all you'd have to do is:

$_ = "Bill! hi, what a wonderful day this is.";

if(/hi, what a wonderful day/){
   print "foo has hi, what a wonderful day\n";
}

Notice that the patter is word for word what we're looking for. Sometimes it is this simple, sometimes not. The difficulty comes in when we introduce special (or wild) characters. The short list includes:

\ | ( ) [ { ^ $ * + ? .

These have a special effect. Instead of representing themselves in strings, they cause patterns to have special features.

For example, the * (formally known as Kleene Star) causes the character right before it to repeat zero or more times. So for example, a pattern such as:

/0*/

Would try to match zero or more 0s in a given scalar. Now, this will always succeed. Why?

Because we are matching zero or more times, which means that even a string that doesn't have any 0s will match that mattern.

Now, there is a similar operator +, which matches 1 or more times. So the thing MUST exist. If we apply it to our pattern:

/0+/

This will only match scalars with at least one 0s. If it has more, it will match more, but it needs at least one.

We can also use scalars inside our expressions. For example, lets say we want to match some name in some string, how can we do that? Well...

$somestring = "Hi John, How are you?";
$name = "John";

if($somestring =~ /$name/){
   print "found $name in $somestring\n";
}

The $name is expanded, and is used as a regular expression.

Modifiers

We can also modify the way our regular expression does the matching. For example, we can make it case insensitive; let's take our old example:

$somestring = "Hi John, How are you?";
$name = "john";

if($somestring =~ /$name/){
   print "found $name in $somestring\n";
}

Will not match. Notice that $name is now "john" (lower case) and in the $somestring the name is John (cap case).

Anyway, by simply putting an i by the end of the regular expression, we cause it to ignore case:

$somestring = "Hi John, How are you?";
$name = "john";

if($somestring =~ /$name/i){
   print "found $name in $somestring\n";
}

There are several other useful modifiers and I'll expect you to read about them yourself.

One other useful one that you should know is /g.

Substitution

Substitution refers to the idea that we can replace some text with another text, using a pattern to select exactly what text to replace. Let's say we are given a string:

$somestring = "Hi John, How are you?";

And we'd like to replace John with Jane, how would we do this? We try to match John, and once matched, we replace it with Jane (fairly straight forward?).

$somestring = "Hi John, How are you?";

print "somestring is: $somestring\n";

$somestring =~ s/John/Jane/;

print "somestring is: $somestring\n";

This produces the expected output of:

somestring is: Hi John, How are you?
somestring is: Hi Jane, How are you?

We can replace just about anything with anything.