php resource centre

  • about
  • articles
  • tutorials
  • resources
  • certification
Home

Primary links

  • About
  • Articles
  • Tutorials
  • Resources
  • Certification

PHP Regular Expressions

admin — Thu, 19/10/2006 - 4:14pm

A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject.



The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of meta-characters, which do not stand for themselves but instead are interpreted in some special way.

There are two different sets of meta-characters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized in square brackets. Outside square brackets, the meta-characters are as follows:


    general escape character with several uses

^    assert start of subject (or line, in multiline mode)

$  assert end of subject (or line, in multiline mode)

.  match any character except newline (by default)

[  start character class definition

]  end character class definition

|  start of alternative branch

(  start subpattern

)  end subpattern

?  extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer

*  0 or more quantifier

+  1 or more quantifier

{  start min/max quantifier

}  end min/max quantifier

Within square brackets, you can list characters that you want to be accepted or rejected:

[abc] - will match a or b or c

[^abc] - will match anything BUT a or b or c - i.e. any character that is not a or b or c

You can include ranged of information:

[a-zA-Z] - will match a-z letters in both upper and lower case

[0-9] - will match digits 0-9

If you want to match a hyphen, then you need to have the hyphen as the last character:

[a-z-] - will match a-z in lowercase only, and hyphens

There are other character classes that are shorthand:

s - matches whitespace characters: n, r (both new lines), t (tabs) and f (form feeds - not use often). Equivalent to [nrtf]

S - matches all non-whitespace characters. Equivalent to [^nrtf]

d - matches 0-9. Equivalent to [0-9]

D - matches all non-digits. Equivalent to [^0-9]

w - matches a word character: a-z, A-Z and 0-9. Equivalent to [a-zA-Z0-9]

W - matches all non-word characters. Equivalent to [^a-zA-Z0-9]

. - the period matches any character apart from new lines. If you wish to match an actual period, you need to escape it using a backslash: .

As an example we wanted to validate an email address (as a user enters it in a form, not the complete specification including alias and so on).

So for a start we define:

1. It starts with a sequence of alphanumeric characters plus hyphen, underscore and dot.

2. This is followed by the @ symbol , following which is a sequence of alphanumeric characters plus hyphen.

3. Then we have a dot follwed with the TLD suffix that may consist of 2 to 6 letters.

In a regex patterns these parts could be expressed as:

^[-_.a-zA-Z0-9]+

@

[-a-zA-Z0-9]+

.

[a-zA-Z]{2,6}$


By putting this together we receive:

"/^[-_.a-zA-Z0-9]+@[-a-zA-Z0-9]+.[a-zA-Z]{2,6}$/"

Or slightly shorter (but possibly slightly less efficient, too) when we decide to match case-insensitive:

"/^[-_.a-z0-9]+@[-a-z0-9]+.[a-z]{2,6}$/i"

But with this, we would reject valid addresses because we did not consider subdomains or additional suffixes as in 'yourname@domain.co.in' yet. Therefore, we would break down part 3, regarding the occurrence of a dot as "special" since it separates subdomains and domain, which make the "normal" parts. The pattern would be: "normal", optionally follow by "special" and "normal" again, where the optional part may repeat. Or as our new part 3:


[-a-z0-9]+(.[-a-z0-9]+)*


Now our pattern looks like this:


"/^[-_.a-z0-9]+@[-a-z0-9]+(.[-a-z0-9]+)*.[a-z]{2,6}$/i"


Treating the first part the same way to reject email addresses with a leading dot or a sequence of dots, our pattern would become:


"/^[-_a-z0-9]+(.[-_a-z0-9]+)*@[-a-z0-9]+(.[-a-z0-9]+)*.[a-z]{2,6}$/i"


So the ultimate script will be as below :

<?php

$email = strtolower($email);

if (!pregmatch("/^[-_a-z0-9]+(.[-_a-z0-9]+)*@[-a-z0-9]+(.[-a-z0-9]+)*.[a-z]{2,6}$/", $email){

echo "Sorry, invalid email address.";

} else {

echo ":)";

}

?>

 

  • Tutorials
  • Login to post comments

User login

  • Request new password

Follow Us

Who's online

There are currently 0 users and 1 guest online.

Who's new

  • Nisha
  • linnaeus
  • Yameen
  • TalleyReedy
  • admin

Follow vipin7873 on Twitter

<!-- Start of Woopra Code -->

<!-- End of Woopra Code -->

  • about
  • articles
  • tutorials
  • resources
  • certification

copyright © 2010 Vipin Chandran