Problem Solution Discussion Validation by Pattern Matching

You m ust of course inst all t he m odule file in a direct ory where Perl will find it . For det ails on library inst allat ion, see Recipe 2.4 . A significant benefit of put t ing a collect ion of ut ilit y rout ines int o a library file is t hat you can use it for all kinds of program s. I t s rare for a dat a m anipulat ion problem t o be com plet ely unique. I f you can pick and choose at least a few validat ion rout ines from a library, it s possible t o reduce t he am ount of code you need t o writ e, even for highly specialized program s.

10.22 Validation by Direct Comparison

10.22.1 Problem

You need t o m ake sure a value is equal t o or not equal t o som e specific value, or t hat it lies wit hin a given range of values.

10.22.2 Solution

Perform a direct com parison.

10.22.3 Discussion

The sim plest kind of validat ion is t o perform com parisons against specific lit eral values: require a nonempty value valid = val ne ; require a specific nonempty value valid = val eq abc; require one of several values valid = val eq abc || val eq def || val eq xyz; require value in particular range 1 to 10 valid = val = 1 val = 10; Most of t hose t est s perform st ring com parisons. The last is a num eric com parison; however, a num eric com parison oft en is preceded by prelim inary t est s t o verify first t hat t he value doesnt cont ain non-num eric charact ers. Pat t ern t est ing, discussed in t he next sect ion, is one such way t o do t hat . St ring com parisons are case sensit ive by default . To m ake a com parison case insensit ive, convert bot h operands t o t he sam e let t ercase: require a specific nonempty value in case-insensitive fashion valid = lc val eq lc AbC;

10.23 Validation by Pattern Matching

10.23.1 Problem

You need t o com pare a value t o a set of values t hat is difficult t o specify lit erally wit hout writ ing a really ugly expression.

10.23.2 Solution

Use pat t ern m at ching.

10.23.3 Discussion

Pat t ern m at ching is a powerful t ool for validat ion because it allows you t o t est ent ire classes of values wit h a single expression. You can also use pat t ern t est s t o break up m at ched values int o subpart s for furt her individual t est ing, or in subst it ut ion operat ions t o rewrit e m at ched values. For exam ple, you m ight break up a m at ched dat e int o pieces so t hat you can verify t hat t he m ont h is in t he range from 1 t o 12 and t he day is wit hin t he num ber of days in t he m ont h. Or you m ight use a subst it ut ion t o reorder MM-DD-YY or DD-MM-YY values int o YY- MM-DD for m at . The next few sect ions describe how t o use pat t erns t o t est for several t ypes of values, but first let s t ake a quick t our of som e general pat t ern- m at ching principles. The following discussion focuses on Perls regular expression capabilit ies. Pat t ern m at ching in PHP and Pyt hon is sim ilar, t hough you should consult t he relevant docum ent at ion for any differences. For Java, t he ORO pat t ern m at ching class library offers Perl- st yle pat t ern m at ching; Appendix A indicat es where you can get it . I n Perl, t he pat t ern const ruct or is pat : it_matched = val =~ pat ; pattern match Put an i aft er t he pat const ruct or t o m ake t he pat t ern m at ch case insensit ive: it_matched = val =~ pat i; case-insensitive match To use a charact er ot her t han slash, begin t he const ruct or wit h m . This can be useful if t he pat t ern it self cont ains slashes: it_matched = val =~ m| pat |; alternate constructor character To look for a non- m at ch, replace t he =~ operat or wit h t he ~ operat or: no_match = val ~ pat ; negated pattern match To perform a subst it ut ion in val based on a pat t ern m at ch, use s pat replacement . I f pat occurs wit hin val , it s replaced by replacement . To perform a case- insensit ive m at ch, put an i aft er t he last slash. To perform a global subst it ut ion t hat replaces all inst ances of pat rat her t han j ust t he first one, add a g aft er t he last slash: val =~ s pat replacement ; substitution val =~ s pat replacement i; case-insensitive substitution val =~ s pat replacement g; global substitution val =~ s pat replacement ig; case-insensitive and global Heres a list of som e of t he special pat t ern elem ent s available in Perl regular expressions: Pa t t e r n W h a t t he pa t t e r n m a t che s Beginning of st ring End of st ring . Any charact er \s , \S Whit espace or non- whit espace charact er \d , \D Digit or non-digit charact er \w , \W Word alphanum eric or underscore or non-word charact er [...] Any charact er list ed bet ween t he square bracket s [...] Any charact er not list ed bet ween t he square bracket s p1 | p2 | p3 Alt ernat ion; m at ches any of t he pat t erns p1 , p2 , or p3 Zero or m ore inst ances of preceding elem ent + One or m ore inst ances of preceding elem ent { n } n inst ances of preceding elem ent { m , n } m t hrough n inst ances of preceding elem ent Many of t hese pat t ern elem ent s are t he sam e as t hose available for MySQLs REGEXP regular expression operat or. See Recipe 4.8 . To m at ch a lit eral inst ance of a charact er t hat is special wit hin pat t erns, such as , , or , precede it wit h a backslash. Sim ilarly, t o include a charact er wit hin a charact er class const ruct ion t hat is special in charact er classes [ , ] , or - , precede it wit h a backslash. To include a lit eral in a charact er class, list it som ewhere ot her t han as t he first charact er bet ween t he bracket s. Many of t he validat ion pat t erns shown in t he following sect ions are of t he pat form . Beginning and ending a pat t ern wit h and has t he effect of requiring pat t o m at ch t he ent ire st ring t hat youre t est ing. This is com m on in dat a validat ion cont ext s, because it s generally desirable t o know t hat a pat t ern m at ches an ent ire input value, not j ust part of it . I f you want t o be sure t hat a value represent s an int eger, for exam ple, it doesnt do you any good t o know only t hat it cont ains an int eger som ewhere. This is not a hard- and- fast rule, however, and som et im es it s useful t o perform a m ore relaxed t est by om it t ing t he and charact ers as appropriat e. For exam ple, if you want t o st rip leading and t railing whit espace from a value, use one pat t ern anchored only t o t he beginning of t he st ring, and anot her anchored only t o t he end: val =~ s\s+; trim leading whitespace val =~ s\s+; trim trailing whitespace That s such a com m on operat ion, in fact , t hat it s a good candidat e for being placed int o a ut ilit y funct ion. The Cookbook_Ut ils.pm file cont ains a funct ion trim_whitespace t hat perform s bot h subst it ut ions and ret urns t he result : val = trim_whitespace val; To rem em ber subsect ions of a st ring t hat is m at ched by a pat t ern, use parent heses around t he relevant part s of t he pat t ern. Aft er a successful m at ch, you can refer t o t he m at ched subst rings using t he variables 1 , 2 , and so fort h: if abcdef =~ ab. { first_part = 1; this will be ab the_rest = 2; this will be cdef } To indicat e t hat an elem ent wit hin a pat t ern is opt ional, follow it by a ? charact er. To m at ch values consist ing of a sequence of digit s, opt ionally beginning wit h a m inus sign, and opt ionally ending wit h a period, use t his pat t ern: -?\d+\.? You can also use parent heses t o group alt ernat ions wit hin a pat t ern. The following pat t ern m at ches t im e values in hh:mm form at , opt ionally followed by AM or PM : \d{1,2}:\d{2}\sAM|PM?i The use of parent heses in t hat pat t ern also has t he side-effect of rem em bering t he opt ional par t in 1 . To suppress t hat side-effect , use ?: pat inst ead: \d{1,2}:\d{2}\s?:AM|PM?i That s sufficient background in Perl pat t ern m at ching t o allow const ruct ion of useful validat ion t est s for several t ypes of dat a values. The following sect ions provide pat t erns t hat can be used t o t est for broad cont ent t ypes, num bers, t em poral values, and em ail addresses or URLs. The t ransfer direct ory of t he recipes dist ribut ion cont ains a t est _pat .pl script t hat reads input values, m at ches t hem against several pat t erns, and report s which pat t erns each value m at ches. The script is easily ext ensible, so you can use it as a t est harness t o t ry out your own pat t erns.

10.24 Using Patterns to Match Broad Content Types