Problem Solution Discussion Using Patterns to Match Email Addresses and URLs

The prevalence of t hese issues in dat a t ransfer problem s m eans t hat youll probably end up writ ing som e of your own validat ors on occasion t o handle very specific dat e form at s. Lat er sect ions of t his chapt er can provide addit ional assist ance. For exam ple, Recipe 10.30 covers conversion of t wo- digit year values t o four-digit form , and Recipe 10.31 discusses how t o perform validit y checking on com ponent s of dat e or t im e values.

10.27 Using Patterns to Match Email Addresses and URLs

10.27.1 Problem

You want t o det erm ine whet her or not a value looks like an em ail address or a URL.

10.27.2 Solution

Use a pat t ern, t uned t o t he level of st rict ness you want t o enforce.

10.27.3 Discussion

The im m ediat ely preceding sect ions use pat t erns t o ident ify classes of values such as num bers and dat es, which are fairly t ypical applicat ions for regular expressions. But pat t ern m at ching has such widespread applicabilit y t hat it s im possible t o list all t he ways you can use it for dat a validat ion. To give som e idea of a few ot her t ypes of values t hat pat t ern m at ching can be used for, t his sect ion shows a few t est s for em ail addresses and URLs. To check values t hat are expect ed t o be em ail addresses, t he pat t ern should require at least an charact er wit h nonem pt y st rings on eit her side: .. That s a pret t y m inim al t est . I t s difficult t o com e up wit h a fully general pat t ern t hat covers all t he legal values and rej ect s all t he illegal ones, but it s easy t o writ e a pat t ern t hat s at least a lit t le m ore rest rict ive. [ 3] For exam ple, in addit ion t o being nonem pt y, t he usernam e and t he dom ain nam e should consist ent irely of charact ers ot her t han charact ers or spaces: [ 3] To see how hard it can be t o perform pat t ern m at ching for em ail addresses, check Appendix E in Jeffrey Friedls Mast ering Regular Expressions OReilly . [ ]+[ ]+ You m ay also wish t o require t hat t he dom ain nam e part cont ain at least t wo part s separat ed by a dot : [ ]+[ .]+\.[ .]+ To look for URL values t hat begin wit h a prot ocol specifier of http: , ftp: , or mailto: , use an alt ernat ion t hat m at ches any of t hem at t he beginning of t he st ring. These values cont ain slashes, so it s easier t o use a different charact er around t he pat t ern t o avoid having t o escape t he slashes wit h backslashes: mhttp:|ftp:|mailto:i The alt ernat ives in t he pat t ern are grouped wit hin parent heses because ot herwise t he will anchor only t he first of t hem t o t he beginning of t he st ring. The i m odifier follows t he pat t ern because prot ocol specifiers in URLs are not case sensit ive. The pat t ern is ot herwise fairly unrest rict ive, because it allows anyt hing t o follow t he prot ocol specifier. I leave it t o you t o add furt her rest rict ions as necessary.

10.28 Validation Using Table Metadata