Writing an Input-Processing Loop Putting Common Tests in Libraries

Earlier recipes in t his chapt er show how t o work wit h t he st ruct ural charact erist ics of files, by reading lines and bust ing t hem up int o separat e colum ns. I t s im port ant t o be able t o do t hat , but som et im es you need t o work wit h t he dat a cont ent of a file, not j ust it s st ruct ure: • I t s oft en a good idea t o validat e dat a values t o m ake sure t heyre legal for t he colum n t ypes int o which youre st oring t hem . For exam ple, you can m ake sure t hat values int ended for INT , DATE , and ENUM colum ns are int egers, dat es in CCYY-MM-DD form at , and legal enum erat ion values. • Dat a values m ay need reform at t ing. Rewrit ing dat es from one form at t o anot her is especially com m on. For exam ple, if youre im port ing a FileMaker Pro file int o MySQL, youll likely need t o convert dat es from MM-DD-YY form at t o I SO form at . I f youre going in t he ot her direct ion, from MySQL t o FileMaker Pro, youll need t o perform t he inverse dat e t ransform at ion, as well as split DATETIME and TIMESTAMP colum ns int o separat e dat e and t im e colum ns. • I t m ay be necessary t o recognize special values in t he file. I t s com m on t o represent NULL w it h a value t hat does not ot herwise occur in t he file, such as -1 , Unknown , or NA . I f you dont want t hose values t o be im port ed lit erally, youll need t o recognize and handle t hem specially. This sect ion begins a set of recipes t hat describe validat ion and reform at t ing t echniques t hat are useful in t hese kinds of sit uat ions. Techniques covered here for checking values include direct com parison, pat t ern m at ching, and validat ion against inform at ion in a dat abase. I t s not unusual for cert ain validat ion operat ions t o com e up over and over, in which case youll probably find it useful t o t o const ruct a library of funct ions. Packaging validat ion operat ions as library rout ines m akes it easier t o writ e ut ilit ies based on t hem , and t he ut ilit ies m ake it easier t o perform com m and- line operat ions on ent ire files so you can avoid edit ing t hem yourself.

10.21.4 Writing an Input-Processing Loop

Many of t he validat ion recipes shown in t he new few sect ions are t ypical of t hose t hat you perform wit hin t he cont ext of a program t hat reads a file and checks individual colum n values. The general form of such a file- processing ut ilit y can be writ t en like t his: usrbinperl -w loop.pl - Typical input-processing loop Assumes tab-delimited, linefeed-terminated input lines. use strict; while read each line { chomp; split line at tabs, preserving all fields my val = split \t, _, 10000; for my i 0 .. val - 1 iterate through columns in line { ... test val[i] here ... } } exit 0; The while loop reads each input line and breaks it int o fields. I nside t he loop, each line is broken int o fields. Then t he inner for loop it erat es t hrough t he fields in t he line, allowing each one t o be processed in sequence. I f youre not applying a given t est uniform ly t o all t he fields, youd replace t he for loop wit h separat e colum n- specific t est s. This loop assum es t ab-delim it ed, linefeed- t erm inat ed input , an assum pt ion t hat is shared by m ost of t he ut ilit ies discussed t hroughout t he rest of t his chapt er. To use t hese program s wit h dat afiles in ot her form at s, you m ay be able t o convert t hem int o t ab-delim it ed form at using t he cvt _file.pl script discussed in Recipe 10.19 .

10.21.5 Putting Common Tests in Libraries

For a t est t hat you perform oft en, it m ay be useful t o package it as a library funct ion. This m akes t he operat ion easy t o perform , and also gives it a nam e t hat s likely t o m ake t he m eaning of t he operat ion clearer t han t he com parison code it self. For exam ple, t he following t est perform s a pat t ern m at ch t o check t hat val consist s ent irely of digit s opt ionally preceded by a plus sign , t hen m akes sure t he value is great er t han zero: valid = val =~ \+?\d+ val 0; I n ot her words, t he t est looks for st rings t hat represent posit ive int egers. To m ake t he t est easier t o use and it s int ent clearer, you m ight put it int o a funct ion t hat is used like t his: valid = is_positive_integer val; The funct ion it self can be defined as follows: sub is_positive_integer { my s = shift; return s =~ \+?\d+ s 0; } Then put t he funct ion definit ion int o a library file so t hat m ult iple script s can use it easily. The Cookbook_Ut ils.pm m odule file in t he lib direct ory of t he recipes dist ribut ion is an exam ple of a library file t hat cont ains a num ber of validat ion funct ions. Take a look t hrough it t o see which funct ions m ay be useful t o you in your own program s or as a m odel for writ ing your own library files . To gain access t o t his m odule from wit hin a script , include a use st at em ent like t his: use Cookbook_Utils; You m ust of course inst all t he m odule file in a direct ory where Perl will find it . For det ails on library inst allat ion, see Recipe 2.4 . A significant benefit of put t ing a collect ion of ut ilit y rout ines int o a library file is t hat you can use it for all kinds of program s. I t s rare for a dat a m anipulat ion problem t o be com plet ely unique. I f you can pick and choose at least a few validat ion rout ines from a library, it s possible t o reduce t he am ount of code you need t o writ e, even for highly specialized program s.

10.22 Validation by Direct Comparison