Problem Solution Discussion Validating and Transforming Data

want t o pull out all but t he password colum n from t he colon- delim it ed et c passwd file and writ e t he result in CSV form at . Use cvt _file.pl bot h t o preprocess et c passwd int o t ab- delim it ed form at for yank_col.pl and t o post -process t he ext ract ed colum ns int o CSV form at : cvt_file.pl --idelim=: etcpasswd \ | yank_col.pl --columns=1,3-7 \ | cvt_file.pl --oformat=csv passwd.csv I f you dont want t o t ype all of t hat as one long com m and, use t em porary files for t he int erm ediat e st eps: cvt_file.pl --idelim=: etcpasswd tmp1 yank_col.pl --columns=1,3-7 tmp1 tmp2 cvt_file.pl --oformat=csv tmp2 passwd.csv rm tmp1 tmp2 Forcing split to Return Every Field The Perl split funct ion is ext rem ely useful, but norm ally it doesnt ret urn t railing em pt y fields. This m eans t hat if you writ e out only as m any fields as split ret urns, out put lines m ay not have t he sam e num ber of fields as input lines. To avoid t his problem , pass a t hird argum ent t o indicat e t he m axim um num ber of fields t o ret urn. This forces split t o ret urn as m any fields as are act ually present on t he line, or t he num ber request ed, whichever is sm aller. I f t he value of t he t hird argum ent is large enough, t he pract ical effect is t o cause all fields t o be ret urned, em pt y or not . Script s shown in t his chapt er use a field count value of 10,000: split line at tabs, preserving all fields my val = split \t, _, 10000; I n t he unlikely? event t hat an input line has m ore fields t han t hat , it will be t runcat ed. I f you t hink t hat will be a problem , you can bum p up t he num ber even higher.

10.21 Validating and Transforming Data

10.21.1 Problem

You need t o m ake sure t he dat a values cont ained in a file are legal.

10.21.2 Solution

Check t hem , possibly rewrit ing t hem int o a m ore suit able form at .

10.21.3 Discussion

Earlier recipes in t his chapt er show how t o work wit h t he st ruct ural charact erist ics of files, by reading lines and bust ing t hem up int o separat e colum ns. I t s im port ant t o be able t o do t hat , but som et im es you need t o work wit h t he dat a cont ent of a file, not j ust it s st ruct ure: • I t s oft en a good idea t o validat e dat a values t o m ake sure t heyre legal for t he colum n t ypes int o which youre st oring t hem . For exam ple, you can m ake sure t hat values int ended for INT , DATE , and ENUM colum ns are int egers, dat es in CCYY-MM-DD form at , and legal enum erat ion values. • Dat a values m ay need reform at t ing. Rewrit ing dat es from one form at t o anot her is especially com m on. For exam ple, if youre im port ing a FileMaker Pro file int o MySQL, youll likely need t o convert dat es from MM-DD-YY form at t o I SO form at . I f youre going in t he ot her direct ion, from MySQL t o FileMaker Pro, youll need t o perform t he inverse dat e t ransform at ion, as well as split DATETIME and TIMESTAMP colum ns int o separat e dat e and t im e colum ns. • I t m ay be necessary t o recognize special values in t he file. I t s com m on t o represent NULL w it h a value t hat does not ot herwise occur in t he file, such as -1 , Unknown , or NA . I f you dont want t hose values t o be im port ed lit erally, youll need t o recognize and handle t hem specially. This sect ion begins a set of recipes t hat describe validat ion and reform at t ing t echniques t hat are useful in t hese kinds of sit uat ions. Techniques covered here for checking values include direct com parison, pat t ern m at ching, and validat ion against inform at ion in a dat abase. I t s not unusual for cert ain validat ion operat ions t o com e up over and over, in which case youll probably find it useful t o t o const ruct a library of funct ions. Packaging validat ion operat ions as library rout ines m akes it easier t o writ e ut ilit ies based on t hem , and t he ut ilit ies m ake it easier t o perform com m and- line operat ions on ent ire files so you can avoid edit ing t hem yourself.

10.21.4 Writing an Input-Processing Loop