General Import and Export Issues

Num eric values m ay need t o be verified as lying wit hin a specific range, dat es m ay need t o be convert ed t o or from I SO form at , and so fort h. Source code for t he program fragm ent s and script s discussed in t his chapt er is locat ed in t he t ransfer direct ory of t he recipes dist ribut ion, wit h t he except ion t hat som e of t he ut ilit y funct ions are cont ained in library files locat ed in t he lib direct ory. The code for som e of t he short er ut ilit ies is shown in full. For t he longer ones, t he chapt er generally discusses only how t hey work and how t o use t hem , but you have access t o t he source if you wish t o invest igat e in m ore det ail how t heyre writ t en. The problem s addressed in t his chapt er involve a lot of t ext processing and pat t ern m at ching. These are part icular st rengt hs of Perl, so t he program fragm ent s and ut ilit ies shown here are writ t en m ainly in Perl. PHP and Pyt hon provide pat t ern- m at ching capabilit ies, t oo, so t hey can of course do m any of t he sam e t hings. I f you want t o adapt t he t echniques described here for Java, youll need t o get a library t hat provides classes for regular expression-based pat t ern m at ching. See Appendix A for suggest ions.

10.1.1 General Import and Export Issues

I ncom pat ible dat afile form at s and differing rules for int erpret ing various kinds of values lead t o m any headaches when t ransferring dat a bet ween program s. Nevert heless, cert ain issues recur frequent ly. By being aware of t hem , youll be able t o ident ify m ore easily j ust what you need t o do t o solve part icular im port or export problem s. I n it s m ost basic form , an input st ream is j ust a set of byt es wit h no part icular m eaning. Successful im port int o MySQL requires being able t o recognize which byt es represent st ruct ural inform at ion, and which represent t he dat a values fram ed by t hat st ruct ure. Because such recognit ion is key t o decom posing t he input int o appropriat e unit s, t he m ost fundam ent al im port issues are t hese: • What is t he record separat or? Knowing t his allows t he input st ream t o be part it ioned int o records. • What is t he field delim it er? Knowing t his allows each record t o be part it ioned int o field values. Recovering t he original dat a values also m ay include st ripping off quot es from around t he values or recognizing escape sequences wit hin t hem . The abilit y t o break apart t he input int o records and fields is im port ant for ext ract ing t he dat a values from it . However, t he values st ill m ight not be in a form t hat can be used direct ly, and you m ay need t o consider ot her issues: • Do t he order and num ber of colum ns m at ch t he st ruct ure of t he dat abase t able? Mism at ches require colum ns t o be rearranged or skipped. • Do dat a values need t o be validat ed or reform at t ed? I f t he values are in a form at t hat m at ches MySQLs expect at ions, no furt her processing is necessary. Ot herwise, t hey need t o be checked and possibly rewrit t en. • How should NULL or em pt y values be handled? Are t hey allowed? Can NULL values even be det ect ed? Som e syst em s export NULL values as em pt y st rings, m aking it im possible t o dist inguish one from t he ot her. For export from MySQL, t he issues are som ewhat t he reverse. You probably can assum e t hat values st ored in t he dat abase are valid, but t hey m ay require reform at t ing, and it s necessary t o add colum n and record delim it ers t o form an out put st ream t hat has a st ruct ure anot her program can recognize. The chapt er deals wit h t hese issues prim arily wit hin t he cont ext of perform ing bulk t ransfers of ent ire files, but m any of t he t echniques discussed here can be applied in ot her sit uat ions as w ell. Consider a w eb-based applicat ion t hat present s a form for a user t o fill in, t hen processes it s cont ent s t o creat e a new record in t he dat abase. That is a dat a im port sit uat ion. Web API s generally m ake form cont ent s available as a set of already- parsed discret e values, so t he applicat ion m ay not need t o deal wit h record and colum n delim it ers, On t he ot her hand, validat ion issues rem ain param ount . You really have no idea what kind of values a user is sending your script , so it s im port ant t o check t hem .

10.1.2 File Formats