Problem Solution Discussion Guessing Table Structure from a Datafile

10.37 Guessing Table Structure from a Datafile

10.37.1 Problem

Som eone gives you a dat afile and says, Here, put t his int o MySQL for m e. But no t able yet exist s t o hold t he dat a.

10.37.2 Solution

Writ e t he CREATE TABLE st at em ent yourself. Or use a ut ilit y t hat guesses t he t able st ruct ure by exam ining t he cont ent s of t he dat afile.

10.37.3 Discussion

Som et im es you need t o im port dat a int o MySQL for which no t able has yet been set up. You can creat e t he t able yourself, based on any knowledge you m ight have about t he cont ent s of t he file. Or you m ay be able t o avoid som e of t he work by using guess_t able.pl, a ut ilit y locat ed in t he t ransfer direct ory of t he recipes dist ribut ion. guess_t able.pl reads t he dat afile t o see what kind of inform at ion it cont ains, t hen at t em pt s t o produce an appropriat e CREATE TABLE st at em ent t hat m at ches t he cont ent s of t he file. This script is necessarily im perfect , because colum n cont ent s som et im es are am biguous. For exam ple, a colum n cont aining a sm all num ber of dist inct st rings m ight be a CHAR colum n or an ENUM . St ill, it s oft en easier t o t weak t he st at em ent t hat guess_t able.pl produces t han t o writ e t he ent ire st at em ent from scrat ch. This ut ilit y also has a diagnost ic funct ion, t hough t hat s not it s prim ary purpose. For exam ple, you m ight believe a colum n cont ains only num bers, but if guess_t able.pl indicat es t hat it should be creat ed using a CHAR t ype, t hat t ells you t he colum n cont ains at least one non- num eric value. guess_t able.pl assum es t hat it s input is in t ab-delim it ed, linefeed- t erm inat ed form at . I t also assum es valid input , because any at t em pt t o guess colum n t ypes based on possibly flawed dat a is doom ed t o failure. This m eans, for exam ple, t hat if a dat e colum n is t o be recognized as such, it should be in I SO form at . Ot herwise, guess_t able.pl m ay charact erize it as a CHAR colum n. I f a dat afile doesnt sat isfy t hese assum pt ions, you m ay be able t o reform at it first using t he cvt _file.pl and cvt _dat e.pl ut ilit ies described in Recipe 10.19 and Recipe 10.32 . guess_t able.pl underst ands t he following opt ions: - - labels I nt erpret t he first input line as a row of colum n labels and use t hem for t able colum n nam es. I f t his opt ion is om it t ed, guess_t able.pl uses default colum n nam es of c1 , c2 , and so fort h. Not e t hat if t he file cont ains a row of labels and you neglect t o specify t his opt ion, t he labels w ill be t reat ed as dat a values by guess_t able.pl. The likely result is t hat t he script w ill m ischaract erize any num eric colum n as a CHAR colum n, due t o t he pr esence of a non- num eric value in t he colum n. - - low er, - - upper Force colum n nam es in t he CREATE TABLE st at em ent t o be low er case or upper case. - - quot e- nam es Quot e t able and colum n nam es in t he CREATE TABLE st at em ent w it h ` char act er s for exam ple, `mytbl` . This can be useful if a nam e is a reserved w ord. The result ing st at em ent requires MySQL 3.23.6, because quot ed nam es are not under st ood by ear lier v ersions. - - report Generat e a report rat her t han a CREATE TABLE st at em ent . The scr ipt displays t he infor m at ion t hat it gat her ed about each colum n. - - t bl_nam e = tbl_name Specify t he t able nam e t o use in t he CREATE TABLE st at em ent . The default nam e is t . Heres an exam ple of how guess_t able.pl works, using t he m anagers.csv file from t he CSV version of t he baseball1.com baseball dat abase dist ribut ion. This file cont ains records for t eam m anagers. I t begins wit h a row of colum n labels, followed by rows cont aining dat a values: LahmanID,Year,Team,Lg,DIV,G,W,L,Pct,Std,Half,Order,PlyrMgr,PostWins,PostLos ses cravebi01,1871,TRO,NA,,25,12,12,0.5,6,0,2,,, deaneha01,1871,KEK,NA,,5,2,3,0.4,8,0,2,,, hastisc01,1871,ROK,NA,,25,4,21,0.16,9,0,0,,, paborch01,1871,CLE,NA,,29,10,19,0.345,7,0,0,,, wrighha01,1871,BOS,NA,,31,20,10,0.667,3,0,0,,, youngni99,1871,OLY,NA,,32,15,15,0.5,5,0,0,,, clappjo01,1872,MAN,NA,,24,5,19,0.208,8,0,0,,, clintji01,1872,ECK,NA,,11,0,11,0,9,0,1,,, fergubo01,1872,BRA,NA,,37,9,28,0.243,6,0,0,,, ... The first row indicat es t he colum n labels, and t he following rows cont ain dat a records, one per line. guess_t able.pl requires input in t ab- delim it ed, linefeed- t erm inat ed form at , so t o work wit h t he m anagers.csv file, first convert it using cvt _file.pl, w rit ing t he result t o a t em porary file, m anagers.t xt : cvt_file.pl --iformat=csv --ieol=\r\n managers.csv managers.txt Then run t he t em porary file t hrough guess_t able.pl t he com m and shown here uses - - lower because I prefer lowercase colum n nam es : guess_table.pl --table=managers --labels --lower managers.txt managers.sql The CREATE TABLE st at em ent t hat guess_t able.pl writ es t o m anagers.sql looks like t his: CREATE TABLE managers lahmanid CHAR9 NOT NULL, year INT UNSIGNED NOT NULL, team CHAR3 NOT NULL, lg CHAR2 NOT NULL, div CHAR1 NULL, g INT UNSIGNED NOT NULL, w INT UNSIGNED NOT NULL, l INT UNSIGNED NOT NULL, pct FLOAT NOT NULL, std INT UNSIGNED NOT NULL, half INT UNSIGNED NOT NULL, order INT UNSIGNED NOT NULL, plyrmgr CHAR1 NULL, postwins INT UNSIGNED NULL, postlosses INT UNSIGNED NULL ; guess_t able.pl produces t hat st at em ent based on deduct ions such as t he following: • I f a colum n cont ains only int eger values, it s assum ed t o be an INT . I f none of t he values are negat ive, t he colum n is likely t o be UNSIGNED as well. • I f a colum n cont ains no em pt y values, guess_t able.pl assum es t hat it s probably NOT NULL . • Colum ns t hat cannot be classified as num bers or dat es are t aken t o be CHAR colum ns, wit h a lengt h equal t o t he longest value present in t he colum n. You m ight want t o edit t he CREATE TABLE st at em ent t hat guess_t able.pl produces, t o m ake m odificat ions such as increasing t he size of charact er fields, changing CHAR t o VARCHAR , or adding indexes. Anot her reason t o edit t he st at em ent is t hat if a colum n has a nam e t hat is a reserved word in MySQL, you can change it t o a different nam e. For exam ple, t he managers t able definit ion creat ed by guess_t able.pl cont ains a colum n nam ed order , w hich is a reserved keyword. The colum n represent s t he order of t he m anager during t he season in case a t eam had m ore t han one m anager , so a reasonable alt ernat ive nam e is mgrorder . Aft er edit ing t he st at em ent in t he m anagers.sql file t o m ake t hat change, execut e it t o creat e t he t able: mysql cookbook managers.sql Then you can load t he dat afile int o t he t able skipping t he init ial row of labels : mysql LOAD DATA LOCAL INFILE managers.txt INTO TABLE managers - IGNORE 1 LINES; When you do t his, youll not ice t hat LOAD DATA report s som e warnings. These are invest igat ed furt her in Recipe 10.38 . The baseball1.com dat abase also is available in Access form at . The Access dat abase cont ains explicit inform at ion about t he st ruct ure of t he managers t able, and t his inform at ion is available t o ut ilit ies like DBTools and MySQLFront t hat can use it t o creat e t he MySQL t able for you. See Recipe 10.39 for inform at ion about t hese program s. This affords us t he opport unit y t o see how well guess_t able.pl guesses t he t able st ruct ure using only t he dat afile, com pared t o program s t hat have m ore inform at ion available t o t hem . One problem wit h ut ilit ies like DBTools or MySQLFront is t hat if an Access t able colum n has a nam e t hat is a reserved word, you cannot im port it int o MySQL wit hout changing t he Access t able t o use a different colum n nam e. This is t he case for t he Order colum n in t he managers t able. Wit h guess_t able.pl, t hat wasnt a problem , because you can j ust edit t he CREATE TABLE st at em ent t hat it produces t o change t he nam e t o som et hing legal. [ 7] However, t o deal wit h t he Order colum n in t he managers t able for purposes of DBTools or MySQLFront , you should change t he Access dat abase it self t o renam e t he colum n for exam ple, t o MgrOrder . [ 7] Anot her approach is t o use t he - - quot e- nam es opt ion when you run guess_t able.pl. That allows you t o creat e t he t able wit hout changing t he colum n nam e, alt hough t hen you m ust put t he nam e wit hin backt icks whet her you refer t o it . The managers t able st ruct ure produced by DBTools looks like t his: CREATE TABLE managers LahmanID char9 NOT NULL default , Year int11 default NULL, Team char3 default NULL, Lg char2 default NULL, Div char2 default NULL, G int11 default NULL, W int11 default NULL, L int11 default NULL, Pct double default NULL, Std int11 default NULL, Half int11 default NULL, MgrOrder int11 default NULL, PlyrMgr char1 default NULL, PostWins int11 default NULL, PostLosses int11 default NULL, KEY LahmanID LahmanID ; MySQLFront creat es t he t able like t his: CREATE TABLE managers LahmanID longtext, Year int11 default NULL, Team longtext, Lg longtext, Div longtext, G int11 default NULL, W int11 default NULL, L int11 default NULL, Pct float default NULL, Std int11 default NULL, Half int11 default NULL, MgrOrder int11 default NULL, PlyrMgr longtext, PostWins int11 default NULL, PostLosses int11 default NULL ; Of t he t hree program s, DBTools does t he best j ob of det erm ining t he st ruct ure of t he MySQL t able. I t uses t he index inform at ion present in t he Access file t o writ e t he KEY definit ion, and t o creat e st ring colum ns wit h t he proper lengt hs. MySQLFront doesnt produce t he key definit ion and it defines st rings as LONGTEXT colum ns—even t he PlyrMgr colum n, w hich never cont ains a value longer t han one charact er. The qualit y of t he out put produced by guess_t able.pl appears t o be som ewhere in bet ween. I t doesnt writ e t he key definit ion, but neit her does it writ e every st ring colum n as t he longest possible t ype. On t he ot her hand, t he colum n lengt hs are som ewhat conservat ive. All in all, t hat s not bad, considering t hat guess_t able.pl doesnt have available t o it all t he inform at ion cont ained in t he original Access file. And you can use it on a cross- plat form basis. These result s indicat e t hat if youre using Windows and your records are st ored in an Access file, youre probably best off let t ing DBTools creat e your MySQL t ables for you. I n ot her sit uat ions such as when youre running under Unix or your dat afile com es from a source ot her t han Access , guess_t able.pl can be beneficial.

10.38 A LOAD DATA Diagnostic Utility