Problem Solution Discussion Extracting and Rearranging Datafile Columns

cvt_file.pl --idelim=: etcpasswd tmp • Convert t ab-delim it ed query out put from mysql int o CSV form at : • mysql -e SELECT FROM profile cookbook \ | cvt_file.pl --oformat=csv profile.csv

10.20 Extracting and Rearranging Datafile Columns

10.20.1 Problem

You want t o pull out colum ns from a dat afile or rearrange t hem int o a different order.

10.20.2 Solution

Use a ut ilit y t hat can produce colum ns from a file on dem and.

10.20.3 Discussion

cvt _file.pl serves as a t ool t hat convert s ent ire files from one form at t o anot her. Anot her com m on dat afile operat ion is t o m anipulat e it s colum ns. This is necessary, for exam ple, when im port ing a file int o a program t hat doesnt underst and how t o ext ract or rearrange input colum ns for it self. Perhaps you want t o om it colum ns from t he m iddle of a file so you can use it wit h LOAD DATA , which cannot skip over colum ns in t he m iddle of dat a lines. Or perhaps you have a version of m ysqlim port older t han 3.23.17, which doesnt support t he - - colum ns opt ion t hat allows you t o indicat e t he order in which t able colum ns appear in t he file. To work around t hese problem s, you can rearrange t he dat afile inst ead. Recall t hat t his chapt er began wit h a descript ion of a scenario involving a 12- colum n CSV file som edat a.csv from which only colum ns 2, 11, 5, and 9 were needed. You can convert t he file t o t ab- delim it ed form at like t his: cvt_file.pl --iformat=csv somedata.csv somedata.txt But t hen what ? I f you j ust want t o knock out a short script t o ext ract t hose specific four colum ns, t hat s fairly easy: writ e a loop t hat reads input lines and writ es only t he colum ns you want in t he proper order. Assum ing input in t ab-delim it ed, linefeed- t erm inat ed form at , a sim ple Perl program t o pull out t he four colum ns can be writ t en like t his: usrbinperl -w yank_4col.pl - 4-column extraction example Extracts column 2, 11, 5, and 9 from 12-column input, in that order. Assumes tab-delimited, linefeed-terminated input lines. use strict; while { chomp; my in = split \t, _; split at tabs extract columns 2, 11, 5, and 9 print join \t, in[1], in[10], in[4], in[8] . \n; } exit 0; Run t he script as follows t o read t he file cont aining 12 dat a colum ns and writ e out put t hat cont ains only t he four colum ns in t he desired order: yank_4col.pl somedata.txt tmp But yank_4col.pl is a special purpose script , useful only wit hin a highly lim it ed cont ext . Wit h j ust a lit t le m ore work, it s possible t o writ e a m ore general ut ilit y yank_col.pl t hat allows any set of colum ns t o be ext ract ed. Wit h such a t ool, youd specify t he colum n list on t he com m and line like t his: yank_col.pl --columns=2,11,5,9 somedata.txt tmp Because t he script doesnt use a hardcoded colum n list , it can be used t o pull out an arbit rary set of colum ns in any order. Colum ns can be specified as a com m a-separat ed list of colum n num bers or colum n ranges. For exam ple, - - colum ns= 1,4- 7,10 m eans colum ns 1, 4, 5, 6, 7, and 10. yank_col.pl looks like t his: usrbinperl -w yank_col.pl - extract columns from input Example: yank_col.pl --columns=2,11,5,9 filename Assumes tab-delimited, linefeed-terminated input lines. use strict; use Getopt::Long; Getopt::Long::ignorecase = 0; options are case sensitive my prog = yank_col.pl; my usage = EOF; Usage: prog [options] [data_file] Options: --help Print this message --columns=column-list Specify columns to extract, as a comma-separated list of column positions EOF my help; my columns; GetOptions help = \help, print help message columns=s = \columns specify column list or die usage\n; die usage\n if defined help; my col_list = split ,, columns if defined columns; col_list or die usage\n; nonempty column list is required make sure column specifiers are positive integers, and convert from 1-based to 0-based values my tmp; for my i = 0; i col_list; i++ { if col_list[i] =~ \d+ single column number { die Column specifier col_list[i] is not a positive integer\n unless col_list[i] 0; push tmp, col_list[i] - 1; } elsif col_list[i] =~ \d+-\d+ column range m-n { my begin, end = 1, 2; die col_list[i] is not a valid column specifier\n unless begin 0 end 0 begin = end; while begin = end { push tmp, begin - 1; ++begin; } } else { die col_list[i] is not a valid column specifier\n; } } col_list = tmp; while read input { chomp; my val = split \t, _, 10000; split, preserving all fields extract desired columns, mapping undef to empty string can occur if an index exceeds number of columns present in line val = map { defined _ ? _ : } val[col_list]; print join \t, val . \n; } exit 0; The input processing loop convert s each line t o an array of values, t hen pulls out from t he array t he values corresponding t o t he request ed colum ns. To avoid looping t hough t he array, it uses Perls not at ion t hat allows a list of subscript s t o be specified all at once t o request m ult iple array elem ent s. For exam ple, if col_list cont ains t he values 2 , 6 , and 3 , t hese t wo expressions are equivalent : val[2] , val[6], val[3] val[col_list] What if you want t o ext ract colum ns from a file t hat s not in t ab- delim it ed form at , or produce out put in anot her form at ? I n t hat case, com bine yank_col.pl w it h cvt _file.pl. Suppose you want t o pull out all but t he password colum n from t he colon- delim it ed et c passwd file and writ e t he result in CSV form at . Use cvt _file.pl bot h t o preprocess et c passwd int o t ab- delim it ed form at for yank_col.pl and t o post -process t he ext ract ed colum ns int o CSV form at : cvt_file.pl --idelim=: etcpasswd \ | yank_col.pl --columns=1,3-7 \ | cvt_file.pl --oformat=csv passwd.csv I f you dont want t o t ype all of t hat as one long com m and, use t em porary files for t he int erm ediat e st eps: cvt_file.pl --idelim=: etcpasswd tmp1 yank_col.pl --columns=1,3-7 tmp1 tmp2 cvt_file.pl --oformat=csv tmp2 passwd.csv rm tmp1 tmp2 Forcing split to Return Every Field The Perl split funct ion is ext rem ely useful, but norm ally it doesnt ret urn t railing em pt y fields. This m eans t hat if you writ e out only as m any fields as split ret urns, out put lines m ay not have t he sam e num ber of fields as input lines. To avoid t his problem , pass a t hird argum ent t o indicat e t he m axim um num ber of fields t o ret urn. This forces split t o ret urn as m any fields as are act ually present on t he line, or t he num ber request ed, whichever is sm aller. I f t he value of t he t hird argum ent is large enough, t he pract ical effect is t o cause all fields t o be ret urned, em pt y or not . Script s shown in t his chapt er use a field count value of 10,000: split line at tabs, preserving all fields my val = split \t, _, 10000; I n t he unlikely? event t hat an input line has m ore fields t han t hat , it will be t runcat ed. I f you t hink t hat will be a problem , you can bum p up t he num ber even higher.

10.21 Validating and Transforming Data