The following
is_ampm_time
funct ion looks for t im es in 12- hour form at wit h an opt ional AM or PM suffix, convert ing PM t im es t o 24- hour values:
sub is_ampm_time {
my s = shift; return undef unless s =~
\d{1,2}\D\d{2}\D\d{2}?:\sAM|PM?i; my hour, min, sec = 1, 2, 3;
hour += 12 if defined 4 uc 4 eq PM; return [ hour, min, sec ]; return hour, minute, second
}
Bot h funct ions ret urn
undef
for values t hat dont m at ch t he pat t ern. Ot herwise, t hey ret urn a reference t o a t hree-elem ent array cont aining t he hour, m inut e, and second values.
10.32 Writing Date-Processing Utilities
10.32.1 Problem
Theres a given dat e-processing operat ion t hat you need t o perform frequent ly, so you want t o writ e a ut ilit y t hat does it for you.
10.32.2 Solution
The ut ilit ies in t his sect ion provide som e exam ples showing how t o do t hat .
10.32.3 Discussion
Due t o t he idiosyncrat ic nat ure of dat es, youll probably find it necessary t o writ e dat e convert ers from t im e t o t im e. This sect ion shows som e sam ple convert ers t hat serve various
purposes:
•
isoize_dat e.pl reads a file looking for dat es in U.S. form at
MM-DD-YY
and convert s t hem t o I SO form at .
•
cvt _dat e.pl convert s dat es t o and from any of I SO, US, or Brit ish form at s. I t is m ore general t han isoize_dat e.pl, but requires t hat you t ell it what kind of input t o expect
and what kind of out put t o produce.
•
m onddccyy_t o_iso.pl looks for dat es like
Feb. 6,
1788
and convert s t hem t o I SO form at . I t illust rat es how t o m ap dat es wit h non-num eric part s t o a form at t hat MySQL
will underst and. All t hr ee scr ipt s ar e locat ed in t he t ransfer direct ory of t he
recipes
dist ribut ion. They assum e dat afiles are in t ab-delim it ed, linefeed- t erm inat ed form at . Use cvt _file.pl t o work wit h files in
a different form at .
Our first dat e- processing ut ilit y, isoize_dat e.pl, looks for dat es in U.S. form at and rewrit es t hem int o I SO form at . Youll recognize t hat it s m odeled aft er t he general input -processing
loop shown in Recipe 10.21
, wit h som e ext ra st uff t hrown in t o perform a specific t ype of conversion:
usrbinperl -w isoize_date.pl - Read input data, look for values that match
a date pattern, convert them to ISO format. Also converts 2-digit years to 4-digit years, using a transition point of 70.
By default, this looks for dates in MM-DD-[CC]YY format. Assumes tab-delimited, linefeed-terminated input lines.
Does not check whether dates actually are valid for example, wont complain about 13-49-1928.
use strict; transition point at which 2-digit years are assumed to be 19XX
below they are treated as 20XX my transition = 70;
while {
chomp; my val = split \t, _, 10000; split, preserving all fields
for my i 0 .. val - 1 {
my val = val[i]; look for strings in MM-DD-[CC]YY format
next unless val =~ \d{1,2}\D\d{1,2}\D\d{2,4}; my month, day, year = 1, 2, 3;
to interpret dates as DD-MM-[CC]YY instead, replace preceding line with the following one:
my day, month, year = 1, 2, 3; convert 2-digit years to 4 digits, then update value in array
year += year = transition ? 1900 : 2000 if year 100; val[i] = sprintf 04d-02d-02d, year, month, day;
} print join \t, val . \n;
} exit 0;
I f you feed isoize_dat e.pl an input file t hat looks like t his: Fred 04-13-70
Mort 09-30-69 Brit 12-01-57
Carl 11-02-73 Sean 07-04-63
Alan 02-14-65 Mara 09-17-68
Shepard 09-02-75
Dick 08-20-52 Tony 05-01-60
I t produces t he following out put : Fred 1970-04-13
Mort 2069-09-30 Brit 2057-12-01
Carl 1973-11-02 Sean 2063-07-04
Alan 2065-02-14 Mara 2068-09-17
Shepard 1975-09-02 Dick 2052-08-20
Tony 2060-05-01
isoize_dat e.pl serves a specific purpose: I t convert s only from U.S. t o I SO form at . I t does not per form validit y checking on dat e subpart s or allow t he t ransit ion point for adding t he cent ury
t o be specified. A m ore general t ool would be m ore useful. The next script , cvt _dat e.pl, ext ends t he capabilit ies of isoize_dat e.pl; it recognizes input dat es in I SO, US, or Brit ish
form at s and convert s any of t hem t o any ot her. I t also can convert t wo-digit years t o four digit s, allows you t o specify t he conversion t ransit ion point , and can warn about bad dat es. As
such, it can be used t o preprocess input for loading int o MySQL, or for post processing dat a export ed from MySQL for use by ot her program s.
cvt _dat e.pl underst ands t he following opt ions: - - iform at
= format
, - - oform at
= format
, - - form at
= format
,
Set t he dat e for m at for input , out put , or bot h. The default format
v alue is iso
; cvt _dat e.pl also r ecognizes any st r ing beginning w it h
us or
br as indicat ing U.S. or Br it ish dat e for m at .
- - add- cent ury
Conv er t t w o- digit y ear s t o four digit s.
- - colum ns
= column_list
Convert dat es only in t he nam ed colum ns. By default , cvt _dat e.pl looks for dat es in all colum ns. I f t his opt ion is given, column_list
should be a list of one or m ore colum n posit ions separat ed by com m as. Posit ions begin at 1.
- - t ransit ion
= n
Specify t he t r ansit ion point for t w o- digit t o four - digit year conversions. The default t ransit ion point is 70. This opt ion t ur ns on - - add- cent ur y .
- - w ar n
War n about bad dat es. Not e t hat t his opt ion can pr oduce spur ious w ar nings if t he dat es have t w o- digit year s and you dont specify - - add- cent ur y , because leap year t est ing w ont alw ays be accur at e in t hat case.
I wont show t he code for cvt _dat e.pl here m ost of it is t aken up wit h processing com m and- line opt ions , but you can exam ine t he source for yourself if you like. As an exam ple of how
cvt _dat e.pl works, suppose you have a file newdat a.t xt wit h t he following cont ent s: name1 010199 38
name2 123100 40 name3 022801 42
name4 010203 44
Running t he file t hrough cvt _dat e.pl wit h opt ions indicat ing t hat t he dat es are in U.S. form at and t hat t he cent ury should be added produces t his result :
cvt_date.pl --iformat=us --add-century newdata.txt name1 1999-01-01 38
name2 2000-12-31 40 name3 2001-02-28 42
name4 2003-01-02 44
To produce dat es in Brit ish form at inst ead wit h no year conversion, do t his:
cvt_date.pl --iformat=us --oformat=br newdata.txt name1 01-01-99 38
name2 31-12-00 40 name3 28-02-01 42
name4 02-01-03 44
cvt _dat e.pl has no knowledge of t he m eaning of each dat a colum n, of course. I f you have a non- dat e colum n wit h values t hat m at ch t he pat t ern, it will rewrit e t hat colum n, t oo. To deal
wit h t hat , specify a - - colum ns opt ion t o lim it t he colum ns t hat cvt _dat e.pl at t em pt s t o convert . isoize_dat e.pl and cvt _dat e.pl bot h operat e on dat es writ t en in all- num eric form at s. But dat es
in dat afiles oft en are writ t en different ly, in which case it m ay be necessary t o writ e a special purpose script t o process t hem . Suppose an input file cont ains dat es in t he following form at
t hese represent t he dat es on which U.S. st at es were adm it t ed t o t he Union : Delaware Dec. 7, 1787
Pennsylvania Dec 12, 1787 New Jersey Dec. 18, 1787
Georgia Jan. 2, 1788 Connecticut Jan. 9, 1788
Massachusetts Feb. 6, 1788 Maryland Apr. 28, 1788
South Carolina May 23, 1788 New Hampshire Jun. 21, 1788
Virginia Jun 25, 1788 ...
The dat es consist of a t hree- charact er m ont h abbreviat ion possibly followed by a period , t he num eric day of t he m ont h, a com m a, and t he num eric year. To im port t his file int o MySQL,
youd need t o convert t he dat es t o I SO form at , result ing in a file t hat looks like t his: Delaware 1787-12-07
Pennsylvania 1787-12-12 New Jersey 1787-12-18
Georgia 1788-01-02 Connecticut 1788-01-09
Massachusetts 1788-02-06 Maryland 1788-04-28
South Carolina 1788-05-23 New Hampshire 1788-06-21
Virginia 1788-06-25 ...
That s a som ewhat specialized kind of t ransform at ion, t hough t his general t ype of problem convert ing a specific dat e form at is hardly uncom m on. To perform t he conversion, ident ify
t he dat es as t hose values m at ching an appropriat e pat t ern, m ap m ont h nam es t o t he corresponding num eric values, and reform at t he result . The following script ,
m onddccyy_t o_iso.pl, illust rat es how t o do t his: usrbinperl -w
monddccyy_to_iso.pl - convert dates from mon[.] dd, ccyy to ISO format Assumes tab-delimited, linefeed-terminated input
use strict; my map = map 3-char month abbreviations to numeric month
jan = 1, feb = 2, mar = 3, apr = 4, may = 5, jun = 6, jul = 7, aug = 8, sep = 9, oct = 10, nov = 11, dec =
12 ;
while {
chomp; my val = split \t, _, 10000; split, preserving all fields
for my i 0 .. val - 1 {
reformat the value if it matches the pattern, otherwise assume its not a date in the required format and leave it alone
if val[i] =~ [.]+\.? \d+, \d+ {
use lowercase month name my month, day, year = lc 1, 2, 3;
if exists map{month} {
val[i] = sprintf 04d-02d-02d, year, map{month}, day;
} else
{ warn, but dont reformat
warn val[i]: bad date?\n; }
} }
print join \t, val . \n; }
exit 0; The script only does reform at t ing, it doesnt validat e t he dat es. To do t hat , m odify t he script t o
use t he Cookbook_Ut ils.pm m odule by adding t his st at em ent aft er t he
use strict
line: use Cookbook_Utils;
That gives t he script access t o t he m odules
is_valid_date
rout ine. To use it , change t he reform at t ing sect ion of t he script t o look like t his:
if exists map{month} is_valid_date year, map{month}, day
{ val[i] = sprintf 04d-02d-02d,
year, map{month}, day; }
else {
warn, but dont reformat warn val[i]: bad date?\n;
}
10.33 Using Dates with Missing Components