Problem Solution Discussion Web Page Access Counting

The clicksort .php script shown here does not do t hat . However, t he recipes dist ribut ion cont ains a Perl count erpart script , clicksort .pl, t hat does perform t his kind of check. Have a look at it if you want m ore inform at ion. The cells in t he rows following t he header row cont ain t he dat a values from t he dat abase t able, displayed as st at ic t ext . Em pt y cells are displayed using nbsp; so t hat t hey display w it h t he sam e border as nonem pt y cells see Recipe 17.4 .

18.13 Web Page Access Counting

18.13.1 Problem

You want t o count t he num ber of t im es a page has been accessed. This can be used t o display a hit count er in t he page. The sam e t echnique can be used t o record ot her t ypes of inform at ion as well, such as t he num ber of t im es each of a set of banner ads has been served.

18.13.2 Solution

I m plem ent a hit count er, keyed t o t he page you want t o count .

18.13.3 Discussion

This sect ion discusses access count ing, using hit count ers for t he exam ples. Count ers t hat display t he num ber of t im es a web page has been accessed are not such a big t hing as t hey used t o be, presum ably because page aut hors now realize t hat m ost visit ors dont really care how popular a page is. St ill, t he general concept has applicat ion in several cont ext s. For exam ple, if youre displaying banner ads in your pages Recipe 17.8 , you m ay be charging vendors by t he num ber of t im es you serve t heir ads. To do so, you need t o count t he num ber of accesses for each one. You can adapt t he t echnique shown in t his sect ion for purposes such as t hese. Ther e ar e sever al m et hods for writ ing a page t hat displays a count of t he num ber of t im es it has been accessed. The m ost basic is t o m aint ain t he count in a file. When t he page is request ed, you open t he file, read t he count , increm ent it and writ e t he new count back t o t he file and display it in t he page. This has t he advant age of being easy t o im plem ent and t he disadvant age t hat it requires a count er file for each page t hat includes a hit count . I t also doesnt work properly if t wo client s access t he page at t he sam e t im e, unless you im plem ent som e kind of locking prot ocol in t he file access procedure. I t s possible t o reduce count er file lit t er by keeping m ult iple count s in a single file, but t hat m akes it m ore difficult t o access part icular values wit hin t he file, and it doesnt solve t he sim ult aneous-access problem . I n fact , it m akes it worse, because a m ult iple-count er file has a higher likelihood of being accessed by m ult iple client s sim ult aneously t han does a single- count er file. So you end up im plem ent ing st orage and ret rieval m et hods for processing t he file cont ent s, and locking prot ocols t o keep m ult iple processes from int erfering wit h each ot her. Hm m . . . t hose sound suspiciously like t he problem s t hat MySQL already t akes care of Keeping t he count s in t he dat abase cent ralizes t hem int o a single t able, SQL provides t he st orage and ret rieval int erface, and t he locking problem goes away because MySQL serializes access t o t he t able so t hat client s cant int erfere wit h each ot her. Furt herm ore, depending on how you m anage t he count ers, you m ay be able t o updat e t he count er and ret rieve t he new sequence value using a single query. I ll assum e t hat you want t o log hit s for m ore t han one page. To do t hat , creat e a t able t hat has one row for each page t o be count ed. This m eans it s necessary t o have a unique ident ifier for each page, so t hat count ers for different pages dont get m ixed up. You could assign ident ifiers som ehow, but it s easier j ust t o use t he pages pat h wit hin your web t ree. Web program m ing languages t ypically m ake t his pat h easy t o obt ain; in fact , weve already discussed how t o do so in Recipe 18.2 . On t hat basis, you can creat e a hitcount t able as follows: CREATE TABLE hitcount path VARCHAR255 BINARY NOT NULL, hits BIGINT UNSIGNED NOT NULL, PRIMARY KEY path ; This t able definit ion involves som e assum pt ions: • The BINARY keyword in t he path colum n definit ion m akes t he colum n values case sensit ive. That s appropriat e for a web plat form where pat hnam es are case sensit ive, such as m ost versions of Unix. For Windows or for HFS+ filesyst em s under Mac OS X, filenam es are not case sensit ive, so youd om it BINARY from t he definit ion. • The path colum n has a m axim um lengt h of 255 charact ers, which lim it s you t o page pat hs no longer t han t hat . I f you expect t o require longer values, use a BLOB or TEXT t ype rat her t han VARCHAR . But in t his case, youre st ill lim it ed t o indexing a m axim um of t he left m ost 255 charact ers of t he colum n values, so youd use a non- unique index rat her t han a PRIMARY KEY . • The m echanism works for a single docum ent t ree, such as when your web server is used t o serve pages for a single dom ain. I f you inst it ut e a hit count m echanism on a host t hat servers m ult iple virt ual dom ains, you m ay want t o add a colum n for t he dom ain nam e. This value is available in t he SERVER_NAME value t hat Apache put s int o your script s environm ent . I n t his case, t he hitcount t able index would include bot h t he host nam e and t he page pat h. The general logic involved in hit count er m aint enance is t o increm ent t he hits colum n of t he record for a page, t hen ret rieve t he updat ed count er value. One way t o do t hat is by using t he following t wo queries: UPDATE hitcount SET hits = hits + 1 WHERE path = page path ; SELECT hits FROM hitcount WHERE path = page path ; Unfort unat ely, if you use t hat approach, you m ay oft en not get t he correct value. I f several client s request t he sam e page sim ult aneously, several UPDATE st at em ent s m ay be issued in close t em poral proxim it y. The following SELECT st at em ent s t hen wouldnt necessarily get t he corresponding hits value. This can be avoided by using a t ransact ion or by locking t he hitcount t able, but t hat slows down hit count ing. MySQL provides a solut ion t hat allows each client t o ret rieve it s own count , no m at t er how m any updat es happen at t he sam e t im e: UPDATE hitcount SET hits = LAST_INSERT_IDhits+1 WHERE path = page path ; SELECT LAST_INSERT_ID ; The basis for updat ing t he count here is LAST_INSERT_ID expr , w hich w as discussed in Recipe 11.17 . The UPDATE st at em ent finds t he relevant record and increm ent s it s count er value. The use of LAST_INSERT_IDhits+1 r at her t han j ust hits+1 t ells MySQL t o t reat t he value as t hough it were an AUTO_INCREMENT value. This allows it t o be ret rieved in t he second query using LAST_INSERT_ID . The LAST_INSERT_ID funct ion ret urns a connect ion- specific value, so you always get back t he value corresponding t o t he UPDATE issued on t he sam e connect ion. I n addit ion, t he SELECT st at em ent doesnt need t o query a t able, so it s very fast . A furt her efficiency m ay be gained by elim inat ing t he SELECT query alt oget her, which is possible if your API provides a m eans for direct ret rieval of t he m ost recent sequence num ber. For exam ple, in Perl, you can updat e t he count and get t he new value wit h a single query like t his: dbh-do UPDATE hitcount SET hits = LAST_INSERT_IDhits+1 WHERE path = ?, undef, page_path; hits = dbh-{mysql_insertid}; However, t heres st ill a problem here. What if t he page isnt list ed in t he hitcount t able? I n t hat case, t he UPDATE st at em ent finds no record t o m odify and you get a count er value of zero. You could deal wit h t his problem by requiring t hat any page t hat includes a hit count er m ust be regist ered in t he hitcount t able before t he page goes online. A friendlier alt ernat e approach is t o creat e a count er record aut om at ically for any page t hat is found not t o have one. That way, page designers can put count ers in pages wit h no advance preparat ion. To m ake t he count er m echanism easier t o use, put t he code in a ut ilit y funct ion t hat t akes a page pat h as it s argum ent , handles t he m issing- record logic int ernally, and ret urns t he count . Concept ually, t he funct ion act s like t his: update the counter if the update modifies a row retrieve the new counter value else insert a record for the page with the count set to 1 The first t im e you request a count for a page, t he updat e m odifies no rows because t he page wont be list ed in t he t able yet . The funct ion creat es a new count er and ret urns a value of one. For each request t hereaft er, t he updat e m odifies t he exist ing record for t he page and t he funct ion ret urns successive access count s. I n Per l, a hit - count ing funct ion m ight look like t his, where t he argum ent s are a dat abase handle and t he page pat h: sub get_hit_count { my dbh, page_path = _; my rows = dbh-do UPDATE hitcount SET hits = LAST_INSERT_IDhits+1 WHERE path = ?, undef, page_path; return dbh-{mysql_insertid} if rows 0; counter was incremented If the page path wasnt listed in the table, register it and initialize the count to one. Use IGNORE in case another client tries same thing at the same time. dbh-do INSERT IGNORE INTO hitcount path,hits VALUES?,1, undef, page_path; return 1; } The CGI .pm script_name funct ion ret urns t he local part of t he URL, so you use get_hit_count like t his: my hits = get_hit_count dbh, script_name ; print p This page has been accessed hits times.; The count ing m echanism pot ent ially involves m ult iple queries, and we havent used a t ransact ional approach, so t he algorit hm st ill has a race condit ion t hat can occur for t he first access t o a page. I f m ult iple client s sim ult aneously request a page t hat is not yet list ed in t he hitcount t able, each of t hem m ay issue t he UPDATE query, find t he page m issing, and as a result issue t he INSERT query t o regist er t he page and init ialize t he count er. The algorit hm uses INSERT IGNORE t o suppress errors if sim ult aneous invocat ions of t he script at t em pt t o init ialize t he count er for t he sam e page, but t he result is t hat t heyll all get a count of one. I s it wort h t rying t o fix t his problem by using t ransact ions or t able locking? For hit count ing, I d say no. The slight loss of accuracy doesnt warrant t he addit ional processing overhead. For a different applicat ion, t he priorit y m ay be accuracy over efficiency, in which case you would opt for t ransact ions t o avoid losing a count . A PHP version of t he hit count er looks like t his: function get_hit_count conn_id, page_path { query = sprintf UPDATE hitcount SET hits = LAST_INSERT_IDhits+1 WHERE path = s, sql_quote page_path; if mysql_query query, conn_id mysql_affected_rows conn_id return mysql_insert_id conn_id; If the page path wasnt listed in the table, register it and initialize the count to one. Use IGNORE in case another client tries same thing at the same time. query = sprintf INSERT IGNORE INTO hitcount path,hits VALUESs,1, sql_quote page_path; mysql_query query, conn_id; return 1; } To use it , call t he get_self_path funct ion t hat ret urns t he script pat hnam e see Recipe 18.2 : self_path = get_self_path ; hits = get_hit_count conn_id, self_path; print pThis page has been accessed hits times.p\n; I n Pyt hon, t he funct ion looks like t his: def get_hit_count conn, page_path: cursor = conn.cursor cursor.execute UPDATE hitcount SET hits = LAST_INSERT_IDhits+1 WHERE path = s , page_path, if cursor.rowcount 0: a counter was incremented count = cursor.insert_id cursor.close return count If the page path isnt listed in the table, register it and initialize the count to one. Use IGNORE in case another client tries same thing at the same time. cursor.execute INSERT IGNORE INTO hitcount path,hits VALUESs,1 , page_path, cursor.close return 1 And is used as follows: self_path = os.environ[SCRIPT_NAME] count = get_hit_count conn, self_path print pThis page has been accessed d times.p count The recipes dist ribut ion includes dem onst rat ion script s hit count er script s for Perl, PHP, and Pyt hon under t he apache direct ory. A JSP version is under t he t om cat direct ory. I nst all any of t hese in your web t ree, invoke it a few t im es, and wat ch t he count increase. First youll need t o creat e t he hitcount t able, as well as t he hitlog t able described in Recipe 18.14 . Bot h t ables can be creat ed from t he hit s.sql script provided in t he t ables direct ory.

18.14 Web Page Access Logging