release2 issue9 Ebook free download pdf pdf

  9 Release 2.0 Issue 2.0.9, June 2008 http://r2.oreilly.com

  

Jesse Robbins of the O’Reilly Radar team puts it succinctly: “You only make

money when your web site is up. The more available and faster your web

site, the more revenue-generating pages a customer can view in the same

amount of time, and the happier the customer will be.”

  Jesse Robbins, from Velocity, page 03

  Release 2.0 Issue 2.0.9, June 2008

ISSN 1935-9446

  Contents Published six times a year by O’Reilly Media, Inc., 1005 Gravenstein Highway North,

   Sebastopol, CA 95472 http://r2.oreilly.com

   This newsletter covers the world of information technology and

  the Internet — and the business and societal issues they raise.

   executive editor Tim O’Reilly tim@oreilly.com

   editor Jimmy Guterman jimmy@oreilly.com

   publisher Sara Winge sara@oreilly.com

   art director

   markp@oreilly.com

   copy editor Steven Sloan contributing writers

  Brady Forrest Jerry Michalski Sarah Milstein Peter Morville Nathan Torkington David Weinberger © 2008, O’Reilly Media, Inc. All rights reserved. No material in this publication may be reproduced without prior written permission; however, we gladly arrange for reprints, bulk orders, or site licenses. Individual subscriptions cost $495 per year. 80427 subscription information Release 2.0 PO Box 17046 North Hollywood, CA 91615-9588 http://r2service.oreilly.com customer service 1.800.889.8969 1.707.827.7019 r2@oreilly.com

  Jimmy Guterman is editor of Release 2.0 and editorial director of O’Reilly’s Radar group.

  Achieving Velocity Operations is the “secret sauce” of successful web sites.

  In the early years of the commercial web, sites were organized like this:

  More and more, people will expect pages to load faster, sites to have higher uptime, and companies to deliver more performance with fewer resources.

  These sites were neat, clean, manageable. In such an environment, with flat files, a hierarchical structure, and not that many customers, it was relatively easy to keep sites smooth and optimized.

  That didn’t last long. Amazon, eBay, and other trailblazers showed that even websites that were neat, clean, and manageable from the point of view of the visitor had to be, in fact, quite complicated on the back end. Flat files couldn’t be the rule, as any large site needed to be built on a database, rather than Release 2.0.9 July 2008 Achieving Velocity Jimmy Guterman

  HTML pages. And, since you’re asking for passwords and Social Security num-

  At the leading digital

  bers, you’d better be rigorous about security. The iTunes Music Store doesn’t look to Apple developers the way it looks to customers.

  enterprises, web operations

  Anyone running a large-scale web site has at least two great worries:

  is embedded throughout the

  performance and scalability. “Another site is just a click away” has been a web- business cliché since NCSA Mosaic was the browser of choice, but much evidence

  enterprise. It’s the foundation

  suggests that the quickest way to get customers off a website is to make it slow and unreliable.

  on which the business runs.

  In recent years, as many large sites have launched and some of them have prospered, the art and craft of web operations has become crucial to companies that want thriving digital businesses. Companies don’t just hire “an ops person” to think about such issues. At the leading digital enterprises, web operations is embedded throughout the enterprise. It’s the foundation on which the business runs. More and more, people will expect pages to load faster, sites to have higher uptime, and companies to deliver more performance with fewer resources. The next cool web startup may well be the one that can quickly scale to serve a large and global audience.

  In this special issue of Release 2.0, we look at the state of web operations, examine early signals of where it’s going, and present the industry’s best practices and most interesting players. Also available as a stand-alone O’Reilly Radar research report, this issue is a complement to O’Reilly’s inaugural Velocity conference (http://conferences.oreilly.com/velocity) for web performance and operations. Longtime IT analyst and reporter Allan Alter called on the deep experience and hard-won strategic insight of conference co-chairs Jesse Robbins and Steve Souders as he crafted the issue. With this report and conference, and in ongoing coverage on the Radar blog (http://radar.oreilly.com), we offer tools n n for making sure your site is one of the winners. Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage

  Velocity: Transforming Web Operations from Cost Center to Competitive Advantage by Allan Alter and the O’Reilly Radar Team

  When people say the Web has changed how we work, they tend to think about

  You only make money when

  how people buy and sell products, collaborate and share information with co- workers, or all the new kinds of businesses that have emerged. What they often

  your website is up.

  overlook is that the Web has changed something else that’s fundamental to every business—execution.

  Execution, say Larry Bossidy and Ram Charan in their book by that name, is the “discipline of getting things done…the missing link between aspirations and results.” It’s understanding how to operate a business in an efficient, effective, and reliable way, knowing how to meet the expectations of your customers so your organization can meet the expectations of management and investors. In an online business—and nearly every business is an online business today— execution must include the discipline of operating websites. Only, “include” doesn’t go far enough. An online business must think of the website as one of the most important parts of a company’s operations. It’s just that critical.

  Customers don’t care about the operational, behind-the-scenes stuff that goes on, such as how many servers support a site, server automation, or HTML coding. But they do care about whether the site keeps crashing, it takes a long time to download pages, or the features on the site hang up. That’s why smart executives at all Internet companies—in fact, any organization that conducts business on the web—are recognizing that reliable web operations and fast website performance are essential. With the Web now the sales channel most frequently used by U.S. companies, serious money is at stake: Amazon, Google, and Microsoft have found a lag of half a second or less can have a major impact on revenues and the number of searches. As Jesse Robbins, an O’Reilly Radar blogger and website availability expert, says, “You only make money when your website is up. The more available and the faster your website, the more reve- nue-generating pages a customer can view in the same amount of time, and the happier the customer will be.” (Robbins was responsible for website avail- ability at Amazon.com, where his title was “Master of Disaster.“)

  Still, most executives don’t fully understand all the potential business benefits of a high performance, high uptime website—and the hit their business can take if they neglect web operations. Nor do they know enough about the basic principles, technologies, and management practices that separate well-run sites from the also-rans. These principles sometimes require new ways of thinking— especially about how to approach website downtime and “failure.” This report, written for business executives and managers at online businesses (or any company with a commercial website), provides a guide to understanding what web operations means, the business opportunities and risks it presents, and the best practices for operating and managing a mission-critical site. Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage What is Web Operations? “Operations” has long referred to the day-in, day-out processes of a business.

  Chief operating officers run their organizations’ day-to-day activities; in banking, “operations” refers to running branches and processing checks or transactions. Likewise, the IT profession has been using “operations” for decades to describe running and maintaining mainframes, servers, and data centers. But the phrase “web operations” is much newer. Its earliest use dates back to 2003, when the phrase appeared as part of the name of the Internet Web Ops conference. You still won’t find many companies with a function known as the web operations department. Web operations remains an ill-defined and even controversial term: just as the phrase “classical music” means both music specifically from the era of Haydn and Mozart, and the entire European concert music tradition from Monteverdi to Stockhausen, web operations sometimes refers only to running and maintaining websites, and at other times serves as an umbrella term that also encompasses the field known as “web performance.” In this report, we’ll use the phrase web operations in its broader sense.

  Job Titles in Web Operations: One indication of just how new the term “web operations” and “web performance” still is: neither phrase appeared in online job postings for these positions between November 2007 and April 2008. Operations = Availability While web operations and

  Theo Schlossnagle, an expert on building scalable, high performing websites

  availability focuses on the

  and CEO of OmniTI Computer Consulting Inc., is not a fan of the phrase “web operations”; he prefers terms that give a tip of the hat to the technical skills

  servers, web performance

  required to run a website, such as “site architects” and “site reliability engineers.” But he does have a clear definition of “web operations” in its narrow sense: it’s

  concentrates on what the

  “how I put the website in place, how I keep it going, how I meet the demand (if user sees. demand rises above capacity), or not eat my shorts in costs (if demand is less than capacity). As business requirements change and mutate, it’s making sure what you have in place still works well.” Web operations focuses on availability—keeping sites up and running.

  Availability includes reliability: the capability to consistently download not just web pages, but the features on those pages (e.g., search, video, account informa- tion, online purchasing, chat, etc). To achieve reliability, web operations personnel set up and maintain their sites’ hardware, software, storage, and network infra- structure. They also ensure scalability as demand for the site increases, by designing an infrastructure that can grow and by preparing for the future through capacity planning. Availability also encompasses recovery: the ability to get a website back up should the site, or any individual feature, fail to be available. Part of the job of running a website is setting up redundant systems and even data centers that can take over when equipment fails, and creating plans to get the site back online when it crashes.

  Performance = Response

  While web operations and availability focuses on the servers, web performance concentrates on what the user sees. “It’s how we deliver our product as quickly as possible and provide a good user experience,” is the succinct definition by Google technical staff member Steve Souders, author of High Performance

  Websites, creator of the YSlow tool for analyzing web performance, and Yahoo’s

  former Chief Performance Yahoo! Web performance focuses primarily on response time: the length of time it take to download a web page. Improving performance involves optimizing the files, instructions, and components that make up a web page. But the definition of performance includes efficiency, too: optimizing hardware, data centers, and networking to serve web pages at maximal speed with minimal resources.

  Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage

Web Operations and Business

  In the oil industry, operations

  The difference between operations and performance is important to the people

  is what you do to extract oil

  who do the work. But for others the distinctions often overlap—at some point, a slow page is equivalent to an unavailable page—and are ultimately not very

  from ground and put it in a

  important to the folks outside the data center. What users care about is their experience on the Web; what mangement cares about most is the bottom line.

  barrel. It’s the same with web

  “In business,” says Adam Jacob, Senior Partner of HJK Solutions and a specialist

  operations: it’s extracting

  in establishing web operations for start-up firms, “ ‘operations’ is usually defined as the things you need to do to extract value from your resources. In the oil

  the value you have in your

  industry it’s what you do to extract oil from ground and put it in a barrel. It’s the same with web operations: it’s extracting the value you have in your website

  website.

  and what you are running on it.” To extract that value, executives don’t need to know the intricacies of web technology. What they absolutely must understand are the three most impor- tant principles for a sound IT infrastructure. And the first of them—design for failure—requires a 180-degree change in how most IT organizations think about downtime.

1. Design for failure. Failure is inevitable: servers break down, bugs and net-

  work outages occur, people make mistakes. It’s better to assume that failure will happen, and design the infrastructure so it recovers quickly when it does. “A system that can tolerate minor problems and still deliver a great customer experience is better than one that delivers great experience all the time except when the page is blocked out,” says Robbins. As telephone companies and power grids do with their networks, Google, Amazon, and many of the most successful websites design their web infrastructure as a mesh of connected systems. When one system goes down—be it a web server, a database server, or some other piece of hardware—others pick up the load.

  John Allspaw, manager of the Operations Engineering Group at Flickr.com, recalls a bizarre request at one of his first jobs that brought home the point for him. “My boss said I want you to take a look at this diagram of networks and servers. In three months I will walk around and randomly unplug things to see if the network still works.” A network that can survive a rambling, destructive force is a perfect allegory for how web operations should work, he says.

2. Design for scalability. In a scalable architecture, computing resources keep

  up with demand as it grows. A scalable system allows you to design or modify applications to run on multiple servers, then add servers to the ones already in place (horizontal scalability), or upgrade your current servers with more power- ful and faster ones (vertical scalability). Horizontal scalability is generally cheaper and preferable; vertical scaling makes sense when rewriting code is too costly or—if horizontal scaling is already in place—it makes more sense to replace old servers with fewer, more powerful, less- expensive-to-run servers.

  By contrast, an example of an architecture that doesn’t scale is one that keeps a company’s data in a single database that can’t query other database servers. If the database hits its upper limit and it’s necessary for queries to go to multiple database servers, it’s hard to change the code.

  For startup companies with a tiny operations staff, the big scalability issue is surviving sudden spikes in traffic and avoiding long periods of downtime. That’s why Sendi Windaja, CTO of startup legal referral website Avvo.com, set up an automated web operations infrastructure from the start—one where the work of adding a new server to a network can be done automatically by the network, rather manually by the operations staff. His one-and-a-half person operations

  The Operations Advantage: new online businesses can spend far less time managing their website if they use server automation from the beginning.

  Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage

  staff can add a bare metal server, load the operating system and applications,

  Important as it is, a scalable

  and add it to the network in 30 minutes. Downtime to deploy a new software release is usually a regularly scheduled 10 seconds.

  architecture isn’t sufficient for

  As websites emerge from the startup phase, capacity planning and anticipat-

  achieving a fast, bug-free user

  ing the stress points on systems become the main issues. Mainstream retailers and companies have the additional burden of integrating their website’s features

  experience.

  with their existing IT infrastructure and systems—still a big challenge for store retailers or catalog firms.

3. Design for the browser. Important as it is, a scalable architecture isn’t

  sufficient for achieving a fast, bug-free user experience. The reason? Steve Souders’ “Performance Golden Rule:” “Only 10–20% of the end user response

  time is spent downloading the HTML document. The other 80% to 90% is spent downloading all the components on the page.”

  In other words, when you’re waiting for a page to load, the holdup is not the basic HTML frame of the web page, but the time it takes the browser to down- load all of the scripts, images, DNS lookups, stylesheets, and redirects that make up a modern web page.

  Souders has set out 14 rules for improving front-end performance in his book

  High Performance Websites: Essential Knowledge for Frontend Engineers (O’Reilly,

  2007). Any website can shed 25 to 50 percent off its download time by following the applicable rules, he claims. A downloadable tool by Souders called “YSlow” (which runs on the Firebug for Firefox tool) analyzes whether a web page is following or breaking these rules, and then grades pages.

  Steve Souders’ 14 rules for high performance websites 1. Make Fewer HTTP Requests.

  Reducing the number of components or combining them so fewer requests are needed can reduce response times by as much as 50%.

  2. Use a content delivery network.

  Bringing the servers closer to users reduces the time to transmit page components over the Internet. (More below)

  3. Add an expires header.

  For repeat users, using components already in the browser’s cache avoids unnecessary HTTP calls.

  4. Gzip your scripts and stylesheets.

  Compressing HTTP responses into smaller files reduces transfer times.

  5. Put stylesheets at the top of the document.

  Placing stylesheets there enables browsers to draw the visible components of a page first as it downloads.

  6. Put scripts at the bottom.

  Otherwise, some components will be blocked from downloading; also enables components to be downloaded simultaneously.

  7. Avoid CSS expressions.

  The browser must evaluate them—and usually much more often than developers expect.

  8. Put JavaScript and CSS in external files.

  This can reduce the size of HTML documents without increasing the number of HTTP requests.

  9. Reduce DNS lookups.

  Reducing the number of hostnames to look up reduces the parallel downloading that must take place.

  10. Minify JavaScript source code.

  Removing unnecessary characters from the code makes for smaller, faster-to-download JavaScript files.

  11. Avoid redirects.

  Rerouting users from one URL to another makes your pages slower.

  12. Remove duplicate scripts.

  This creates superfluous downloads and evaluations.

  13. Remove or reconfigure ETags.

  ETags can thwart caching, forcing browswers to make more HTTP requests.

  14. Make Ajax cacheable.

  Make sure Ajax requests have an Expires header, and follow the other 13 performance guidelines. Source: High Performance Websites: Essential Knowledge for Frontend Engineers by Steve Souders (2007, O’Reilly Media, Inc.; $29.99, 146 pages) Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage Web Operations and Performance: Business Principles

  So how much more value can a company extract from a high-performance, high-availability website than one that’s slow or crash-prone? Can a few extra seconds of download time or minutes of downtime really matter? It turns out that it matters deeply. Web operations has a direct impact on customer satisfaction, revenues, the cost of doing business, and the ability to collect customer data; it also affects Sarbanes-Oxley (SOX) compliance and energy use. In fact, the very survival of a business can even depend on the speed and availability of its web- site. These nine principles highlight how performance and availability impact the business value of a website.

1. Speed keeps customers satisfied. Along with pleasing site design, compel-

  ling content, and security, site speed is one of the critical customer satisfaction factors for any website. “Without a product that’s fast, it doesn’t matter what features it has, people won’t want to use it. You can’t have satisfied customers if the product is slow,” says Eric Schurman, the senior development lead currently responsible for Microsoft’s Live Search search engine, and formerly responsible for Microsoft’s home page. To retail veteran Thomas Harden, the former COO of Land’s End and executive vice president at L.L. Bean, and the executive who over- saw both companies’ online businesses, “Speed of response time is just table stakes to get into the game. It’s one of the most significant factors causing people to defect from your website.” n Study after study confirms site speed’s importance:

  Experiments by Google found a 500 millisecond delay in the speed with which results are returned led to a 20% drop in the number of searches.

  Google also found the use of mobile Gmail on Apple’s iPhones surged after n the company tweaked the program to improve its speed. Prof. Dennis F. Galletta of University of Pittsburgh, along with other academic researchers, confirmed delays of as little as two seconds can have a “profound impact” on the way users react to websites. Other academic studies found that sites with download delays are less trusted than faster sites.

  Source: Google Mobile Blog, n

  By reducing download times around the world by up to 65%, Cathay Pacific

  

  Airways increased online bookings by more than 100% while saving $1.5 million as more customer booked tickets online instead of the through airline’s call center. n

  One third of shoppers using broadband will leave a site after a four second

  Speed is a key business metric

  wait, compared to just 19% of dial-up shoppers, according to a sponsored study of retail websites for Akamai, conducted by Jupiter Research. The

  that just keeps getting more

  same study found 55% of shoppers who spend more than $1,500 a year important. online considered “pages load quickly” as one of the most influential factors in their decision to return to an online retailer.

  Websites and pages don’t have to be equally fast. Users are more demanding with search engines, news, and shopping sites where there is plenty of compe- tition, and more patient on sites that provide tools they use or information they can’t find anywhere else (like their bank account status), according to Allspaw and AOL operations architect Eric Goldsmith. Galletta’s research team found users are more tolerant of slower performance on deep websites, or if they are familiar with the “terminology” on a website.

  Speed is a key business metric that just keeps getting more important. Users are more impatient now that they’ve grown accustomed to broadband speeds. Competitors raise the bar by improving their sites’ performance. “Two to five years from now, we will see that speed is a primary differentiator from com petitors,” predicts Souders.

2. Performance boosts customer satisfaction. “The ‘wow!’ factor is a differ-

  entiator for us,” says Jay Shah, CIO of E-LOAN, Inc., which has been ranked the #1 mortgage industry website by Keystone Systems for five years in a row. When a potential borrower submits their data to apply for a loan, “we can process it, integrate third party data, and make a decision in seconds—and we relay that back to user while they are in the browser session. Then we contact the majority of approved customers within 15 minutes.”

  That’s customer satisfaction—and even delight: capturing customers and users by delivering features that exceed expectations, or are amazingly useful or truly enjoyable. Speed can deliver “wow.” But the push to wow customers also poses a performance challenge: how to add functionality without harming performance?

  Designers are adding browser-choking features and piling on objects to download. In the past five years, the size of the average web page has more than tripled from 93.7K to over 312K, while the average number of objects per page has doubled from 25.7 to 49.9 objects. Meanwhile, the search for new ways to delight customers continues. In one cutting-edge example, a team Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage

  

Forty-three percent of U.S. households still of MIT researchers have come up with a way to change the look and feel of a

use dialup, according to an April 2008

  website to a web users’ cognitive style—impulsive or deliberative, analytic/visual recent survey by Website Optimization L.L.C. or holistic/verbal, leader or follower, reader or listener. This so-called site “morphing”

  (Dial-up connections are down: a mere 4.3% of U.S. employees still use dial-up connec-

  has a huge potential to increase sales. A BT Group (formerly British Telecom)

  tions at work.)

  experimental website using this technology increased purchasing intentions by

  Average download times for broadband

  nearly 20 percent, which could have added about $80 million in revenue to the

  users dropped from 2.8 to 2.33 seconds BT Group’s coffers. between February 2006 and February 2008,

  More features need not cause slower performance. You can add more to a

  according to the Keynote Business 40 Internet Performance Index.

  page and still make it faster, if you do it the right way. But any company that adds features without instituting sound practices for web performance will find its pages will load more slowly. And as Harden has found, “Customers don’t go backwards. Once they get used to a level of performance, they don’t expect it to get worse.”

  Page weight Response time YSlow grade (in Seconds)

  Amazon 405K

  15.9 D AOL 182K

  11.5 F CNN 502K

  22.4 F eBay 275K

  9.6 C Google

  18K

  1.7 A MSN 221K

  9.3 F MySpace 205K

  7.8 D Wikipedia 106K

  6.2 C Yahoo! 178K

  5.9 A YouTube 139K

  9.6 D

  The page weight (size of page), response time, and YSlow grade for the home pages of 10 top U.S. websites, circa early 2007 (from High Performance Websites, p. 103) Complex, popular sites tend to have big home pages that don’t load quickly.

3. High-performing sites capture more customer information. Companies

  Customers don’t go

  that can capture data about the activity of customers and visitors can analyze it to improve current products, develop new ones, and uncover ways to generate

  backwards. Once they

  more income. It’s one reason spending on analytic and business intelligence

  get used to a level of

  tools is growing by 13.1% in 2008, faster than any other application area. Strong site performance helps companies do a better job of analysis: People will use a

  performance, they don’t

  site more often if it’s fast and easy to use. And the more people use it, the more customer information you can collect.

  expect it to get worse.

  Take Flickr.com: the photography website has two levels—free (which limits users to uploading 200 photos) and pro (which permits unlimited uploading for $24.95 a year and provides other benefits). Flickr constantly examines what convinces users to spend the money, says John Allspaw. “Was it the 200 photo limit? Was there a feature that made them turn? That is an exercise that is con- stantly going on.” Collecting, storing, backing up, and sharing this data is part of the operations team’s job at Flickr, he adds. (For more on web operations at Flickr, see the case study).

  This online customer data is also invaluable for capacity planning. “Every time we launch a large feature, we see how the feature is adopted,” Allspaw noted. “The marketing people want to know how quickly it is adopted and its affect on revenue. We use the same metrics to figure out whether we planned for enough capacity for this feature.”

  Whether it’s channeling onlne customer data to the marketing, sales, or customer service departments, or using it for capacity planning, the web ops group has to handle this data with care. They need to be sure the data is kept secure, and that policies are in place that make certain only the appropriate people can view it.

4. Speed increases revenue; downtime loses money. The connection

  between web performance and revenues is very simple: The faster the site is, the more pages you can view in a unit of time. Faster site, more page views, more clicks. Downtime means no clicks at all, and no clicks means no revenue from ads or sales.

  There have been plenty of estimates of retail revenues lost from unavailable, sluggish, or buggy websites to back this up. They range from $25 billion a year (Zona Research, 2001) to $877 million for the 2004 holiday season (Tom Kuegler’s Conversion Chronicles blog) and £300 million in a 2007 study of 40 British retail, finance, insurance, and travel websites (SciVisum Ltd.) An academic study of a three-hour-long denial-of-service attack on Yahoo on February 7, 2000 estimated that Yahoo lost 2.22 million visits and $88,854 in click-through revenues while Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage

  under attack, and 7.56 million visits and $302,277 during the following 8 weeks.

  Amazon found “every After that time, losses were negligible as visitors returned to Yahoo.

  Companies rarely spill the beans on their own winnings or losses, though

  100 millisecond delay in the

  the information does sometimes sneak out: Greg Linden, a former Amazon.com

  time it takes a web page to

  computer scientist who wrote the site’s recommendation engine, stated in a 2006 speech (and later posted on his website) that Amazon found “every 100

  load costs 1% of sales.”

  millisecond delay in the time it takes a web page to load costs 1% of sales.” Microsoft’s Schurman and Thomas Harden (formerly of Lands’ End and L.L. Bean) won’t provide actual numbers, but both confirm Robbins’ argument.

  “When we make a change to the Microsoft Live Search site that makes it faster or slower, we found it has a direct measurable impact on revenue,” says Schurman. “Will they use us more often if we are faster? Yes. We’ve also found if we make the site faster we’ll even get more ad clickthroughs.” Even a small improvement in site helps. “If you make the site just a tiny bit faster, we do see a tiny change in revenue. Last year we made a very substantial change of performance, on the order of 50 percent, and we saw a really substantial bump in revenue.”

  Harden found the connection between site performance and revenues resembles the one between in-stock inventory and retail sales. “When in-stocks are 94% or 96% instead of 100%, the vast majority of customers are not going to feel that. But you reach a point at the high 80’s where things begin to fall apart. It’s almost like you hit a tipping point and things begin to collapse— all the customers are complaining about it. To me, site speed is much the same way. There is a point where boom, you hit a bad point and it drops off like a rock and goes to almost zero very quickly. It’s like this in call centers, too.”

5. Scalability makes or breaks online businesses. Web operations veterans

  in the Silicon Valley have seen it happen plenty of times: a new web business is written about on Slashdot.org or another popular site, users hit the link in droves, and its servers can’t keep up with the traffic. Some companies, like iLike, survive their brush with popularity; others like Twitter, a target for criticism over its service outages, risk losing users to new competitors.

  Sudden spikes in traffic are the most dramatic example of a classic operations issue: scalability. The ability to keep up with demand for the website insures a company’s website can keep pace with growth. Besides lost revenue and customers, failure to scale damages to reputation, and provides openings to competitors. Scalability also matters to mature companies: A website that can’t

  Web operations are just as

  scale becomes a drag on agility; it cannot execute at the speed the business requires. “Getting the platform right is the basis for better availability, much

  affected by SOX as any other

  better information for decision making and risk management, and for making

  part of IT involved in tracking

  agility sustainable,” says George Westerman, co-author of IT Risk: Turning Business

  Threats into Competitive Advantage (2007, Harvard Business School Press. $35, 256 the flow of information about pages) and a researcher at the Center for Information Systems Management at MIT.

  While managers at larger companies are used to thinking about scale, it’s

  financial transactions inside a

  easily overlooked by developers and managers at start-ups rushing to get their first product out the door. The trick is to design scalability into your architecture company. from the start.

6. Strong operations practices make Sarbanes-Oxley compliance easier.

  Weak web operations practices don’t just set the stage for scaling problems; they set companies up for problems with Sarbanes-Oxley compliance. The Sarbanes-Oxley Act (SOX), passed in 2002, obliges public companies listed on a U.S. stock exchange to certify the accuracy and timeliness of their financial information, and the integrity of the processes which produce it. Since data on clickthroughs, page views, and online sales and order fulfillment are part of the financial information stream, web operations are just as affected by SOX as any other part of IT involved in tracking the flow of information about financial transactions inside a company. Yet, “there’s a misconception that Sarbanes doesn’t apply to websites, even though they are generating a significant vol- ume of dollars,” according to Marios Damianides, a past international president of ISACA and the IT Governance Institute, and a partner with Ernst & Young’s Technology and Security Risk Services (TSRS) group in New York.

  Under SOX, companies need to establish operating procedures, document them, show they are repeatable, and document who is involved in the procedures and who has access to the company’s IT Infrastructure. Auditors need to be able to verify all this information. This requirement isn’t necessary just for customer tracking or CRM, but it does cover web operations that are directly linked to the financial chain, just like other IT operations. Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage

  Compliance is both a blessing and a curse for web operations. Most of the common practices in a web operations organization must be rethought from a compliance perspective. Everyone must be educated on their responsibilities and underlying reasons behind them.

  “The best way to approach compliance is to look at the controls as a set of design requirements.” says Robbins, “Make sure that the compliance team, engi- neering, and operations are working together with a common purpose right from the start. Otherwise you end up with a huge gap between people and process and intense conflict between risk management and web operations.” 7. The web operations cost/benefit equation changes with your business. A big-budget line item always gets attention inside a business. For online businesses, says Flickr’s John Allspaw, web operations is one of the bigger, if not the biggest expense, there is. “Every quarter, the person in finance assigned to us brings a slide with a table on how much money going in and how much is going out. At this point, everyone looks at me. The slide makes interesting read- ing: our capital expenditures is probably the largest driver of the financials every quarter.” Servers, storage, electricity, staff, and (if used) proprietary software are all costs to manage as efficiently as possible; they are part of the equation when deciding whether to use service providers to host websites or host a site internally.

  Managing operations expenses while supporting growth is why capacity plan- ning is so important: the central question is how many servers and how much storage to add without buying more capacity than needed. Communi cation between the operations group and finance should be done on an almost con- stant basis, since the group knows when they will need to buy more servers.

  John Allspaw’s capacity management chart on the Radar blog—

8. Efficient websites reduce energy costs. An efficient web operation has the

  For online businesses, web

  additional benefit of lower energy costs and CO2 emissions. According to a study by Jonathan Koomey, a researcher at the Lawrence Berkeley National Laboratory

  operations is one of the

  and Stanford University, 1.2% of electricity generated in the US was consumed

  bigger expenses there is,

  by servers and cooling equipment in 2005. That amounted to $2.7 billion in electricity sale, or the equivalent of five 1000 megawatt power plants.

  if not the biggest.

  Reducing the number of web servers through virtualization or replacing them with more energy-efficient equipment can help. For example, when Flickr replaced 67 servers with 18 new quad core servers in its web operations center, it reduced power consumption by 70% in that cluster, according to Allspaw. Even a small change can make a difference. According to Steve Souders, if Wikipedia were to make one set of changes—add a far future expires header to the thirteen images on its front page—it would eliminate 52 million image validation requests daily. Those requests require six servers running full time every year, he estimates. “Assume a fully loaded server uses 100W. Six servers, year-round, consume 5,000 kilowatt-hours per year or approximately 500-1000 pounds of CO2 emissions.” For a mid-sized company in Boston, that’s $950 a year.

  Case Study: Flickr

  With 26.5 million users storing 2.2 billion photos, Flickr.com, a San Francisco-based subsidiary of Yahoo, is one of the Web’s largest and most successful photo storing and sharing websites. Operating that website is a massive undertaking: it stores 2.1 petabytes of storage space (4.2, if you count storing them twice for redundancy), and operates nine data centers with a total of 1,200 servers using the LAMP stack platform with the MySQL database and PHP.

  John Allspaw, manager of the Operations Engineering Group, and his staff are the people who keep Flickr running with rarely a flicker. Allspaw was already a systems operations veteran—he built the web infrastructures for Salon.com, InfoWorld.com, and Friendster.com—when he was recruited to Flickr by founders Stewart Butterfield and Caterina Fake to take over the responsibility of managing web operations from Cal Henderson, the original lead developer who wrote the code for Flickr. Since joining Flickr, his staff has grown from one to eight people, including Allspaw. (The department also includes one database admin- istrator, one search operations administrator, one systems administrator, and four web operations generalists.) IT security monitoring, testing, and auditing for web operations, as for other functions, is handled separately by a group named, aptly, the Paranoid Group. The group’s data centers make operations the largest capital expense at Flickr. Release 2.0.9 July 2008 Velocity: Transforming Web Operations from Allan Alter Cost Center to Competitive Advantage

  Allspaw’s focus is availability and speed. Flickr has servers in multiple geo-

  Successful web operations

  graphic regions to reduce network latency, and its system architecture splits data into multiple database servers for scalability and availability, should a data

  comes from non-techical

  center go down or traffic routing problem pop up. A firm believer in the princple

  practices. Some of it is culture,

  of design for failure—“the mindset that failure isn’t something you want to avoid, it’s something you expect to happen in unexpected ways and unexpected times,”

  some are tips or techniques,

  Allspaw has made sure Flickr’s systems are frequently tested and designed to with- stand failures. “For every feature—and I mean almost every tiny little feature—that

  some are just good habits.

  developers put into production, there is a hook in the code, an ability to turn that feature off at any moment in time, so we can continue operating the site with a reduced feature set. We can just disable uploads of photos or disable tagging of photos. Why is that important? If a feature depends on a resource that fails, it’s better not to have that feature than throw an error to the user or make things worse.”

  But Allspaw’s secret sauce for running a high performing, reliable website isn’t really the technology; it’s the culture within Flickr’s operations group, the way it communicates outside the group, and the management practices it fol- lows. “Successful web operations comes from non-techical practices. Some of it is culture, some are tips or techniques, some are just good habits.”

  Take the tense, fingerpointing-prone relationship between software devel- opers who write code, and the operators who manage the servers it runs on. “I am absolutely convinced that the key to a successful web operating culture is that different parts of engineering need to trust each other,” Allspaw says. At Flickr, “One of the reasons it works is that the two groups have an under- standing of what the other person’s job is. I have developers who think like web operations people and web operations people who think like the developers.”

  Allspaw also stresses interaction between the operations team and Flickr’s product managers, a group of largely non-technical people that manage the features of an application. “Stereotypically, you have product managers who don’t care or have no awareness of operations. You can’t just say we’ll release video. There are a lot of things to considered: storage, technology, resources, time. When product managers don’t talk to the operations group, there are a lot of problems.” Allspaw, along with the development side, is involved early on in product and product feature decisions. And because Yahoo’s technical staff reports separately from the product, design, and marketing groups, technical experts have the independence to make the technical decisions. The result is that decisions about launching products and features are made jointly between engineering and product managers.