Performance Optimisations in a Cloud Centric World pdf pdf

  

Performance Optimizations in a

Cloud-Centric World

  Andy Still

  Performance Optimizations in a Cloud-Centric World

  by Andy Still Copyright © 2015 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (

  . For more information, contact our

  corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .

  Editor: Brian Anderson

  Revision History for the First Edition

  2015-07-19: First Release 2015-09-02: Second Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.

  

Performance Optimizations in a Cloud-Centric World, the cover image, and

related trade dress are trademarks of O’Reilly Media, Inc.

  While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this

  For Candance

  For Candance, who insists that all poor performance on the Internet is my fault

  Introduction Back in the day, it was simple...

  Content was served from your server, over your network, and then to client machines that you controlled. Even when that moved out from a LAN to a WAN, the connectivity came from a single provider—it was all under your control.

  Then came the Internet…

  Now content was being served across the public Internet to end-user machines—you lost control of the location, type of machine, and type of connectivity.

  Terminology For simplicity, I’ve used the term “website” throughout to refer to any system that

distributes data across the Internet, including browser-based applications, mobile apps, etc.

Chapter 1. Losing Control So, here we are in the world of the cloud, with ever-expanding elements of our websites being placed in the hands of others.

  Advantages to Giving Up Control

  There are many positive aspects to making this move (after all, why else would so many people be doing it?), so before going into the negatives, let’s remind ourselves of some of the advantages of cloud-based systems: Quick and easy access to enterprise-level solutions

  For example, building your own geographically available SQL server cluster with real-time failover would take lots of hardware, high-quality connectivity between data centers, a high degree of expertise in databases and networking, and a reasonable amount of time and ongoing maintenance. Services such as Amazon RDS make this achievable within an hour, and at a reasonable hourly rate. systems that do not use them. Cloud-based systems are also built for high performance and throughput and designed to scale out of the box. Many services will scale automatically and invisibly to you as the consumer, and others will scale at the click of a button or an API call.

  Access to systems run by specialists in the area—not generalists In house or using a general data center, you may have a small team dedicated to a task—or more likely, a team of generalists who have a degree of expertise across a range of areas. Bringing in a range of specialist cloud providers allows to you work with entire companies that are dedicated to expertise in specific areas, such as security, DNS, or geolocation.

  Performance Risks

  Despite these advantages, it’s important to be aware of the inherent performance risks, especially in this era where good website performance is key to user satisfaction. The next sections cover important considerations for performance and outline key performance risks, following the journey that a user must travel in order to take advantage of your website.

1. The Last Mile

  Before any user can access your website, they need to connect from their device to your servers. The first stage of this connection, between the user’s device and the Internet backbone, is known as the last mile. For a desktop user, this is usually the connection to their ISP, whether that be by DSL, cable, or even dial-up. For a mobile user, it’s the connection via their mobile network.

  This section of the connection between user and server is the most inefficient and variable, and it will add latency onto any connection. To illustrate this, in 2013 the FCC released research that showed that a top- speed fiber connection would add 18ms latency—and that was the best-case

  Performance Risks

  Unreliable delivery of content The variability in connection speed of the last mile means that it’s hard to determine how fast content will be delivered to users. This presents many of the same challenges that we’ll explore in the next section—they’re often amplified by the challenges of the last mile.

2. Backbone Connectivity

  Traditionally, this is seen as the path that the data from your website takes after it leaves your data center until it arrives at the end user’s machine. However, in the Internet age, backbone connectivity can be seen more as the means by which a user reaches your data—you have little control over how or from where the user is coming to you to request it. Users are now accessing data from an expanding range of devices, via many different means of connectivity, and from an ever-widening range of locations. To understand the performance challenges caused by unknown means of connectivity, you need to look at three key factors:

  Which Is the Biggest Challenge to Performance?

  Bandwidth is often discussed as a limiting factor, but in many cases, latency is the killer—bandwidth can be scaled up, but latency is not as easy to address. There is a theoretical minimum latency that will exist based on the physical distance between two places. Optimally configured fiber connections can travel at approximately 1.5× the time taken to travel at the speed of light. The speed of light is very fast, but there is still a measurable delay when transmitting over long distances. For example, the theoretical fastest speed for sending data from New York to London is 56ms; to Sydney, it’s 160ms.

  This means that to serve data to a user in Sydney from your servers in New York, 160ms will pass to establish a connection, and another 160ms will pass

  When choosing a data center, you can get information about these arrangements; however, cloud providers are not so open. Therefore, it’s important to monitor what’s happening to determine the best cloud provider for your end users.

  Performance Risks

  The variability of connectivity across the backbone really boils down to a single performance risk, but it’s a fundamental one that you need to be aware of when building any web-based system. Unreliable delivery of content

  If you cannot control how data is being sent to a user, you cannot control the speed at which it arrives. This makes it very difficult to determine exactly how a website should be developed. For example:

  Can data can be updated in real time? Can activity be triggered in response to a user activity, e.g., predictive search?

3. Servers and Data Center Infrastructure

  Traditionally, when hosting in a data center, you can make an informed choice about all aspects of the hardware and infrastructure you use. You can work with the data center provider to build the hardware and the network infrastructure to your specific requirements, including the connectivity into your systems. You can influence or at least be aware of the types of hardware and networking being used, the peering relationships, the physical location of your hardware, and even its location within the building. The construction of your platform is a process of building something to last, and once built, it should remain relatively static, with any changes being non- trivial operations.

  Performance Risks Loss of control over the data center creates two key performance risks.

  Loss of ability to fine-tune hardware/networking Cloud providers will provide machines based on a set of generic sizes, and they usually keep the underlying architecture deliberately vague, using measurements such as “compute units” rather than specifying the exact hardware being used.

  Likewise, network connectivity is expressed in generic terms such as

  small, medium, large, etc., rather than specifying the actual values so that the exact nature of the networking is out of your control.

  All of this means that you cannot benchmark your application and then

4. Third-Party SaaS Tools

  While you lose control over the hardware and the infrastructure with IaaS, you still have access to the underlying operating system; however, in the world of the cloud, systems are increasingly dependent on higher-level Software as a Service (SaaS) systems that deliver functionality rather than a platform on which you can execute your own functionality.

  All access is provided via an API, and you have absolutely no control over how the service is run or configured.

  Examples in this section For consistency and to illustrate the range of services offered by single providers, all

examples of services in this section are provided by Amazon Web Services (AWS); other

providers offer similar ranges of services.

  These SaaS systems can provide a wide range of functionality, including database (Amazon RDS or DynamoDB), file storage (Amazon S3), message queuing (Amazon SQS), data analysis (Amazon EMR), email sending (Amazon SES), authentication (AWS Directory Service), data warehousing (Amazon Redshift), and many others.

  Performance Risks

  As you start to introduce third-party SaaS services, there are two key performance risks that you must be aware of. Complete failure or performance degradation

  Although one of the selling points of third-party SaaS systems is that they are built on much more resilient platforms than you could build and manage on your own, the fact remains that if they do go down or start to run slowly, there is nothing you can do about it—you are entirely in the hands of the provider to resolve the issue.

  Loss of data Though the data storage systems are designed to be resilient (and in

5. CDNs and Other Cloud-Based Systems

  Many systems now sit behind remote cloud-based services, meaning that any requests made to your server are routed via these systems before hitting it.

  CDNs

  The most common example of these systems are CDNs (content delivery networks). These are systems that sit outside your infrastructure, handling traffic before it hits your servers to provide globally distributed caching of content.

  CDNs are part of any best-practice setup for a high-usage website, providing higher-speed distribution of data as well as lowering overhead of your servers. The way they work is conceptually simple: when a user makes a request for a resource from your system, the DNS resolution is resolved to the point of presence within the CDN infrastructure that has the least latency and load. The user then makes the request to that server. If the server has a cached copy

  Other Systems

  There are many other examples of systems that can sit in front of yours, including: DDoS protection

  Protects your system from being affected by a DoS (Denial of Service) attack.

  Web application firewall Provides protection against some standard security exploits, such as cross-site scripting or SQL injection.

  Traffic queuing

  Performance Risks

  There are a number of performance risks associated with moving your website behind cloud-based services. Complete failure or performance degradation

  Like with third-party SaaS tools, if a cloud system you rely on goes down, so will your system. Likewise, if that cloud system starts to run slowly, so will your system. This could be caused by hardware or infrastructure issues, or issues associated with software releases (SaaS providers will usually release often and unannounced). They could also be caused by third-party malicious activities such as hacking or DoS attacks—SaaS systems can

6. Third-Party Components

  Websites are increasingly dependent on being consumers of data or functionality provided by third-party systems.

  Client Side

  Client-side systems will commonly display data from third parties as part of their core content. This can include: Data from third-party advertising systems (e.g., Google AdWords) Social media content (e.g., Twitter feeds or Facebook “like” counts) News feeds provided by RSS feeds Location mapping and directions (e.g., Google Maps) Unseen third party calls, such as analytics, affiliate tracking tags, or monitoring tools

  Server Side

  Server-side content will often retrieve external data and combine it with your data to create a mashup of multiple data sources. These can include freely available and commercial data sources; for example, combining your branch locations with mapping data to determine the nearest branch to the user’s location.

  Performance Risks

  Dependence on these third-party components can create the following performance risks. Complete failure or inconsistent performance

  If your system depends on third-party data and that third party becomes unavailable, your system could fail completely. Likewise, poor performance by the third party can have a domino effect on your system’s performance.

  Unexpected results Third parties can sometimes change the data they return or the way their data feeds work, resulting in errors when you make requests or when the

Chapter 2. If You Can’t Control It, Monitor It It’s vitally important for you to understand what’s going on with the elements

  of your website and infrastructure that you can’t control—particularly their impact on other areas of your website. A good monitoring system is essential to enabling the performance optimizations that are recommended in

  In addition to monitoring, it’s important that you set up appropriate alerting to notify you when issues may be occurring.

1. RUM and EUM

  Ultimately, the most important data answers the question: what is the user seeing? This is the task of real user monitoring (RUM) and end user monitoring (EUM). RUM gathers data from all user activity and passes that data back to a central collection server, allowing analysis of your users’ exact experience. This will flag any unexpected behavior and can help you drill down to identify the cause of the problem. RUM is also useful for determining whether there is a pattern to the types of users who are experiencing a particular problem. For example, is it related to a specific geographic area, type of connection, browser, or device?

2. APM

  Application performance management (APM) is a monitoring technology that sits on your server and tracks all activity and reports to a central analysis server. This will collect code-level metrics (e.g., method and SQL query execution times) and details of communications with external systems, in addition to hardware metrics (e.g., memory and CPU usage). APM systems are very useful for getting a detailed understanding of what your application is doing under the hood, and they’re a good starting point for root-cause analysis of issues with your system. Some APM solutions will integrate with RUM and EUM tools to give a full end-to-end breakdown of a user’s interaction with your system.

3. Network Monitoring (NPM)

  While RUM and EUM give you a good understanding of what the end user is experiencing and APM illustrates what’s going on on your server, network monitoring looks at the areas in between. In traditional data centers, this would involve operational management tools such as Nagios, or NPM (network performance monitoring) tools such as Zabbix or SolarWinds to see details of how your network infrastructure is behaving. (It’s worth noting that these two types of tools are increasingly overlapping.) However, the network infrastructure is largely hidden from you in cloud environments.

4. Proprietary System Monitors

  Most cloud providers will offer their own tools for monitoring the performance of their systems, like Amazon’s CloudWatch for AWS services. The depth of information and functionality provided by these systems varies greatly, but they should all should a first port of call for identifying issues with a system.

5. Data Aggregators/Dashboard Creation

  It can be difficult to stay on top of all of the monitoring tools that are necessary to understand the diverse elements in your system. Data aggregators and dashboarding systems provide the ability to gather all these data sources into one central location and display them side by side. There are many examples of these types of tools, from open source (e.g., Tableau) or cloud-based (e.g., DataDog) to enterprise-level (e.g., Soasta DOC).

  The more advanced of these systems will also allow you to correlate multiple datasets onto a single graph.

Chapter 3. Minimizing Performance Risks The performance risks described previously can be minimized using the

  following five strategies:

  1. Use a best-of-breed DNS provider

  2. Cache content as close to the user as possible

  3. Understand the nature of cloud services

  Use a Best-of-Breed DNS Provider

  DNS is your first point of contact with an end user; without it, your user will never access your site. So it is essential that it is reliable, performant, and flexible. Providers, such as cloud providers or CDNs, often prefer (or require) that they also manage your DNS, but this can create a single point of failure (SPOF); if a provider experiences problems with its own system, it may also have issues with its DNS provision, making it difficult to use DNS as a defense against that failure. Having an independent DNS provider allows you to have policies that favor different cloud providers/CDNs in different circumstances, such as location,

  A Low-Latency Network

  It is essential that the DNS provider you select operates a low-latency network, allowing fast resolution of DNS records wherever your users are situated. As all users will need to resolve your DNS record before accessing your system, a slow resolution time will add delay onto the first request to your site for all users. If you’re using domain sharding (i.e., serving your content from many subdomains to improve performance), then this delay is applicable for each of subdomains you are using. (The actual impact of the overall delay will be dependent on how well constructed your page is; a well- constructed page will ensure that as many requests as possible are made concurrently.)

  Support for DNS-Based Failover

  If your provider has a complete outage, then your DNS provider should allow a switch to point traffic to another location. Alternatively, if one of the cloud providers you’re sitting behind has an outage, then you need to be able to quickly reroute traffic to bypass that service.

TTL AND DNS

  

TTL, or time to live, is the element of a DNS record that tells the requester how long the record is

valid for. In other words, if the TTL for your DNS record is set to 24 hours, once a browser has

resolved that DNS record, it will continue to use that same value for the next 24 hours regardless

of whether you’ve updated the details.

  

If the TTL is set too high, then DNS cannot be used as a failover method, as the change will take

too long to take effect with any existing users. Setting a very low TTL, however, adds extra

overhead, as DNS lookups have to happen much more regularly, which adds to the page-load time

  Support for Geolocation

  A simple way to mitigate the impact of latency is to serve content from as close to users as possible. This can be achieved by caching content close to user (see ); however, it can also be achieved by hosting multiple systems at different locations around the world.

  

ANYCAST

Anycast is an addressing methodology that allows a “one-to-nearest” transmission of traffic to a target node, usually using BGP to simultaneously advertise the same IP address at multiple locations. In practice, this means that traffic to a single IP address can be routed to multiple locations based on the location of the request.

  Managed DNS providers use anycast networks to allow resolution of DNS

  Cache Content as Close to the User as Possible

  It’s an old statement, but it’s still as true as ever: the fastest request is the one you don’t make, so it is best to cache content as close to the user as possible. Make sure all your static resources have appropriate expires headers on them so the browser will cache as you expect. If you’re using any client-side data retrieval from APIs, then try to store what you can locally—JavaScript has access to local storage on the client now, so data can be stored across sessions. Future W3C standards such as service workers are designed to give more

  CDNs

  If you can’t cache on the client, then try to cache as close as possible. This leads us on to CDNs, which we discussed previously i CDNs are designed as globally distributed caching and delivery systems. Modern CDNs offer much wider functionality than this, but this is the core of their function. The advantages of CDNs are obvious: most of the time, users should be served content from destinations close to them. CDNs are also typically set up for high-traffic usage, so a good CDN will address issues of both bandwidth and latency.

  Using CDNs for dynamic content apparent that the network can handle it. After the initial connection is made and a handshake completed, the server sends a small number of packets, the client receives and acknowledges receipt, and the server can then send two packets for every packet successfully acknowledged. This allows for exponential growth until the capacity of the network is determined. This means that an initial request to a server will involve more round trips to the server than are actually necessary. For example, a 20k request that could easily be served in one round trip will take four round trips on an initial connection to a server.

  Choose the best CDN

  Not all CDNs are created equal, and this is where knowledge of your audience and some of the topology of the Internet comes in useful. Most CDN providers publish maps of the locations of their POPs; the amount and distribution of them will vary from CDN to CDN. Looking at a selection, you will soon see that there are areas that are well supported and others that are not.

  Understand the Nature of Cloud Services

  Although there are risks inherent in taking advantage of cloud providers’ multitude of different services, they are generally built for high performance and high resiliency and are generally less risky than trying to create your own, especially when running that software on cloud-based infrastructure.

  However, it’s essential to confirm that this is the case for you and that the cloud services are being used correctly.

  Try Before You Buy

  Before using any service, you need to put it through its paces and ensure that it is behaving as expected and performing as advertised. The nature of the cloud makes these kinds of proof-of-concept tests much more viable than non-cloud offerings. They can be undertaken with minimal upfront costs and long-term commitment and can be thrown away if they fail. While performing this testing, it’s good to get as many monitoring systems as possible going to ensure that you’re not just focusing on functional correctness; other metrics such as availability, reachability, and performance should be considered. For example, the IPM data should be used to determine the network impact of using this service from different locations.

  Optimize Your Systems for the Cloud

  It’s easy to use cloud services in a sub-optimal way, because they’re relatively new systems, have a high velocity of change, and because developers are usually self-taught. Furthermore, developers often apply on- premise thinking and practices to the cloud, not realizing that cloud systems are built with a slightly different paradigm in mind. For example, the cloud-based database as a service offerings are better suited for a few larger queries than many small queries, meaning that any systems that are very “chatty” with the database will likely perform considerably worse in the cloud than on premise with a direct database connection.

  Monitoring data should be used to confirm that the performance of these services is as expected and required.

  Understand the Configuration Options

  Cloud services are usually aimed at delivering complex pieces of functionality in a simple way through a GUI or API. Therefore, you can usually get up and running with them fairly quickly. However, the out-of-the-box configuration options may not be the most resilient or performant.

  You should be proactive in understanding which options are available as well as being reactive to issues identified by monitoring and testing.

  Understand the SLAs

  Most cloud providers will provide SLAs; however, it’s is important to understand the terms of the SLA that they provide and ensure that you have implemented your service correctly to take advantage of it. For example, Microsoft Azure provides an uptime SLA for cloud services, but only if you’re running two or more instances.

  

Apply the Same Good Practice to the Cloud as

You Would to Any Other System

  The same good practices that you would apply to on-premise solutions should be applied to cloud-based solutions. A standard risk assessment process should be followed. For example, the cloud-based database as a service systems provide multiple levels of resilience around data (multiple copies in multiple places) but still involve a SPOF if there’s a system failure that causes data corruption. Good practice in this case would dictate that a separate backup be taken and stored remotely—in traditional terms, an “offsite backup.” This backup should ideally be stored with another cloud provider (or elsewhere).

  Ensure You Can Handle Any Failure

  When you’re dependent on services that are out of your control, you have to be conscious of two things:

  1. They may stop working at any point

  2. You will have no control whatsoever over when they will start working again Therefore, you have to architect your systems to handle this failure gracefully.

  Avoid “Death by Retry”

  Once a failure state is known, share that knowledge across any elements of your system that depend on that service and put in place a measured policy for attempting retries. Do not create a death by retry situation where your system is brought down by constant attempts to connect to an unavailable system. A good architectural practice is to route all requests through a central point of connection.

  Have a Backup Plan

  If the functionality provided by the third-party system is key, then consider having a replacement system in place and automatically failover to it. Another option is to capture all the details of the request for processing offline when the system returns. This is valid for systems such those for payment processing or appointment bookings.

  Provide Ability to Turn Functionality Off

  Your system should be built to provide the ability to remove elements of functionality by a simple configuration or application change—often referred to as feature toggles. This allows you much more granular control over the impact of elements of your system. If they’re starting to cause issues, then remove then.

FEATURE TOGGLES

  Feature toggles are a development methodology where software features are built into systems with the ability to turn them on and off without redeploying the application. This approach is often used as a way of pushing new features into production ahead of the time that they need to be made active, allowing the wider business to activate the feature at an appropriate time with minimal assistance needed from the IT team.

  Fail Gracefully

  If there’s no way to proactively handle the failure and prevent any impact on the user, then you need to ensure that your system will fail gracefully. The user should see a properly designed and presented page with a helpful error message that explains what has happened. If the failure happens within a data transaction, the user should be notified of the current state of the transaction (e.g., has their order been placed successfully?).

  Create a “Flight Manual”

  A “flight manual” should be created with mitigation plans associated with each type of failure. This should include the nature of the change that can be made and the circumstances under which it is acceptable to make that change. Having this sort of manual allows people on the ground to be empowered to make decisions and changes without having to go through a complex decision-making process with management.

Chapter 4. Takeaways There are six important lessons to take from this book: 1. Don’t fear loss of control—embrace the cloud. Introducing cloud systems will lead to further loss of control over your

  website, but the advantages of using these systems outweigh the disadvantages. For most people, the services offered by cloud providers will be faster and easier to implement and manage, as well as more resilient, technologically advanced, and cost effective to run than anything they could implement themselves.

  Implement a CDN to optimize responses and minimize latency. Use your monitoring to determine the best CDN or combination of CDNs to use.

  5. Understand the difference between cloud and on-premise.

  Cloud providers offer many advantages over on-premise systems, and it’s important to understand the differences between them. Research, investigate, and try new systems to ensure that you’re taking advantage of their features and understanding their weaknesses.

  6. Failure will happen—build systems and processes to handle it.

  As good as they are, when it comes down to it, you have no control over the systems and services you’re using, so your website must be able to handle the failure or poor performance, and you must have a

  About the Author

Andy Still has worked in the web industry since 1998, leading development

  on some of the highest-traffic sites in the UK. He co-founded Intechnica, a vendor-independent IT performance consultancy, to focus on helping companies improve performance on their IT systems, particularly websites. He is also the creator of TrafficDefender, a cloud-based traffic-management tool. Andy is one of the organizers of the Web Performance Group North UK and Amazon Web Services NW UK User Group.

  Acknowledgments

  As usual, I have to pay tribute to all my fellow Performance Architects at Intechnica for sharing their knowledge across the spectrum of performance issues. Books like this wouldn’t be possible without them.

  Thanks also to Samir Jafferali for taking the time out to review the content and provide feedback and invaluable comments.