Archive

Recent performance events

Performance testing

Performance testing

Hulk Smash puny web site.

The Healthcare.gov web site was crushed or smashed as soon as it was opened. The top IT people are quoted as saying it was due to the volume and unprecedented number of users accessing the site. They had more than five times the estimated number of people, they predicted 50,000 to 60,000 concurrent users, actually there were 250,000 to 300,000. The total number of people who could use the web site is estimated to be 23 Million, if one percent are on the system, that is 230,000 people at one time. To arrive at the 50,000 to 60,000 user estimate, that would one quarter of one percent (0.25%). That seems a little low, is it more reasonable that 1% or more would access the system that was so widely publicized?

Demand forecasting and scenarios

It is always a challenge to define the workload profile for a new web site. There are user behavior questions like; when are they most likely to access the web site, what is the hourly traffic pattern distribution, how many simultaneous users will there be, what would the peak be, what is normal?  What are the types of user profiles or user categories? What is the possible number of users (23 million)?  What is the average number of users (1%, 230,000)? What if you got 2X or 3X the average?

What will the workload profile be like as the deadline to register gets closer? Will the pattern be like the Batman mask; a spike in the beginning, then it flattens out, and then spike at the end of enrollment.

The next topic to understand is the transaction mix, what are the users doing on the web site? For new web sites, the registration event is a onetime transaction for each user, they will not register again. But for new site, this is the most critical transaction. So, the question becomes, how many simultaneous registrations do you expect?

The Healthcare.gov web site starts off as a router, of the 23 Million people who will visit it, some percentage will be routed to the State Healthcare web site, so, maybe 12 Million will eventually be the user population.

Design goals

Answering these questions setups the design goals and the performance testing goals for your web site. Most web sites and the Healthcare.gov site is not different, must interact with many third parties.  You can communicate to them what your volumetrics are so they can plan as well.

The design goals are translated in the development phase. The developers must know what that critical business transactions are, their response time goals and their throughput (10 TPS).

Performance testing

Preparing to test a system that might have 10 Million, 15 million or more customers is no easy task. The data setup alone is a huge task. The team will need to create the synthetic data to support the performance test. The team will then have to simulate 50,000 or 100,000 users accessing the system. This is where Cloud based performance testing tools can really help.

Production – APM

For any new web site, with a new user profile, you have to be watching the system utilization and the transactions. Application performance management is a critical part of being prepared for such a large system. The tools can provide real-time transaction performance information, measuring the user experience. You can spot slow transactions before it becomes a real issue. The system resource consumption can be closely monitored as the workload increases.  Again allowing you to respond with more resources quickly.

Leverage the Software performance engineering Body of knowledge

What can you do to prepare for a very large web site?  You can take advantage of the Software Performance engineering Body of knowledge and apply each knowledge area to prepare.

  • Software architecture and design
  • Performance testing and validation
  • Capacity planning
  • Application performance management
  • Problem detection and resolution

These five areas work hand-in-hand to prepare your web site.

SWSlowPerf

Some of the top reasons that your application/web site/mobile app is slow;

10) I thought you turned off the diagnostic logging

9) Do you really have to index a table with five years of history and one Billion rows

8) You doubled the number of calls from the application tier to the database tier for the same workload and were surprised by the increase, which no one noticed until production.

7) They moved the application server to another continent

6) They virtualized it (you weren’t using all the real CPU anyway)

5) You wrote your own caching component and didn’t really understand the impact of flushing the cache

4) Even Amazon Web Services stops allocating Servers (thought you could buy your way out)

3) The Marketing group ran a hugely successful ad on a major TV program and you under estimated the new workload. The good news is you and the CIO are on a first name basis.

2) They upgraded to a new version of the Application server/database server/etc. and no one thought a performance test might be needed.

1) The business critical Applications don’t have performance goals, how do you know its slow?

Also, I like “Is there really a difference between polling and event driven programming?”

 

In October this year the SEC sponsored a technology roundtable to discuss how to promote stability in today’s markets. This is after a year with headline grabbing stock market failures;

NASDAQ market

Disrupted trading in the markets

In March the BATS Exchange IPO was halted and the BATS exchange stopped trading for hours, because the IPO price of BATS went from $16 to zero in minutes. In May the Facebook IPO on the NASDAQ had issues due to extreme volume coupled with the effects of high frequency trading. The rate of order change and cancellations prevented the exchange from establishing a price for the IPO.  The orders changed that fast, sub five milliseconds. Then Knight Capital suffered from a software issue.  They installed new software to work with the NYSE’s new Retail Liquidity provider program. The software sent buy orders out  from the market open and continued unnoticed for 30 minutes. Knight Capital was not aware of the impact it was having to the marketplace.

In between all this, you can find many examples of micro-flash crashes of single stocks. These can be seen by visiting the web site of Nanex (www.nanex.net).

The SEC Roundtable review

The morning session lasted 2 ½ hours. The roundtable was kicked-off with by the SEC Chairman Mary Schapiro. The success of the market is tied to the technology and when it fails, the consequences are extreme. These events continue to erode confidence in the market.  The industry needs to address the high volume of cancellations.  Also, there are more basic technology 101 issues occurred during the two IPO cases and the Knight Capital case. We need to balance the need for rapid innovation and competition with the proper and diligent testing methods.

My key take-a-ways from the morning session of the roundtable;

  • There is a need for firms to have a better understanding of their impact to the overall market. This would involve using Drop-Copies, where the exchanges send real-time trading records back to the broker/dealers, so they would understand their order flow. The broker/dealer could then run real-time reports to check their orders.
  • Improved testing strategies within the firms and an elevation of the Software Quality and performance profession. While QA people are independent from the development teams, they must be integrated into the development teams. The QA role in the firms must change to attract the best and the brightest to the role, including functional testing, as well as performance testing.
  • Testing in production. This is always a controversial topic in any industry. The firms and exchanges would agree on a number of testing Symbols for use when testing new features. This would require significant cooperation across the marketplace.
  • A focus on internal software testing for stability, performance, and scalability. Introduce earlier involvement in the SDLC with software quality resources. The benefit that an outside organization can bring on the processes and test cases could be helpful.  However, the roundtable participants discussed how difficult it is to being new people into the teams, due to the complex and technical nature of their systems.
  • Orderflow kill switches for the firms and exchanges. The exchanges would provide this capability and allow each firm to set its own parameters or limits that would trigger the kill switch. This would allow brokers to manage specific order types, control the size or value of the order, limit prices, or restrict the stock they trade in.  This came out of a working group that was established after the new meltdown Knight Capital.

The issues this year occurred in the exchanges and the broker/dealers, no one was immune from disruption. The key items that caused disruption were related to new or changed software and large volumes of orders.  An Enterprise wide software performance engineering strategy will help mitigate these software issues, both for the brokerages and exchanges.  The Market is facing a significant challenge with these issues.  The need to innovate and introduce better features before the competition does,  in a very complex and interconnected marketplace, with the need for rigor and increased testing. In addition production monitoring (really marketplace monitoring) is a critical component.  The devil is in the details and the compensation models.

Web page size is increasing – From the HTTP Archive

HTTPArchive Image load

Day one of the CMG Conference was kicked-off with the keynote presentation by Pat Meenan of Google and Webpagetest.org, where Pat discussed the user experience and the browser.  He discussed performance monitoring and tuning of a few well-known web pages. When is a page really loaded and ready?  That keeps getting fuzzier by the day. During his presentation he mentioned a site called the HTTP Archive, as it has shown that the web page size is increasing at a rapid rate. So, I thought I would take a quick look at the HTTP archive.

From their web site; The HTTP Archive records the content of Web Pages and how it is constructed and served. It is a permanent repository of we performance information such as size of pages, failed requests, and the technologies utilized. They use this performance to identify trends in how the Web is built and provides a common data set for conducting web performance research.  Starting in November 2011, they started using the web sites listed in the Alexa Top 1,000,000 sites. From November 2010 to October 2011 they analyzed 18,026 URL’s.

They produce a number of Trending graphs, almost all of them are increasing during time.  The Total Transfer Size is at 1.27 MB’s and Total Requests is at 87. They track the HMTL Transfer size and HTML Requests, JavaScript Size and Javascript requests and many more. You can download the data for your own detailed analysis of performance trends.

One this is clear, the web pages continue to increase in size, as the bandwidth increases, the ability to consume the bandwidth increases.

Take look at httparchive.org

 

The naitonal CMG conference starts today with a series of workshops on Top performance metrics for Capacity management of Virtualization; SAN/Ethernet fabrics Network performance, and z/OS Enterprise Storage performance and architecture. Then the Keynote will be given by Patrick Meenan of Google discusssing Web Performance, the Big Picture. And when google says big, they mean big.

The holiday shopping season has begun and E-retail sales are up more than  15% from last years’ traffic for post-thanksgiving sales. Retailers are driving more customers to their web sites with deals and special offers exclusively for the web customer. I know a few people who finished their Thanksgiving dinner, washed some dishes and logged onto their favorite site for some early season purchases.  Using their iPads.  According to a recently released IBM report (Thanksgiving day) 27% of the traffic is coming from tablets and smartphones.

Knowing your tiers and application performance management

The Software performance engineering body of knowledge has five knowledge areas;  the Knowledge Area most relevant for monitoring your production web sites is Application performance management (APM).  This area is critical to a successful holiday season, for retail websites. Because, at this point in the calendar year for Retail websites, its money time and freeze time.  Many companies minimize the changes to production systems, so as not to introduce instability and problems during their businesses peak time.  During the year design and development occurred to enhance the web site and now it’s the play-offs for Retail web sites. This is where software performance is directly tied to revenue.  A direct ROI for all the software performance engineering tasks and activities in the Enterprise. There are numerous studies out there that directly correlate web site performance to increased revenue.

By implementing an Application Performance Management solution you can be proactive when it comes to your customers’ web site experience. These tools can monitor every transaction that goes across your web site.  You can enable early warning alerts for transactions that are slowing down and react before they become a problem. They will show you all the points between your order database and your customers Safari browser, Chrome browser, Internet Explorer or iPad. Do you know all the components in your order entry process?

This new generation of APM tools is very powerful and fast.  They can install very quickly and discover what is going on in your system.  Usually, the biggest hurdle to adopting these tools is the processes within the companies.  I have seen and continue to see companies in the situation where there are performance issues, where they have partial  APM solutions that are a combination of COTS tools and custom developed scripts, ,they are time consuming and problematic.  This partial solution cannot see when the user experience is slowing down, they can only react when there is noticeable issue.  So, the companies have a problem that repeats, and they will not install these new generation tools, until the budgeting process occurs.  As an aside, many companies claim they are Agile, their processes say otherwise. Getting tools into production takes paperwork, testing and vetting, and planning and training.  But, this should not be a reason to wait until the next quarter.

In one of my papers “Consequences of poorly performing systems” I attempt to quantify the cost of these unforeseen and unplanned production issues.  I believe the performance engineer is able to justify the purchase of these new tools.  The costs of using an incomplete APM solution can be calculated, people and time, unplanned capital expenses, missed sales, etc.  The benefits of a full APM solution can be quantified.

We have heard the term software glitch mentioned in the press a lot these last few weeks. A glitch seems to have a very wide range of impact to people and businesses. The use ranges from a single gas station in Texas that installs new software, where the software sets the price of a gallon of gas to $1.01.  This is localized the poor station owner, who sold gas till his tanks were empty, costing him initially thousands of dollars.  But I am sure the company that installed and developed the software will helpl him or her out.

The other end of the scale is what happened during the Facebook IPO and the Knight Capital glitches. Each of these totaled losses in the hundreds of millions. The glitch was so impactfull to Knight Capital, that the company was on the brink of being shutdown. Kknight Capital also installed new software and the issue became apparent in the first 30 minutes.

The Facebook IPO occurred with the existing software systems, with out any new software or upgraded software as far as I know. The change here was the new workload being generated by the traders coupled with the outrageous volume they generated. The test plan might not have covered the very large workload, or the entire system may not have been expecting this volume.

Other examples of glitches;

  1. Southwest Airlines – Large number of small transactions crashing systems and cascading (web site, billing systems, call centers). An extreme workload, the airline sent emails to all it Facebook friends that a sale was going on, because they reached the milestone of 3 Million likes.
  2. Local Conoco gas station – New computer system, at midnight, the price of gas was set to $1.01. This caused a frenzy.
  3. Six Flags Roller coaster – suddenly stopped at the top of its ride.  They ruled out mechanical issues, and are looking into the programming, it may be a computer glitch.
  4. Tokyo Stock exchange – Computer error halted derivatives trading for 95 minutes. This was the second Glitch in seven months.

Two themes emerge from these glitches;the first is new software introduces a large amount of risk, and must be reviewed, tested for performance scalability, and antagonistically. The second is the impact and the uncertainties of new and extreme workloads cannot be underestimated.

Maybe we can start using the terms, small, medium, large, extra large, and business glitch.

See the www.collaborative.com for a paper on Application Performance Risk, I have defined a list of criteria that can help determine the degree of the risks of your application and workload.