Archive

Monthly Archives: September 2012

In our software performance engineering practice here are Collaborative Consulting, my team and I are fortunate to work across many different clients and across many different industries; Financial services, Insurance, Retail, Life Sciences, energy, and more.  We are called in for many different reasons; from a production performance and stability issue, to a designing and executing a custom benchmark, to reviewing the results of performance tests that didn’t quit deliver the value or answer the question the business needed. I am a big proponet of clearly descrbing the value that a Performance Engineeing team is providing to the business.  In some cases the production system still experiences performance and stability issues, even after performance testing was done. So what happened?

When we look at each project and try to undecover where the disconnect is, we look across the following organizations; Architecture and development, the Performance engineering or testing team, Enterprise Architecture, perhaps a stage gate process (how projects are approved  to move forward), performance testing, capacity planning and production monitoring. Each has a performance concern. Oh, and don’t for the business viewpoint. One of the key questions I ask; Can the performance engineering team stop a release?  Under what conditions will a performance defect prevent code from going in? Do people ask, “Have they run the performance tests?”, or do they ask how were the results of the performance test.

What is the connection between the development team and the performance testing team? Certainly, the development team is a consumer of performance engineering/testing services. How silo’ed are these two organizations?  Is the development team involved in the performance test and how eager are they for the results? Now, not every development change needs a performance test. The high risk, high value transactions must be tested, at volume and a workload analysis done as well so there are no surprises, as the new coding changes can consume more system resources than planned.

Looking at the performance testing environment or lab, we find a shared environment, with infrastructure changes occuring without notice.  Or the testing can only occur at off-hours, and sometimes they are frequently cancelled. Performance and scalability testing require significant resources, large databases, load testing tools, metrics collection, a few people, and hardware. Many times companies only go half-way; some componets are close to production size, others are a fraction of production computing resources. Then, the business or IT management is surprised when the performance tests are not indicative of production performance. If you really cannot run a large scale performance or scalability test that adds real value, you must clearly articulate that.

What else can you do?  Component testing, where you rate each component interms of transactions per second. You can do workload profiling, with a few comparably configured components, with low voluem end-to-end performance testing. You need some additional tools to monitoring the workload changes driven by the code modifications. The performance engineering team can work more closely with the developmen team, you can introduce code profiling tools into developement, and make user of service virtualization.

Too often, the release process simply wants to know, have the test been run? Not the quality and findings of the tests. In the stage gate processes that are out there, often times a performance and scalability isn’t even a requirement. Or there is simply a checkbox.

Have you clearly defined the business value of the software performance engineering process to your business? You should be able to tell quickly be looking at your budgeting process.

Transaction analysis

Not all business and system level transactions are created equal. Some system transactions are lightweight and do not consume many system resources.  Others may be an order of magnitude heavier than others.  Typically the smaller, lightweight transactions are the majority of the transaction mix.  For example, a key business transaction might be “order status”. There are dozens of system transactions that support that. This is a relatively simple transaction, a logged in web user selects the check order option, the system has the customer information and can quickly retrieve the information. However, this can become more complex if the user can make modifications. Other more complex transactions may involve rules processing or complex history queries, or an inadvertent poorly written database query.

These system transactions are executed on web servers, the application servers, the database servers. Systems are designed and constructed to support tens of thousands of users, executing 100’s of concurrent system transactions.  These transactions are manipulating customer orders, order history, billing and invoicing, search and recommendations based on past purchases. They are eventually translated into a generic transactions per second rating, TPS.

Transactions per second

For example, after doing the workload analysis for a new system, the model might predict that 50 TPS can be expected. This may start as 50 URL requests per second, and then fans outs to web service requests, method invocations, queue writes, and SQL statements. Each being a multiple of the original 50 TPS.

The generic TPS rating is a guide, however we need the detailed transaction mix and we need the workload distribution. The TPS is an aggregate for transactions, and an average across some time interval.  For instance, over the eight hour day, 1,000,000 URL requests were processed; this becomes 125,000 per hour, 2,083 per minute and 35 per second. This even distribution is misleading. You must know the distribution over the eight hours.

What if 50% of the transactions arrive in the first two hours or first hour?  At two hours this becomes 250,000 transactions per hour, double our original estimate. Now our TPS becomes 70.  Do you design a system that can handle 34 TPS or 70 TPS? By the way, what is your transaction mix?

The overall TPS is composed of a transaction mix. You must be aware of the heavy transactions, as they consume significantly more resources. If you predicted 5 TPS for these heavy hitters, of your overall 34 TPS or 70 TPS, what happens to your system when these heavy hitters increase to 10 TPS or more?  Your system will run out of gas quickly and sooner than your capacity model predicted.   Using the proper monitoring tools will enable you to watch the workload profile on the product system.  You can review the daily workload and compare to the planned workload or normal workload.

For improved capacity planning, your TPS must be categorized by transaction.

Peak hour and peak minute

What is your peak hour?  What is the peak minute within the peak hour? Does your business require you to design and build the system to the peak minute? Is the peak a sudden sharp spike or is more gradual and build over minutes, sustains and then subsided? Your system might be able to handle a sudden short spike, the transactions will slow down for a brief period (seconds) and then work through the queues and return to normal after a few minutes.

However, if the peak load is sustained, your system will slow down and the queues will fill-up. Due to the profile of the spike (duration) the system might not be able to work through the queues, so the requests continue to slow down, connections time-out, a queue forms for connections pools, etc.  For this case, you must designed and build the system for the sustained peak.

Under load your complex heavy transactions will slowdown exponentially.

Safety tips

1)      Transaction mix: Identify the transaction mix and  the heavy hitters

2)      TPS: each transaction needs a TPS rating, as well as the aggregate TPS. 50 TPS, 10 TPS Login , 10  TPS Product search by name , 10 TPS Enter order ,etc.

3)      System resources: Measure system resources consumed for each transaction (CPU, DB)

4)      Performance test: Define and execute performance tests to validate the transaction weight on the system. Run each transaction independently. Measure the TPS that the JVM can safely support.

5)      Monitor production: Compare workload models, trending response time of transactions

6)      Identify Peak hour and peak minute

7)      Understand transaction workload distribution