Increase Sales

Selling the value of your APM investment to your business

Application performance management tools provide a great value to the business; from accelerating the root-cause analysis process, to measuring the customer experience, and providing business transaction dashboards. When used properly they can provide alerting and trending information so you find out whether the change to the application or system did no harm, some harm or really caused a problem.

In my position, as a performance engineer and a cranky one sometimes, I have used many different APM tools and diagnostic tools. These tools are usually brought into the company to help solve an immediate production performance or stability problem.  It starts with the developer, engineer, or operations person or team, who have to solve the problem. The system is slow. An APM tool is used during a free-download period or a vendor is used for a proof-of-concept and the problem is solved.  However, there is no budget at the time as the tool was not identified last year. So, the people on the ground are stuck, no APM tool. I think there is an opportunity here to enhance the standard corporate desktop or laptop to add an end-user monitoring or measurement tool, such as YSlow, HTTPWatch or Dynatrace to name a few.

In these cases, no one really knows why the system is slow, in fact, they usually can’t even tell you what they mean by slow, is it 5 seconds, 10, or 42 seconds. It is amazing that this is still the rule in most fortune 1,000 companies. The issue is that a Corporate  end-user cannot download and install any software on the corporate laptop or desktop.  With web performance tools (YSlow, Dynatrace, HTTPWatch, et al) having the ability to download and instantly measure the response times of web transaction, SLOW can be measured.

As a technologist, you know implicitly that the APM investment  will provide value, save time, allow for better use of all resources, and provide some great looking dashboards. So, just buy the tool. However, there are budget keepers who are not as informed a you, the performance engineer, and they don’t understand the value.  So, with hat in hand, you need to find the money.

Know your business

Let’s talk about business value and business benefit.  What are the goals of your business this year? If you are in insurance, Is it to capture the business of more Independent Agents? If you sell on the web then you know that faster response times mean increased revenue. What if you run a distribution center, what is the goal there? The business goal is to increase revenue and the Distribution Center must move more orders. That requires a highly available environment and a responsive environment. Don’t forget call centers.

Understand the business motivation for performance and user experience dashboards.  There a number of third party companies that monitor the performance of web sites across industries to establish benchmarks. The business maybe motivated to move up in the Benchmark and understand that some investment is needed.

Your homework: Go find your company business goals; there should be 4-6 of them.

Payback period, direct benefit and indirect benefit

Technology often gets caught in the budgeting process when you cannot clearly link the investment (purchase) to a business goal and benefit.  Many technologies are far away from revenue or cost reduction, and how do you quantify flexibility? The ability to invest in technology and the willingness to invest in technology varies greatly by industry and by the type of application. Online web retailers relentlessly work to improve the user experience and user performance, the business understands the connection to performance and revenue. They can quickly justify the investment in APM solutions.

Direct benefit: The APM investment you are making will improve the application performance and customer experience for the Independent Agents. This is lined up with the company goal to attract more Independent  Agents and increase revenue.  If the business can attract 15% more IA then revenue increases by XXX%.

Indirect benefit: The APM investment will defer technology (Server) purchases for two quarters.  If the business has a cost containment goal then this will align with it.

Payback period: There will be an initial investment required ($100K), recurring charges ($15), and one time training expenses.  Based on your direct and in-direct benefits, how long is the payback period?

Your homework: Does your APM investment provide a direct benefit or an in-direct benefit?  How long will the payback period be?

Thanks.

Stresstest

Purpose of a stress test

The purpose of the stress test is to design and execute extreme testing scenarios that will cause the application to violate its Service Level Objectives. This is accomplished by increasing the workload on the application for both online and background transactions. The workload for the stress test is well above the normal workload day. A key outcome of the stress test; it allows you to find out how much headroom there is in the system before the user experience is severely impacted or the background processes slow down.

  • For online transactions, the response time will start to increase under the workload as more users are added and as the transaction arrival rate increases.
  • For background transactions, the component throughput will decrease as the workload increases. For example, starting at 10 Orders per second, it will slow down to three Orders per second.

Ultimately the workload is increased until the application breaks.

Stresstestbatch

Stress test entry criteria

A large application stress test requires a large amount of preparation in order to get ready to run the tests and maximize the value from it. To successfully execute a Stress test, the starting performance of the application is critical to know. Is the application performing well under a normal workload? Otherwise, what is the value derived from the test if the application has not achieved the service level objectives before you start the stress test? Here is a sample checklist you can use to help determine if your application is ready for a stress test.

Item No. Criteria Comments/details
1 Passed performance test scenarios and achieved the performance goals
2 Critical business transactions have met the service level objectives   (list critical transactions)
3 Meeting current online transaction service level objectives
4 Meeting the background or batch service level objectives
5 Meeting the real-time messaging requirements, at or better than the   desired throughput
6 Application is at the expected Release Level or version for   production
7 At normal load, it is fitting within the Batch window
8 The Stress testing system and application configuration is like   production configuration
9 Data in the database is at the expected production size.
10

GhostMachin

Please allow me to introduce myself, I am a glitch of wealth and taste. I’ve been around for a long, long year, stole many a software architects soul and faith. I was around when the Architect had his moment of doubt and pain. Pleased to meet you, hope you guess my name. But what’s puzzling you is the nature of my game.

Perhaps there is something more sinister at work here, with all the headlines on software glitches and problems, maybe there really are ghosts in the machines.

I rode a tank, held a generals rank

The software glitch is large and in change, it creeps in and disrupts your plans and intentions. Most people really want to design and build software systems and web sites that will provide an outstanding user experience. However there are too many demands and temptations to keep to the schedule even when there is ample evidence that it is not going to work.

I laid traps for troubadours, who get killed before they reached Bombay

There are so many components involved in large scale enterprise systems, the glitch is laying in wait at every corner, at every hand-off. Often times, it is hard see while you are building the system. However, you can feel it coming, you can feel its cold breath on your neck, when you turnaround, no one is there.

You check in your code, hoping the build cycle will keep your code safe and keep the glitch out. As the system gets larger and more complex, you feel that something isn’t quit right. As the release date grows nearer, you stay later, work later, working into the night. The code is finally sent to the QA team for testing.

Just as every cop is a criminal, and all the sinners saints

The QA team moves through their test plans, they find minor things that are not working.  These issues go back for fixes and are then retested. Nearing the end of the testing cycle, several of the team members report errors, that can’t be reproduced. The function failed, then succeeded. This is the first sign that a ghost is in the machine. The team dismisses this inconsistency, and they continue.  They pass the new Release. They kept to the schedule, they didn’t do all the testing they had planned. They didn’t have time for a performance test.

So, if you meet me, have some courtesy, have some sympathy and some taste.

Performance testing

Performance testing

Hulk Smash puny web site.

The Healthcare.gov web site was crushed or smashed as soon as it was opened. The top IT people are quoted as saying it was due to the volume and unprecedented number of users accessing the site. They had more than five times the estimated number of people, they predicted 50,000 to 60,000 concurrent users, actually there were 250,000 to 300,000. The total number of people who could use the web site is estimated to be 23 Million, if one percent are on the system, that is 230,000 people at one time. To arrive at the 50,000 to 60,000 user estimate, that would one quarter of one percent (0.25%). That seems a little low, is it more reasonable that 1% or more would access the system that was so widely publicized?

Demand forecasting and scenarios

It is always a challenge to define the workload profile for a new web site. There are user behavior questions like; when are they most likely to access the web site, what is the hourly traffic pattern distribution, how many simultaneous users will there be, what would the peak be, what is normal?  What are the types of user profiles or user categories? What is the possible number of users (23 million)?  What is the average number of users (1%, 230,000)? What if you got 2X or 3X the average?

What will the workload profile be like as the deadline to register gets closer? Will the pattern be like the Batman mask; a spike in the beginning, then it flattens out, and then spike at the end of enrollment.

The next topic to understand is the transaction mix, what are the users doing on the web site? For new web sites, the registration event is a onetime transaction for each user, they will not register again. But for new site, this is the most critical transaction. So, the question becomes, how many simultaneous registrations do you expect?

The Healthcare.gov web site starts off as a router, of the 23 Million people who will visit it, some percentage will be routed to the State Healthcare web site, so, maybe 12 Million will eventually be the user population.

Design goals

Answering these questions setups the design goals and the performance testing goals for your web site. Most web sites and the Healthcare.gov site is not different, must interact with many third parties.  You can communicate to them what your volumetrics are so they can plan as well.

The design goals are translated in the development phase. The developers must know what that critical business transactions are, their response time goals and their throughput (10 TPS).

Performance testing

Preparing to test a system that might have 10 Million, 15 million or more customers is no easy task. The data setup alone is a huge task. The team will need to create the synthetic data to support the performance test. The team will then have to simulate 50,000 or 100,000 users accessing the system. This is where Cloud based performance testing tools can really help.

Production – APM

For any new web site, with a new user profile, you have to be watching the system utilization and the transactions. Application performance management is a critical part of being prepared for such a large system. The tools can provide real-time transaction performance information, measuring the user experience. You can spot slow transactions before it becomes a real issue. The system resource consumption can be closely monitored as the workload increases.  Again allowing you to respond with more resources quickly.

Leverage the Software performance engineering Body of knowledge

What can you do to prepare for a very large web site?  You can take advantage of the Software Performance engineering Body of knowledge and apply each knowledge area to prepare.

  • Software architecture and design
  • Performance testing and validation
  • Capacity planning
  • Application performance management
  • Problem detection and resolution

These five areas work hand-in-hand to prepare your web site.

For large complex systems, a performance data engineer is a must; because, historically, the database and the SQL statements have caused the most trouble when implemented by application architects with limited database understanding . Experience tells us, that when the application team refers to the database as persistent storage there is much opportunity for performance improvement.

This role requires a deep understanding of the physical implementation of the database system. In this role you must understand how the application will use the tables and the access paths for those tables. This insight is needed to define the indexing strategy.  You must be able to create a physical implementation model; where you spread the database tables and indexes across the disk subsystem. This role must fully understand the normalization process and when you must de-normalize for performance. The goal is to minimize joins for very large tables.

In the role, you also provide guidance on writing SQL statements, writing database packages and procedures, provide guidelines on the use of the “hint” statement (don’t do it). Understanding the impact of views is very important. With database views there is a balance to strive for between too many views and too few views.  In most large organization there is a mandatory one view per table, then additional views based on application requirements. Views can hide complexity, however, they can be tough to maintain for new team members if they are too complex.

This role requires the person to be very familiar with the particular optimizer for the database. You  must be able to read SQL plans, know how to access the relevant performance tables and views, understand wait-events.  What are the different types of wait events? What are the top SQL statements?

This week the stock market experienced a technical glitch. A rather large one, as the market stopped for three hours. These complex systems are continuously being pushed further and further, in an industry where all parties are looking for an edge. This edge is in the 1 millisecond range and going lower. The benefit is to find fractions of pennies, lots of them. If we push the edge too far, we risk shutting the system down. This in my mind is a great example of Asymmetric risk.

Asymmetric risk – essentially, the downside is catastrophic, and the upside is incremental. Asymmetric risk means that you have reached the point where you keep chasing a small benefit and the risk is now much greater than the benefit.  In this case, perhaps the benefit is shaving a 1/10 of a penny off the price or transaction cost, or pick your benefit.  The risk is we shut down the exchange for a couple of hours if something goes wrong.  In this one case, did this glitch just wipe out all the accumulated benefit? 

You could ask Knight Capital the same question.

 

SuperBus

A Farrari 458 Italia can move two people at speeds in excess of 200 MPH, the Superbus (pictured above) moves 23 people at speeds of 155 MPH, the newest Maglev train in Japan tops out at 312 MPH, with 14 cars each carrying 68 passengers (952 people). These are examples of bulk or batch movement of people.

Processing data, moving data, from one application to another. This occurs in a number of ways either as a singleton or a large set (batch), instantly, near-real time, delayed by minutes, in batches throughout the business day or in a large batch cycle at the close of the day. Each case has specific performance and throughput goals. Software performance engineering includes methods for batch processing.

Lets talk about batches…

Our complex business and IT systems of today still do a large amount of batch processing, moving data out of one system and into many more systems, with transformation and enhancements along the way. These can be internal systems in the Enterprise or external third party providers or even customers. I have noticed over the past few years that batch performance design has become (by accident) an iterative approach.
The original design goals were simply specified for the entire batch process, it must be completed by 06:00 EST. Often times, the batch critical path is not clear. There has been little thought to the throughput of each batch job (transaction per second). The iteration starts during the QA testing process when it is discovered that the current batch process will take 24 hours to process the days’ work. Someone finally did the math. The team did not know if it was designing a Farrari, a Superbus or a Maglev.
For the critical path batch processes you must define the required throughput. What are you moving? Invoices, orders, customers, transactions. How many of them do you have and what is the peak volume? What is your scalability approach and how do you achieve increases in throughput? Do you need to scale-up or scale-out?
You need to design the batch process to fit comfortably in the batch window and with room to grow. This year 500,000 invoices, next year is 750,000 invoices. How do you process the increase and stay within the window?

Software performance engineering Sample Questions

1 What are you processing?
2 How many steps (programs) are in the batch process?
3 What is the overall throughput?
4 What is the throughput of the individual programs?
5 What is the growth rate?
6 What is the peaking factor?
7 How will the programs scale with the load?
8 Have you identified the critical path throughput goals?
9 Are all the developers aware of the throughput goals?
10 How will you test and validate the throughput of each program?

Know your design goals

The barchettas were made in the 1940’s and 1950’s. An Italian style topless 2-seater sports car, designed and build for one purpose, to go fast. It was built for racing, weight and wind resistance were kept to a minimum, any unnecessary equipment was removed, no decoration. Doors were optional. Ferrari created one of the earlier models and other followed with the same style. The design team was focused on speed, they knew there were performance and response time goals for the car.