Archive

Nonfunctional Requirements

 The SPE Body of Knowledge

 

How can we benefit from the using a Body of Knowledge approach for Software Performance Engineering. The goals is to address the career path, the organization, the industry practices and to enable you to build a BoK within your company, for the performance engineer. There are five knowledge areas for performance engineering.

This is an overview presentation I gave at the Greater Boston Computer Measurement Group and the National meeting. There are two documents here, the Powerpoint and the Detailed document.  It is a work in progress.

PresentationSPEBoK CMG National V4

Paper with the details of the SPEBoK: The Guide to the Software Performance engineering body of knowledge V4

Please send me your comments..

 

Walter

 

 

Advertisements

Frustrated User

Bring Your Own Response Time.

The consumers’ expectations have greatly influenced the demands and expectations on Enterprise IT departments. The consumer and IT customer brought their own devices, expected more self service and at a much faster pace. One of the key tasks that a Performance Engineer must do is to help the business and IT set expectations on the response times of corporate systems. The history of performance requirements for corporate facing systems and even call centers has been problematic. Often times ignored and certainly deferred. The typical approach is to see just how slow the system can be before the users completely revolt. This tends to be the case because its not a revenue generating system, however, many of the Corporate IT systems directly touch the customer or business partner after the sale or contract is signed.

Response time or performance goals for Internet retailers is well defined and measured, there are many Industry specific benchmarks that compare the response times of web pages against competitors in the industry. The Internet business models demand faster and faster response times for transactions. Benchmarks can be found at Compuware (www.compuware.com), Keynote (www.keynote.com), among others. However, there is not a benchmark for Corporate systems. The users of Corporate systems are starting to voice their concerns and displeasure more loudly.  They are expecting speeds comparable to Internet Retailer speeds. Their expectations are for less than five seconds and often two seconds for simple transactions.

Our studies have shown and are in alignment with the research done by Jakob Nielson (www.nngroup.com) on usability, A guide to setting user expectations must consider three barriers;

1) 0.100 Seconds: The user perceives the system to respond in real time with out any noticeable delay

2) 1.0 Seconds: the User starts to perceive a slight delay with the system, but us very happy with response time

3) 10.0 Seconds: the user will greatly notice the delay and start to be distracted and attempt to do other things while waiting

So, just as the consumer has brought their own devices, they are bringing their own Response times to Corporate systems.

 

Performance testing

Performance testing

Hulk Smash puny web site.

The Healthcare.gov web site was crushed or smashed as soon as it was opened. The top IT people are quoted as saying it was due to the volume and unprecedented number of users accessing the site. They had more than five times the estimated number of people, they predicted 50,000 to 60,000 concurrent users, actually there were 250,000 to 300,000. The total number of people who could use the web site is estimated to be 23 Million, if one percent are on the system, that is 230,000 people at one time. To arrive at the 50,000 to 60,000 user estimate, that would one quarter of one percent (0.25%). That seems a little low, is it more reasonable that 1% or more would access the system that was so widely publicized?

Demand forecasting and scenarios

It is always a challenge to define the workload profile for a new web site. There are user behavior questions like; when are they most likely to access the web site, what is the hourly traffic pattern distribution, how many simultaneous users will there be, what would the peak be, what is normal?  What are the types of user profiles or user categories? What is the possible number of users (23 million)?  What is the average number of users (1%, 230,000)? What if you got 2X or 3X the average?

What will the workload profile be like as the deadline to register gets closer? Will the pattern be like the Batman mask; a spike in the beginning, then it flattens out, and then spike at the end of enrollment.

The next topic to understand is the transaction mix, what are the users doing on the web site? For new web sites, the registration event is a onetime transaction for each user, they will not register again. But for new site, this is the most critical transaction. So, the question becomes, how many simultaneous registrations do you expect?

The Healthcare.gov web site starts off as a router, of the 23 Million people who will visit it, some percentage will be routed to the State Healthcare web site, so, maybe 12 Million will eventually be the user population.

Design goals

Answering these questions setups the design goals and the performance testing goals for your web site. Most web sites and the Healthcare.gov site is not different, must interact with many third parties.  You can communicate to them what your volumetrics are so they can plan as well.

The design goals are translated in the development phase. The developers must know what that critical business transactions are, their response time goals and their throughput (10 TPS).

Performance testing

Preparing to test a system that might have 10 Million, 15 million or more customers is no easy task. The data setup alone is a huge task. The team will need to create the synthetic data to support the performance test. The team will then have to simulate 50,000 or 100,000 users accessing the system. This is where Cloud based performance testing tools can really help.

Production – APM

For any new web site, with a new user profile, you have to be watching the system utilization and the transactions. Application performance management is a critical part of being prepared for such a large system. The tools can provide real-time transaction performance information, measuring the user experience. You can spot slow transactions before it becomes a real issue. The system resource consumption can be closely monitored as the workload increases.  Again allowing you to respond with more resources quickly.

Leverage the Software performance engineering Body of knowledge

What can you do to prepare for a very large web site?  You can take advantage of the Software Performance engineering Body of knowledge and apply each knowledge area to prepare.

  • Software architecture and design
  • Performance testing and validation
  • Capacity planning
  • Application performance management
  • Problem detection and resolution

These five areas work hand-in-hand to prepare your web site.

SuperBus

A Farrari 458 Italia can move two people at speeds in excess of 200 MPH, the Superbus (pictured above) moves 23 people at speeds of 155 MPH, the newest Maglev train in Japan tops out at 312 MPH, with 14 cars each carrying 68 passengers (952 people). These are examples of bulk or batch movement of people.

Processing data, moving data, from one application to another. This occurs in a number of ways either as a singleton or a large set (batch), instantly, near-real time, delayed by minutes, in batches throughout the business day or in a large batch cycle at the close of the day. Each case has specific performance and throughput goals. Software performance engineering includes methods for batch processing.

Lets talk about batches…

Our complex business and IT systems of today still do a large amount of batch processing, moving data out of one system and into many more systems, with transformation and enhancements along the way. These can be internal systems in the Enterprise or external third party providers or even customers. I have noticed over the past few years that batch performance design has become (by accident) an iterative approach.
The original design goals were simply specified for the entire batch process, it must be completed by 06:00 EST. Often times, the batch critical path is not clear. There has been little thought to the throughput of each batch job (transaction per second). The iteration starts during the QA testing process when it is discovered that the current batch process will take 24 hours to process the days’ work. Someone finally did the math. The team did not know if it was designing a Farrari, a Superbus or a Maglev.
For the critical path batch processes you must define the required throughput. What are you moving? Invoices, orders, customers, transactions. How many of them do you have and what is the peak volume? What is your scalability approach and how do you achieve increases in throughput? Do you need to scale-up or scale-out?
You need to design the batch process to fit comfortably in the batch window and with room to grow. This year 500,000 invoices, next year is 750,000 invoices. How do you process the increase and stay within the window?

Software performance engineering Sample Questions

1 What are you processing?
2 How many steps (programs) are in the batch process?
3 What is the overall throughput?
4 What is the throughput of the individual programs?
5 What is the growth rate?
6 What is the peaking factor?
7 How will the programs scale with the load?
8 Have you identified the critical path throughput goals?
9 Are all the developers aware of the throughput goals?
10 How will you test and validate the throughput of each program?

Know your design goals

The barchettas were made in the 1940’s and 1950’s. An Italian style topless 2-seater sports car, designed and build for one purpose, to go fast. It was built for racing, weight and wind resistance were kept to a minimum, any unnecessary equipment was removed, no decoration. Doors were optional. Ferrari created one of the earlier models and other followed with the same style. The design team was focused on speed, they knew there were performance and response time goals for the car.

Market Data

Calculating system momentum – A market basket of transactions and an index

Can we use momentum as a derived value or index to alert us to impending problems with the application or system? Well, the transaction response time is really a byproduct of the workload on the system resources. So, may be a better way to look at it is; does the workload have momentum? Is the workload increasing or decreasing? Borrowing from Physics, momentum is equal to mass times velocity.

We could use transaction complexity to represent mass, we all know that some transactions are heavier than others. However, using response time as velocity really does not work. Instead I could use the transaction arrival rate to represent velocity. Then I could say that the transaction or system momentum is increasing as the arrival rate increases, taking into account the weight of the transactions.

What I am looking for is a communication vehicle to let non-technical people know how the health of the system is.
Momentum is equal to the transaction weight times the arrival rate of the transactions.

I need to pick a rating or scale for my transactions; 1,5,10. Then there is an overall transaction arrival rate and an individual transaction arrival rate. I need the individual transactions in order for the momentum index to have a chance of being relevant.

M = (T1 * T1 TPS) + (T2 * T2 TPS) + (….) or index?

This would be a very custom index for each application. It represents a market basket of transactions. Much list an EFT represents a basket of stocks.
Also, what I want to determine is how quickly the momentum is changing up or down. If I can get the real-time transaction arrival rates, then I can use the momentum to get an early warning of trouble in the system. Another term, might be a volatility index for the application. Can I get the alert in the front-end of the application early and the correlate with all the system resource monitors.

For this I need to borrow from the Financial markets High Frequency Traders. They have tools and techniques that track large amounts of market date in real-time and try to jump in-front of the market momentum. In need to jump in-front of my system momentum.
The faster I can determine that the arrival rate of the heavy transactions is increasing, then I might be able to jump in-front of that and prevent an application or system outage. I need to calculate the rate of change in real time of the arrival rates. I need a to see that at a clock tick at time zero, the arrival rate is 10 TPS and the transaction response time is 300 ms. Then I need another sample at the next clock tick to calculate the TPS is now 11, and the response time is 305 ms. Perfect for using HFT techniques.

Archeology

Performance artifacts in development

Where are your requirements and development performance artifacts? Over the years of being a performance engineer, I have been involved in a number of projects related to performance and scalability readiness assessments. This involves evaluating the software, either from a vendor or developed in-house, to determine if it has been designed and developed with performance and scalability goals. During this readiness assessment project, myself and the team I work with, will look for non-functional requirements for the key business and system transactions, and development guidelines and artifacts that track or measure service time during the development and unit testing phase. Finding performance early.

Non-Functional requirements

To start, there are non-functional requirements that should have been defined for the development team. The team develops the code to make the business functions real. The next question is where does your Software development lifecycle and methodology (that’s right, I said methodology) have activities and artifacts specific to performance, scalability, and stability? For example, the application needs a change to the pricing calculation, or order history functions, how fast should it be? Where is it specified that it still needs to be 300 milliseconds after the functional change? Initially the non-functional requirements have specified that the pricing calculation must be completed in 300 milliseconds for average complexity and 600 millisecond for complex calculations. Can you point to the artifact(s) where that is defined in your methodology? Before the developer begins coding, is he or she aware of that?
Then we look for guidelines for developers and services provided by a framework. Has the Performance or Architecture team defined a set of guidelines for the developer to use when building this type of service? Is the use of caching been defined, who verifies the database access and SQL statements are optimal? Where is that captured, what artifacts captures this? Does each developer understand the proper use of logging and code instrumentation, or is it part of the development framework? For the case of the Pricing service, each method must measure service time (internal), and each exposed public service must have a service time measurement.

Continuous Integration

A key artifact to look for is the results from the Weekly or daily build process. Are there test results for the internal method calls and external service calls? Junit will support the internal verification and Jmeter can support the external verification. In order to get value from this, the testing database must be robust (not simply single rows with no history). But, how can you use the response results during development to indicate eventual production performance? The value comes from comparing build to build, for instance, did the service time change radically? This can be an early indicator. However, often times the development environment changes or the database changes. The Performance Engineer must show the business there is value by maintaining consistency in the development environment. With a consistent development environment you can show that the service time of the pricing service has significantly changed, well before production.

Key Performance artifact

For the Jmeter test case: For build 1, the Pricing service is measured at 1.000 second. The goal is 300 milliseconds. Or, what if the service time is 100 milliseconds? Then you need to track the service time from build to build to monitor for consistency. If the 100 milliseconds goes to 1.00 second, how did that happen? Did the environment change, did the developer add new code to the function? You must evaluate this, as you found it early.

Workflow

Business workflow, business process

The System response time must not impact the workflow. The transition from transaction to transaction must be seamless and the user must not notice the system.  One might even describe the interaction between the person and the system as graceful and flowing, where the system responds before they can even finish a sip of their coffee, do your users cozy up to the system (too far??).

Understanding each workflow in the application is crucial to setting the proper response time goals of the application. This is required to set up the software performance requirements for the system and for each transition that supports the workflow. The systems today are highly distributed with web servers, application servers and web services, and message hubs and multiple database, etc.  In the software requirements phase, once the workflows are defined with performance goals, it is critical to make everyone who makes a component in the workflow aware of those goals.

There are call center workflows, document management workflows, order placement workflow, business intelligence and analytical workflow, and of course Big Data workflow.

When in the software requirements phase you might consider this checklist for the workflow;

1)      Identify the workflows: Have the key workflows been defined that have performance requirements and are the response time goals defined?

2)      Duration: have you defined the overall duration of the workflow? How long should the call center interaction by?

3)      Downstream processing: Have you defined when the data from the workflow must be available for the downstream workflows?  For instance, after collecting a customers demographic information and vehicle information, when is it available for rating a policy? 30 seconds, 24 hours?

4)      Business transactions: These support the workflow. Have performance critical business transaction been defined, with response time goals?

5)      System Transactions: These support the business transactions. Have you defined the response time goals for critical systems transactions supporting the critical business transactions?  This is where share system transaction can be found, have your requirements captured enough performance information to tell the developer how fast this system transaction must be and how many transactions per second it will support?

6)      Performance budget: Now that you have a business transaction response time goal, have you allocated the response among all the technical components supporting the business transaction? You should create a system interaction diagram to help with this, defining the time allocated across the tiers; client, web, application, message hub, database.

7)      Database query requests: Have you categorized your database queries? Simple transactions to the complex? Is there a response time goal for each? Is there difference between the first request and subsequent request?

8)      Report requests: Have you categorized the report request types?  Simple reports are 2 seconds, complex multi table grouping, ordering reports take longer that cross fiscal quarters?

9)      Discussion and negotiation with the end-user or business sponsor.  All along you must be in discussion with business people who own the system. The role of the architect is to work with the business to tell them what is possible and how much it will cost. The business priorities are critical.  The business might want to spend the extra money to have near real time reporting to gain advantage or they might be satisfied with a four hour reporting window.

 How to handle the response time discussion

Categorize: You should look to categorize the response time into; satisfied, tolerating, frustrated and abandonment. Two seconds could keep the people satisfied, while eight seconds will make them frustrated for an online transaction.  Another transaction at five seconds could keep people satisfied and 15 seconds make them frustrated.

Percentiles: You need to establish a goal for what percentage of the user population would be satisfied, 50th, 80th, 90th percentile? 90 percent of the people should have a satisfied experience.

Under what load: You need to discuss with the business people that there is a normal workload, a peak workload, an above peak workload and define target for each.  This business might ok with a relaxed target where the people are tolerating or frustrated for a period of time during a peak load for a short duration.