Archive

Monthly Archives: August 2012

Nonfunctional requirements define the performance and scalability goals for the business that are needed by the Architecture and Development teams.

Nonfunctional requirements define the performance and scalability goals of your application. They include a defined response time goal for key business transactions.  These transactions are what your customers will do on your web site, their user profile, or behavior.  The system must respond quickly and not slowdown the workflow or user experience. For instance, transactions types are; account history (what have I purchased this year), place product in shopping cart, and of course the check-out process. The actually numbers for the response time can vary by application and transaction. Each transaction must have a number assigned to it from a sliding scale of user experience (from the industry group Application Index or ApDex);

1)      Satisfied

2)      Tolerating

3)      Frustrated

4)      Abandonment

The user business transactions are important, however, we cannot forget about batch type business transactions.  These are programs supporting the application that run without user interaction.  These can run in real-time or many are still scheduled in a nightly batch window of processing. For example, your may have to match all the health insurance claims to the payors.  The NFR must be defined in terms of transactions per section to support the business goals.  How soon must the information be made available to the application? Is there a real-time component where the claim must be processed as soon as it arrives? How many arrive in a given minute? Or does 500,000 claims arrive at 8:00 PM EST and must be available for the business by 2:00 AM for further processing?

Requirements phase

The key business transactions must be defined and assigned a response time goal for the expected number of users for average load, peak loads, and the extreme.

The application volumetrics should be defined;

1)      User population goals – What is the total expected user population and how does it change over time (quarter to quarter)?

2)      User behavior and profile: The application must support 2,000 concurrent users with a satisfied response time of 2 seconds or less.

  • How many people will be on the web site during an average day or average hour?
  • What is your expected peak number of concurrent users?
  • Is there an extreme peaking factor? Does the business support flash marketing events?
  • Are there weekly, monthly, quarterly, year-end or seasonality that impact the user concurrency?
  • How is a processing day defined? (24, 12, 8 hours?)

3)      Workflow processing – How soon must information be available for the next step in the workflow for a business process? This is directed towards real-time messaging systems or batch processing. For instance, once a prescription is entered, how soon should it be available for filling and adjudication steps?

  • Real time message processing – Are there messages or work, arriving from other applications, how quickly must they be processed? Instantaneously, seconds, minutes or next business day? These should be rated in messages or transactions per second.
  • Batch processing – Batch can be run during the business day or during a nightly processing window. The overall batch process must have transaction per second goal and the individual programs must have a TPS rating. For instance, we must convert orders into invoices at a rate of at least 15 per second.
  • Movement to other systems – Many operational system move records into a Data Warehouse for analytics or retention. Is there a requirement on how quickly the records must be replicated into the Data Warehouse?

4)      Historical information – How long must key business information be retained? Many industries are regulated for records retention or have policies on retention.

5)      Key business transactions – These must be identified with performance requirements.

  • The business can establish general guidelines that apply to most transactions.
  • Specific business transactions must be identified, especially if there is an SLA’s in place with a contract.  Penalties can be incurred if the SLA’s are not met for some period of time.

This information must be communicated to the architecture and design team.

Introduction

There are many examples in the news today about companies with complex systems that did not perform or scale as expected. Southwest airlines launched a campaign to reward it customers for crossing the 3 millionth friend on Facebook. The reward lured those customers to the web site for discounted airfare. It also caused the web site to slow down for people purchasing tickets, the status of the purchase was unclear for the customer, did they buy the ticket or not. This issue spilled over in the billing systems which billed the credits cards repeatedly for the same seat, in some cases 6-15 times or more. This caused the customers to make phone calls to the call center, the call center became overwhelmed, people waited a long time to speak with a representative.

The software performance architect has emerged to deal with the very complex and interrelated applications and systems being designed for today’s large and small companies. The performance architect will help mitigate your technology risk for the business. He or she will plan and manage across the Enterprise performance and scalability processes, tasks and activities.

The performance architect can help negotiate with the business a range of service level agreements for the key business applications and transactions. They can help establish response times for key online transactions, should it be 1 second, 2, 3 or 5 seconds?

A key role of the performance architect is to be the clearing house of Enterprise performance and scalability information. A weak spot in many large enterprises is the sharing of performance information.  Often times the Application architect is not aware of how the application is performing in the production environment, how the architecture design decisions perform. The capacity planning team must be aware of any expected changes in the workload, and the capacity planning team that monitors production, can provide feedback on how the application or web site is being used.

 SDLC

The performance architect will define Enterprise wide practices and procedures that the application architects, and the application developers can use to build well performing systems. They can help make sure that nonfunctional requirements are defined and communicated across the software development lifecycle. They can make sure that the developers have implemented unit tests that indicate performance of the components early. They can help in the QA functional testing process by introducing micro-performance tests in the QA process. They can help establish guidelines for using virtual services to help test components when dependent third party or downstream components are not available.

Performance Testing

The performance architect will provide the practices for designing and executing performance tests for sub-systems, business applications, Enterprise applications, and SaaS hosted applications. Many large Enterprises have complex production environments that require many different applications to work together; a front-end portal, Enterprise services bus, multiple back-end databases, and third party applications. Each of these must be independently performance tested.  Often times, the risk to the business is large enough to require a massive end-to-end performance test. In addition to performance testing, these must include failover and failback testing, and antagonistic testing.

Performance tests require detailed analysis and interpretation of results. One of the greatest values the SPA provides is informing the business of the results in business terms.

The SPA will also provide the performance testing framework for the Enterprise to use. They can define the proper tools for generating load, the tools used for test data management and data generation, the monitoring and trending tools, the deep dive diagnostic tools for detailed analysis, the proper results reporting formats. In order to accelerate performance testing, there must also be a virtual service component that allows the ability to stub-out components during a test.

Production monitoring

How is the application performance in production?  Is the workload as expected? The information to these and more questions, must be shared and made widely available. Some organizations place large screen monitors in common areas showing a real-time performance dashboard for the key business transactions.

Application performance management (APM) is a critical component of the Enterprise performance processes.  APM captures real-time transactions response times for every transaction on your web site. The Performance Architect must be the manager of that information and help distribute it to  the business team, the architecture team, the developers and the performance engineering teams.

A key task the Performance architect must do is to define and increase the  value of performance engineering activities to the business. To answer the question of what are we getting for our investment in the processes, people and tools for the performance engineering team.

We have heard the term software glitch mentioned in the press a lot these last few weeks. A glitch seems to have a very wide range of impact to people and businesses. The use ranges from a single gas station in Texas that installs new software, where the software sets the price of a gallon of gas to $1.01.  This is localized the poor station owner, who sold gas till his tanks were empty, costing him initially thousands of dollars.  But I am sure the company that installed and developed the software will helpl him or her out.

The other end of the scale is what happened during the Facebook IPO and the Knight Capital glitches. Each of these totaled losses in the hundreds of millions. The glitch was so impactfull to Knight Capital, that the company was on the brink of being shutdown. Kknight Capital also installed new software and the issue became apparent in the first 30 minutes.

The Facebook IPO occurred with the existing software systems, with out any new software or upgraded software as far as I know. The change here was the new workload being generated by the traders coupled with the outrageous volume they generated. The test plan might not have covered the very large workload, or the entire system may not have been expecting this volume.

Other examples of glitches;

  1. Southwest Airlines – Large number of small transactions crashing systems and cascading (web site, billing systems, call centers). An extreme workload, the airline sent emails to all it Facebook friends that a sale was going on, because they reached the milestone of 3 Million likes.
  2. Local Conoco gas station – New computer system, at midnight, the price of gas was set to $1.01. This caused a frenzy.
  3. Six Flags Roller coaster – suddenly stopped at the top of its ride.  They ruled out mechanical issues, and are looking into the programming, it may be a computer glitch.
  4. Tokyo Stock exchange – Computer error halted derivatives trading for 95 minutes. This was the second Glitch in seven months.

Two themes emerge from these glitches;the first is new software introduces a large amount of risk, and must be reviewed, tested for performance scalability, and antagonistically. The second is the impact and the uncertainties of new and extreme workloads cannot be underestimated.

Maybe we can start using the terms, small, medium, large, extra large, and business glitch.

See the www.collaborative.com for a paper on Application Performance Risk, I have defined a list of criteria that can help determine the degree of the risks of your application and workload.

On Wednesday August 1st, morning, right at the market open, a program began trading shares at an accelerated rate.  In some cases, as much as 100 trades per second. The rogue algorithm bought and sold millions of shares of companies such as Nokia, RadioShack, Wells Fargo, and Best Buy.  There was a sudden acceleration in trading volume for these and many more. The algorithm was the product of Knight Capital Group, their tag lines are “the Science of trading and The Standard of Trust”.

Complex Systems
As part of our software performance engineering practice, one of the key activities I work with clients on is end to end systems performance testing for complex systems. Part of this includes defining a series of workloads to simulate during testing. What are the user profiles and what type of volume can we expect. Today, at an increasing rate across many different industries those workloads include a 10X or more.  This means to test at 10 times the expected volume.  This could be for a retail web site hosting a new marking sale, an auction, etc. In the brokerage industry, it means automated trading algorithms executing at light speed.
These performance tests and workloads help identify weak points or risks in the system, as many systems today, were not designed for the new 10X events.  I have extended our performance and scalability analysis and testing to include failover and failback under load, as well as a newer category of test, called Antagonistic testing.  This was coined by a Colleague of mine a couple of years ago.

Wednesday event reminds every one of the infamous Flash Crash of May 6th, 2010. The entire market dropped 1,000 points and recovered in minutes. Since that time the SEC has put in circuit breakers to suspend trading in stocks that move too far in too short a period of time. This helps to stop the over all market from moving too much, however, it still causes havoc for individual stocks.  For detailed analysis and updates I recommend going to the Nanex Research site. www.nanex.net