Monthly Archives: January 2013

System log data is Big Data.

Server log file

Over the years I have been involved in a number of production rescue projects, where the system is either unavailable or in a degraded state. These are complex systems, with many technical components. These projects often start with a group of diverse people and teams saying their component is working fine. The devil is in the details. In order to find the problem, we have to look at the logs and configurations and correlate the events. Many companies have partial solutions, where you have to get the logs from many diverse systems and do the analysis (tailing files, greping, Perl, Excel). Time and resources are consumed to find the issues. Usually there is no central dashboard or reporting, most of the time is spent on finding the right logs and making the data useful, all the while the system is still not working.  There are a number of new products out there to help with this, to Operational Intelligence from machine data.

Machine Data is Big Data

Complex business applications and their supporting systems are a rich source of data. Every technical component in the system produces logging information.  There are detailed logs on every transaction that is processed successfully and there are error or exception logs for failed transactions.

The current definition of Big Data includes;

  • Variety a large number of data sources and formats, system logs have that covered
  • velocity rapidly changing and fast moving, log records are created by the 100’s every seconds and
  • volume of data, data is recorded for events and for time periods.

Enterprise class business applications can have 100’s of technical components.  One business transaction will fan out and create 100’s of log and audit table entries. The Apache web server records each request, the Websphere or Weblogic server records requests, the application within the container will create log entries, the database server, the message servers, and the storage platform.

Do you have a big picture of your operations environment? This rich and complex technical environment presents a challenge for the operations and development teams. It presents a challenge for the software performance engineer. When response time slows down, or a component is not quite working as expected, or in general a production issue arises, root-cause analysis is problematic.  There are partial solutions for providing the entire picture of the problem. Locating the causes of the problem can be time consuming.  The Operations team is managing a larger and large number of real and virtual environments. Plus, the business is asking for more frequent releases into product.

New tools for machine data

These tools not only help with trending performance of key business transaction and root-cause analysis that can track configuration changes (often the root of problems), there is a big compliance and security benefit from them. Correlating system access logs can help with information forensics.

You need to collect all the machine data that is available to you;

  • Network packet flow information from the Cisco routers and switches,
  • Storage subsystem metrics from EMC or Netapp
  • Database metrics from MySQL, Oracle and MS SQL Server
  • Application servers from your Log4j messages
  • Microsoft WMI
  • Web Server information from Apache
  • Operating system information from AIX, Linux and VMWare
  • End user experience

Massive amounts of data are collected and indexed, then made available for real-time analysis. Alerts can be configured for key metrics. If you use these tools and are monitoring all the logs and configuration files, have dashboard created and alerting on the key metrics, you are way ahead on the problem solving curve.

What is the ROI?

You have to spend money on these tools, so how do you position them with the keeper of the budgets?  Will this tool replace some other tool or set of tools?  Does your Application Performance Management Road Map already call for achieving Operational Intelligence?  Your ROI calculation either demonstrates an indirect benefit to the business or a direct benefit to the business. You can start with a free version of some of these products, I suggest you download them and use them in your testing environments.  Then, show the value derived whe you have to solve a problem.

If you can make the claim that this tool will reduce your Retail web site from losing customers, then you might have direct benefit. Or, you can make the claim that with this tool fewer people are required for Root-cause analysis. If any of your production outages were caused by product configuration problems, then these tools can help. You must know the cost of an outage, $5,000/minute, $15,000/minute, etc. Perhaps, fewer people are needed to growth with the business. You can change your hiring profile and curve as one outstanding operations person can now support twice the number of systems.

These are things the forward looking software performance engineer must be doing to continue to add value to the business.

Can you make the case that these help with business agility?

Check them out..

Sumo Logic, Splunk , Alert Logic , Loggly , LogRythm  and more, google Machine Data.



While sitting on a slow triple chair lift, what do you think about?  Well, a ski resort is a great way to demonstrate software performance and capacity concepts.

You need a transaction rating scheme for your key Enterprise applications and systems.  Its winter in New England and I like to ski. While sitting on a triple chair lift, slowing being carried up the mountain, I thought it would be great if Enterprise applications had a rating system. Each business rated the business transactions as easy, intermediate, advanced and expert, in terms of not only business criticality, but technology complexity.  How many different systems are involved in the transactions? The ski industry has a trail rating system; green circles, blue squares and black diamonds. If the trail is really difficult, it can be rated double or even triple black diamond.

When making the trail ratings, the factors used to make that rating include the trail gradient and slope, the width of the trail, the trail conditions and whether the trail is groomed or not. The trail itself can be a nicely groomed path, a mogul trail (this has many bumps) you must be experienced to enjoy these, or (glades) wooded trails where you ski around trees while going downhill.  There are also terrain parks with half pipes and ski jumps.

The trail rating system is relative for each mountain.  A blue trail on one mountain might be a black diamond on another or a green on another.  So, there is somewhat of a range for the mountain resort owners. I think the key idea is that the rating is mountain specific, that is relative for each mountain. This can be applied to our Enterprise IT systems. Each system can be considered a mountain, each business transaction can be rated as a trail. If you take some of the mountains in the northeast;  Pats Peak (20 trails) , Waterville (52) , Bretton Woods (102), Killington (140), and Sunday River (132); you can see the mountain statistics sheet.

The  mountain fact sheet:

Quick Stats (snapshot of your system)

  • Mountain Name: Different from the resort; Mount Tecumseh
  • Total Acres: 500
  • Skiable acres: 220
  • Longest Run: 3 miles
  • Summit Elevation: 4,004
  • Vertical Drop: 2,020

Vertical Descent by peak (each peak could be an application)

  • Total Vertical: 2,340
  • White Cap: 1,630
  • Locke Mountain: 1,460
  • Bear Mountain: 1,400
  • Jordan Bowl: 1,490

Trails (these are your transactions)

  • Total Trails: 52
  • Novice: 20%
  • Intermediate: 50%
  • Advanced: 20%
  • Expert: 10%
  • Glades: 5
  • Moguls: 6

Snow Making (the engines of the business)

  • Terrain Coverage: 616 Acres
  • Water Capacity: 9,000 gallons per minute
  • Air Capacity: 60,000 cubic feet per minute @ 150 psi
  • Snow making capacity: 4 acre feet per hour
  • Snow making arsenal: 1,900 guns
  • Miles of pipe in system: 73
  • Mile of hose in system: 30

Lifts (hardware and software)

  • Total lifts: 16, including 5 high speed
  • Skier Capacity per hour: 30,000
  • High speed gondola: 1
  • High Speed Quads: 4
  • Fixed Quads: 4
  • Triples: 3
  • Doubles: 2
  • Surface Lifts: 5

Let’s take this approach with you key systems and rate the transactions.  

  • What are your Peaks (key applications)?
  • What is the ratio of the trail ratings (green to double diamonds)?
  • How many lifts and what is the capacity?

The next post is going to review the business and technical transactions with some examples ratings.

What type of application are you building?

The performance goals must be negotiated between the business stakeholders and the application Architect. To have a productive and two-way negotiation, you must understand the type of application and the amount of data manipulated for each transaction. For instance, generating a large report covering sales for that last two quarters, will take longer than adding a product to an order. And you must have supporting data, from current production systems. You must have measurements from the current system on the key business transactions.

  • Online web transactions, ecommerce, browser based and mobile based. These transactions are typically synchronous with a customer or business partner waiting for the responses.
  • Message based transactions: computer to computer transactions, what is the distance between the systems and what is the latency of the network. Typically, they are part of a larger workflow.
  • Business intelligence and reporting with on-demand reports
  • Data capture or collection from sensors
  • Calculation engine: Pricing, medical claim adjudication, insurance policy rating
  • Are there specific penalties regarding stability and availability, and performance?
  • Are there stringent time based deadlines that must be enforced?


  1. Have your historical data ready for the business discussion, be able to tell them how much it costs (real dollars) to achieve each level of performance; sub-second vs. two second response times.
  2. Understand how the application is categorized and how that influences the performance goals. The industry you are in will make a big difference in the performance goals.

Critical business transactions

First, you and the business must understand the user response time satisfaction index; there are four levels of the index, satisfied, tolerating, frustrated and abandonment. Some transactions will be fine with 3 second response time, other will need 1 second.

Then, what are the critical business transactions?

  • Define the response time for a satisfied user experience for those key business transactions.
  • Define the response time for a frustrated user experience for those key business transactions.

For message-based systems; what are the throughput requirements?  How many prices must be calculated per second, how many claims must be adjudicated per second?

Here are some basic questions that every architect/developer/ performance engineer must be able to answer for the application:

  • Who uses your application?
  • How many people use the application?
  • What do they use it for?
  • Why do they use it?
  • When do they use your application?
  • From where do they access your application?

Define the conditions that these performance goals need to operate in.  This is the workload discussion.  What is the average number of users, and what is the peak number of concurrent users?

If you have 500,000 registered users, how many access the application?

What is the business’ tolerance for failure? At what point does a degraded transaction response time become unacceptable and hurtful to the business?  A transaction, such as order status, may have a response time of two seconds under normal load.  How does is slow down under heavy load? It may go to 4/6/8 seconds.  It may just not respond.  The architect must decide when to turn users away due to load. You want to make sure that users in the application continue to process under load and not allow new users into to the system until they can be processed safely.

You must have a scalability plan defined for peak load conditions.

Take-away: requirements phase

1)      List all the key transactions and be specific with the transactions and the satisfied response times.  For instance, the Order Status transaction is satisfied at 2 seconds for the 90th percentile under normal load. Under peak load, tolerated response time is 3 seconds for the 90th percentile.

2)      Have a clearly defined scalability plan for flash events and for normal business growth for online transactions and calculation engines.  For instance, the adjudication engine under normal load must process 20 claims per second. At month end peak, it must process 40 claims per second.

3)      Communicate the performance goals clearly to the next phase of your SDLC. The architect team must be fully aware of these performance and scalability requirements. The goals will be validated in the performance testing process.

4)   Each component of the transaction must have a performance target. The client component, browser or app based, needs response time targets.  Once the HTTP request returns to the browser, the browser must render the screen in 500 milliseconds. The Architect is repsonsible to communicate the performance budget to each developer for each component.

Cellphone a6_hero

Mobile speed – for the performance engineer, you must account for the network.

Web-user experience

The end-user experience is greatly impacted by the network. No kidding right? This is occurring because  consumers are transitioning from the home-based browser, with a high speed and stable network, to the wireless mobile network with varying bandwidth and varying stability.  The home network connection on the latest Comcast cable service, consistently delivers 22 Mbps for downloads and 10 Mbps for uploads. Then when is use my iPhone 5 with Verizonwireless 4G/3G, the upload and download change significantly. Using some nice tools on my iPhone I can measure latency, upload and download speeds, Ookla, Network Analyzer, Shunra, among others.

The end-user experience is paramount today’s for applications and that experience is greatly impacted by the variability of the wireless network. The user’s preferred device is a mobile device, divided between smartphones and tablets.  What is the performance engineers preferred device? It must be a smartphone and a tablet. You must be able to measure the response time (end user experience) on these devices.  Smartphone adoption is occurring so fast that, according to Branding Brand, by 2014 more than half the 49 Billion visits to the Internet Top 500 e-retailers will be from the smartphone.  Think of how many more mobile phone contracts will hit the two year upgrade cycle in 2013. The mobile phone plans are changing as well, encouraging people to add tables to the plan.  The upgrade cycle will switch many people from 3G to 4G, as the 4G wireless networks continue to be built out.

The content that is coming over the wireless network is changing.  As the bandwidth increase, so does the size of the web page, or app. Video, pictures, voice, more connection to social site, Facebook and Google+.

The performance engineer must be able to measure the latency of the network, how many milliseconds to reach the destination.  Measure the bandwidth of the connection, and be able to measure packet loss as well.

Architects, developers and performance engineers

They really should be one in the same, well certainly the Architects and the Performance Engineers,  but that career path discussion is for another day.  Mobile impacts every aspect of the Software Performance Engineering Body of Knowledge; SA: SDLC, PVT: Performance testing and validation, CP: Capacity planning, APM: Application performance management and of course PDR: Problem detection and resolution.  Mobile is added to the underlying competencies and techniques.


The network impacts the architects and developers, and it impacts the software performance engineer. The architect must consider the challenges the wireless varying bandwidth will present.  In a low bandwidth connection, what will happen when the connections are held open longer, from 50 milliseconds to 200 milliseconds?  How will this change the time-out settings within each tier of the application?   Applications must be designed with unreliable bandwidth and varying latency as part of the non-functional requirements.  You now have to consider designing more resilient clients, or possibly a fatter client. Web pages has more and more third party sites and partner connection than ever before.   The web page still must be rendered even when those are not available.

PVT: Performance testing and validation

The activities, tasks, and tools of the performance engineer must be modified to deal with the mobile application and the variable network. This test design and test cases must include the modified workload that includes many mobile cases, covering different devices, different network bandwidth and network latency, and importantly different test cases for degraded response times of third party components.

The test monitoring tools must include the network measurement, and network bandwidth simulators. The load generating tool must be able to launch large scale tests from the cloud.  For instance a tool from Shunra will be very helpful in simulating the various network bandwidths.

The workload profile of a mobile site will be different from the PC-Browser based workload.  The usage and access patterns will differ. The user profiles to simulate may change.  These will require more complex test scenarios and test cases.

Test results analysis changes. The report must include the measure response time under different network bandwidth and latency.

More to come…