System log data is Big Data.
Over the years I have been involved in a number of production rescue projects, where the system is either unavailable or in a degraded state. These are complex systems, with many technical components. These projects often start with a group of diverse people and teams saying their component is working fine. The devil is in the details. In order to find the problem, we have to look at the logs and configurations and correlate the events. Many companies have partial solutions, where you have to get the logs from many diverse systems and do the analysis (tailing files, greping, Perl, Excel). Time and resources are consumed to find the issues. Usually there is no central dashboard or reporting, most of the time is spent on finding the right logs and making the data useful, all the while the system is still not working. There are a number of new products out there to help with this, to Operational Intelligence from machine data.
Machine Data is Big Data
Complex business applications and their supporting systems are a rich source of data. Every technical component in the system produces logging information. There are detailed logs on every transaction that is processed successfully and there are error or exception logs for failed transactions.
The current definition of Big Data includes;
- Variety a large number of data sources and formats, system logs have that covered
- velocity rapidly changing and fast moving, log records are created by the 100’s every seconds and
- volume of data, data is recorded for events and for time periods.
Enterprise class business applications can have 100’s of technical components. One business transaction will fan out and create 100’s of log and audit table entries. The Apache web server records each request, the Websphere or Weblogic server records requests, the application within the container will create log entries, the database server, the message servers, and the storage platform.
Do you have a big picture of your operations environment? This rich and complex technical environment presents a challenge for the operations and development teams. It presents a challenge for the software performance engineer. When response time slows down, or a component is not quite working as expected, or in general a production issue arises, root-cause analysis is problematic. There are partial solutions for providing the entire picture of the problem. Locating the causes of the problem can be time consuming. The Operations team is managing a larger and large number of real and virtual environments. Plus, the business is asking for more frequent releases into product.
New tools for machine data
These tools not only help with trending performance of key business transaction and root-cause analysis that can track configuration changes (often the root of problems), there is a big compliance and security benefit from them. Correlating system access logs can help with information forensics.
You need to collect all the machine data that is available to you;
- Network packet flow information from the Cisco routers and switches,
- Storage subsystem metrics from EMC or Netapp
- Database metrics from MySQL, Oracle and MS SQL Server
- Application servers from your Log4j messages
- Microsoft WMI
- Web Server information from Apache
- Operating system information from AIX, Linux and VMWare
- End user experience
Massive amounts of data are collected and indexed, then made available for real-time analysis. Alerts can be configured for key metrics. If you use these tools and are monitoring all the logs and configuration files, have dashboard created and alerting on the key metrics, you are way ahead on the problem solving curve.
What is the ROI?
You have to spend money on these tools, so how do you position them with the keeper of the budgets? Will this tool replace some other tool or set of tools? Does your Application Performance Management Road Map already call for achieving Operational Intelligence? Your ROI calculation either demonstrates an indirect benefit to the business or a direct benefit to the business. You can start with a free version of some of these products, I suggest you download them and use them in your testing environments. Then, show the value derived whe you have to solve a problem.
If you can make the claim that this tool will reduce your Retail web site from losing customers, then you might have direct benefit. Or, you can make the claim that with this tool fewer people are required for Root-cause analysis. If any of your production outages were caused by product configuration problems, then these tools can help. You must know the cost of an outage, $5,000/minute, $15,000/minute, etc. Perhaps, fewer people are needed to growth with the business. You can change your hiring profile and curve as one outstanding operations person can now support twice the number of systems.
These are things the forward looking software performance engineer must be doing to continue to add value to the business.
Can you make the case that these help with business agility?
Check them out..