What everybody should know about log analysis and effective critical alerting

The Great Wall of China began construction in 7 B.C. to protect the Chinese kingdom from Eurasian warriors. Chinese soldiers would marshal forces to protect the Great Wall from enemy attack by using smoke signals to send alerts from tower to tower. This method of alerting enabled messages to be sent to garrisons hundreds of miles away in just a few hours time. With these alerts, soldiers could prepare to convene and combat their enemies.

Move forward some 2000 years and we see IT departments adopting modern constructions of the Great Wall to protect their valued infrastructure. Yet in modern times, as these great walls are built, sentries are concerned with the perimeter as well as with the behavior inside the walls. Today, logs take on the role of sentries and they record how the system behaves. Logs provide the insights necessary to understand how the systems are progressing. But proper monitoring of these numerous systems requires reviewing vast amounts of data.

How can teams effectively analyze this vast amount of data from their various systems? How can they use this data to troubleshoot issues when they do arrive? How can they use this data to prepare for the IT dangers they know of and those that are unforeseen? Furthermore, how can they make sure they are alerted when serious issues and even dangers arrive?

Why is Log Analysis Important for IT teams?

While most developers and DevOps teams believe in the importance of log analysis, they consider it akin to eating spinach – it’s good for you but do we really have to do it? While log analysis contains a lot of important information on how the system is behaving, analyzing logs is a lot of work. But, avoiding this analysis is dangerous. Without this careful analysis, a company cannot recognize the threats and opportunities that lie before it.

Most companies run off multiple servers and have numerous devices providing logs to inform them about troubleshooting issues, monitoring, business intelligence and SEO. Furthermore, as written in a previous article, IT infrastructure continues its move to public clouds such as Amazon, Microsoft Azure and Google Cloud. As such, it becomes more difficult to isolate issues. And since there is a lot of fluctuation of server usage in the cloud based on the specific loads, environments, and number of active users, obtaining an accurate reading can become quite difficult.

Yet by centralized log analysis, you have a way to normalize the data in one database and acquire a sense of how the system’s “normal state” operates. Log analysis can provide insight into cloud based services as well as localized systems. The analysis provides the knowledge of how the network looks when it is humming along. Knowing baseline traffic, companies then have a sense of how to view the outliers. What should our site traffic be like? What error logs are normal and consistent with system traffic and which are causes for alarm? Having answers to these questions enables engineers to make data-informed decisions.

Furthermore, logs and log analysis can provide insight into many key points of information throughout deployment. Analytics can be used to understand system logs, webserver logs, error logs, and app logs. Logs provide us with a way to see traffic, incidents or events over time. By including log analysis as part of healthy system monitoring, the seemingly impossible process of reading logs and responding to their information becomes possible. By enabling log analysis, companies can optimize and debug system performance and give essential inputs around bottlenecks in the system.

Where does ELK come in

There are several software packages out there that provide log analysis capabilities. Some large enterprises use packages such as Splunk and Sumo Logic. Yet these packages can get quite expensive at high scale. Instead, many in the DevOps community have moved towards using the ELK (Elasticsearch, Logstash and Kibana) stack for their log analysis. ELK components can be used separately. But, when joined together, they give users the ability to run log analysis on top of open sourced software that everyone can run for free.

ELK has many advantages over competitors – it is open source, easy to set up and provides fast performance. Of additional value is the visibility it offers into the overall IT stack. When numerous servers are running multiple applications as well as virtual machines, you need a way to easily view and analyze problems. ELK provides this opportunity in a low cost way that correlates metrics with logs.

Example of ELK solutions

One of the biggest challenges of building an ELK deployment is making it scalable. Given a new product deployment or upgrade, traffic and downloads to a site might conceivably skyrocket. Ensuring this influx doesn’t kill the system requires that all components of the ELK stack scale as well. Ideally, you would have a tool which combines these 3 components into a viable stack that is integrated in the cloud so that scaling and security are taken care of. This is where a hosted ELK solution like Logz.io or Elastic Cloud steps in. Logz.io is built on top of Amazon’s AWS and enables this very type of scaling.

Additionally, when running a large environment, problems can originate from the network and cause an interruption in the application. Trying to correlate these issues can be very complicated and time consuming. The ELK Stack is useful in these cases because it provides a method to bring in data from multiple sources and create rich visualizations.

Where critical alerting and OnPage come in

Operational analysis is one of the more common use cases for ELK. DevOps engineers and site reliability engineers can get notifications of events such as whenever traffic is significantly higher than usual or the error rate exceeds a certain level. Logz.io, has several pre-built alerts for notifying when these happen. These alerts go to Slack or email.

Yet there is also the need to alert beyond Slack channels and email. Yes, teams are awake and monitoring systems during normal business hours. And, yes, a product like Logz.io has AI capabilities as well as crowd sourcing capabilities to help flag logs that matter. But even with this level of orchestration, teams cannot catch every system overload, every potential DDoS attack, every memory leak, every server failure. Receiving alerts about complex issues such as these is a key part of completing the picture for DevOps.

There needs to be a way to alert the DevOps on-call engineer or the IT service tech who can respond to the alert, both during and after business hours. These alerting tools need to have the following capabilities:

Alerts need to continue until they are responded to
Low and high priority. Not all alerts are created equal. There needs to be a differentiation between these two types so that high priority alerts come through at any hour. Low priority alerts can wait until normal business hours.
Alerts need to come with information on which system sent the message along with a time stamp.
Message exchange. Alerting tools need to also provide holders with the ability to message one another.
If the alerted individual is unable to answer the critical notification, then the tool needs to automatically go to the next person in the on-call group
Audit trail. In order to improve future responses to alerting as well as to provide painless post-mortems on recent alerts, the alerting tool needs to provide an audit trail detailing who received the messages and the responses that they did or did not provide.

Fortunately, OnPage Corporation’s critical alerting tool provides this level of insight and capability. Many IT shops have used OnPage’s capabilities to enable critical alerting on their systems to avoid missing important alerts and cut response time. OnPage can also be integrated with Logz.io to provide this additional level of alerting for DevOps.

Logs and IT alerting – better together

Clearly there is great and growing value in collecting and analyzing log data for IT planning, operations, and security. And while there are still challenges to be faced, best practices are emerging to help everyone understand what to expect and how to get the most returns on investments into log data collection and analysis tools.

Moving forward, it is fair to expect that an integral part of future IT planning will be enabling further correlations and analysis for known and unknown issues. As these capabilities arise, it will be important to have the log analysis tools that can scale along with the growth, as well as the critical alerting tools to alert teams when issues arise.

OnPage is a critical alerting and incident notification platform used by DevOps and IT practitioners. Download a free trial to get started on the path to better incident management.