This is the third in a series of blog posts examining the role of AIOps in observability.
Logging is an essential method to understanding what’s happening in your environment. Logs help developers and system administrators understand where and when things have gone wrong. Ideally, logs on their own would suffice as indicators of what’s happening. However, there’s far too many log messages being produced in today’s world and most don’t contain the information we actually need.
Logs generate data for every request in the application or service, but, to be understood, logs must be aggregated. As a result, you end up with an immense accumulation of logs that are expensive to collect, process and store. The deluge of data from logs has gotten worse in recent years, because applications and services no longer run on a standalone box, but rather are distributed, containerized and ephemeral.
As I said in my previous blog post, we have moved past “static-only thresholds, manually selected metrics and widgets, across a theatre of dashboards in a dimly lit operations center”. Also gone are the days of data lakes full of useless data and queries that break the bank. After all, the information you really want is: What is going wrong right now and why? Or better yet: what is about to go wrong and why?
Remember that processing power is at a point where serverless architecture and cloud computing allow for collection and analysis to occur directly at the source. Well, that also applies to logs, not just metrics. By distributing our AIOps intelligence and building it into local collectors, we can persist log message metadata to memory while streaming log events in real-time. That way, we keep only those logs that match criteria or deviations, with the ability to iteratively identify the relevant subset of logs.
First, let’s take a look at a few log message examples:
2019-09-23 23:30:20 PST abf-moo-023  [utilities]:[error]-[sel-pay.m@214]:: file found /usr/place/file
2019-09-23 23:30:20 PST agp-moo-02  [utilities] [error] [sel-pay.m@214]:: file found /usr/place/file
2019-09-23 23:30:21 PST lov-moo-19 [warn] transaction failed [checkout@219]:
2019-09-23 23:30:22 PST ann-moo-12 [info] [cart.s99]  [break.x1132] transaction occurred 
I will admit I wrote these and they are difficult to look at but then again that is the nature of logs. Let’s break them down into the information we need:
2019-09-23 23:30:20 PST | abf-moo-023 | [error] | [sel-pay | file found /usr/place/file
2019-09-23 23:30:20 PST | agp-moo-02 | [error] | [sel-pay | file found /usr/place/file
2019-09-23 23:30:20 PST | lov-moo-19 | [warn] | transaction 4192384 failed | [checkout]
Now let’s put that into an even clearer view:
|2019/09/23 23:30:20||abf-moo-023||Critical||Select Payment||File Not Found /usr/place/file|
|2019/09/23 23:30:20||agp-moo-02||Critical||Select Payment||File Not Found /usr/place/file|
|2019/09/23 23:30:21||love-moo-19||Warning||Checkout||Transaction 4192384 Failed|
Notice the fourth log isn’t shown in the table above. I’d also like to point out some translations that were made: “sel-pay = Select Payment”, “error = critical.” These are shortcuts, or codes if you will, that are only translatable to those who write and support these services, with the exception of the severity. Typically “info,” “warn,” “error” and “fatal” are translated to “critical,” “major” and “warning.” The point here is that we have the information we need with a little transformation and processing; we can begin to understand what went wrong and why, without the very common flawed approach of sending, parsing, processing and storing all the logs. Each matching log will only adjust the original message if there’s a delta change (i.e.: a value in a field has changed).
Now let’s look at several ways of extracting insights from your logs using Moogsoft AIOps.
The first approach to understanding the value in your logs begins with a combination of two techniques involving semantic syntax and neural network processing. Using vast amounts of logs as training sets, we can isolate the natural language within the strings to identify nouns, verbs, and the like, and build a reusable and distributable model. As you can imagine, the original tagging or marking of the training set is quite an arduous task. But this is a very valuable task, since the typical conventions and structures used by developers can be indecipherable when you’re looking for the proverbial “needle in the haystack”. Things like date/timestamps, and criticality or priority, can only be written in a handful of ways, making them easy to identify and transform.
Once we’ve isolated the natural language, we begin to structure it both as surface structure and deep structure in the neural network. This is where the semantic syntax comes in to aid in the tokenization and structuring of the strings to fundamentally turn the underlying language in the log into a decipherable event that we can then begin to aggregate and deduplicate.
Several of the log messages in log files can indicate the same event is occurring. Also, several of the logs across your microservices can indicate the same event is occurring, but not quite in the same language as the messages being emitted from the other log files. Allowing all of these events to flood your system would amount to what we call “noise” — noise from hundreds, thousands, millions, or even billions of duplicate events. The method in this stage of analysis begins to apply algorithms to remove or deduplicate the events.
By determining the surface and deep structure of the event, we can begin to correlate and discover commonalities in the tokens and strings. This allows us to structure the tokens into deduplication keys so we can aggregate and consolidate the duplicate events into a single unique event, while persisting the metadata to memory so we understand when and if there’s a deviation that would result in an updated event or a new unique event. At this point, the event is ready to be analyzed with other events, whether they be metric, log or trace events, for correlation and causality.
We now introduce and combine the above methods with analyzing the complex rates: the rate at which log messages are occurring or, in other words, the metric behind your logs to help understand how many messages are normal or not. This provides a free actionable insight without storing all the logs. If you’ve never had to tail a log file or seen a busy log in action imagine the lines in this paper flowing from bottom to top in about 2-3 seconds flat –oh, and remember each line isn’t decipherable until we apply our algorithms, so along with the blur of words they would be quite jumbled with special characters and numbers. The complex rates in themselves provide an insight into normal operating behavior or when something is outside optimal performance.
By analyzing the rates of the log file, deduplication instances, and unique event occurrence, we can determine and utilize a robust measure of the variability of a univariate sample of quantitative data, along with a range of probability distributions to calculate the deviation separately for above and below the median. Simply put, it’s the same as the first method (Method I) described for metrics analysis in the previous blog post. This allows for anomaly detection in your logs so you can understand how well the model is performing against your data.
As I’ve mentioned, the conventions and structures developers use in the logs might be completely different from one dataset to another. While the algorithms will adapt to new datasets, there can be structures that you’ll want to override. For this we provide you the ability to put the model into a manual “learning” mode which renders the ability for you to conduct your own tagging and override the previous tags. This fundamentally circulates your data through method I and II but includes your own knowledge into the equation.
When entering the model into a manual learning mode, you’ll be able to select which parts of the string you’d like to form the description, severity, type, date/timestamp, etc… from single words or by combining words and values from multiple parts of the log. Once you execute your own tags, the tokens will be applied and the algorithms will begin to restructure the data into unique events according to you and your knowledge.
Natural Language That Maths Can Structure
There is an immense amount of valuable information in your logs. Your logs are meant to be simple, easy to understand, descriptive and manageable. The truth of it is, between the sheer volumes we face today and, quite frankly, text and characters that are indecipherable to us humans, logs are not manageable without the intelligence and real-time analysis AI provides. AI must drive the value and information out of your logs, along with the value you bring to assist or override AI models. Events can then be classified, severities determined and underlying metadata correlated for a high degree of context. There is natural language that math can structure quite beautifully in your logs. It’s better to manage log volumes and information, and get back to the basic simplicities that logs are meant to be. That way you’ll be able to focus on your development and services that your customers love to utilize, and not on the logging infrastructure you’re paying too much for.
Now that we’ve covered metrics and logs, in that very specific order, let’s take a look at how AI is automating distributing tracing and providing more value than ever before. Tune in next time to learn about inter-process communication discovery and interceptors that provide latency information, dynamic source and sink mapping, and the ability to visualise topologies for contextual traces surrounded by metric deviations and log messages.
About the author
Adam Frank is a product and technology leader with more than 15 years of AI and IT Operations experience. His imagination and passion for creating AIOps solutions are helping DevOps and SREs around the world. As Moogsoft’s VP of Product & Design, he's focused on delivering products and strategies that help businesses to digitally transform, carry out organizational change, and attain continuous service assurance.