What’s In Your Data Lake?

It is always productive to re-examine the assumptions behind the generally accepted truths of our industry. Today, I want to look at Big Data.

The topic of Big Data has been discussed for at least a couple of decades, since the last years of the twentieth century. Most of the early focus was on how to store the fast-growing volumes of data that were being generated by the the new web-scale infrastructures that were popping up everywhere. What had previously been a concern only for government statisticians or academics working on high-energy physics experiments suddenly became an area of intense interest for sysadmins everywhere. Best practices used to indicate that /var/log — the file system portion assigned to logging data — should be measured in single-digit hundreds of megabytes. Suddenly that was barely sufficient even for the systems’ own internal, technical logs, let alone anything relating to the purpose that system was designed to serve — and the volume of data was increasing by the day.

These days, we don’t pay attention to data storage that is measured in less than terabytes. Data transmission still lags an order of magnitude or so behind, but at least within a single environment, network speeds measured in gigabits per second mean that for most purposes, it is now Fast Enough. Easy access to this much storage and transmission speed means that when in doubt, the first reflex is to gather and store every single piece of data, just in case it might turn out to be useful later. That is not to say that you can’t still run into issues at extreme scale; I know of one European telco who took out their own backbone when they turned on some over-aggressive debugging options on the hardware…

Assumption: if we keep pouring more and more data from more and more diverse sources into the lake, eventually we will be able to find anything in the lake that we might ever need. Reality: Most data gathered are never even queried.

For most of us, though, the problem of gathering and storing Big Data is solved. The question now is… what next?

One of the most common metaphors for the resulting Big Pile Of Data is the “data lake”. The assumption is that, if we keep pouring more and more data from more and more diverse sources into the lake, eventually we will be able to find anything in the lake that we might ever need.

Here is where those hidden assumptions come into play, and can potentially cause problems.

Data is Valuable

Because for so long it was hard to gather data, there is a tendency to attribute value to data that may not be deserved. Most data gathered are never even queried, let alone made use of. The data lake in this analogy is one of those Alpine dams, with a smooth unruffled surface, and a huge mass of cold, dark, anoxic water underneath, full of boulders and dead tree trunks to snare the unwary. The water is only valuable in aggregate, because of the mass it embodies. Unstructured data is the same way: any given item of information is probably in there, but getting it out is a different story.

Extracting Value is Easy

Following the metaphor above, fishing anything out of the data lake is a time-consuming process. Nets will get snagged on lake-bottom rubbish, and return cargos of old boots, tins, and car tires more often than anything valuable. In most cases, a quick trawl of the surface layers is all there is time for; if that provides value, good, but if it doesn’t, the deeper waters are not worth the effort to go diving into. Crafting the custom requests takes time, and there is always another question to ask. Worse, because the lake is so full and deep, most queries will return something, and it takes time to determine whether the result is useful and to filter out the noise.

AI Will Save Us

Much of the more recent hype around Big Data has been about how it does not matter that humans cannot extract value from the data lake, because artificial intelligence and machine learning will spot the patterns and flag them automagically without people even having to ask. On this basis, there was no need to worry; just continue hoarding data against the glorious AI future (and keep paying the ever-larger bills for the storage and management). As the dawn of this magical age of AI continues to be seemingly just as far away as it was several years ago, though, people are beginning to notice that the patterns the latest shiny AI-enabled tools return are not always useful. Within a sufficiently deep lake, any pattern can be found, including many spurious ones.

It’s much better to filter the data streams before allowing them to enter the data lake, rather than trying to apply analytic algorithms to the unfiltered data lake itself. There is an important distinction to be made here between redundancy filtering, which is relatively easy and can be done at source or in stream, and higher-order analytics, which are carried out on the now-information rich data left over after the filtering. As more intelligence is made available directly at the network edge, we are finally able to break the age-old dilemma of filtering — namely, how to define a filter so that it removes just enough, but not too much. AI-enabled filtering can avoid the pitfalls of static filters, ensuring that all relevant and actionable data makes it into the second stage — and nothing else does.

So What to do Today?

Data is good, and AI/ML are useful. They’re just not magic, neither alone nor in combination. There are huge benefits from being able to gather more data, increasing the sensitivity of data gathering. In the same way, AI can indeed spot useful patterns and bring them to the attention of human specialists. It’s just that the massive data lake is not the way to manage all types of data, especially streaming data with a short half life. While somebody may graph performance of your systems on a curve over the last couple of years, that curve is not likely to give any particularly useful real-time insight. What you want is something that can identify a pattern involving a particular user request, the code path being exercised, and system resource utilization, determining the parameters of a new problem and inviting specialists from different domains to collaborate to understand the meaning of the pattern identified by the algorithm.

That sort of correlation is in real time by definition. Subsidiary systems may indeed help to identify whether this is a one-off or a recurring issue, in particular by trawling the data lake, but this is a process that takes time and may in any case come up empty: more and more incidents are occurring for the first time, with no historical antecedents. The rapid rate of change in modern infrastructure means that this ratio is only moving in one direction, limiting the value of the historical data lake for this purpose.

This is not to say that data lakes have no value; it’s just that they are not a universal solvent. For monitoring in particular, real-time is the primary goal, and data lakes can never help with that.