IT organizations are now realizing that they can only operate at a more competitive velocity if their networks and systems become more agile, scalable, and cost-effective. Software Defined Networking (SDN) and Network Function Virtualization (NFV) are next-generation approaches being quickly adopted to allow networking to better support competitive IT business objectives.
While dramatically enhancing the agility of networking, SDN and NFV have also introduced a new degree of complexity that will undeniably affect service quality.
Here is why…
- Single faults (root-causes) rarely cause impact in modern, constantly changing infrastructure;
- modern, constantly changing infrastructures make it impossible to model all the likely failure scenarios that may exist;
- it is impossible to model the topology from the core network through the datacenters, the OpenStack (or VMware or even Containers) hypervisor layer, the Virtualized Network Functions, and the Applications and Services reliant upon the infrastructure;
- in modern Service Delivery Platforms, it is the confluence of multiple faults that lead to performance or capacity issues which, if go unremediated, will cause service interruptions.
What About Well Maintained Networks?
Logic says that if a network is fully software defined and maintained, then the software controller will always be aware of the state of the infrastructure, as well as the capacity / load of the infrastructure. So the software controller will automatically adapt the configuration of the network to cope with any adverse conditions.
That’s (potentially) all fine and good—where the network is homogenous, where the entire stack from physical to application is software defined, and where you can be 100% confident that your Network has no single node that can cause failures.
The problem is that, even achieving two out of the three conditions above is impossible in all but the simplest of lab environments.
It is well known that rules-based correlation techniques will not work. This is because:
- You don’t know all the models that need to be created;
- when a ruleset is created, things change and so the model needs to be changed;
- can’t keep up to date with the amount of new event types.
Bottom line: This is an insoluble problem.
What About Historical Analysis?
The alternative approach is to capture a huge set of data (Events and / or Time Series data), then apply analytics to that data in order to identify ‘Features.’ You then compare those Features with real failure conditions and use those Features as ‘Models’ and monitor for those Models in real-time. This is a common approach by many vendors in the event correlation space today.
- You can only detect what you have seen before;
- you are prone to phantom issues because either the circumstances do not pan-out or the models are incomplete;
- and of course, things change.
Forrester tells us that 74% of issues are detected by the end users first.
This monitoring survey, conducted at AppSphere 2015, tells us that only 35% of issues are recurring—things we have seen before.
So, that leaves us with a problem: constant change, incomplete and inaccurate topology from the bottom (network) to the top (VNF or application/service), and multiple parties/organizations/silos involved in the operations management of an infrastructure.
Is there a viable solution?
A Data-Driven Approach to Service Assurance in SDN & NFV Environments
Highly software-defined environments give technologies like Moogsoft the opportunity to be truly data-driven in its service assurance approach. Moogsoft allows organizations to dramatically improve their service quality in SDN/NFV environments through several attributes of the core technology.
Firstly, Moog leverages machine learning algorithms to detect features and anomalies (we call them ‘Situations’) without pre-training the system, or reliance upon topology. Moog will begin creating Situations as soon as you turn it on, but the customer goes through a calibration stage for fine-tuning.
Secondly, Moog can correlate events and alerts into Situations. From these Situations, Moog can define who should be notified, what is impacted, and why the issue occurred in the first place. Moog can delegate the actions to the appropriate teams that need to (a) be aware of the Situation, and (b) need to remediate the Situation. That’s called Situational Awareness.
Thirdly, Moog has a virtual war-room (we call it the Situation Room) where operators can collaborate toward remediation of issues, which gives two core benefits: (a) makes remediation more efficient, and (b) captures the knowledge of the resolution as a knowledge article automatically.
Furthermore, Moog can allow a Managed Service Provider, which is using SDN/NFV to offer virtualized networks (Cloud VPN, Managed Security Gateway, etc.), to integrate the customer into their support. For example, where a given OpenStack Instance has an issue, the impacted Tenant can be identified and ‘brought into the Situation Room’ to be made instantly aware of the issue, improving customer experience.
What Does This Mean for Your Service Quality?
In the context of SDN/NFV, Moogsoft can detect issues in the underlay (the OpenStack layer), the datacenter underpinning the OpenStack, the network, and relate them to issues impacting the virtualized network functions. Moogsoft can indicate whether the issue is within a single VNF or impacting the entire (or a portion of the) NFV layer.
Where some form of topology information is available (for example, in a BGP event from an MPLS network), Moogsoft can extract the BGP information from the alert automatically, and indicate the impacted portion of the network. Where other inventory information exists that relates to the border between IP core and aggregation, Moogsoft can reference that data from inventory or CMDB systems in real-time to augment Situations, decorating them with impact, etc.
In summary, a tool like Moogsoft is able to leverage your data in real time to help you detect incidents faster, delegate incidents to the right people, and resolve incidents faster.
Today, Moogsoft has an OEM partnership with Cisco as a part of their Virtualized Managed Service offering—where Moogsoft is the service assurance component of VMS.
In addition to software defined networks, Moogsoft offers the same value for IMS and IP Video—not just offering Situational Awareness from the transmission and mobile to the IP video/voice servers and applications, but also helping pinpoint quickly that (a) something is happening, and (b) where the causality is located.
About the author Mike Silvey
An expert in IT operational management and technology commercialization, Mike launched SunNet Manager in the UK for Sun Microsystems before founding an open systems service management business at Micromuse where he brought several innovative service management tools into the European market (such as Remedy) and established key OEM relationships (Cisco, HP, Intel) that led to successful IPOs for both Micromuse and RiverSoft. Today, Mike is focused on and scaling Moogsoft by overseeing strategic business relationships with key partners around the globe.