With real-time machine learning capabilities, Moogsoft delivers complete situational awareness for IT Operations teams, allowing them to detect and restore service incidents faster than ever before.
AWS CloudWatch in the Grand Scheme of Things
Amazon CloudWatch is your primary tool for monitoring the AWS resources and applications running on your Amazon infrastructure. It gives you great coverage into the resources behind EC2 instances, Elastic Load Balancers, EBS volumes, RDS database instances, SQS queues, and SNS topics.
When an Incident occurs and one of your cloud applications is impacted, you might be used to seeing hundreds or thousands of CloudWatch alerts. This might take you hours to analyze and understand while your end users are being impacted. With Moogsoft, there’s a better way.
Moogsoft Brings Correlation and Collaboration to CloudWatch
With Moogsoft, alert storms are reduced and contextualized in real-time, allowing Incidents to be simplified and resolved in just minutes.
Moogsoft does this by (1) de-duplicating and blacklisting unwanted CloudWatch events and (2) using machine learning to correlate CloudWatch alerts in real-time and create individual clusters (‘Situations’) of alerts that manifest the full narrative of an Incident; beginning to end.
However, the full narrative of an Incident can’t always be comprised of alerts from one single monitoring tool. In fact, that’s rarely the case when major Incidents occur at large enterprises and service providers. Fortunately, Moogsoft can ingest events and alerts from your ENTIRE production stack and correlate them in real-time to give you full situational awareness.
As a basic example, let’s say that you have an S3 storage issue and one of your applications can’t write to it. CloudWatch allows you to look at the impact through the storm of application events that come in. But what came first? How do you differentiate between the cause and the symptoms?
With Moogsoft, you can see exactly how this Incident unfolded:You can interface directly with CloudWatch and tools like New Relic from Moogsoft’s Situation Room to get more context around the alerts that comprise each particular Situation.
Furthermore, Moogsoft facilitates collaboration across teams. When a Situation occurs, all relevant cross-domain stakeholders are automatically notified by Moogsoft so that they can come together to communicate and collaborate around remediation. Whether insights are shared, command scripts are executed, or resolving steps are revealed, all activity in the Situation Room is archived for future reference when similar Situations are identified by Moogsoft.
With Moogsoft, operators gain a clear, concise and actionable workload, specialized for each user to resolve Incidents in the most timely fashion. The result is is a massive reduction in the Mean Time to Detect (MTTD), Mean Time to Resolve (MTTR), and and overall business disruption.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.