Wednesday, March 14, 2012

Using VMware vCenter Operations Manager

As part of my day to day routine this morning I ran into a quick use case that offers a perfect introduction to VMware vCOPS and what it can do for your environment to help detect issues before they become issues as well as finding the root causes of performance problems. If you have never seen vCOPS before here is a funny video that pretty much explains what the product is all about:

This example is a real life use case that happened to me this morning and demonstrates the value of real time, intelligent monitoring of dynamic environments.

As you can see from the below screenshot my vCOPS instance is monitoring 1700+ VMs. Some are in a production environment and some are in a lab environment. The really important part here is it took 3 seconds to recognize one of those 1,759 VMs had an issue... just 3 seconds...

Ok, so obviously red is not cool so let's click on it and see what information we are presented with...

A single click shows me where exactly the VM is located and also shows that this is an issue that is only effecting a single VM:

And I click on it again and I get all the information that matters to me: What is wrong, when it went wrong and what all is it affecting. In this case I see that 85 Anomalies were detected and the biggest indicators that something is wrong is that the CPU usage is up as well as the Memory. It also tells me that this machine has been working fine in the past and this is a new occurrence.

Ok, that's all nice and everything, but what is vCOPS actually looking at? Let's click on the Orange Anomalies Badge and see what comes up:

As you can see it has symptoms that it is alerting on and you can click on individual symptoms to get more details.

Interested in more details? How about letting you chose the metrics you want and getting them on a timeline? Sure! Just click on the All Metrics tab and you are presented with a list of metrics that are alerting and you can select the ones you want to get a pretty sweet datasheet like the one below:

So there you have it, how one piece of software can tell you what is wrong, what is affected and give you an idea of what needs done to fix it. All in a real-time, efficient and intelligent manner. The entire exercise took about 5 minutes to do a complete health check on 1,700+ VMs and figure out what was wrong with the one that I have covered here. If I can do the math right that means I did a complete health check on my environment at a rate of 586 VMs a second (totally ignoring the hosts and storage which were also checked) to figure out if I had an issue... and within a minute knew what was wrong with the VM having an issue... now that is pretty awesome!