There are three kinds of lies: lies, damned lies, and statistics."

A tale of caution.

I was called back into a customer where I had installed MOM 2005 and the problem was that there were “too many alerts” but no-one could say how many alerts were too many. There was a figure being bandied about saying that 1,000 service desks tickets were being generated by MOM each day and senior people were getting annoyed. I was surprised at the figure as I had done quite a stringent tuning job when I put MOM. So I was wondering what had changed.

I accessed the MOM console and started creating views to show me what was going on. And after a while I thought that MOM was fine and there were only 1 or 2 alerts per hour being generated. They did say that things had settled down a bit as they had fixed some infrastructure issues. There was case one. Rather than saying that they were getting alerts that pointed them to problems that they fixed and so MOM was good they were just complaining how noisy it was. Well it will be noisy if there are lots of problems!

I looked at what rules had changed (search for rules based on date last modified) and found not many changes. There were a couple of noisy rules for Project Server and some disc space rules had been copied but the originals had not been disabled. Some fine tuning was needed but it did not seem that there were that many alerts and most of the ones in the console looked like problems to fix and as they were repeating they were not generating any new tickets.

However that was not the end of it and I did a bit of digging. This MOM system has a simple script that takes alerts of warnings and above and passes them as events onto the 2 management servers and an NSM system picks them up and transfers them to the service desk. I went down to analysing the number of these events created (the event log filter is very hand for this as I can put in the event number and a date and it tells me the number of events that match). The figure matched up with what I was seeing in the console. Puzzling.

I received a few of the alert analysis reports that had been previously run and they were indeed showing over 1000 alerts per day. I picked a day and in the alert analysis report there were a large number of alerts but a quick look at them showed a particular pattern. I created a view that showed all alerts for that day (including resolved) and my suspicions were confirmed. Although it looked like there had only been 45 alerts created that day (according to the specific events in the event log) there were 710 raw alerts. 47 were information so they do count as they do not create tickets. There were however 594 alerts from 1 server that was in maintenance mode. There was the culprit. As each alert comes in it gets auto resolved as the server is in maintenance mode but the MOM server keeps the alerts in the database. So when the alert analysis report is run all these alerts show up. As each alert is auto resolved then none of them get to repeat so it looks like there were a lot of individual alerts. Minus the 24 synthetic events that tested the link once per hour and the numbers added up. Well one problem solved.

This shows that with these reports you can not take these things at face value and have to understand what is going on and where these numbers are coming from. Unfortunately the person that did the report did not and the figure of over a 1000 alerts a day started being bandied about. Nobody questioned it as it was in black and white in a report (so it must be true) and there was a perception in a part of the organisation that MOM was noisy and that was the problem. In fact one of the team leaders said that to me before I had even installed MOM! Talk about prejudging.

And it does not end there. The Unicenter NSM system that was the conduit to the service desk software had been doing its stuff as well. Once I received a report from NSM I could see a whole load of spurious alerts that had not been generated by MOM. In fact they were tripling the tickets being created. Apparently they decided to “helpfully” add some monitoring of their own but not tell anyone.As no-one knew that this was happening these alerts were not tuned. I was told by one of my old bosses years ago that during a project you can never over communicate and many times he has been proved right. Once the ticket was created it looked like it came from MOM and because of the perception about MOM being noisy it was accepted. We got them to stop that and now the number of tickets per day looks good. And most worrying no-one had thought to look at what was being created as tickets compared to the MOM console and alerts being generated. I was not given access to the service desk console as the cost per console was too high!

Lessons learned

  1. Don’t believe all you read just because it is a report – dig in to find where the numbers come from
  2. Get the story from both sides
  3. Communicate
  4. Promote MOM internally as a fixer of problems
  5. Put down stories about MOM being noisy

Without a MOM champion MOM was getting a bad reputation. Numbers were being taken at face value without a detailed analysis of where they had come from.  It was only because I was adamant that MOM was working well that the investigation expanded. A sad tale of people too quick to blame the tool and relying on numbers without understanding them. MOM is a great tool in the right hands. Don’t dis it.

Advertisements

Comments are closed.

%d bloggers like this: