I was investigating a problem with a SQL server having 100% CPU spikes. I found that Sysinternals Process Monitor invaluable in helping me see what was going on. What was happening was that a large number of scripts were running at the same time and as it was a SQL cluster with multiple instances (this customer was doing a SQL Server consolidation) then it was exasperated as the scripts run for each instance at the same time. This has been confirmed by Microsoft as being a problem with the SQL MP (v6.0.6460.0) when running on a cluster with multiple instances all running off one active node.
To improve the situation I created a group with the cluster members and went through all the discoveries that had generic targets (like Windows Computer or Windows Server) and put an override on them to stop them. This cluster is only going to be used for SQL Server and so there is no need to discover anything else. The main ones that really helped were eliminating the discoveries of SRS, Integration and Analysis Services as those were getting run for each instance. The active node is down to running about 15% which the SQL DBA thinks is dreadful but at least there are no 100% spikes now.
In my investigationI found some strange things. The IBM Director Agent script was set to run every 20 seconds – targeted to Windows Computer. I don’t even run discoveries that fast on a demo system! That was definitely changed.
Here is a list of more reasonable times for discoveries and what they are in seconds that you can use for overrides.
1 hour 3600 seconds
2 hours 7200 seconds
4 hours 14400 seconds
8 hours 28800 seconds
12 hours 43200 seconds
24 hours 86400 seconds
1 week 604800 seconds
I have always thought that one of the advantages of the application since MOM 2000 has been its automatic discovery and downloading of MP rules. This meant that if someone installed IIS on a server and forgot to tell the monitoring team it would not matter as the discovery would ensure that IIS was found and the rules downloaded. With 2007 using Cscript it seems that there is more chance of hitting 100% CPU which was pretty much unheard of in 2000 and 2005. Yet I have seen threads on forums about this for the AD and DNS MPs and now SQL on clusters with multiple instances.
In contrast I have been looking at the beta of the new Exchange 2007 MP which has been written for 2007 rather than converted from 2005. One of the things that struck me immediately when reading the MP Guide was that the majority of discoveries are switched off by default and when they are enabled the default is a rather more sensible 24 hours. After all how often do you change server roles in a production environment? This is a philosophical change as in the past all discoveries were targeted to all servers. The only discovery in this MP that does this is a light weight discovery that just checks registry keys. Once that has been done then the other discoveries (when switched on) are targeted at just those servers. That type of behaviour is seen in a number of MPs where a general discovery is run against Windows Computer but then other specific discoveries are targeted towards the class that is discovered. Obviously this puts less load on servers that are not running that application – especially if the discovery uses big scripts and/or WMI.
I like this idea. I have suggested in the past that the Exchange MP is split into two with the basics (event monitoring) in one and the advanced stuff that needs configuration (synthetic transactions) in a second. While this is not how the Exchange MP has been done it is split into multiple roles so you can just install the mailbox monitoring or the CAS or Hub MPs. This will make it easier to tune as you can put in one bit at a time.
I would recommend that you have a look at your discoveries and how often they are running and ask yourself what frequency is good for your environment. I would suggest a long period like a day or even a week for most as you can always create an override for a shorter period if you need to speed things up temporarily.
I was hoping to add the almost obligatory 1 line PowerShell script to show you how to get that information but although there is a get-discovery command it does not include the actual frequency of the discovery. You can always use the excellent MP Viewer (from Boris Yanushpolsky) but that can only look at 1 MP at a time. But it can examine an MP before you import it. It has a node for discoveries and will tell you the target, whether it is enabled or not and the all import frequency (in seconds).