I have been seeing a number of servers that were greyed out and the alert “Health Service Unloaded System Rule(s)” showing that those agents have unloaded rules. The Alert Knowledge said to initiate a repair but that did not work. The alert auto resolves but then a new alert is created. Looking at the server the health service was running but there were masses of event ID 1102 in the Operations Manager event log. Hundreds of them. Basically all rules and monitors had unloaded.
A quick search for the alert “Health Service Unloaded System Rule(s)” shows that the problem can have a time zone element as mentioned here – http://myitforum.com/cs2/blogs/momlist/archive/2009/01/08/msmom-issue-with-health-service-unloaded-system-rules-alert-jahaig.aspx but that was not relevat to this problem.
I found a different problem. Something in the environment is creating a file called program in the root of c:\. This is obviously confusing OpsMgr with “C:\Program Files” and the agent just unloads all its monitors. When you log into the server you get an error message about the file and you are asked to rename it.
When you do and restart the health service everything is fine. If you do not rename the file and restart the Health Service then you get 1102 errors for every rule and monitor again but luckily only one alert in the console.
The creation of the file coincides with the reboot of the server on the ones that I examined. I also notice that there is DCOM error 10000 in the system log at the same time where OpsMgr can not start as “c:\program files\system center operations manager 2007\monitoringhost.exe –embedding is not a valid win32 application.” So it looks like OpsMgr is not loading so the rules and monitors can not load.
Apparently it’s a Windows bug when a call “%systemroot%\program files\anything” is made without the quotes. The person who wrote about it said that he “wrote a batch file that just deletes it and set Win.ini to activate the batch on boot. You could do the same with a login script or whatever. Problem solved.” There are plenty of posts about it going back to Windows 2000 days.
In this environment this happens after a server is rebooted. So now I need to try and find out what script is run when a server starts up. Or perhaps create a diagnostic that renames the file and restarts the OpsMgr service.