SCOM Performance and Scalability
The newsgroups have an announcement that the “Operations Manager 2007 Performance and Scalability White Paper” is released but only on Connect. I have asked why it is not on the main web sites as OM is an RTM product and not in beta any more. Also released on the Connect site is the updated SDK and samples for RTM.
Having talked to a couple of PMs at the MMS I was not expecting much from this paper and so I am not disappointed. It is 20 pages long – so it is not big nor exhaustive.
The crux of what they said was that there will not be any definitive sizing like there was for MOM 2000 and 2005 but that there would be guidelines on what they tested that you can use as a starting point and they recommend that you use System Center Capacity Planner (SCCP) to model your system. That is still in beta at the moment and so is not complete and I have found issues with using it but that is a separate post.
I can see their point as SCOM is really 3 products rolled into one and each product has different requirements. See previous post on SCOM 2007 Architecture – https://ianblythmanagement.wordpress.com/2006/08/01/scom-2007-architecture/, which needs updating. And they want to be more flexible. After all if you have 10,000 desktops doing 1 alert per day then that will generate the same load as 100 servers doing 100 alerts per day. So you can create many different variations. I started to do some work on creating a deployment options post but it is turning into a major piece of work.
I did mention previously (https://ianblythmanagement.wordpress.com/2006/11/22/scom-sizing-%e2%80%93-an-easy-win/) that SCOM should be a lot more scalable than 2005 based on what I knew of how the testing was done but organisations did not go above those limits due to support requirements. But it looks like it is not so. The PM did say that the MPs are more complex etc but I was expecting a better result. I also asked him about PSS support for designs that people came up with but did not get an answer as they were Product Group people. Well, he vaguely mentioned SCCP and the white paper says that this will be the official tool. So who knows, it may be that to get PSS support you have to show them your SCCP design.
The paper does give a fascinating insight to the workings of some of the components like the RMS. Basically you need lots of memory, preferably 64 bit and in large deployments having no agents report into it. From talks with the PM and Microsoft’s internal IT it appears that Management Servers can be lighter. The testing PM had 3 coping with 5,000 agents and MS IT is considering turning their Management Servers into virtual servers which is always a sign that the load is light. In the document it mentions that 2007 Management Servers do not put all data into the disk cache like 2005 which lightens their load. The database is still the main bottleneck in 2007.
There are some obvious recommendations like the more agents and the more management packs the bigger the load on the database server and similarly for consoles. IO is the most critical part of the database server. You can up the RAM to lesson the load on the IO and again 64 bit is recommended so you can go above 4 GB.
The reporting database is different to 2005. In 2005 there was the DTS job that did a transfer at 01:00 each night but with 2007 the data is written in real time so it needs to be sized similarly to the operational database server. As the data is summarised for reports CPU and memory are also important especially when running large reports that span a big range of dates.
OM Guidelines Summary
|MOM 2005||SCOM 2007|
|Agents per Management Server||2,000||2,000|
|Agents per Management Group||4,000||5,000|
|Agents per Gateway Server||N/A||200|
|Consoles per Management Group||15||50|
|30 GB||No official limit but keep small|
|Size of Reporting database||1 TB||No official limit|
|Collective Client computers per management server||N/A||2,500|
Collective Client monitoring has an agent on each computer but the alerting on the individual agent is disabled and the data is gathered and aggregated to report at a collective rather than individual level.
Agentless Exception Monitoring Guidelines
|AEM computers per Management Server||25,000|
|AEM computers per Management Group||100,000|
But the paper shows a recommended hardware configuration for a Management Server that can deal with 100,000 AEM clients.
Audit Collection Guidelines
(not in the paper but from MMS presentation)
A single collector handles up to
• Peak maximum of 20,000 events/sec
(short burst only – non sustainable)
• Continuous maximum of 2,500 events/sec
• Average byte per event over the wire < 100 bytes
A single collector can support up to (this varies depending on factors like Audit Policy):
• 150 Domain Controllers OR
• 3,000 non-DC servers OR
• 15,000 workstations
NB – note the OR.
And if you throw disaster recovery options into this then it gets more complex.
The big question that was always asked about 2005 was how many agents a single box could handle. In this paper they show the hardware that can cope with up to 250 agents where it is a combined database (OM and DW), RMS, reporting server but it does not mention ACS or AEM. As this is a dual proc 2 GB server then coping with more agents if you have a 4 proc, 8 GB 64 bit system should be possible.
The bottom line is that Microsoft has produced a set of options that they have tested but the onus is on each organisation to ensure that they specify the hardware correctly and Microsoft will recommend that organisations use System Center Capacity Planner.