Fault Tolerant MOM 2005

As I read an article about the new dual core PCs with dual graphics cards it brought to mind a question that frequently comes up. Should MOM be made fault tolerant? Some organisations say no as it is not a business application and so does not matter while others say yes as if MOM goes down their view of the business applications goes down and so they are blind to any problems. The answer breaks down to two parts. What is your attitude to risk and what is your budget? If you do not have the budget for an extra MOM server and an extra SQL Server with the software licences which when you move to a SQL cluster requires Enterprise Edition of Windows and SQL then it is a no brainer. No budget – no fault tolerance. 

If you do have the budget then it is down to risk. If you rely heavily on MOM to notify you of problems with Exchange, AD, SQL and have created custom management packs for the in house applications and you are proactively using MOM then your exposure to any downtime on the MOM server is high. If your use of MOM is light and you are not dependent on it (yet – usually in the early days of an installation before the people see what MOM can really do) you can use a non fault tolerant system and rely on the improved uptime of the modern versions of SQL and Windows Server.  For less that 4,000 agents (the limit of a Management Group) then the options are as follows. Non fault tolerant

  1. All on one box (only good up to 200 agents roughly)
  2. One MOM management server and one SQL Server (up to 2000 agents)
  3. Two MOM management servers and one SQL Server (up to 4000 agents)

Although the last option appears to be fault tolerant as there are 2 MOM servers it is not as each MOM server is only supported up to 2,000 agents. Does this mean that it won’t work? With modern hardware it may well cope as the performance testing was done with relatively old hardware but if you phone up Microsoft for support then they will tell you that you are running a non supported installation.  Fault tolerant

  1. Two MOM management servers and a SQL Cluster (up to 2,000 agents – if one MOM server dies then the whole load goes to the other and that can only cope with 2,000 agents)
  2. Three MOM management servers and a SQL Cluster (up to 4,000 agents – if one MOM server dies then the 4,000 agents split across 2 MOM servers which can cope with 2,000 agents each)
  3. Use DTS. This avoids having to use a SQL cluster and is explained comprehensively in the Service Continuity Solution Accelerator. But this is more a hot standby solution than true fault tolerance.
  4. Over the top fault tolerant. I came up with this design for a bank that was very keen on fault tolerance! Have two management groups (either as in bullet 1 for up to 2,000 agents or bullet 2 for up to 4,000 agents) in two separate data centres and dual home the agents so that each management group is fault tolerant and if a whole management group (data centre) goes down then the other one carries on working. An analogy would be RAID 10. Very fault tolerant – very expensive!

Another option is to use SQL Log Shipping but the Service Continuity SA excluded it because

  • A single failure in the log-shipping mechanism will force a full re-synchronization of the primary and alternate MOM database, which is both labor- and time-intensive.
  •  Log shipping will transport all data including statistical information, which increases the WAN bandwidth requirements.

For more than 4,000 agents then you are into multiple management groups and the number of companies that would be in that position is small plus they can afford to bring in consultants to help them.  I have started to look at SQL 2005 and there are more options but I am still getting to grips with them. SQL 2005 brings in database mirroring which allows almost the same level of redundancy as clustering without having to have a cluster. See http://www.microsoft.com/sql/prodinfo/overview/whats-new-in-sqlserver2005.mspx for more information. Clustering supports up to 8 nodes (Enterprise Edition) so you can have seven active and one passive. See the table at the bottom of the page to see which versions support which features and prices. A cut down version of the table highlighting the fault tolerant features: 

Edition Pricing Key Features
Express Free Replication and SSB Client
Workgroup $3,900 per processor
$739 (server + 5 users)
Limited Replication Publishing
Back-up Log Shipping
Standard $6,000 per processor
$2,799 (server + 10 users)
Database Mirroring
Full Replication and SSB Publishing
Clustering (supports two nodes)
Enterprise $25,000 per processor
$13,500 (server + 25 users)
Advanced database mirroring, complete online and parallel operations, and database snapshot
Clustering up to the limit of the operating system

So by using Database Mirroring you can get most of the advantages of clustering. But why not use 2 node cluster as it is also included in the Standard Edition? That means that you need Windows 2003 Enterprise Edition and hardware that is supported whereas Database Mirroring works on standard server hardware and requires no special storage or controllers. There is a good Media show of the two at http://www.microsoft.com/sql/prodinfo/demo/wss-refarchdesign-demo.mspx.   And as always make sure you have a good backup!

Advertisements

3 Comments

  1. We have choosen for database mirroring to make the OnePoint DB “redundant” for a customer with 2 geographically dispersed datacenters. The main raison for that – and not implementing an SQL cluster – was the fact that we couldn’t get all things configured in the networking area (you need a single subnet for two nodes and that wasn’t possible in their datacenters, so each node would be placed in a different IP subnet). It’s great that MOM 2005 SP1 supports SQL 2005 so you can leverage the database mirroring to increase availability of the MOM environment.

  2. Thanks Maarten. Also nice to get confirmation of people who have tried it and got it working.

    Ian

  3. Nick Madge

    I guess the one thing to be aware of with MOM 2005 & SQL 2005 is the ongoing issue of the Exchange SLA Scorecard still not working correctly in this configuration.
    However i would imagine this issue disappears with OM 2007?

%d bloggers like this: