MOM 2005 and ITIL – Part 1
ITIL (IT Infrastructure Library – http://www.itil.co.uk/) was developed by the
UK government as a set of best practices for IT service management and is now recognised internationally and used by the public and private sector. MOF (Microsoft Operations Framework – http://www.microsoft.com/technet/itsolutions/cits/mo/mof/default.mspx) is based on ITIL with more specific focus and recommendations for Microsoft products. As the demand grows for IT organisations to be cost efficient and effective then the use of ITIL/MOF is growing.
MOM can be used to help in the ITIL process. While no piece of software is ITIL certified (only people can be ITIL certified) there is no doubt that MOM can be used to help with ITIL processes. ITIL can be a big, and sometimes daunting project, for some organisations to embark on. Where do they start? I have also seen organisations put in MOM and then ask what do they do with it? By using MOM and ITIL it helps define where to start on an ITIL project and at the same time as ITIL is going in it focuses the reasons why MOM is being used and how to configure it.
The four main areas that MOM can help are
- Incident Management
- Problem Management
- Service Level Management
- Capacity Management
The goal of Incident Management is to restore normal service operation as quickly as possible with minimum disruption to the business, in order to ensure that the best achievable levels of availability and service are maintained.
The Incident Management life cycle is
- Incident detecting and recording.
- Initial classification & support.
- Investigation & diagnosis.
- Resolution & recovery.
- Incident Closure.
Incident Detecting and Recording
This is pretty much what MOM does out of the box. An alert equates to an incident for the majority of the alerts. Obviously information alerts do not. But by focusing MOM on getting alerts to equal incidents it gives a target for alert tuning. MOM will have an alert with the date, time, server, a level (warning, error, critical error etc) and Product Knowledge to help in the investigation and diagnosis. So the incident detecting and recording is done automatically. Additionally MOM is proactive. It is looking for events that may affect the IT systems and spot issues before they become incidents – like out of disk space incidents.
Initial Classification & Support
What you need to do for initial classification & support is
- Classify incidents
- Match against known errors
- Assign impact and urgency
- Provide initial support
- Close or route to a specialist support group
You can assign a Resolution State and Owner. Also if you use a helpdesk or service desk package you can forward the alert manually or automatically to a helpdesk or service desk using the MOM Connector Framework (MCF) or via a third party tool to create the incident in the service desk software and keep the changes made in synch between the two packages. Some organisations prefer this as all incidents are recorded in one package for analysis regardless of whether the incident comes from a MOM alert or a user call to the helpdesk.
It is also possible to modify the rules so that an rule that would normally create a warning alert can be changed to create a critical alert if you have deemed that this alert is important in the organisation. As well as the Product Knowledge tab which is filled in with details on the problem and probable cause as well as potential solutions there is the Company Knowledge tab where additional information, workarounds, company specific knowledge with web links can be added by the organisation to aid in the resolution when this alert happens. The Product Knowledge or Company Knowledge may help in providing the initial support and the operations team may be able to close the incident based on that. Otherwise further investigation and diagnosis is needed.
Investigation & Diagnosis
You can use the Resolution State to escalate to another team. This can be customised to suit the organisation. A process should be put in place to have a mechanism to hand over incidents from one team to another and not just rely on someone “looking at the console”. While keeping an eye on the console is a front line task it is unlikely that second and third line support will monitor a console as they should be working on projects and proactive activities as well as responding to requests from front line support when there is an incident that needs to be escalated to them.
The tasks in the Task Pane help with the investigation and diagnosis as well as recovery. This can be used for simple tasks like pinging a server to more complex tasks like running a script against a server to determine a specific piece of information.
The Product Knowledge tab and Company Knowledge tab can assist in this phase by providing information and known fixes.
Resolution & Recovery
MOM can be setup to automatically alert a specialist group via e-mail for certain incidents or automatically run a script or command line task to fix the problem if it is a known issue with a known fix. This ability to automate these known issues frees up the operations staff to focus on incidents that need manual intervention. Different groups can have different views of the alerts that satisfy their criteria. For example one group may just want to see all alerts for all AD servers, another may just want to see all alerts for Exchange servers in London while another may just want to see all critical errors that have been in that state for more than 30 minutes.
Once an incident is resolved then additional knowledge or fixes should be added to the Company Knowledge to aid in the resolution and recovery of future occurrences of this incident.
Once the incident is resolved the alert can also be resolved and is removed from the main Alert View.
To be continued.