ITIL v2 is the most widely adopted standard – Incident Management, Problem Management and Change Management are the most popular processes ITIL v3 is the latest version of ITIL – Brings a concept of Service as the central idea of ITSM. Modules like Service Strategy, Service Design, Service Operation and Service Transition. Event Management is introduced as a separate process in ITIL v3 but you can also implement Event Management with v2 Incident and Problem Management
Service Operation module is responsible for managing all the technology – the applications and the infrastructure in order to ensure that the service is operating at acceptable levels. It is the most important stage in terms of business realizing value for all the money that it spent on the service. Event Management is one of the processes in Service Operation – but it plays a central role in ensuring the operational health of IT services. Two points I want to touch upon Different types of events – Informational events, Exception events – like CPU threshold violated on a Server Event Management roles – SD, IT ops and App Mgmt teams perform the different event mgmt activities We will try to understand what event management is all about, but please remember “I have a Monitoring tool running does not mean that you are doing Event Management”
Let’s look at some of the most popular challenges in Event Management Major incidents are raised by users instead of IT receiving proactive warning or critical events Too many events – Everyone suffers from this problem Several event management tasks are manual and time-consuming – esp. true for IT Operations people dealing with Server backups, patch management etc. Let’s focus on them one by one
A good starting point would be the Incident Analysis Report What % of incidents are being generated by end users ?
Maybe we are just monitoring Servers and Networks. What about incidents caused by application failures ? Or db crashes ? Maybe your vmware is causing performance problems. Do you h Increase monitoring scope to ensure that user reported incidents are captured in future Done by the Tech and app mgmt team in the Service Design phase ave enough monitoring capability to capture these events in future?
Coming to the second challenge of “Too many events” We first need to get to a level of Manageable events before we attempt to perform event management. This involves reducing the number of events This involves Traditional Event Correlation techniques Reducing Duplicate events – from multiple monitoring systems Alerts triggered during scheduled downtime Device Dependencies – Upstream router link is down, hence servers are unreachable Handling Parent-child type failures eg. Server down and web server down Root Cause Analysis – which component is causing a performance degradation of a business service A recent trend in Event Management is to introduce Automation after correlation -kind of like a second level filtering You can automate manual tasks or automate troubleshooting steps to filter false alarms and identify root cause.
Sample scenario – A real world use case to free up disk space on a server either periodically or when disk space low event is triggered
Another sample scenario – When a device down event is triggered quickly run multiple troubleshooting steps like connecting to a web app on the server, dns lookup, traceroute etc. to ensure that only real incidents are reported.
Ok – Now we have this double filtration process for events with automation enabled – everything is great. But how are the teams organized and how is the information flow ? Traditionally we have had a NOC with monitoring tools and alerts flying to network and Server admins Application teams usually have their own tools and their own notification systems. Service Desk is end user facing and deals with Incidents When there is a major incident – is the Service Desk able to quickly isolate events and find the right person ? Most of the time in this setup – the Service Desk lacks visibility into what caused the disruption.
Recently one of our Product Managers visited a large telecom customer in Brazil. He was pretty impressed with the way they were using Service Desk. They were doing two things differently In addition to Incidents, the monitoring tools were actually logging Critical and Warning events to the Service Desk application. They actually had people from different teams (network, Servers and Application mgmt sit as shown in the picture) The first set of people were working on preventive tasks (warning events) the next smaller set were proactively looking at critical events and resolving them The final incident management team was really small. This reminded me of a new concept in ITIL v3 that is known as an Operations Bridge. But if you think about it – in addition to the physcial location the use of a centralized tool to capture all important events from different tools is critical in ensuring that you have event management under control.
Poll Question Do you think using your Service Desk tool as a centralized place for all actionable events is a good idea ? 1. Yes. I think it will help
ITIL v2, ITIL v3 and Event Management Image from http://iig.umit.at/ ITIL v2 ITIL v3 Event Management is new in ITIL v3 but can be implemented with ITIL v2 also
Service Operation & Event Management Types of Events Informational events Exception events Incidents Service Operation module is responsible for the activities and processes required to deliver and manage services at agreed levels Business services actually deliver value to the business only in this stage Event Management plays a significant role in ensuring the operational health of IT services Event Management Roles Service Desk IT Operations Application Management Monitoring Event Management
Event Management challenges User Reported Incidents - Ensure that the right events are being generated by monitoring tools Too many events – Implement strategies for Event correlation Manual & time-consuming tasks - Identify routine manual tasks that can be automated
Are the right events being generated? Why are 34% Incidents being reported by Users? Why is monitoring not generating these events? % of Incidents Generated by Users % of Incidents generated by Monitoring Tools
Do we have enough monitoring? Ensure that the right events are being generated Identify CI’s that are not monitored Establish baselines & Reconfigure thresholds Buy additional tools to add monitoring capability
“ Manageable Events” before “Event Management”! Eliminate Duplicate events Dependency correlation Root Cause Analysis Event Correlation techniques Image from http://ideachampions.com Automation Run Book Automation
Event Management Automation Sample Scenario – Automate manual task Weekly scheduled task Check for Free Disk space on a server If < 20GB , free up space by deleting temp files If delete fails log a ticket to SD If delete successful confirm free space again If >20GB send a note to server ops team
Automation – Reducing Incidents Sample Scenario – Filter false alarms Server down event from monitoring tool Ping Server If failed – try connecting to a web app on the server, if that fails too then log an Incident to SD If ping success , try other troubleshooting tasks like DNS lookup and traceroute to determine why ping failed in the first case
Conventional IT Organization - NOC vs. Service Desk Events sent directly to operations via email Events sent directly to app mgmt via email Separate teams with separate tools Critical events may be missed leading to Service disruption Service Desk missing visibility into critical events that caused the disruption. The million dollar question - Is it the Network, the Server or the Application ? What caused the Service disruption?
ITIL v3 Best practice – Operations Bridge Preventive tasks – routine operations Exceptions – Unusual activity Incidents – Disruption of Service Service Desk (tool) as the Operations Bridge All actionable IT events and Incidents are logged in Service Desk Reduced Incidents due to high quality event management Basis for identifying automation opportunities Operations Bridge "A physical location where IT Services and IT Infrastructure are monitored and managed."
Service Desk as Operations Bridge Critical & Warning events are logged to Service Desk Incidents & Service Requests are already logged Service Desk acts as the Operations Bridge IT operations, Application Management & Service Desk teams work together
Info for current (and future )ManageEngine customers Event Management Feature / Best Practice ManageEngine Product Incident Analysis Report Service Desk Plus <ul><li>Monitoring Scope </li></ul><ul><li>Network, Server, VMware, VoIP </li></ul><ul><li>Applications and Databases </li></ul><ul><li>OpManager </li></ul><ul><li>Applications Manager </li></ul>Event Correlation Techniques OpManager Root Cause Analysis Applications Manager Run Book Automation - OpManager (due in the new release) Operations Bridge Service Desk Plus
Summary Optimum monitoring scope to ensure that the right events are generated Implement Event correlation strategies to filter events Use automation to reduce manual tasks and false alerts Consider using your Service Desk as an Operations Bridge