Event management (ITIL)
Event Management, as defined by ITIL, is the process that monitors all events that occur through the IT infrastructure. It allows for normal operation and also detects and escalates exception conditions.
An event can be defined as any detectable or discernible occurrence that has significance for the management of the IT Infrastructure or the delivery of IT service and evaluation of the impact a deviation might cause to the services. Events are typically notifications created by an IT service, Configuration Item (CI) or monitoring tool.
Purpose/scope
[edit]- The purpose is the ability to detect events, investigate and determine the correct control action
- The events (warnings and exceptions) can be used to automate many routine activities
- Event Management can be applied to any aspects of Service Management that can be controlled and can be automated (Configuration Items)
- Provide mechanisms for early detection of incidents.
- Some types of automated activities can be monitored by exception, reducing downtime.
Event handling
[edit]Event notification and detection
[edit]Event notifications can be proprietary, only certain management tools can be used to detect events. Most of the Configuration Items (CIs) generate event notifications using SNMP open protocol (Simple Network Management Protocol).
The CIs are configured to generate a set of events based on the designer's experience.
Once an Event notification has been generated, it will be detected by the specific tool (read and interpreted)
Event filtering
[edit]Filtering means that the event notification can be ignored or communicated to the management tool. If ignored, the event will usually be recorded in a log file on the device, but no further action will be taken.
During the filtering step, the event will receive a level of correlation (type: informational, warning, or exception).
The filtering step is not always mandatory, some CI's have significant events that are communicated directly into the management tool (even if they are duplicated).
Significance of event
[edit]Standard categorization based on the significance of an event:
- Informational (INFO): the event does not require any immediate action and does not represent an exception. They are recorded in the log files and maintained for a predetermined period. This type of event is used to check the status of a device or service, to confirm the state of an activity, to generate statistics (user login, batch job completed, device power up, number of users logged into an application)
- Warning (WARN / ALERT): the event is generated when a device or service, (application / utility), is approaching an agreed threshold (KPI). Warnings are intended to notify the group/process/tool in order to take the necessary actions to prevent an exception occurring.
- Exception (ERROR): means that a service or device is currently operating below the normal parameters/indicators (predefined). This mean that the business service is impacted and the device or service presents a failure, performance degradations or loss of functionality (web server down, CS coverage lost for several sites). A device failure is an error.
Note the addition below is not an Event type but analysis that can be carried out from the Event logs:
- Trend analysis The event logs should be regularly analyzed for indication that the event patterns [INFO, WARN, ALERT, ERROR] may indicate an underlying Problem that may be addressed in advance of a serious service disruption.
Response
[edit]At this point in the process, there are a number of response options available. Some of the options available are:
- Event logging: regardless of the event type, a good practice should be to record the event and the actions taken. The event can be logged as an Event Record or it can be left as an entry in the system log of the device.
- Alert and human intervention: for events that requires human intervention, the event needs to be escalated. The purpose of the alert is to notify the correct resource (person) to handle the event.
Incident Record: an incident can be generated when an exception is detected.
- RFC: in case of an RFC there are two scenarios underlined:
- For an exception (two new network devices have been added without the necessary authorization)
- For a change (in order to prevent a file system failure, the server needs to be upgraded. It may take a while for the change to start working.)
Close event
[edit]- In the case of events that generated an incident, problem or change, these should be formally closed with a link to the appropriate record from the other process
- Informational events are simply logged and then used as input to other processes, such as Backup and Storage Management. Auto response events will typically be closed by the generation of a second event.