ITIL and Network Operations: IT departments are under pressure to adopt stable, standardized management practices. ITIL provides a framework network managers can use to answer the call.
In part one of this article, we discussed how the Information Technology Information Library (ITIL) model can be applied to the network operations (NetOps) group inside an enterprise-level firm. We also detailed how ITIL Service Delivery, the strategic half of the ITIL Model, works on a human scale. In this part, we will discuss how the second half of ITIL, Service Management, interlocks with the other ITIL components to complete a unified structure.
ITIL divides roles by their tactical or strategic function.
|Service Delivery||Service Management|
|Service Level Management
|IT Security Management
The Incident Management (IM) team is the front-line of the NetOps group, handling ‘first contact’ on both issues and requests. Note that in the ITIL framework, both issues and requests are categorized as incidents. Unlike a call center, however, IM is responsible for driving the issues and requests to resolution. Its job role breaks down into four major components:
- ownership of the incident
- monitoring the ongoing environment
- communicating the status of the incident to all stakeholders
- tracking the incident to resolution
Phillip manages the Incident Management Team. According to his network management suite, a router in Asia just went off-line. While his chief network technician starts troubleshooting the issue, his front-line staff update the intranet homepage and begin answering calls, collecting information on the impact of the outage. Apparently a layer 2 spanning-tree loop caused the CPU of the switch-portion of the router to peg at 100 percent, causing a high level of packet drops. Phillip’s network engineer grabs the router’s logs and completes a reboot, causing a spanning tree priority change and ending the incident. After publishing an ‘all clear’ message to the intranet, Phillip writes up a report on the incident and submits it to the Problem Manager and to the Service Level Manager (SLM) of impacted groups.
While the IM team is responsible for getting service back up and operational as quickly as possible, the Problem Manager (PM) is concerned with what the underlying reason for the incident was, and how it could possibly link to other incidents in the environment. First, this person identifies a potential ‘Problem’ causing one or more incidents. After study of the data provided by the IM team, a ‘Known Error,’ or root cause is identified. The last step of the PM is to communicate the ‘Workaround’ which when implemented will solve the Problem permanently.
Ty, a Problem Manager, was happy to see the e-mail from Phillip that described the spanning tree incident. This was the third time in as many months that routers had mysteriously locked up and required a reboot to start functioning again. This time the network technician grabbed all of the logs before rebooting the router, so Ty could see exactly how spanning tree went out of control. It looks like multiple routers had the same spanning tree priorities in their layer 2 configurations, causing a race condition which clogged the CPUs. The fix was easy: Ty created different priority assignments for each router in his environment and passed the information off to the Configuration Management owner. Ty alerts the availability manager to the Known Error so she can log why her uptime numbers have been down.
The Configuration Manager is responsible for identifying, recording and reporting all the hardware and software components in the environment. Specifically for NetOps this means owning records of all hardware chassis and modules, software revisions and configurations. Any modification—whether it be hardware, software, or configuration—must get the Configuration Manager’s approval.
Alyson receives the Known Error Workaround from Phillip. She is able to do a quick search through her inventory and notices that Phillip did not assign spanning tree priorities for three devices, and wonders if this is where the problem began. After she receives the missed updates from Phillip she packages the configuration changes together and passes them off to Release Management. Alyson also makes a note reminding Release Management to notify her when the implementation is complete so she can update her configuration database.
Release Management (RM) includes personnel who are responsible for actually making modifications to the environment. In the case of NetOps, these are the personnel logging into devices to update firmware, software, and configuration changes. They own not only the change itself but also any testing and communication required.
Robinson, a member of the Release Management team, gets assigned the spanning tree update. After confirming that the change is described as ‘zero impact’ to the network, he puts together an implementation plan which lists the exact steps he’ll complete in order, the testing steps he’ll perform after each upgrade, and back-out plans he’ll use if the tests are not successful. He has his entire Release Management team review the document and makes changes based on their recommendations. Even though this change shouldn’t cause an outage, he also communicates his request to the Incident Management team in case they have any otherwise unexplained calls. As the Known Error has already caused incidents, he is requesting an emergency change and submits his documentation to the change management board.
Change Management involves multiple people representing stakeholder groups which meet regularly to discuss and review requests. This group, called the Change Advisory Board, is given Change Authority by senior management and is overseen by a single Change Manager, who directs the reviews. There is also a smaller team called the Executive Committee (EC) which is made of critical stake holders plus the change manager. The EC reviews emergency changes which cannot go through the longer standard review cycle.
Patrick is the NetOps change manager. He tends to annoy his peers in Release Management by pushing back on their requests, making sure that any change they request is really required. Since he took on the role of Change Manager he also put an end to any non-critical network changes during normal business hours. The spanning-tree Known Error request is an emergency however, so he brings his executive council together on a conference call to review the request with Robinson. Patrick makes Robinson explain every detail of his testing and back-out procedure before the EC gives its approval, then sends a note to the Service Level Manager and Incident Manager informing them of the change.
ITIL is not a silver bullet which will solve all of your organization’s issues, but it does provide a relatively easy-to-follow set of guidelines to properly modularize your NetOps group and provide the appropriate communication interlinks between those groups to be a successful. Properly implemented, ITIL can greatly increase stakeholder satisfaction.
Michael Burton is a senior program manager for Intel. He holds a PMP and ITIL-Foundation certification and resides in Portland, Oregon.