To review, we named a few features our ideal NMS solution should have:
- Layer two and three network discovery
- Visualization of the network – Layer two and three maps
- Network equipment configuration management
- The ability to monitor nodes, their services, and receive SNMP traps
- Charts of host and service availability
There are likely many other features that we'd name in a wish list, but this is a good start. Unfortunately, this is all about either network device management or network monitoring. What about the things we need to do with our servers?
The entire IT infrastructure is built to support the servers, and more specifically, the applications that run on them. Without services, we wouldn't care about network connectivity, for example. The NMS features listed above, that which are present in some NMS products right now, provide only specific information from one data source.
Now I'm a firm believer in the Unix idea that each application should perform its one task extremely well, and then pass data along to another application. That type of thinking works quite well when you're thinking about designing a Web infrastructure with databases, proxies, front ends, application servers and the like. It doesn't work well when you're trying to manage all of these complex systems in a sane fashion.
What the IT world really needs is a system that performs all the NMS functions, in addition to a few others. Some readers may recognize this as an expanded complaint from an article titled, ITIL: Listen to Your Admins, back in April. How very observant. Indeed, the ITIL ideas are sound, but the implementation is mystical. Taking one of the big mysterious ITIL concepts a bit further, we can easily see that our entire IT infrastructure can be managed as a whole. The Configuration Management Database (CMDB) is sometimes thought of as just an inventory tracker, but in reality, it is more about the specific configuration of each server—every aspect of the configuration.We can realize the true spirit of ITIL by using our NMS data in conjunction with other key IT assets.
Trouble Tickets are something every IT organization has to deal with. There must be a mechanism for IT customers to submit problems, and we must be able to keep track of outstanding issues. Many organizations use ticket-tracking systems to track the progress of projects as well. A trouble ticket comes from a user (consumer) of IT services, and involves an IT service, which will generally invokes some sort of action. The action likely results in some sort of change, which will impact specific servers.
Asset tracking, as the name implies, is usually a database of assets. But what good is that? A few serial numbers and other facts about a server, sitting mostly unused, sure doesn't sound very useful to the sysadmins. What if every time a user reported a problem with a system, we could record what system (application, servers, or whatever) was involved, and link it to a ticket? We could actually search for all problems related to a specific application, for example, in a given time period. This is normally done with text-based searches through a ticket system, which is hardly robust.
Change management, that lovely process, feels like a productivity killer at first, but produces amazing results and is closely related to ticket tracking. Quite simply, change management is the oversight, documentation, and even execution of changes to any IT asset. A trouble ticket or a planned outage initiates a change in an IT service.
Configuration management, of servers, is the mechanism that sysadmins use to ensure a consistent and functioning infrastructure exists. Systems like Cfengine and Puppet, and even Group Policies in Windows, provide a means to both document and enforce specific configurations of large groups of servers. Now we're really reaching; getting far from the duties of an "NMS." Well, not quite.
Theoretically, a configuration management system is quite capable of seeding a monitoring system with information about a server. Configuration management systems know every detail about a host, if implemented properly, because the host can be completely rebuilt from scratch and emerge in the exact same configuration. We generally think in terms of automated reproducible builds, which are important, but at the same time we're ignoring the wealth of data available in such systems. We should be able to automatically provide (and update) our monitoring system with information about which services a server is running.
Likewise, if the monitoring system linked all services associated with a certain host to an asset item within a "host database" or asset tracking system, we'd be able to know precisely what hosts have what configurations. When a problem occurred with a host, or when a change needed to be made, these events would also be linked to the server or application in question.
In an IT world where all these systems are linked, and centered on the real asset that IT provides, it's possible to manage the entire operation much easier. Imagine searching for all problems related to an e-mail server. You could start by noticing that an increasing number of e-mail complaints are being received. Then, you can search through your ticket system and notice that all complaints in the last month are related to a single server. The change management database lists the remediation actions taken to get the server limping along again, each time a request was entered. To figure out what really happened, you'll likely be done after discovering a past change that impacted this server. If not, then it's on to troubleshooting as usual, but with some major clues to the problem already presented. That's just one example, you can think of more. Heck, just imagine the time-savings inherent in automatically configuring your service monitoring system. In some organizations, that's an entire employee's job.
What systems can help you obtain this nirvana, you ask? A giant step would be using RT (Request Tracker), an open source ticketing system. The AssetTracker module allows you to create a host database and correlate users and their tickets with specific servers. This is a huge step! The rest, unfortunately, is still best accomplished in separate systems, like Nagios, Puppet, and whatever you currently use for change management tracking. In theory, most everything else could be implemented as RT modules, but I'm not advocating that. The purpose of these "NMS" articles is to get some ideas flowing; it's impossible to write a HowTo for something that does not yet exist.
The all-encompassing IT Management System is the first step toward more fully automated IT operations, which require less low- and mid-level administrators. Of course, this is not possible today, but it's a wonderful goal.