Monitoring Best Practices

Monitoring is a critical component of running an online application but only tested under non-ideal conditions. There are a few layers to effective monitoring and a few rules to follow.

Monitoring Stack

Effective monitoring takes place at several different levels to ensure all of the facets of an online website are up and running. A common mistake is to only whether a computer system has ping. While that does indicate whether a system is reachable, that is not telling whether the entire system works.

Network Monitoring

Network monitoring is the basic form of monitoring and measures whether a system (single computer system) is up and on the network. An effective test for this category is to see whether a node/system responds to ICMP/Ping. Too many people rely on this level of testing to ensure that their web application is up and running. Problems in this category typically mean hardware or ISP/network carrier problems. When problems come up in this area, someone needs to pay attention to it immediately. Tools like Nagios and DynDNS.com Network Monitoring do this well.

Node/System Monitoring

The next level of monitoring refers to a single computer system or operating system. It may be a dedicated server or a virtual private server. Effective monitoring is more about thresholding and trending to understand CPU, network, and disc usage. As applications scale, an understanding of what bottle necks are created is hard to deduce without seeing the historical usage. Busy mail servers will look overloaded by CPU but they are typically disc-bound instead. Tools Cacti and Munin are good for this.

System/Application Monitoring

Once your network and individual systems are being watched, the next step is to take a look at the overall system/application. Can users create accounts or order online? This is application specific and requires some time but is a great notification tool in case something breaks. In the event of outages from this level, it's important to think about what is the root cause since failed user signups may actually be caused by network issues, database issues, full disc drivers, or many other reasons. Custom plugins for Nagios or Smokeping or tools like Selenium can handle this.

Performance Monitoring

Once you know that your application is up and running, it's important to see if it is running fast. 3am might be superfast but is the peak time of the day also fast? Smokeping is a great tool for measuring how fast something runs and is highly configurable. You can create custom scripts to perform a test a number of times, and see what is the distribution. Well-tuned applications will have almost identical testing results while overloaded systems may have some fast but a lot of slow responses.

General Notes

Carefully pick your vantage point - If your monitoring is taking place on the same switch as your application server, you are not going to effectively evaluate whether the application server is generally available to the Internet. You want to have your monitoring diverse from your own network looking in. This may mean that you should pick up some additional servers with a different provider to give that outsider vantage point.

Do not underestimate good trending data - Being able to look back at your site and correlate traffic and general system performance is critical. Three months ago when you had half the traffic you have now, were your servers half as busy? What resources are tight and when you double in traffic again in three months, how can you prevent pain points before they occur?

Think through outage scenarios - Monitoring only gives you information about what you are testing. There are probably system components which are not monitored or trended so be prepared for odd scenarios and think about where to look to determine root cause of slow downs or failures.

Dyn Inc. Monitoring

Dyn Inc. watches the DynDNS.com and other Dyn Inc. services through judicious use of network monitoring, node level monitoring, and a fair balance of application / performance monitoring. We use Nagios for networking, node, and system monitoring so that when problems occur, our Network Operations Team quickly responds to problems. We also use Cacti and Munin to collect and analyze system performance. While we have limited transaction/application testing, all of the testing we do is within Smokeping to make sure that DNS queries and DNS updates are fast.

Tools to Help

Nagios - An open source tool that performs automated checks on regular intervals. Checks can be a variety of protocol checks or custom plugins to do almost anything. Works well with a few to many systems. Many features like multiuser, paging, scheduled downtime, and a lot more.

Cacti - Great graphing tool which collects data over a variety of formats and displays them in a web interface. Easily handles system data collected on systems, networking equipment but can take custom data as well. Cacti does trending and uses RRDtool.

Munin - A monitoring tool to collect data from remote systems and aggregate them all together. Munin and cacti provide similar data but Munin is more for system performance is also easy to install on many systems.

Smokeping - A monitoring tool that tracks latency using a variety of probes. Probes can be basic protocol checks or custom written tests such as online transactions. It also supports a distributed monitoring setup to evaluate latency from a variety of locations.

Selenium - A web application testing system that performs tests scripted from within the browser. Record the complete process of user creation or placing and order to see if your web application works. Set it up and run it every night or even before you promote new code to your site.

DynDNS® Network Monitoring - Our web based server and networking monitoring service can monitor almost a dozen protocols and send emails or SMS messages on outages. Start monitoring in minutes.

DynDNS® Spring Server℠ VPS - Monitoring from a VPS outside of our normal network is easy and provides a critical off-net vantage point. You can install Nagios, Cacti, Munin, and Smokeping on a server to receive alerts.