Monday, February 24, 2014

Upcoming Classes

I look forward to seeing some of you at an upcoming class.

My class, "Technology Manager's Survival Guide" is on the Friday afternoon training schedule at LOPSA-East on May 2 in New Brunswick, NJ.

And I'll be presenting a two-part class, "Leader's Survival Guide," in Jacksonville, FL on July 16 and 23.

Friday, January 17, 2014

Effective Solaris System Monitoring

In order to maintain a reliable IT environment, every enterprise needs to set up an effective monitoring regime.

A common mistake by new monitoring administrators is to alert on everything. This is an ineffective strategy for several reasons. For starters, it may result in higher telecom charges for passing large numbers of alerts. Passing tons of irrelevant alerts will impact team morale. And, no matter how dedicated your team is, you are guaranteed to reach a state where alerts will start being ignored because "they're all garbage anyway."

For example, it is common for non-technical managers to want to send alerts to the systems team when system CPU hits 100%. But, from a technical perspective, this is absurd:

  • You are paying for a certain system capacity. Some applications (especially ones with extensive calculations) will use the full capacity of the system. This is a GOOD thing, since it means the calculations will be done sooner.
  • What is it you are asking the alert recipient to do? Re-start the system? Kill the processes that are keeping the system busy? If there is nothing for a the systems staff to do in the immediate term, it should be reported in a summary report, not alerted.
  • If there is an indication (beyond a busy CPU) that there is a runaway process of some sort, the alert needs to go to the team that would make that determination and take necessary action.

In order to be effective, a monitoring strategy needs to be thought out. You may end up monitoring a lot of things just to establish baselines or to view growth over time. Some things you monitor will need to be checked out right away. It is important to know which is which.

Historical information should be logged and retained for examination on an as-needed basis. It is wise to set up automated regular reports (distributed via email or web) to keep an eye on historical system trends, but there is no reason to send alerts on this sort of information.

Availability information should be characterized and handled in an appropriate way, probably through a tiered system of notifications. Depending on the urgency, it may show up on a monitoring console, be rolled up in a daily summary report, or paged out to the on-call person. Some common types of information in this category include:

  • "Unusual" log messages. Defining what is "unusual" usually takes some time to tune whatever reporting system is being used. Some common tools include logwatch, swatch, and logcheck. Even though it takes time, your team will need to customize this list on their own systems.
  • Hardware faults. Depending on the hardware and software involved, the vendor will have provided monitoring hooks to allow you to identify when hardware is failing.
  • Availability failures. This includes things like ping monitoring or other types of connection monitoring that give a warning when a needed resource is no longer available.
  • Danger signs. Typically, this will include anything that your team has identified that indicates that the system is entering a danger zone. This may mean certain types of performance characteristics, or it may mean certain types of system behavior.

Alerting Strategy

Alerts can come in different shapes, depending on the requirements of the environment. It is very common for alerts to be configured to be sent to a paging queue, which may include escalations beyond a single on-call person.

(If possible, configure escalations into your alerting system, so that you are not dependent on a single person's cell phone for the availability of your entire enterprise. A typical escalation procedure would be for an unacknowledged alert to be sent up defined chain of escalation. For example, if the on-call person does not respond in 15 minutes, an alert may go to the entire group. If the alert is not acknowledged 15 minutes after that, the alert may go to the manager.)

In some environments, alerts are handled by a round-the-clock team that is sometimes called the Network Operations Center (NOC). The NOC will coordinate response to the issue, including an evaluation of the alert and any necessary escalations.

Before an alert is configured, the monitoring group should first make sure that the alert meets three important criteria. The alert should be:

  1. Important. If the issue being reported does not have an immediate impact, it should be included in a summary report, not alerted. Prioritize monitoring, alerting, and response by the level of risk to the organization.
  2. Urgent. If the issue does not need to have action taken right away, report it as part of a summary report.
  3. Actionable. If no action can be taken by the person who receives the alert, it should have been defined to be sent to the right person. (Or perhaps the issue should be reported in a summary report rather than sent through the alerting system.)

Solaris Monitoring Suggestions

Here are some monitoring guidelines I've implemented in some places where I have worked. You don't have to alert on a ton of different things in order to have a robust monitoring solution. Just these few items may be enough:
  • Ping up/down monitoring. You really can't beat it for a quick reassurance that a given IP address is responding.
  • Uptime monitoring. What happens if the system rebooted in between monitoring intervals? If you make sure that the uptime command is reporting a time larger than the interval between monitoring sweeps, you can keep an eye on sudden, unexpected reboots.
  • Scan rate > 0 for 3 consecutive monitoring intervals. This is the best measure of memory exhaustion on a Solaris box.
  • Run queue > 2x the number of processors for 3 consecutive monitoring intervals. This is a good measure of CPU exhaustion.
  • Service time (avserv in Solaris, svctime in Linux) > 20 ms for disk devices with more than 100 (r+w)/s, including NFS disk devices. This measures of I/O channel exhaustion. 20 ms is a very long time, so you will also want to keep an eye on trends on regular summary reports of sar -d data.
  • System CPU utilization > user CPU utilization where idle < 40% for systems that are not serving NFS. This is a good indication of system thrashing behavior.