Using Nagios

 

Terminology

* Commands
* Time periods
* Contacts and Contact Groups
* Host
* Services
* Host and service escalations

Soft and Hard States

* Defines how many retries before escalate soft to hard state

Configuration

/etc/nagios/nagios.cfg

Web Interfaces

* Tactical Overview
* Status Map
* Host information

Plugins

check_ping

# Command. -H: host, 
# -w: warn, wrta: warn return time average, wpl%: warn packet loss percentage
# -c: cirtical, -crta: critial return time average, cpl%: critical packet loss percentage
# -p packet, -t timeout
# -4|-6: ipv4 or ipv6
check_ping -H <host_address> -w <wrta>,<wpl>% -c <crta>,<cpl>% [-p packets] [-t timeout] [-4|-6]
 
# For example, ping localhost with 5 packets, 
# warn if 1 packet returns in 3 seconds, 
# Output critical if 0 packet returns in 5 seconds:
$ check_ping -H localhost -w 3000.0,80% -c 5000.0,100% -p 5
PING OK - Packet loss = 0%, RTA = 0.06 ms|rta=0.058000ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0

check_tcp

check_tcp|check_udp -H host -p port [-w <warning >] [-c <critical >]
           [-s <send string>] [-e <expect string>] [-q <quit string>]
           [-A] [-m <maximum bytes>] [-d <delay>] [-t <timeout>]
           [-r <refuse state>] [-M <mismatch state>] [-v] [-4|-6] 
           [-j] [-D <days to cert expiry>] [-S] [-E]
 
# For example, check localhost on port 80
check_tcp -H localhost -p 80
TCP OK - 0.000 second response time on port 80|time=0.000220s;;;0.000000;10.000000

check_pop, check_spop, check_imap, check_simap

* Similar to check_tcp

 

check_smtp

* Similar to check_tcp
* Port defaults to 25

check_smtp -H host [-p port] [-C command] [-R response] [-f from addr]
           [-F hostname] [-A authtype –U authuser –P authpass]
           [-w <warning time>] [-c <critical time>] [-t timeout]
           [-S] [-D days] [-n] [-4|-6]
check_smtp -H smtp.my.com -p 25

check_ftp

* Similar to check_tcp
* Port defaults to 21 ot 990 for ssl
* Expect standard FTP welcome message

check_ftp -H ftp.my.com

check_dhcp

check_nagios

check_http

check_http -H <vhost> | -I <IP-address> [-u <uri>] [-p <port>]
           [-w <warning time>] [-c <critical time>] [-t <timeout>]  
           [-L] [-a auth] [-f <ok | warn | critcal | follow>]  
           [-e <expect>] [-s string] [-l]  
           [-r <regex> | -R <regex>] [-P string]
           [-m <min_pg_size>:<max_pg_size>]  
           [-4|-6] [-N] [-M <age>] [-A string] [-k string] [-S]  
           [-C <age>] [-T <content-type>]
# Examples
check_http -H www.yahoo.com -p 80
check_http -H coe-soa-1 -p 9900 -u /ms/index.do

check_mysql

check_mysql [-H host] [-d database] [-P port]
            [-u user] [-p password] [-S]
 
check_mysql_query -q SQL_query [-w <warn>] [-c <crit>] [-d database]
                  [-H host] [-P port] [-u user] [-p password]

check_pgsql

 check_pgsql [-H <host>] [-P <port>] [-w <warn>] [-c <crit>]
             [-t <timeout>] [-d <database>] [-l <logname>]  
             [-p <password>]

check_oracle

* Need Oracle client installation (tnsping)

check_oracle --tns <ORACLE_SID>
             --db <ORACLE_SID>
             --oranames <Hostname>
             --login <ORACLE_SID>
             --cache <ORACLE_SID> <USER> <PASS> <CRITICAL> <WARNING>
             --tablespace <ORACLE_SID> <USER> <PASS>
                          <TABLESPACE> <CRITICAL> <WARNING>

check_swap

Check virtual memory.

check_swap [-a] [-v] -w limit -c limit
# -a: all
# -w limit: warn if swap fall below limit
# -c limit: critical if swap fall below limit

check_ide_smart

check_ide_smart [-d <device>] [-i] [-q] [-1] [-O] [-n]

check_disk

Check disk space.

check_disk 
  -w limit # warn if below limit
  -c limit  # critical if below limit
  [-W limit] # warn if inode below limit
  [-K limit] # critical if inode below limit
  {-p path # -p path or partition, can be repeated
              | -x device} # -x exclude path
  [-C] # clear thresholds
  [-E] # only checks for exact path as specified by -p
  [-e] # displays errors only
  [-g group ] 
  [-k] # kb
  [-l] # check local file system only
  [-M] # displays mount point instead of path
  [-m] # mb
  [-r path ] # regex for path/partition, can be repeated
  [-R path ] # as -r but case insensitive
  [-t timeout] # in seconds, default to 10
  [-u unit] #  bytes, kB, MB, GB, TB, default to MB
  [-v] # verbose
  [-X type] # exclude file type, can be repeated
 
# Examples
check_disk -w 500 -c 10 -p /tmp
DISK OK - free space: /tmp 4449 MB (96% inode=99%);| /tmp=140MB;4340;4830;0;4840

check_disk_smb

Check disk space on remote shares.

check_disk_smb -H <host> -s <share> -u <user> -p <password> 
      -w <warn> -c <crit> [-W <workgroup>] [-P <port>]

check_disk

Check system load.

check_load 
  [-r] # divide load average by number of CPUs
  -w WLOAD1,WLOAD5,WLOAD15 # warn if load averages exceed 1, 5, 15 min averages
  -c CLOAD1,CLOAD5,CLOAD15 # critical if load averages exceed 1, 5,, 15 min averages
 
# Example, warn if 1min load average exceeds 10, 5min 8, 15min 5
# critical if 1min load average exceeds 15, 5min 10, 15min 8
check_load -w 10.0,8.0,5.0 -c 15.0,10.0,8.0

check_procs

check_procs 
  -w <range> # warn if outside range
  -c <range>  # critical if outside range
  [-m metric] # metric type: PROCS, VSZ, RSS, CPU, ELAPSED
  [-s state] # only scan for processes with one or more status flags form ps command
  [-p ppid] # only scan for child processes of parent ppid 
  [-u user] # only scan for user or user id
  [-r rss] # only scan for processes with rss higher than indicated
  [-z vsz] # only scan for processes with vsz higher than indicated
  [-P %cpu] # only scan for processes with pcpu higher than indicated
  [-a argument-array] # only scan for processes with args that contain string
  [-C command] # only scan for exact matches of command
  [-t timeout] # timeout
  [-v]
 
# Example
# Alert if CPU of any processes over 10% or 20%
check_procs -w 10 -c 20 --metric=CPU

Monitor Logged In User

check_users -w limit -c limit
 
# Example, warn if one user logged in, critical if 1 user logged in
check_users -w 0 -c 1
#USERS WARNING - 1 users currently logged in |users=1;0;1;0

References

http://www.debianhelp.co.uk/nagiosinstall.htm
Learning Nagios 3.0: A Detailed Tutorial to Setting Up, Configuring, and Managing This Easy and Effective System Monitoring Software by Wojciech Kocjan
http://nagios.sourceforge.net/docs/3_0/toc.html

This entry was posted in infra and tagged . Bookmark the permalink.