[[TracNav]] [[TOC(inline,depth=1)]] Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored, services (AKA sensors) and so on. = Configuring the Nagios server = The configuration of a Nagios server is done in a set of ''standard'' templates, in the [source:templates/trunk/standard/monitoring/nagios ''monitoring/nagios''] namespace. Also, sensors are provided for many of the plug-ins described on the [http://www.gridpp.ac.uk/wiki/Nagios_Plugins GridPP wiki]. An [source:templates/trunk/clusters/example-3.1/profiles/nagios-server.example.org.tpl example Nagios server template] is included in the QWG distribution. == Quickstart == In order to configure a basic nagios server, simply include the template [source:templates/trunk/standard/monitoring/nagios/config monitoring/nagios/config] in the server's template. This will automatically generate a Nagios configuration that monitors all Quattor-managed machines. == Monitoring grid services == Preliminary work has been done to integrate monitoring of grid services according to the EGEE model (described at https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNcg). Please note that in order to retrieve SAM results for your site, you will need to request access to the SAM database from your Nagios server according to the procedure described at https://twiki.cern.ch/twiki/bin/view/LCG/SamProgInterfaceACL The template [source:templates/trunk/standard/monitoring/nagios/ncg_services monitoring/nagios/ncg_services] will automatically generate service, service group, service dependency and external info configuration for CE and SEs defined in the CE_HOST and SE_HOST variables. (The NCG services configuration template should be included before the core Nagios config template.) = Customising your configuration = == What is monitored == In principle, all hosts present in the `DB_MACHINE` database are expected to be monitored and are added to Nagios configuration. (This is done via the variable `HOSTSLIST`, which is automatically derived from `DB_MACHINE`.) If you want to monitor additional hosts that are not Quattor-managed, they can be specified in the variable `NAGIOS_EXTRA_HOSTS`. Currently, all hosts are given the same settings. Finer-grained settings (for instance, separate hosts to be monitored on working and non-working hours) are yet to be done. == Hardware-related monitoring == The variable `HW_LISTINGS` is essentially the inverse of `DB_MACHINE`: it is a structure with fields related to some hardware attribute and whose values are the lists of nodes with such attribute. For instance, `HW_LISTINGS["per_cpu"]["_4"]` is the list of nodes with 4 CPU cores, whether 4 single-code Pentiums or a single quad-core Barcelona chip. See [#Variable-index here] for the full description of each variable. == Specifying commands == A structure template with all the commands used at UAM is provided, as an example. Variable `NAGIOS_COMMANDS_TEMPLATE` is the name of a structure template with all the command definitions you need. You can write your own commands template, and even include the example one. A command template is just a set of `"command_name" = "command_definition"` lines, like this: {{{ "check_ftp" = "$USER1$/check_ftp -H $HOSTADDRESS$"; "notify-by-email" = "/bin/echo '$SERVICEOUTPUT$' | /bin/mail -s '$SERVICESTATE$ alert for $HOSTALIAS$($HOSTNAME$)/$SERVICEDESC$' $CONTACTEMAIL$"; "check_nrpe_ncd" = "$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_ncd"; }}} == Host aliases and host-specific customization == Additional, per-host customization is possible with another template, called `NAGIOS_HOSTALIASES_TEMPLATE`. It is a full template, not an structure one. It is a rather poor way of specifyng customizations, better ideas are welcomed. == Host groups == A structure template for defining host groups can be supplied with the `NAGIOS_HOSTGROUPS_TEMPLATE` variable. == Macros == User macros can be supplied with the `NAGIOS_MACROS_TEMPLATE`. By default, `$USER1$` and `$USER2$` are set with sensible values. The macros file will be written with 0600 permissions, so it is safe to place passwords here, if needed. == Time periods == A structure template can be supplied with the `NAGIOS_TIMEPERIODS_TEMPLATE`. The default templates provide only a '''always''' timeperiod. == Contacts and contact groups == Contacts and contact groups for alarms can be specified with structure templates called `NAGIOS_CONTACTS_TEMPLATE` and `NAGIOS_CONTACTGROUPS_TEMPLATE`. I provided an example template under '''monitoring/nagios/contacts''', but please adapt it to your needs!! == Services == Each service must be provided via a structure template. This way, the `use` directive can be imitated with just an include. So, the variable `NAGIOS_SERVICE_TEMPLATES` is a list with the names of the templates that fully describe services. For instance, {{{ include monitoring/nagios/services/fast-service; 'service_description'='DNS response time'; 'check_command'= list ( 'check_dns', 'www.cern.ch', '2', '5'); }}} === Generic services === As seen in the previous example, you can reduce the amount of code by including some other structure template and changing the appropriate values. For this, 6 generic templates are provided: `generic-service`:: A "generic" service, with the appropriate contact groups, and other usual settings. Before sending an alarm, the check is tried 4 times. Alarms are re-sent each 4 hours. `fast-service`:: A `generic-service` to be checked each 5 minutes, or each minute in case of problems. `performance-service`:: A `fast-service` to be checked each 30 minutes, or each 5 minutes in case of problems. `security-service`:: A ''volatile'' `fast-service` to be checked each 20 minutes. `slow-service`:: A `fast-service` to be checked each hour, or each 20 minutes in case of problems. Alarms are re-sent each 7 hours. `expire-service`:: A `fast-service` to be checked each 48 hours. Alarms are re-sent each 50 hours. It is useful for checking certificate expirations. === Checking for a host load === The acceptable load on a host depends on the number of CPUs and cores on it. For instance, having a single-core node on load 4 is problematic, whereas an 8-core node is wasting CPU with that same load. The function `generate-check-load-services` is aimed at this. It generates service structures based on the ''monitoring/nagios/services/load'' templates, for all hosts in `DB_MACHINE`, setting sensible limits. These are: || ||'''Load 1'''||'''Load 5'''||'''Load 15'''|| ||'''Warning'''||cores*2||cores*1.75||cores*1.5|| ||'''Critical'''||cores*3||cores*2||cores*1.75 The limits are hardcoded in the function, a more flexible approach will be provided (some day). == External configuration files == If you have existing configuration files, or if parts of your configuration are generated by tools other than ncm-nagios, you can integrate them with the `NAGIOS_EXTERNAL_FILES` variable. == General options == The component provides sensible defaults for most of the Nagios options. If you want to override any of these, set the variable `NAGIOS_GENERAL_OPTIONS`. == Visualizing performance metrics == Many service and host checks return performance metrics to the Nagios server. The performance data can be visualized in plots based on rrdtool using pnp4nagios. To enable visualization with pnp4nagios on the web interface, pnp4nagios must be enabled (via `PNP4NAGIOS_ENABLE`) and the base URL for pnp4nagios on the web server must be supplied (via `PNP4NAGIOS_BASE_URL`). The exact locations of the host and service metrics can be modified via variables `PNP4NAGIOS_HOST_ACTION_URL` and `PNP4NAGIOS_SERVICE_ACTION_URL`, although by default they are derived from the base location: {{{ variable PNP4NAGIOS_ENABLE = true; variable PNP4NAGIOS_BASE_URL = '/nagios/html/pnp4nagios/index.php'; variable PNP4NAGIOS_HOST_ACTION_URL = PNP4NAGIOS_BASE_URL + '?host=$HOSTNAME$'; variable PNP4NAGIOS_SERVICE_ACTION_URL = PNP4NAGIOS_BASE_URL + '?host=$HOSTNAME$&srv=$SERVICEDESC$'; }}} Every host or service for which visualization is required, should contain a `action_url` in its definition. Hosts that are derived from the generic host template `monitoring/nagios/generic_host` already have this definition: {{{ "action_url" = PNP4NAGIOS_HOST_ACTION_URL; }}} However, the `action_url` must be set for every service for which it is desired (note that it may not be useful for all services!). The definition of the `load` service template in the example cluster (template sites/example/monitoring/nagios/services/load.tpl) contains an example: {{{ 'action_url' = PNP4NAGIOS_SERVICE_ACTION_URL; }}} Note: the above assumes that the Nagios server is configured to collect performance metrics. The default Nagios master template `monitoring/nagios/master` contains the required definitions. == Passive service checks (NSCA) == Besides displaying results or sending notifications for regular checks, Nagios can also receive results from checks that it did not initiate itself. Such checks are passive checks. They are executed via some mechanism external to the Nagios server (e.g., a cron job on a node) and the check result is sent to the NSCA daemon that runs on the Nagios server. On the node, the send_nsca executable submits the results to the NSCA daemon. The NSCA daemon accepts the passive check results and inserts them into Nagios. Passive service checks are required when using a hierarchy of Nagios servers. The slave servers that execute the (active) service checks, submit the check results to the master server. There are 3 different configuration variables to enable parts the NSCA configuration: * `NSCA_SEND_TEMPLATE`: required for nodes that run passive checks and slave servers to submit check results to the master * `NSCA_SUBMIT_RESULT_TEMPLATE`: required for Nagios slave servers to submit check results to the master * `NSCA_DAEMON_TEMPLATE`: required for a Nagios server that must accept passive check results and for a Nagios master server when using a hierarchy. The NSCA configuration uses the following variables: * `NSCA_PORT`: port at which the server listens (and to which the client send the results), defaulting to 5667. * `NSCA_ENCRYPTION_METHOD`: encryption scheme used by send_nsca, defaulting to 1. * `NSCA_DECRYPTION_METHOD`: decryption scheme used by the daemon, should be (and is by default) identical to the encryption. * `NSCA_PASSWORD`: password used for encrypting and decrypting connection. This variable must be set and there is no default value. == Hierarchy of servers == Nagios servers can be used in a hierarchy, where there is 1 master server and 2 or more slave servers. The master server runs the web interface, handles notifications, performance data etc. The slave servers schedule and execute checks and submit the check results (via NSCA) to the master server. Each slave server runs (a subset of the) checks against (a subset of the) nodes. The master server receives the results for all services and all hosts. The QWG templates contain some generic templates for Nagios master and slave servers. However, much of the configuration is site-specific. An example setup is provided under directory `sites/example/site/nagios`. This setup must be modified to match the desired configuration of your site. The generic setup consists of the following templates (under `standard/`): * `monitoring/nagios/master`: template containing specific settings for the Nagios master server, such as accepting passive checks, running the web interface, collecting performance data etc. Service templates must be included as '''passive''' checks, therefore variable `NAGIOS_DEFAULT_SERVICE_TEMPLATE` is forced to `monitoring/nagios/services/passive-service`. * `monitoring/nagios/slave`: template defining specific Nagios settings for a slave server, such as suppressing notifications, submission of check results, disabling performance data. The site-specific setup consists of the following templates (under `sites/example`): * `site/nagios/master`: contains specific settings for the Nagios master for this cluster; specifically, it includes all host, hostgroup and service templates. * `site/nagios/slave-A`: contains specific settings for a Nagios slave server that monitors a subset of the hosts. Only the hosts, hostgroups and services that are monitored by this server must be included. * `site/nagios/slave-B`: this templates is very similar to the one for `slave-A`, except that is uses a different subset of hosts, hostgroups and services. * `site/nagios/common`: site-specific configuration that is common to all servers (master and all slaves). * `site/nagios/hosts/`: this directory contains the list of hosts per monitoring slave server (templates `cluster-A` and `cluster-B`) as well as a template aggregating all hosts for the master server (`all`). * `site/nagios/hostsgroups/`: this directory contains the list of hostgroups per monitoring slave server and an aggregating template * `site/nagios/config/services/standard-templates`: list of services to be monitored; in the example cluster, there is only 1 service (load) that is monitored for all hosts. * `monitoring/nagios/services/load`: this template overrides the standard load template. It uses a variable for the service template (via variable `NAGIOS_DEFAULT_SERVICE_TEMPLATE`) to distinguish the active check (needed by slave servers) from the passive variant (used by the master). Furthermore, it includes support for an action_url to visualize performance metrics. * `site/nagios/nsca`: various variables for NSCA. * `site/nagios/OCP_setup`: this template can optionally be used by Nagios slave servers to push service checks results to the master using OCP. That gives a much better performance than using a simple 'send_nsca' per check result, in the sense that the latency of service checks will dramatically decrease. It is recommended to use this template. This template uses `monitoring/nagios/nsca/OCP_daemon` (under namespace standard), which provides the OCP daemon code and the init.d script for the daemon. Other templates present under the `sites/example/` directory that are not specific for the hierarchy: * `site/nagios/config/apache`: example of an Apache configuration for a Nagios web server, with support for X509 certificate-based authentication via https. Users connecting to port 80 of the web server are mapped to the guest user. Because of the complexity of the Apache configuration, this template uses ncm-filecopy to get the configuration files on the server. * `site/nagios/config/webinterface`: this template adds the certificate DNs for administrators to the list of people who can access the Nagios web interface via https. The administrators get access to all services, hosts, system and configuration options via the web interface. * `site/nagios/site-commands`: commands that are specific to the site; that is, all commands that require non-standard 3rd party software to be installed on either the Nagios server or the nodes to be monitored. Everything means all software not contained in the rpms nagios-plugins or nagios-plugins-nrpe. == Variable index == || `NAGIOS_BASE_HOST` || A structure template defining a basic host, with all fields except, perhaps, an alias, set to sensible values.|| ||`HW_LISTINGS` || Lists all machines, grouped by hardware characteristics. It is set automatically by ''monitoring/nagios/cfg'' and cannot be modified. || ||` per_hardware` || nlist in which keys are the hardware templates (the values of `DB_MACHINE`). The value is another structure: ||` host_list` || List of all hosts associated to that hardware template. || ||` hardware_structure` || The complete hardware information as described on the hardware template. || ||` per_cpu` || nlist with the machines grouped by number of processors. The key is in the form "_n", where n is the number of CPUs. The value is the list of hosts. || || `HOSTSLIST` || Plain nlist with all hosts structures, extracted from `HW_LISTINGS`. || || `NAGIOS_EXTRA_HOSTS` || List of FQDNs with additional hosts to be monitored. Put here any host outside your CDB you want to monitor, for instance routers or external DNS servers. || || `NAGIOS_COMMANDS_TEMPLATE` || A structure template with home-made commands. A default set with all commnands used today (6/2008) at UAM is provided.|| || `NAGIOS_HOSTSALIASES_TEMPLATE` || A template with additional per-host customizations. To be improved. || || `NAGIOS_HOSTGROUPS_TEMPLATE` || A structure template with all hostgroups definitions. || || `NAGIOS_MACROS_TEMPLATE` || A structure template with any user-defined macros, not enclosed by '$' symbols. For instance, for setting `$USER1$`, do: {{{"USER1" = "/usr/lib/nagios/plugins";}}} || || `NAGIOS_TIMEPERIODS_TEMPLATE` || A structure template with time periods definitions. || || `NAGIOS_CONTACTS_TEMPLATE` || A structure template with all contacts.|| || `NAGIOS_CONTACTGROUPS_TEMPLATE` || A structure template with the different ways the CONTACTS are grouped. || || `NAGIOS_SERVICE_TEMPLATES` || List with the names of the nagios structure templates that define each sensor. || || `NAGIOS_EXPLICIT_SERVICES` || List with the complete definition of any services you don't want to specify as structure templates.|| || `NAGIOS_EXPLICIT_SERVICEGROUPS` || List with the complete definition of any service groups you don't want to specify as structure templates.|| || `NAGIOS_EXPLICIT_SERVICEDEPENDENCIES` || List with the complete definition of any service dependencies you don't want to specify as structure templates.|| || `NAGIOS_EXPLICIT_SERVICEEXTINFO` || List with the complete definition of any service external info you don't want to specify as structure templates.|| || `NAGIOS_GENERAL_OPTIONS` || nlist with all generic options you may want, such as whether to accept external commands or not. Check the Nagios documentation to see what options are available. || || `NAGIOS_EXTERNAL_FILES` || List of external files to be added to Nagios configuration. || || `NAGIOS_SERVICEEXTINFO_TEMPLATES` || List of templates containing definitions for extended service information || || `NAGIOS_LOAD_BASE_TEMPLATE` || Base template for the load checks, which are generated by the `generate_load_checks` function ||