Nagios监控系统和网络


Nagios Wolfgang Barth Nagios System andNetwork Monitoring Munich SanFrancisco NAGIOS. Copyright c 2006 OpenSource PressGmbH Allrightsreserved. No part of this work maybereproducedortransmittedinany form or by anymeans, electronic or mechanical, includingphotocopying,recording,orbyany informationstorageorretrievalsystem, withoutthe prior writtenpermission of thecopyright ownerand thepublisher. Printedonrecycledpaper in theUnitedStatesofAmerica. 12345678910—09 08 07 06 No Starch Pressand theNoStarch Presslogo areregisteredtrademarksofNoStarch Press, Inc. Otherproduct and companynames mentionedhereinmay be thetrademarksoftheir respective owners.Ratherthanuse atrademark symbol with everyoccurrence of atrademarked name,weare usingthe namesonlyinaneditorialfashionand to the benefit of thetrademark owner, with no intentionofinfringement of thetrademark. Publisher: William Pollock CoverDesign: Octopod Studios U.S. editionpublishedbyNoStarch Press, Inc. 555 De HaroStreet,Suite 250, SanFrancisco, CA 94107 phone: 415.863.9900; fax: 415.863.9950; info@nostarch.com;http://www.nostarch.com Original edition c 2005 OpenSource PressGmbH PublishedbyOpenSource PressGmbH, Munich, Germany Publisher: Dr.Markus Wirtz Original ISBN 3-937514-09-0 For informationontranslations, pleasecontact OpenSource PressGmbH, Amalienstr.45Rg, 80799 M ¨unchen, Germany phone+49.89.28755562; fax+49.89.28755563; info@opensourcepress.de;http://www.opensourcepress.de Theinformationinthisbook is distributed on an “AsIs” basis, withoutwarranty. While everyprecautionhas been takeninthe preparationofthiswork, neitherthe author norOpenSource PressGmbHnor No Starch Press, Inc. shall have anyliabilitytoany person or entity with respecttoany loss or damage caused or allegedtobecaused directly or indirectly by theinformationcontained in it. LibraryofCongressCataloging-in-Publication Data Barth, Wolfgang Nagios :system and network monitoring /Wolfgang Barth.-- 1st ed. p. cm. Includes index. ISBN 1-59327-070-4 1. Computer networks--Management--Automation. I. Title. TK5105.5.B374 2005 004.6--dc22 2005026745 Contents Introduction 15 From Source Code to aRunning Installation 23 1Installation 25 1.1Compilingthe Source Code...................... 26 1.2Installingand TestingPlugins ..................... 30 1.2.1Installation .......................... 30 1.2.2Plugintest.......................... 32 1.3Configuration of theWeb Interface................. 33 1.3.1SettingUpApache ...................... 33 1.3.2UserAuthentication..................... 34 2Nagios Configuration 37 2.1The Main Configuration File nagios.cfg ............... 38 2.2Objects—an Overview......................... 41 2.3Defining theMachines to Be Monitored,with host ......... 44 2.4GroupingComputersTogether with hostgroup ........... 46 2.5Defining Services to Be Monitored with service ........... 47 2.6GroupingServices Together with servicegroup ........... 50 2.7Defining Addressees for ErrorMessages: contact .......... 50 2.8The Message Recipient: contactgroup ................ 52 2.9WhenNagios NeedstoDoSomething: the command Object ... 53 2.10 DefiningaTime Period with timeperiod ............... 54 5 Contents 2.11 Templates ............................... 54 2.12 Configuration Aids for ThoseToo Lazy to Type ............ 56 2.12.1Defining servicesfor severalcomputers. .......... 56 2.12.2One host groupfor allcomputers.............. 57 2.12.3Other configuration aids ................... 57 2.13 CGIConfiguration in cgi.cfg ..................... 57 2.14 TheResources File resource.cfg .................... 59 3Startup61 3.1Checking theConfiguration ...................... 61 3.2GettingMonitoringStarted ...................... 63 3.2.1Manualstart ......................... 63 3.2.2Automatic start. ....................... 64 3.2.3Making configuration changescomeintoeffect ...... 64 3.3Overviewofthe WebInterface.................... 64 In More Detail... 69 4Nagios Basics 71 4.1Taking into Account theNetworkTopology ............. 72 4.2ForcedHostChecks vs.PeriodicReachabilityTests .......... 75 4.3States of Hostsand Services ..................... 75 5Service Checksand HowTheyAre Performed79 5.1Testing Network Services Directly ................... 81 5.2Running Pluginsvia Secure Shellonthe Remote Computer .... 82 5.3The Nagios Remote Plugin Executor ................. 82 5.4Monitoringvia SNMP ......................... 83 5.5The Nagios ServiceCheck Acceptor .................. 84 6Pluginsfor NetworkServices 85 6.1StandardOptions ........................... 87 6.2ReachabilityTestwithPing...................... 88 6.2.1 check_icmp as aservicecheck ............... 90 6 Contents 6.2.2 check_icmp as ahostcheck ................. 91 6.3MonitoringMailServers ........................ 92 6.3.1MonitoringSMTPwith check_smtp ............. 92 6.3.2POP andIMAP........................ 95 6.4MonitoringFTP andWeb Servers................... 97 6.4.1FTP services .......................... 97 6.4.2Web servercontrol via HTTP................. 98 6.4.3MonitoringWeb proxies...................101 6.5DomainNameServerunder Control .................105 6.5.1DNS check with nslookup ..................106 6.5.2Monitoringthe name serverwith dig ............107 6.6Queryingthe Secure ShellServer...................108 6.7Generic Network Plugins.......................110 6.7.1Testing TCPports .......................110 6.7.2MonitoringUDP ports ....................112 6.8MonitoringDatabases .........................114 6.8.1PostgreSQL..........................115 6.8.2MySQL............................119 6.9MonitoringLDAP DirectoryServices .................121 6.10 Checking aDHCPServer. .......................124 6.11 MonitoringUPS withthe Network UPS Tools............126 7Testing LocalResources 133 7.1FreeHardDrive Capacity .......................134 7.2Utilizationofthe SwapSpace. ....................136 7.3Testing theSystem Load ........................137 7.4MonitoringProcesses.........................138 7.5Checking LogFiles...........................141 7.5.1The standardplugin check_log ...............142 7.5.2The modern variation: check_logs.pl ............143 7.6KeepingTabsonthe Number of Logged-in Users ..........144 7.7Checking theSystem Time ......................145 7.7.1Checking thesystem time via NTP. .............145 7 Contents 7.7.2Checking system time withthe time protocol .......146 7.8Regularly Checking theStatusofthe Mail Queue..........147 7.9KeepinganEye on theModification Date of aFile .........148 7.10 MonitoringUPSswith apcupsd ....................149 7.11 Nagios MonitorsItself.........................150 7.11.1Running thepluginmanually withascript .........151 7.11.2 check_nagios as atool for CGIprograms. .........152 7.12 Hardware Checks withLMSensors ..................152 7.13 TheDummy Plugin for Tests.....................154 8Manipulating Plugin Output155 8.1NegatingPluginResults ........................155 8.2Inserting Hyperlinks with urlize ...................156 9Executing Pluginsvia SSH 157 9.1The check_by_ssh Plugin .......................158 9.2Configuring SSH ............................160 9.2.1GeneratingSSH keypairs on theNagios server. ......160 9.2.2Settingupthe user nagios on thetargethost.......161 9.2.3Checking theSSH connectionand check_by_ssh .....161 9.3Nagios Configuration .........................162 10 TheNagios RemotePlugin Executor(NRPE)165 10.1Installation ...............................166 10.1.1Distribution-specificpackages. ...............166 10.1.2Installation from thesourcecode ..............167 10.2Startingvia theinetDaemon.....................168 10.2.1 xinetd configuration .....................168 10.2.2 inetd configuration .....................169 10.3NRPEConfiguration on theComputer to Be Monitored .......170 10.3.1Passing parametersontolocal plugins...........171 10.4Nagios Configuration .........................172 10.4.1NRPEwithout passing parameterson............172 10.4.2Passing parametersoninNRPE...............173 8 Contents 10.4.3Optimizingthe configuration ................173 10.5IndirectChecks .............................174 11 Collecting Information Relevantfor Monitoring with SNMP 177 11.1IntroductiontoSNMP.........................178 11.1.1The Management Information Base.............179 11.1.2SNMPprotocolversions...................183 11.2NET-SNMP...............................184 11.2.1Toolsfor SNMP requests ...................184 11.2.2The NET-SNMP daemon ...................187 11.3Nagios’s OwnSNMPPlugins .....................196 11.3.1The genericSNMPplugin check_snmp ...........196 11.3.2Checking severalinterfaces simultaneously. ........201 11.3.3Testing theoperating status of individualinterfaces ....203 11.4Other SNMP-basedPlugins ......................205 11.4.1Monitoringharddrive space andprocesseswith nagios- snmp-plugins ........................205 11.4.2Observingthe load on networkinterfaces with check- iftraffic ............................207 11.4.3The manubulon.com pluginsfor specialapplicationpur- poses .............................209 12 TheNagios Notification System 215 12.1Who Should be InformedofWhat, When?..............216 12.2WhenDoesaMessage Occur? ....................217 12.3The Message Filter ...........................217 12.3.1Switchingmessagesonand off systemwide........218 12.3.2Enablingand suppressing computer andservice-related messages...........................219 12.3.3Person-related filter options .................221 12.3.4Caseexamples........................222 12.4ExternalNotification Programs....................224 12.4.1Notification via e-mail ....................225 12.4.2Notification via SMS.....................227 9 Contents 12.5EscalationManagement ........................231 12.6Dependences betweenHosts andServices as aFilter Criterion...234 12.6.1The standardcase: service dependencies ..........234 12.6.2Onlyinexceptional cases:hostdependencies.......238 13 PassiveTests with theExternalCommandFile 239 13.1The Interfacefor ExternalCommands................240 13.2Passive ServiceChecks .........................241 13.3Passive Host Checks ..........................242 13.4ReactingtoOut-of-Date Information of Passive Checks ......243 14 TheNagios ServiceCheck Acceptor(NSCA) 247 14.1Installation ...............................248 14.2Configuring theNagios Server ....................249 14.2.1The configuration file nsca.cfg ...............249 14.2.2Configurung theinetdaemon................251 14.3Client-sideConfiguration .......................252 14.4Sending Test Resultstothe Server ..................253 14.5ApplicationExample I: Integrating syslog andNagios ........254 14.5.1Preparing syslog-ng for usewithNagios ..........255 14.5.2Nagios configuration:volatile services...........257 14.5.3Resettingerror states manually ...............258 14.6ApplicationExample II: Processing SNMP Traps ...........260 14.6.1Receivingtraps with snmptrapd ..............260 14.6.2Passing on traps to NSCA ..................261 14.6.3The matching service definition...............263 15 DistributedMonitoring 265 15.1SwitchingOnthe OCSP/OCHPMechanism ..............266 15.2Defining OCSP/OCHPCommands...................267 15.3Practical Scenarios. ..........................269 10 Contents 15.3.1Avoidingredundancy in configuration files .........269 15.3.2Defining templates ......................270 16 TheWeb Interface273 16.1Recognizingand Acting On Problems ................275 16.1.1Commentsonproblematichosts ..............276 16.1.2Taking responsibilityfor problems:acknowledgements..278 16.2AnOverviewofthe IndividualCGI Programs. ............279 16.2.1Variationsinstatusdisplay: status.cgi ...........279 16.2.2Additional information andcontrol center: extinfo.cgi ..284 16.2.3Interfacefor externalcommands: cmd.cgi .........288 16.2.4The most important things at aglance: tac.cgi ......290 16.2.5Networkplan: thetopological mapofthe network(sta- tusmap.cgi). .........................291 16.2.6Navigationin3D: statuswrl.cgi ...............293 16.2.7Queryingthe status withacell phone: statuswml.cgi ...295 16.2.8Analyzingdisrupted partialnetworks: outages.cgi .....295 16.2.9Queryingthe object definitionwith config.cgi .......295 16.2.10 Availabilitystatistics: avail.cgi ...............296 16.2.11 What events occur, howoften? histogram.cgi .......298 16.2.12 Filteringlog entriesafter specificstates: history.cgi ....299 16.2.13 Whowas told what,when? notifications.cgi .......300 16.2.14 Showingall logfileentries: showlog.cgi ..........301 16.2.15 Evaluating whatever youwant: summary.cgi .......301 16.2.16 Followingstates graphically over time: trends.cgi .....303 16.3Planning Downtimes.........................304 16.3.1Maintenance periods for hosts...............305 16.3.2Downtimefor services....................306 16.4Additional Information on Hostsand Services ............307 16.4.1Extendedhostinformation .................307 16.4.2Extendedserviceinformation ................310 16.5Configuration Changesthrough theWeb Interfaces:the Restart Problem................................311 11 Contents 17 Graphic DisplayofPerformance Data 313 17.1Processing Plugin PerformanceDatawithNagios ..........314 17.1.1The template mechanism..................314 17.1.2Using externalcommandstoprocess performancedata. .317 17.2Graphsfor theWeb withNagiosgraph ................317 17.2.1Basic installation .......................318 17.2.2Configuration .........................319 17.3Preparing PerformanceDatafor Evaluation withPerf2rrd .....325 17.3.1Installation ..........................326 17.3.2Nagios configuration .....................326 17.3.3Perf2rrd in practice ......................327 17.4The GraphicsSpecialist drraw .....................330 17.4.1Installation ..........................330 17.4.2Configuration .........................331 17.4.3Practical application.....................332 17.5Automated to aLarge Extent: NagiosGrapher ............336 17.5.1Installation ..........................336 17.5.2Configuration .........................338 17.6Other toolsand thelimitsofgraphic evaluation ...........349 SpecialApplications 351 18 Monitoring WindowsServers 353 18.1NSClient andNC Net .........................354 18.1.1Installation ..........................354 18.1.2The check_nt plugin .....................355 18.1.3Commandswhich can be runwithNSClient andNC Net .356 18.1.4AdvancedfunctionsofNC Net ...............363 18.2NRPEfor Windows: NRPE NT .....................371 18.2.1Installation andconfiguration ................372 18.2.2Function test.........................373 18.2.3The Cygwin plugins. .....................373 18.2.4Perlplugins in Windows...................374 12 Contents 19 Monitoring Room Temperatureand Humidity 377 19.1Sensors andSoftware .........................378 19.1.1The PCMeasure software for Linux .............378 19.1.2The queryprotocol......................379 19.2The Nagios Plugin check_pcmeasure ................379 20 Monitoring SAPSystems 383 20.1Checking without aLogin: sapinfo ..................384 20.1.1Installation ..........................384 20.1.2First test. ...........................384 20.1.3The plugin check_sap.sh ...................386 20.2MonitoringwithSAP’s OwnMonitoringSystem (CCMS)......388 20.2.1CCMS—a shortoverview...................388 20.2.2Obtaining thenecessary SAPusage permissionsfor Nagios 390 20.2.3Monitorsand templates ...................392 20.2.4The CCMS plugins......................394 20.2.5Performanceoptimization..................398 Appendixes 399 ARapidlyAlternating States:Flapping 401 A.1FlapDetection withServices .....................402 A.1.1Nagios configuration .....................403 A.1.2The historymemoryand thechronological progression of thechanges in state.....................404 A.1.3Representationinthe Webinterface............404 A.2FlapDetection for Hosts. .......................406 BEvent Handlers 409 B.1Execution Timesfor theEvent Handler................410 B.2Defining theEvent Handlerinthe ServiceDefinition ........411 B.3The HandlerScript...........................411 B.4ThingstoNoteWhenUsing EventHandlers.............413 13 Contents CWriting Your OwnPlugins: Monitoring Oracle with the InstantClient 415 C.1Installingthe OracleInstant Client ..................416 C.2Establishing aConnectiontothe OracleDatabase..........417 C.3AWrapper Plugin for sqlplus .....................417 C.3.1How thewrapperworks ...................418 C.3.2The Perl plugin in detail ...................419 DAnOverviewofthe Nagios Configuration Parameters 423 D.1The Main Configuration File nagios.cfg ...............424 D.2CGI Configuration in cgi.cfg .....................443 D.2.1Authenticationparameters.................443 D.2.2Other Parameters.......................444 Index447 14 Introduction It’s ten o’clockonMonday morning. Thebossofthe branch officeisinarage. He’s been waitingfor hoursfor an important e-mail, anditstill hasn’t arrived. It can only be thefault of themailserver; it’s probably hung yetagain. Butaquick checkofthe computer showsthatnomails have got stuckinthe queuethere,and there’snomention either in thelog filethatamail from thesenderinquestionhas arrived. So where’sthe problem? Thecentral mail serverofthe companydoesn’t respondtoaping.That’sprobably theroot of theproblem. Butthe IT departmentatthe companyheadoffice abso- lutelyinsists that it is nottoblame.Italsocannotpingthe mail node of thebranch office, butitmaintains that thenetworkatthe head officeisrunning smoothly, so theproblemmustlie withthe networkatthe branch office. Thesearchfor the errorcontinues. .. Thehumiliating result:the VPNconnectiontoheadoffice wasdown, andalthough theISDNbackup connectionwas working, no routetothe head office(andthus to thecentral mail server) wasdefinedinthe backuprouter.Aglobally operating IT service provider wasresponsible for thenetworkconnections (VPN andISDN) betweenbranchand head office, for whom somethinglikethis“just doesn’t hap- pen”.The endresult: many hoursspent searchingfor theerror,anirritated boss (the meeting for whichthe e-mail wasurgentlyrequiredhas long sincefinished), andasweating admin. With aproperlyconfiguredNagios system,the adminstratorwould already have noticedthe problemateight in themorning andbeen able to isolateits cause withinafewminutes.Insteadoflosingvaluable time,the IT service provider would have been informeddirectly. Thetimethenrequiredtoeliminatethe error(in this case, half an hour)would have been sufficienttodeliverthe e-mail in time. Asecondexample:somewhere in Germany, theharddrive on whichthe central Oracledatabasefor ahospitalstoresits logfilesreaches full capacity.Although this doesnot cause the“lightstogoout”inthe operating room,the database stops workingand thereisconsiderable disruptiontoworkprocedures: patients 15 Introduction cannotbeadmitted,examination resultscannotbesaved,and reports cannotbe documented until theproblemhas been fixed. If thecritical hard drive hadbeen monitoredwithNagios,the IT departmentwould have been warned at an earlystage.The problemwould notevenhaveoccurred. With personnelresources becoming more andmorescarce,noITdepartmentcan really afford to regularlycheck allsystems manually.Networks that aregrowing more andmorecomplex especially demand theneed to be informedearly on of disruptions that have occurred or of problems that areabout to happen. Nagios, theOpenSourcetool for system andnetworkmonitoring, helpsthe administrator to detectproblems beforethe phonerings off thehook. Theaim of thesoftware is to informadministratorsquicklyabout questionable (WARNING)orcritical conditions (CRITICAL).Whatisregardedas“questionable”or “critical”isdefinedbythe administrator in theconfiguration.AWebpage sum- mary then informs theadministrator of normally workingsystems andservices, whichNagios displays in green,ofquestionable conditions (yellow),and of criti- cal situations (red). Thereisalsothe possibilityofinforming theadministratorsin charge—depending on specificservices or systems—selectively by e-mail butalso by paging servicessuchasSMS. By concentratingontrafficlight states (green,yellow, red),Nagios is distinct from networktoolsthatdisplay elapsed time graphically (for exampleinthe load of a WANinterfaceoraCPU throughout an entire day)orthatrecordand measure networktraffic(howhighwas theproportion of HTTPonaparticular interface?). Nagios is involvedplainly andsimplywiththe issueofwhether everything is on agreen light.The software doesanexcellent jobinlookingafter this,not just in terms of thecurrent status butalsooverlongperiods of time. Thetests When checking critical hostsand services, Nagios distinguishesbetween host and service checks.Ahost check tests acomputer,called host in Nagios slang, for reachability—as arule, asimple ping is used.Aservice check selectively tests indi- vidualnetworkservices such as HTTP, SMTP,DNS,etc., butalsorunning processes, CPU load, or logfiles. Host checks areperformedbyNagios irregularlyand only whererequired, for exampleifnoneofthe servicestobemonitored can be reached on thehostbeing monitored. As long as oneservicecan be addressedthere,then this is basically valid for theentirecomputer,sothatthistestcan be dropped. Thesimplesttestfor networkservices consists of lookingtosee whether therele- vant target portisopen, andwhether aserviceislisteningthere.But this doesnot necessarily mean that,for example, theSSH daemonreally is running on TCPport 22. Nagios thereforeusestests for many servicesthatgoseveral stepsfurther.For SMTP,for example, thesoftware tests whether themailserveralsoannounces itself 16 Introduction witha“220” output,the so-called SMTP greeting;and for aPostgreSQLdatabase, it checks whether this will acceptanSQL query. Nagios becomesespecially interesting throughthe factthatittakesintoaccount dependencies in thenetworktopology (ifitisconfiguredtodoso).Ifthe target system can only be reached throughaparticular router that hasjustgonedown, then Nagios reports that thetargetsystem is “unreachable”, anddoesnot bother to bombarditwithfurther host andservicechecks.The software puts administrators in apositionwhere they can more quicklydetect theactualcause andrectify the situation. Thesuppliersofinformation Thegreat strength of Nagios—evenincomparisonwithother networkmonitoring tools—liesinits modularstructure:the Nagios core doesnot containone single test. Insteaditusesexternalprogramsfor service andhostchecks,which areknown as plugins.The basicequipment already containsanumber of standardplugins for themostimportant applicationcases.Special requests that go beyond theseare answered—provided that youhavebasic programming knowledge—by pluginsthat youcan writeyourself. Beforeyou invest time developing these, however, it is first worthtaking alook in theInternetand browsing throughthe relevant mailing lists, 1 as thereislively activityinthisarea. Ready-to-useplugins areavailable, especially in theNagios exchange platform, http://www.nagiosexchange.org/ . Apluginisasimple program—often just ashell script (Bash, Perl etc.)—that gives outone of thefourpossible conditions OK,WARNING, CRITICAL, or (withoperating errors, for example) UNKNOWN. This meansthatinprinciple Nagios can testeverythingthatcan be measured or counted electronically: thetemperature andhumidity in theserverroom,the amount of rainfall, thepresenceofpersons in acertain room at atimewhenno- body should enter it.There arenolimitstothis, provided that youcan findaway of providingmeasurement dataoreventsasinformation that can be evaluated by computer (for example, withatemperatureand humidity sensor,aninfraredsen- sor, etc.). Apartfromthe standardplugins,thisbook accordinglyintroducesfurther freelyavailable plugins, such as theuse of aplugintoquery atemperature and humidity sensor in Chapter 19 from page 377. Keepingadminsup-to-date Nagios possessesasophisticatednotification system.Onthe sender side (thatis, withthe host or service check) youcan configure when whichgroup of persons— theso-called contactgroups —are informedabout whichconditionsorevents(fail- 1 http://www.nagios.org/support/mailinglists.php 17 Introduction ure, recovery,warningsetc.).Onthe receiversideyou can also defineonmultiple levels what is to be donewithacorresponding message—for examplewhether the system should forward it,depending on thetimeofday,ordiscardthe message. If aspecific service is to be monitoredseven daysaweek round theclock, for exam- ple, this doesnot mean that theadministrator in charge willnever be able to take abreak: instead, youcan instruct Nagiostonotifythe person only from Mondays to Fridaysbetween 8am and5pm,every twohours at themost. If theadminis- trator in charge is notable to solve theproblemwithinaspecified period of time, eighthours for example, then theheadofdepartmentresponsible should receive amessage.Thisisalsoknown as escalation management.The corresponding configuration is explainedinChapter 12.5frompage 231. Nagios can also make useoffreelyconfigurable,externalprogramsfor notifica- tions, so that youcan integrate anysystem youlike: from e-mail to SMStoavoice serverthatthe administrator calls up andreceivesavoicemessage concerning the error. With itsWeb interface(Chapter 16 from page 273, Nagios provides theadminis- trator withawiderange of information,clearly arranged according to theissues involved. Whether theadmin needsasummary of theoverall situation, adisplay of problematicservices andhosts andthe causesofnetworkoutages, or thesta- tusofentiregroupsofhosts or services, Nagios provides an individually structured information page for nearly everypurpose. Throughthe Webfront end, an administrator can informcolleagues upon accepting aparticularproblemsothattheycan concentrate on otherthingsthathavenot yet been seen to.Information already obtained can be stored as commentsonhosts andservices,justlikescheduled downtimes:Nagios prevents false alarms going off in theseperiods. By reviewingpastevents, theWeb interfacecan reveal what problems occurred in aselected time interval, whowas informed, what thesituation wasconcerningthe availabilityofahost and/or servicesduringaparticular time period—allthisalso taking account of downtimes,ofcourse. Taking in information from outside Fortests,notifications, etc.,Nagios makesuse of externalprograms, butthe reverse is also possible:through aseparateinterface(see13.1frompage 240),independent programscan send status information andcommandstoNagios.The Webinterface makeswidespreaduse of this possibility, whichallows theadministrator to send interactive commandstoNagios.But abackup program unknown to Nagios can also transmit asuccess or failure to Nagios,aswellasasyslog daemon—thereisno limit to thepossibilitieshere. 18 Introduction Thanks to this interface, Nagios allows distributed monitoring.Thisinvolvesseveral decentralized Nagios installationssending theirtestresults to acentral instance, whichthenhelps to maintain an overviewofthe situationfromacentrallocation. Othertoolsfor network monitoring Nagios is notthe only tool for monitoring systemsand networks.The most well- known“competitor,”perhaps on an equalfooting, is BigBrother (BB).Despite a number of differences,its Webinterfacealsoservesthe same purposeasthatof Nagios:displayingtothe administrator what is in the“green area”and what is not. Thereasonwhy theauthorusesNagios insteadofBig Brotherliesinthe license for BigBrother,onthe BB homepage2 called BetterThanFreeLicense:the product continuestobecommercially developedand distributed.Ifyou useBBand earn moneywithit, youmustbuy thesoftware.The factthatthe software,including the source code,may notbepassedonormodifiedexceptwiththe explicit permission of thevendor meansthatitcannotbereconciledwiththe criteria for Open Source licenses. This meansthatLinux distributors have theirhands tied. Forthe graphical displayofcertain measured values over aperiodoftime, such as theload on anetworkinterface, CPU load, or thenumber of mails perminute, thereare othertoolsthatperformthistaskbetter than Nagios.The original tool is certainlythe Multi RouterTrafficGrapher MRTG, 3 which, despitegrowingcom- petition,still enjoys greatpopularity.The relatively young,but very powerfulal- ternative is calledCacti 4 :thishas alargerrange of applications,can be configured via Webinterface, andavoidsthe restrictions in MRTG,which can only displaytwo measured values at thesametimeand cannotdisplay anynegative values. Nagios itself can also displayperformancedatagraphically,using extensions(Chap- ter 17 from page 313).Inmanycases this is sufficient, butfor very dedicatedre- quirements,the useofNagios in tandem withagraphic representation tool such as MRTG or Cacti is recommended. AboutThisBook This book is directed at networkadministratorswho want to findout aboutthe conditionoftheir systemsand networks usinganOpenSourcetool.Itdescribes Nagios version2.0,which is somewhat differentfromits predecessors in itsconfig- uration. Theplugins,onthe otherhand, lead theirown lives, aretoagreatextent independentofNagios,and arethereforenot restricted to aparticularversion. 2 http://www.bb4.org/ 3 http://www.mrtg.org/ 4 http://www.cacti.net/ 19 Introduction Even though this book is basedonLinux as theoperating system for theNagios computer,thisisnot arequirement.Mostdescriptionsalsoapplytoother Unix systems,5 only system-specificdetails such as startscripts need to be adjusted accordingly.Nagios currentlydoesnot work under Windows, however. Thefirstpartofthisbook dealswithgettingNagios up andrunning withasimple configuration,but onethatissufficient for many uses,asquicklyaspossible.This is whyChapters1through3do nothavedetaileddescriptionsand treatments of alloptions andfeatures. Theseare examined in thesecondpartofthe book. Chapter 4looksatthe details of service andhostchecks,and in particular intro- ducestheir dependency on networktopologies. Theoptions available to Nagios for implementing service checks andobtaining their resultsisdescribedinChapter 5. This is followedbythe presentation of individualstandardplugins andanumber of additional,freelyobtainable plugins: Chapter 6takesalook at theplugins that inspectthe servicesofanetworkprotocoldirectlyfromthe Nagios host,while Chapter 7summarizesplugins that need to be installedonthe machinethatis beingmonitored,and for whichNagios needsadditional utilitiestoget them run- ning.Several auxiliary plugins, whichdonot performany tests themselves, but manipulate already establishedresults,are introducedinChapter 8. TwoutilitiesthatNagios requires to runlocal pluginsonremotehosts areintro- ducedinthe twosubsequentchapters: in Chapter 9the SSH is described, while Chapter 10 introducesadaemondevelopedspecifically for Nagios. Wherever networks arebeing monitored, SNMP also needstobeimplemented. Chapter 11 notonlydescribesSNMP-capable pluginsbut also examines thepro- tocoland theSNMPworld itself in detail, providingthe background knowledge needed for this. TheNagios notificationsystem is introducedChapter 12, whichalsodeals with notificationusing SMS, escalation management,and taking account of dependen- cies. Theinterfacefor externalcommandsisdiscussedinChapter 13; this forms thebasis of otherNagios mechanisms,suchasthe Nagios ServiceCheck Acceptor (NSCA), aclient-server mechanismfor transmitting passive testresults,covered in Chapter 14. Theuse of this is shownintwo concrete examples—integrating syslog-ng and processing SNMP traps.NSCAisalsoarequirementfor distributed monitoring, discussedinChapter 15. Even though youmay have already used theWeb interface, youmight still be wonderingabout allthe detailedoptions that this offers. Chapter 16 triestoanswer this question as completelyaspossible,supported by very helpfulscreenshots. It 5 Forexample,*BSD, HP-UX, AIX, andSolaris;the author doesnot know of anyNagiosversions running under MacOSX. 20 Introduction also describesaseries of parameterswhich until nowhavenot been documented anywhere, except in thesourcecode. Althoughinits operation,Nagios concentrates primarily on trafficlight signals (red-yellow-green), thereare ways of evaluating andrepresentingthe performance dataprovided by plugins, whichare describedindetail in Chapter 17. Networksare rarely homogeneous, that is,equippedonlywithLinux andother Unix-based operating systems. ForthisreasonChapter 18 demonstrates what util- itiescan be used to integrate andmonitor Windowssystems. Chapter 19 uses theexample of alow-costhardwaresensortoshowhow room temperatureand humidity can be monitoredsimplyyet effectively. Nagios can also monitorproprietary commercialsoftware,aslongasmechanisms areavailable whichcan querystates of thesystem integrated into aplugin. In Chapter 20, this is describedusing an SAP-R/3system. Theappendix Nagios Configuration introducesall theparametersofthe twocen- tral configuration files nagios.cfg and cgi.cfg,while Rapidly Changing States: Flapping and EventHandler aredevoted to some useful butsomewhatexoticfea- tures. Furthernotes on thebook At thetimeofgoing to press, Nagios 2.0isclose to completion.Whenthisbook is on themarket, therecould well be some modifications. Relevant notes,aswellas corrections, in casesomeerrorshaveslippedintothe book,can be found at http://linux.swobspace.net/books/nagios/ . Note of Thanks Many peoplehavecontributed to thesuccess of this book.Mythanksgofirstof alltoDr. Markus Wirtz, whoinitiated this book withhis comment,“Whydon’t you writeaNagios book,then?!”,whenherefused to acceptmyNagios activitiesas an excuse for delays in writinganother book.Iwouldalsoliketothank thetwo technical editors, SteffenWaitz andJ¨orgLinge,for theirsupport. Averyspecial thanks goes to Patricia Jung,who,asthe technical editor for theGermanlanguage version, overhauled themanuscriptand pestered me withthousands of questions— whichwas agood thingfor thecompletenessofthe book,and whichhas ultimately made it easier for thereadertounderstand. 21 From Source Code to aRunning Installation 1 Ch ap te r Installation Thesimplestmethod of installation is for youtoinstall theNagios packagesthat aresuppliedwiththe distribution youare using. However, Nagios 2.0isrelatively new, so youmay have to make do withanolder Nagios versionusing this method. Configuring this is quitedifferent from theversion 2.0describedhere, whichiswhy it is recommended that youtake things into your ownhands andcompile Nagios yourself if thedistributordoesnot provideany Nagios 2.0packages. If youare compilingNagios yourself,you also have an influence on directorystruc- turesand severalother parameters. ANagios system compiledinthisway also pro- vides an almost complete main configuration file, in which, initially,nothing hasto be changed. Butitshouldbementioned here that compilingNagios yourself might involve alaborious search forthe necessary developmentpackages, dependingon what is already installedonthe computer. 25 1 Installation ForcompilingNagios itself yourequire gcc , make, autoconf and automake.Re- quired librariesare libgd1 and openssl2 .The developmentpackagesfor thesemust also be installed(dependingonthe distribution,witheitherthe ending -dev or -devel): libssl-dev, libgd-dev , libc6-dev. Forthe pluginsitisrecommended that youalsoinstall thefollowingpackagesat thesametime: ntpdate, 3 snmp, 4 smbclient, 5 libldap2 ,and libldap2-dev , 6 as well as theclient anddeveloperpackagesfor thedatabasetobeused(e.g., postgresql- client and postgresql-dev ). 1.1Compilingthe Source Code TheNagios source code itself is available for download on theproject page, http:// www.nagios.org/.The followinginstallation descriptionusesabeta versionthat hasbeen released, 7 andthatisprovided by thedevelopers as atarball: linux:˜ # mkdir /usr/local/src linux:˜ # cd /usr/local/src linux:local/src # tar xvzf Path/to/nagios-2.0b3.tar.gz Thethree commandsunpack thesourcecode into thedirectory created for this purpose, /usr/local/src.Whenthisisdone, asubdirectorywiththe name nagios- 2.0b3 is also created.Beforethe actualcompilation andinstallation,the groups required for operation,namely nagios and nagcmd,are setupwith groupadd,and theuser nagios,who is assigned to thesegroupsand withwhose permissionsthe Nagios serverruns is setupwith useradd : linux:˜ # groupadd -g 9000 nagios linux:˜ # groupadd -g 9001 nagcmd linux:˜ # useradd -u 9000 -g nagios -G nagcmd -d /usr/local/nagios \ -c "Nagios Admin" nagios Insteadofthe user (9000) andgroup IDs(9000 or 9001) used here,any other (available)IDmay be used.The primarygroup nagios of theuser nagios should remain reservedexclusively for this user. 1 http://www.boutell.com/gd/ 2 http://www.openssl.org/ Dependingonthe distribution,the required RPMand Debianpack- ages aresometimes nameddifferently. Here youneed to refertothe search help in the corresponding distribution.For Debian, thehomepagewill be of help.Ifaconfigure instruction complains, forexample,ofamissing gd.h file, youcan search specifically at http://www.debian.org/distrib/packages forthe contents of packages. Thesearch will then come up with allpackagesthatcontain thefile gd.h . 3 http://ntp.isc.org/bin/view/Main/SoftwareDownloads 4 http://net-snmp.sourceforge.net/ 5 http://samba.org/samba/ 6 http://www.openldap.org/ 7 Thefinalversion of Nagios 2.0was notyet availableatthe time of going to press. 26 1.1 Compilingthe Source Code TheCGI scriptsare runbyNagios under theuserIDofthe user withwhose per- missionsthe ApacheWeb serverruns.Inorder that this user can accesscertain protected areasofNagios,anadditional groupisrequired, theso-called Nagios Command Group nagcmd:onlythe Webuserand theuser nagios should belong to this group. TheWeb user can be determinedfromthe Apacheconfiguration file: linux:˜#grep "ˆUser" /etc/httpd/httpd.conf User www-data linux:˜ # usermod -G nagcmd www-data In theexample,the Webuseriscalled www-data.The command usermod (this changesthe datafor an existing user account)alsoincludesthe Webuserinthe nagcmd groupthankstothe -G option, by manipulating thecorresponding entry in thefile /etc/group . TheApache configuration fileisnot always locatedinthe directory /etc/httpd/; dependingonthe distribution on theApache versionused, this couldalsobe called /etc/apache or /etc/apache2;the configuration fileitselfissometimescalled apache.conf or apache2.conf . In addition,the directoryspecifiedasthe home directoryofthe user nagios, /usr/ local/nagios ,the configuration directory /etc/nagios andthe directory /var/nagios , whichrecords variable datawhile Nagios is running,are setupmanually andare assigned to theuser nagios andtothe groupofthe same name: linux:˜ # mkdir /usr/local/nagios /etc/nagios /var/nagios linux:˜ # chown nagios.nagios /usr/local/nagios /etc/nagios /var/nagios Younow change to thedirectory withthe Nagios sourcestopreparethese for compilation: linux:˜ # cd /usr/local/src/nagios-2.0b3 linux:src/nagios-2.0b3 # ./configure \ --sysconfdir=/etc/nagios \ --localstatedir=/var/nagios \ --with-command-group=nagcmd Forthe configure command,parametersare specified that differ from thestandard; Table 1.1lists themostimportant of these. Thevalueschosenhereensurethat theinstallation routineselects thedirectories used here in thebook andthatall parametersare correctlyset when themainconfiguration fileisgenerated.This considerably simplifies thefine-tuning of theconfiguration. If --prefix is notspecified, Nagios installs itself in thedirectory /usr/local/nagios . We recommend that youstick to this directory.8 8 In accordance with the Filesystem HierarchyStandard FHS, version2.3,orlocal programs loaded by theadministrator should be installedin /usr/local . 27 1 Installation Table1.1: Installation parametersfor Nagios Property Value configure Option Root directory /usr/local/nagios--prefix Configuration directory /etc/nagios--sysconfdir Directoryfor variable data /var/nagios--localstatedir Nagios user (UserID) nagios ( 9000) --with-nagios-user Nagios group(GroupID) nagios ( 9000) --with-nagios-group Nagios Command Group (GroupID) nagcmd ( 9001) --with-command-group Thesystem normally stores itsconfiguration files in thedirectory etc beneathits root directory. In generalitisbetter to storethese in the /etc hierarchy, however. Here we use /etc/nagios . 9 Variable datasuchasthe logfile andthe status fileare by defaultstoredbyNagios in thedirectory /usr/local/nagios/var .Thisisinthe /usr hierarchy, whichshould only containprogramsand otherread-only files,not writable ones.Inorder to ensure that this is thecase, we use /var/nagios . 10 Irrespective of thesechanges,inmostcases configure doesnot runthrough fault- lesslythe very first time,since onepackage or anotherismissing.For required librariessuchas libgd,Nagios almost always demandsthe relevant developerpack- age withthe headerfiles(here, libgd-dev or libgd-devel ). Dependingonthe dis- tribution,their namesend in -devel or -dev. Afterall thetests have been runthrough, configure presents asummary of allthe important configuration parameters: *** Configuration summary for nagios 2.0b3 04-03-2005 ***: General Options: ------------------------- Nagios executable: nagios Nagios user/group: nagios,nagios Command user/group: nagios,nagcmd Embedded Perl: no Event Broker: yes Install $prefix: /usr/local/nagios Lock file: /var/nagios/nagios.lock Init directory: /etc/init.d Host OS: linux-gnu 9 This is notentirelycompatible with FHS2.3,which wouldprefertohavethe configuration files in /etc/local/nagios. 10 This also doesnot quitematch therequirementsofthe FHS2.3.But sinceNagiosmakes no differentiation between spool,cache,and statusinformation,anFHS-truereproductionisnot possible to achieveinasimple manner. 28 1.1 Compilingthe Source Code Web Interface Options: ------------------------ HTML URL: http://localhost/nagios/ CGI URL: http://localhost/nagios/cgi-bin/ Traceroute (used by WAP): /usr/sbin/traceroute If therewas a yes after theitem EmbeddedPerl ,thiswould mean that Perl plugins arenot continually reloaded, butare kept in thememory. This savestimewhenrun- ning Perl scripts.11 that can be loadedasadditional moduleswhile thesystem is running. 12 If youare satisfied withthe result, make starts theactualcompilation andthen installs thesoftware: linux:src/nagios-2.0b3 # make all linux:src/nagios-2.0b3 # make install linux:src/nagios-2.0b3 # make install-init linux:src/nagios-2.0b3 # make install-commandmode linux:src/nagios-2.0b3 # make install-config make all compilesall therelevantprograms, whichare then copied to theap- propriate directories, together withCGI scriptsand documentation, by make in- stall .Apart from /etc/nagios and /var/nagios ,further directoriesare created under /usr/local/nagios ,which aresummarizedintable 1.2. Table1.2: Nagios directories under /usr/local/nagios DirectoryContents ./binExecutable Nagios main program ./libexecPlugins ./sbin CGIscripts ./shareDocumentation, HTML files for theWeb interface make install-init installs asuitable init script for thesystem start. Here make automatically triestodetect thecorrect path, whichfor most Linux distributions is /etc/init.d.Depending on your system,thismay also go wrong, whichiswhy youshouldcheck it.Inorder for Nagios to startautomatically when thesystem is booted,the followingsymboliclinks arecreated in the /etc/rc?.d directories: linux:˜ # ln -s /etc/init.d/nagios /etc/init.d/rc2.d/S99nagios linux:˜ # ln -s /etc/init.d/nagios /etc/init.d/rc2.d/K99nagios 11 At thetimeofgoing to press, however, theEmbeddedPerlinterface hadproblems with memory usage: Nagios occupied more andmoremainmemoryuntil themachine came to astandstill. 12 At thetimeofgoing to pressthere were notyet anyexternal extensions,which is why theEvent Broker is currentlyonlyofinteresttodevelopers. 29 The EventBroker in turn provides an interfacefor extensions 1 Installation Wherenecessary,thisstep is repeated for rc3.d and rc5.d .Finally make install- commandmode generates thedirectory that is required for later usage of the command filemechanism (see section13.1frompage 240) onwards. This step is optional, dependingonthe intendeduse,but sinceitiseasytoforgetlater on,it is better to take precautionsnow.The final make install-config creates theexam- pleconfiguration,which willbeusedinthe next chapter. 1.2Installingand TestingPlugins What is nowstill missing arethe plugins. They mustbedownloadedseparatelyfrom http://www.nagios.org/ andinstalled. As independentprograms, they aresubject to adifferent versioning system than Nagios.The currentversion at thetimeof going to presswas version1.4,but youcan,for example, also useplugins from version1.3.1 if youdon’t mind doing without themostrecentfeatures. Although theplugins aredistributed in acommonsourcedistribution,theyare independent of oneanother,sothatyou can replaceone versionofanindividualpluginwith anotherone at anytime, or withone youhavewritten yourself. 1.2.1Installation Theinstallation of thepluginsources takesplace,likethe Nagios ones,inthe di- rectory /usr/local: linux:˜ # cd /usr/local/src linux:local/src # tar xvzf path /to/nagios-plugins-1.4.tar.gz linux:src/nagios-plugins-1.4 # ./configure \ --sysconfdir=/etc/nagios \ --localstatedir=/var/nagios When running the configure command youshouldspecify thesamedeviating val- uesasfor theserver, whichhereare theconfiguration directory(/etc/nagios )and thedirectory for thedatasaved by Nagios ( /var/nagios ). Since theNagios plugins arenot maintained by thesamepeopleasNagios itself,you should always checkin advance,with ./configure--help ,whether the configure options for Nagios and theplugins really matchordeviate from oneanother. It is possible that aseriesof WARNING smay appearinthe output of the configure command,somethinglikethis: ... configure: WARNING: Skipping radius plugin configure: WARNING: install radius libs to compile this plugin (see REQUIREMENTS). 30 1.2 Installing and Testing Plugins ... configure: WARNING: Tried /usr/bin/perl -install Net::SNMP perl module if you want to use the perl snmp plugins ... If youare notusing Radius,you need have no qualms in ignoring thecorresponding errormessages. Otherwiseyou should installthe missing packages andrepeatthe configure procedure. Thequite frequently required SNMP functionalityismissing aPerlmodule in theexample.Thisisinstalledeitherinthe formofthe distribution packageoronlinevia theCPANarchive: 13 linux:˜ # perl -MCPAN -e ’install Net::SNMP’ If youare running theCPANprocedure for thefirsttime, it will guideyou inter- actively throughaself-explanatory setup, andyou can answer nearly allofthe questionswiththe defaultoption. Running make in thedirectory nagios-plugins-1.4 willcompile allplugins.After- wardsyou have theopportunity to performtests,with make check .Because these have notbeen particularly carefully programmed, youwill often seemanyerror messagesthathavemoretodowiththe testitselfthanwiththe plugin.ifyou still want to tryit, then the Cache Perl module mustalsobeinstalled. Irrespective of make check ,the most important pluginsshouldbetested manually anywayafter theinstallation. make install finally anchorsthe pluginsinthe subdirectory libexec (which in our caseis /usr/local/nagios/libexec), butnot allofthem: thesourcedirectory contrib contains anumber of pluginsthat make install doesnot installautomatically. Most pluginsinthisdirectory areshell or Perl scripts. Whereneeded,these aresim- plycopied to theplugindirectory /usr/local/nagios/libexec.The fewCprograms thereare mustfirstbecompiled, whichinsomecases maybenolaughingmat- ter,since acorresponding makefile, andoften even adescription of therequired libraries, is missing.Ifasimple make is notsufficient,asinthe caseof linux:nagios-plugins-1.4/contrib # make check_cluster214 cc check_cluster2.c -o check_cluster2 then it is best to look for help in themailinglist nagiosplug-help . 15 Thecompiled program mustalsobecopied to theplugindirectory. 13 The ComprehensivePerlArchive Network at http://www.cpan.org/. 14 With check_cluster,hosts andservicesofaclustercan be monitored. Here youusually want to be notifiedifall nodesorredundantservicesprovidedfailatthe same time.Ifone specific servicefails on theother hand,thisisnot critical,aslongasother hosts in theclusterprovide this service. 15 http://lists.sourceforge.net/lists/listinfo/nagiosplug-help 31 1 Installation 1.2.2Plugintest Because pluginsare independentprograms, they can already be used manually for testpurposes rightnow—beforethe installation of Nagios hasbeen completed.In anycaseyou should checkthe check_icmp plugin,which playsanessentialrole: it checks whether anothercomputer can be reached via ping andisthe only plugin to be used both as aservicecheck andahostcheck. If it is notworking correctly, Nagios willalsonot work correctly, sincethe system cannotperformany service checks as long as it categorizes ahostas“down”.Section 6.2from88describes check_icmp in detail, whichiswhy thereisonlyshort introductionheredescribing itsmanualuse. In orderfor theplugintofunction correctlyitmust, likethe /bin/ping program, be runasthe user root.Thisisdonebyprovidingitwiththe SUID bit : linux:˜ # chown root.nagios /usr/local/nagios/libexec/check_icmp linux:˜ # chmod 4711 /usr/local/nagios/libexec/check_icmp linux:˜ # ls -l /usr/local/nagios/libexec/check_icmp -rwsr-x--x 1root nagios 61326 2005-02-08 19:49 check_icmp Brief instructions for thepluginare givenwiththe -h option:16 nagios@linux:˜$ /usr/local/nagios/libexec/check_icmp -h Usage: check_icmp [options] [-H] host1 host2 hostn Where options are any combination of: *-H|--host specify atarget *-w|--warn warning threshold (currently 200.000ms,40%) *-c|--crit critical threshold (currently 500.000ms,80%) *-n|--packets number of packets to send (currently 5) *-i|--interval max packet interval (currently 80.000ms) *-I|--hostint max target interval (currently 0.000ms) *-l|--ttl TTL on outgoing packets (currently 0) *-t|--timeout timeout value (seconds, currently 10) *-b|--bytes icmp packet size (currenly ignored) -v |--verbose verbosity++ -h |--help this cruft The -H switch is optional. Naming ahost (or several) to check is not. Forasimple testitissufficient to specifyanIPaddress(it is immaterialwhether youprefix the -H flagornot): user@linux:˜$ cd /usr/local/nagios/libexec user@linux:nagios/libexec$ ./check_icmp -H 192.168.1.13 OK -192.168.1.13: rta 0.261ms, lost 0%|rta=0.261ms;200.000;500.000;0; pl=0%;40;80;; 16 Thelistedoptions areexplained in detail in Section6.2 from page 88. 32 1.3 Configuration of theWeb Interface Theoutputappearsinasingle line, whichhas been line-wrapped here for the printed version: withzeropercent packageloss(lost 0%), thetesthas been passed. Nagios uses only thefirst300 bytes of theoutputline. If thepluginprovides more information,thisiscut off. If youwould liketotestother plugins, we referyou to Chapters6and7,which describe themostimportant pluginsindetail. All(reasonably well-programmed) pluginsprovidesomewhatmoredetailedinstructionswiththe --help option. 1.3Configurationofthe WebInterface In orderfor theWeb frontend of Nagios to function,the Webservermustknow theCGI directoryand thebasis Webdirectory.The followingdescription,witha slight deviation,appliestobothApache 1.3and Apache2.0. 1.3.1Setting Up Apache As long as youhavenot addedadifferent address for thefront end, through the configure script with -with-cgiurl ,itcan be addressedunder /nagios/cgi-bin . Since theactualCGI scriptsare locatedinthe directory /usr/local/nagios/sbin,a corresponding script aliasisset in theApache configuration: ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin AllowOverride AuthConfig Options ExecCGI #Remove the comment sign (#) from the following lines for Apache 2.0: #SetHandler cgi-script Order allow,deny Allow from 192.168.0.0/24 Thedirective ScriptAlias ensuresthatApache accessesthe Nagios CGIdirectory when callinganURL such as http://nagios-server/nagios/cgi-bin ,irrespective of wherethe ApacheCGI directoriesmay be located. OptionsExcecCGI ensuresthat theWeb serveracceptsall thescripts locatedthere as CGI. Apache2.0 in addition demandsthe directive SetHandler.The directives Order and Allow ensure that only clientsfromthe network 192.168.0.0/24 ( /24 stands forthe subnet mask 255.255.255.0 )may obtain accesstothe specified directory. To be able to address theNagios document directory /usr/local/nagios/share under http://nagios-server/nagios (independently of wherethe Apache DocumentRoot is located),the followingisadded: 33 1 Installation Alias /nagios /usr/local/nagios/share Options None AllowOverride AuthConfig Order allow,deny Allow from 192.168.0.0/24 Here thedirectives Order and Allow also allowaccess only from thespecifiednet- work. It is recommended that youwrite theabove details in your ownconfiguration file, called nagios.conf ,sothatthisconfiguration is notlostduringanApache update, andplace it in theApache directoryfor individualconfigurations. This is usually to be found under /etc/apache/conf.d,but dependingonthe distribution andthe Apacheversion,thiscould also be under /etc/httpd/conf.d or /etc/apache2/conf.d . In anycasethe Apacheconfiguration filemustintegrate this directorywiththe di- rective Include .MorerecentSuSEdistributionsonlyaccept files in thesubdirectory conf.d that endin .conf .The command linux:˜ # /etc/init.d/apache reload loads thenew configuration.Ifeverythinghas worked outcorrectly,the Nagios main page appearsinthe Webbrowser under http://nagios-server/nagios . 1.3.2UserAuthentication In thestate in whichitisdelivered, Nagios allows only authenticatedusers access to theCGI directory. This meansthatusers not“loggedin” have no waytosee anything otherthanthe home page andthe documentation. They areblockedoff from access to otherfunctions. Thereisagood reason forthis: apartfromstatusqueries andother displayfunc- tions, Nagios hasthe abilitytosendcommandsvia theWeb interface. Theinterface for externalcommandsisusedfor this purpose(Section13.1, page 240).Ifthisis active,checks can be switched on andoff via theWeb browser, for example, and Nagios can even be restarted.Onlyauthorizedusers should be in apositiontodo this. Theeasiest waytoimplement acorresponding authentication is via a .htaccess file in theCGI directory /usr/local/nagios/sbin. 17 Thedocument directory, on theother hand,requiresnospecial protection.Inaddition,the parameter use_authentication in theCGI configuration file cgi.cfg18 of Nagios mustbeset to 1 : 17 Theaccess rule describedhere, via .htaccess in theCGI directory, adherestothe officialNagios documentation. ThosemorefamiliarwithApachewill have otherconfigurationpossibilities available,ofcourse. 18 More on this in Section2.13frompage57. 34 1.3 Configuration of theWeb Interface use_authentication=1 This is thedefault during installation.Inthe CGIdirectory /usr/local/nagios/sbin a .htaccess fileiscreated withthe followingcontents: AuthName "Nagios-Monitoring" AuthType Basic AuthUserFile /etc/nagios/htpasswd require valid-user AuthName is just acomment that thebrowser displays if theWeb serverrequests authentication. AuthType Basic stands forsimpleauthentication, in whichthe passwordistransmitted without encryption,aslongasnoSSL connectionisused. It is best to save thepasswordfile—here htpasswd —in theNagios configuration directory /etc/nagios .The final parameter, requirevalid-user,means that allau- thenticatedusers have access(thereare no restrictions for specificgroups; only the user-passwordpairmustbevalid). In combinationwithits ownmodulesand thoseofthird parties, Apacheallows a series of otherauthenticationmethods.These includeauthenticationvia an LDAP directory, via Pluggable Authentication Modules(PAM), 19 or usingSMB via aWin- dowsserver. Here we referyou to therelevantliteratureand thehighlydetailed documentationonthe Apachehomepage at http://httpd.apache.org/. The(basically freelyselectable)nameofthe passwordfile will be specified here so that it displays what type of passwordfile is involved. It is generated with the htpasswd2 program included in Apache(in Apache1.3 theprogram is called htpasswd ). Running linux:/etc/nagios # htpasswd2 -c htpasswd nagios generates anew passwordfile withapasswordfor theuser nagios.Its formatis relatively simple: nagios:7NlyfpdI2UZEs Each linecontainsauser-passwordpair, separated by acolon. 20 If youwanttoadd otherusers, youshouldensurethatyou omit the -c (“create”)option. Otherwise htpasswd(2) willrecreatethe fileand delete theold contents: linux:/etc/nagios # htpasswd2 htpasswd another user 19 The“Pluggable Authentication Modules” nowcontrol authentication in allLinux distributions, so that youcan also useexisting user accountshere. 20 To be precise, thesecondpositiondoesnot containthe password itself, butratherits hash value. 35 1 Installation Theusernamecannotbechosenfreelybut mustmatch thenameofacontact person (see Section2.7,page 50). Only theWeb user ( www-data in ourexample) needstobeable to read thegenerated htpasswd file, anditshouldbeprotected from access by anyone else: linux:/etc/nagios # chown www-data htpasswd linux:/etc/nagios # chmod 600 htpasswd Even though configuration of theWeb interfaceisnow finished, at themoment only thedocumentationisproperlydisplayed:Nagios itself mustfirstbecorre- spondingly adjusted—asdescribedindetail in thefollowingchapter–beforeitcan make usable monitoring dataavailable in this way. 36 2 Ch ap te r Nagios Configuration Althoughthe Nagios configuration can become quitelarge,you only need to han- dleasmallpartofthistoget asystem up andrunning.Luckily many parametersin Nagios arealready settosensible defaultsettings.Sothischapter willbeprimar- ily concernedwiththe most basicand frequently used parameters, whichisquite sufficientfor an initialconfiguration. Furtherdetails on theconfiguration areprovided by thechaptersonindividual Nagios features:inChapter 6about networkplugins (page 85)there aremanyex- amples on theconfiguration of services. Allparametersofthe Nagios messaging system areexplained in detail in Chapter 12, page 215, andthe parametersfor con- trollingthe Webinterfaceare describedinChapter 16 from page 273. In addition to this,Nagios includes itsown extensive documentation, once it is installed, in the directory /usr/local/nagios/share/docs ,which can also be reached from theWeb interface. This can always be recommended as ausefulsourcefor furtherinforma- tion,which is whyeach of thesectionsbelow refertothe corresponding location in theoriginaldocumentation. 37 2 Nagios Configuration Theinstallation routinein make install-config (see Section1.1 on page 26) stores examples of individualconfiguration files in thedirectory /etc/nagios .Theyall endin -sample ,sothatapossible updatewill notoverwrite thefilesneeded for productive operation. Allsubsequentworkshouldbecarried outasthe user nagios.Ifyou areedit- ingfilesasthe superuser, youmustensureyourselfthatthe contents of directory /etc/nagios afterwardsbelongtothe user nagios again. With theexception of the file resource.cfg—thismay contain passwords,which is whyonlythe owner nagios should have thereadpermission set—allother files maybereadable for all. 2.1The Main ConfigurationFile nagios.cfg Thecentral configuration takesplace in nagios.cfg.Insteadofstoring allconfigu- ration options there, it makeslinks to otherconfiguration files (withthe exception of theCGI configuration). Theeasiest method is first to copy theexample file: nagios@linux:/etc/nagios$ cp nagios.cfg-sample nagios.cfg Thosewho compile andinstall Nagios themselveshavethe advantage that at first they do notevenneed to adjust nagios.cfg,since allpaths arealready correctly set.1 And that’sasmuchasyou need to do. Neverthelessone smallmodification is recommended,which helpstomaintainaclearpicture andconsiderably simplifies configuration wherelargernetworks areinvolved. Theparameter concernedis cfg_file,which integrates files withobject definitions (see Sections 2.2through 2.10).The file nagios.cfg-sample ,includedinthe pack- age, contains thefollowingentries: nagios@linux:/etc/nagios$ fgrep cfg_file nagios.cfg ... cfg_file=/etc/nagios/checkcommands.cfg cfg_file=/etc/nagios/misccommands.cfg cfg_file=/etc/nagios/contactgroups.cfg cfg_file=/etc/nagios/contacts.cfg cfg_file=/etc/nagios/dependencies.cfg cfg_file=/etc/nagios/escalations.cfg cfg_file=/etc/nagios/hostgroups.cfg cfg_file=/etc/nagios/hosts.cfg cfg_file=/etc/nagios/services.cfg cfg_file=/etc/nagios/timeperiods.cfg #cfg_file=/etc/nagios/hostextinfo.cfg #cfg_file=/etc/nagios/serviceextinfo.cfg ... 1 If Nagios is from adistribution package,itisworth checking at leastthe path details.Ina well-maintained distribution thesewill also be matchedtothe Nagios directoriesusedthere. 38 2.1 TheMainConfiguration File nagios.cfg As an alternative to cfg_file,you can also usethe parameter cfg_dir :thisrequests youtospecify thenameofadirectory from whichNagios should integrate allcon- figuration files ending in .cfg (fileswithother extensionsare simply ignored).This also worksrecursively; Nagios thus evaluates all *.cfg files from allsubdirectories. With theparameter cfg_dir youthereforeonlyneed to specifyasignal directory, insteadofcallingall configuration files,with cfg_file,individually.The only re- striction: thesemustbeconfiguration files that describe objects. Theconfiguration files cgi.cfg and resource.cfg areexcludedfromthis, whichiswhy,likethe main configuration file nagios.cfg,theyremaininthe main directory /etc/nagios . Forthe object-specificconfiguration,itisbesttocreateadirectorycalled /etc/ nagios/mysite ,thenremoveall cfg_file directivesin nagios.cfg (orcomment them outwitha# at thebeginning of theline) andreplace them withthe following: ... cfg_dir=/etc/nagios/mysite ... Thecontents of thedirectory /etc/nagios willthenlook likethis: nagios@linux:/etc/nagios$ tree2 . |-- nagios.cfg |-- cgi.cfg |-- resource.cfg |-- htpasswd |-- mysite ||-- contactgroups.cfg ||-- misccommands.cfg ||-- contacts.cfg ||-- timeperiods.cfg ||-- checkcommands.cfg ||-- hosts.cfg ||-- services.cfg |‘-- hostgroups.cfg |-- sample ||-- ... ... ... Themaindirectory /etc/nagios contains only threeconfiguration files andthe pass- word filefor protected Webaccess.For thesake of clarity, theconfiguration exam- ples *-sample should be movedtothe directory sample. In this book we willinclude allobjectsofatype in afile of itsown,thatis, all host definitions in thefile hosts.cfg ,all servicesin services.cfg ,and so on.But you couldjustaswellsaveeach of thehostdefinitionsinaseparatefile for each host anduse adirectory structuretoreflectthis: 2 http://mama.indstate.edu/users/ice/tree/ 39 2 Nagios Configuration ... |-- mysite ||-- linux |||-- services ||‘-- hosts || |-- linux01.cfg || |-- linux02.cfg || ‘-- linux03.cfg ||-- windows |||-- services ||‘-- hosts || |-- win03.cfg || ‘-- win09.cfg ||-- router |||-- services ||‘-- hosts || |-- edge01.cfg || |-- edge02.cfg || ‘-- backbone.cfg ... In doing this,onlythe topdirectory mysite needstobeintegrated into nagios.cfg, using cfg_dir .For theinitial configuration,however,wewill leaveall thefilesin thedirectory mysite . Thedatespecifications in Nagios appear by defaultinthe American format MM- DD-YYYY: date_format=us If youprefersomethingelse, e. g. theEuropean dateformat, it is recommended that youchangethe parameter date_format in nagios.cfg rightfromthe start. Thevalue iso8601 ensuresthatNagios datespecifications aredisplayed in theISO or DINformat YYYY-MM-DD HH:MM:SS.Table 2.1lists thepossible values for date_format . Table2.1: possible dateformat Value Representation us MM- DD- YYYY HH: MM: SS euro DD- MM- YYYY HH: MM: SS iso8601 YYYY- MM- DD HH: MM: SS strict- iso8601 YYYY- MM- DDT HH: MM: SS Theother parametersin nagios.cfg aredescribedinAppendix D.1onpage 424; in theoriginaldocumentationthese can be found at http://localhost/nagios/docs/ configmain.html or /usr/local/nagios/share/docs/configmain.html. 40 2.2 Objects—an Overview 2.2Objects—anOverview ANagios object describesaspecificunit: ahost, aservice, acontact,but also the groups to whichitbelongs.Evencommandsare defined as objects. This definition hasnot come aboutbychance: Nagios is also able to inheritcharacteristics (Section 2.11 from page 54). Object definitions followthe followingpattern: define object-type { parameter value parameter value ... } Nagios hasthe followingvaluesfor the object-type: host Thehostobject describesone of thenetworknodesthatare to be monitored. Nagios expectsthe IP addressasaparameter here (orthe Fully Qualified Domain Name)and thecommand that should definewhether thehostis alive (see Section2.3 from page 44).The host definitionisre-referenced in theservicedefinition. hostgroup Severalhosts can be combined into agroup (see Section2.4 on page 46). This simplifies configuration,since entire host groups insteadofsinglehosts can be specified when definingservices (the service willthenexist for each member of thegroup). In addition,Nagios represents thehosts of ahost grouptogether in atable in theWeb frontend,which also helpstoincrease clarity. service Theindividualservices to be monitoredare defined as service objects(Chap- ter 2.5frompage 47). Aservicenever exists independentlyofahost.Soitis quitepossible to have severalservices withthe same name,aslongasthey belong to different hosts. Thefollowingcode, define service { name PING host_name linux01 ... } define service { name PING host_name linux03 } 41 2 Nagios Configuration describestwo servicesthatbothhavethe same service name butbelongto different hosts. So in thelanguage of Nagios,aservice is always ahost- service pair. servicegroup As it doeswithhostgroups, Nagios also combines severalservices,torepre- sent theseinthe Webfront endasaunitwithits owntable (see Section2.6 on page 50).Servicegroupsare notabsolutelyessential, buthelptoimprove clarity, andare also used in reporting. contact Apersonwho is to be informedbyNagios of specificevents(seeSection 2.7 from page 50).Nagios also uses contact objectstoshowtoauser via the Webfront endonlythose things for whichthe user is listed as acontact person.Inthe basicsettingusers do notget to seehosts andservices for whichtheyare notresponsible. contactgroup Notificationofeventsinhosts andservices takesplace via thecontact group (Section 2.8frompage 52). Adirectlinkbetween thehost/service anda contact person is notpossible. timeperiod Describesatime period withinwhich Nagiosshouldinformcontact groups (Section 2.10 from page 54).Outside such atimeslot, thesystem willnot send anymessages. Themessaging chaincan be fine-tunedvia various time periods,depending on thehost/service andcontact/contact groups.More on this will be presented in Section12.3frompage 217. command Nagios always calls externalprogramsvia command objects(Section2.9 from page 53).Apart from plugins, messaging programsalsoinclude sending e-mails or SMSmessages. servicedependency This object type describesdependences betweenservices.If, for example, an applicationdoesnot function without adatabase, acorresponding depen- dencyobject willensurethatNagios willrepresent thefaileddatabaseas theprimary probleminsteadofjustannouncingthe nonfunctioningofthe application(seeSection 12.6frompage 234). serviceescalation Usedtodefine properescalationmanagement:ifaserviceisnot available after aspecific time period, Nagios informs afurther,ordifferent circle of people. This can also be configuredonmultiplelevels, in anyway youwant (see Section12.5). 42 2.2 Objects—an Overview hostdependency Like servicedependency ,but for hosts. hostescalation Like serviceescalation,but for hosts. hostextinfo “ Extended Host Information ”objectsare optionaland defineaspecific graphic and/or URL, whichNagios additionally integrates into itsgraphic output.The URLcan refertoaWebpage that provides additionalinforma- tion on thehost(seeSection 16.4frompage 307). serviceextinfo Extended Service Information ,like Extended Host Information . Notall object typesare absolutelyessential; especially at thebeginning,you can easily do without the *dependency , *escalation,and *extinfo objects, as well as the servicegroup.Chapter 12 looksatescalationand dependencies in detail. The extendedinformation objectsare used to providea“morecolorful” graphical rep- resentation, buttheyare notatall necessary for running Nagios.Werefer here to theoriginaldocumentation.3 Notesonthe object examplesbelow Althoughthe followingchaptersdescribe individualobject typesindetail, only the mandatoryparametersare describedthere andthose that areabsolutelyessential for meaningfuloperation.Mandatoryparametershereare always printed in bold type.The first (comment)lineineach examplelists thefile in whichthe recorded object definitionistobestored. When youfirststart usingNagios,itisrecommended that yourestrictyourselftoa minimalconfiguration withonlyone or twoobjectsper object type,inorder to keep potential sourcesoferror to aminimum andtoobtainarunning system as quickly as possible.Afterwardsextensionscan be implemented very simply andquickly, especially if youtake on boardthe tips mentionedinSection 2.11 on templates (page 54). Time details in generalrefer to time units.Atime unitconsistsof60seconds by default. It can be settoadifferent valueinthe configuration file nagios.cfg,using theparameter interval_length.You should really change this parameter only if youknowexactly what youare doing. 3 http://localhost/nagios/docs/xodtemplate.html#hostextinfo and #serviceextinfo ;the file canbefound locally in /usr/local/nagios/share/docs/ 43 2 Nagios Configuration 2.3Defining theMachines to Be Monitored, with host Thehostobject is thecentral command postonwhich allhostand service checks arebased.Itdefinesthe machinetobemonitored.The parametersprinted in bold mustbespecifiedinall cases: #--/etc/nagios/mysite/hosts.cfg define host{ host_name linux01 hostgroups linux-servers alias Linux File Server address 192.168.1.9 check_command check-host-alive max_check_attempts 3 check_period 24x7 contact_groups localadmins notification_interval 120 notification_period 24x7 notification_options d,u,r,f parents router01 } host_name This parameter specifies thehostnamewithwhich Nagiosaddressesthe ma- chineinservices,hostgroupsand otherobjects. Only thespecial characters - and _ areallowed. hostgroups This parameter,new in version2.x,allocatesthe host to ahostgroup object, whichmustalready be defined (Section 2.4, page 46).Ahost groupinthe Webinterfacecombines severalhosts into agroup (see Figure on page 280). Thesecondpossibilityofassigningahost to ahostgroup,compatible with version1.x,usesthe members parameter in definingthe host groupitself. Thetwo methods can also be combined. alias This parameter contains ashort descriptionofthe host,which Nagiosdis- playsatvarious locationsasadditional information.Ordinarytextisallowed here. address This specifies theIPaddressorthe Fully Qualified Domain Name (FQDN) of thecomputer.Ifitispossible (i.e., for static IP addresses),you should useanIPaddress, sincethe resolution of anametoanIPaddressisalways dependentonDNS working, whichisalsonot infallible. 44 2.3 Definingthe Machines to Be Monitored, with host check_command This specifies thecommand withwhich Nagioschecks,ifnecessary,tosee whether thehostisreachable.The parameter is optional. If it is omitted, Nagios willnever carry outahost check! This can be useful for network components that arefrequentlyswitched off (for example, printservers). Thecommand normally used for check_command is called check-host- alive,which is already predefinedinthe suppliedfile, checkcommands.cfg (see Section2.9 on page 53).Thismakesuse either of theplugin check_ping or themoremodern check_icmp.Bothplugins checkthe reachabilityofthe host via theICMP packets “ICMPEchoRequest”and “EchoReply”. max_check_attempts This parameter determineshow often Nagios should trytoreach thecom- puter if thefirsttesthas gone wrong. Thevalue 3 in theexample meansthat thetestisrepeated up to threetimes if it returnsanythingother than “OK” in thefirsttest. As long as thereare still repeat tests to be made,Nagios refers to this as a soft state .Ifthe final testhas been made,the system categorizes thestate as hard.Nagios notifiesthe system administrator ex- clusively of hard states andinthe examplesends messagesonlyifthe third testalsoendswithanerror or warning. check_period This specifies thetimeperiodinwhich thehostshouldbemonitored.Really, only “round theclock” makessense,thatis, 24x7 .Atimeperiod object is involvedhere, thedefinition of whichisdescribedinmoredetail in Section 2.10 on page 54. It only makessense to useaspecification otherthan 24x7 if youwanttoexplicitly suppressthe host checkatcertain times. contact_groups This specifies thereceiverofmessageswhich Nagiossends withrespect to thehosts defined here,thatis localadmin.Section 2.8explainsthismore fully on page 52. notification_interval This specifies at what intervals Nagios should repeat notificationofthe con- tinuedexistence of thestate. 120 time units normally mean onemessage every120 minutes,provided theerror statecontinues. notification_period This specifies at what time intervalamessage should be sent.Atime period different from 24x7 couldcertainly be useful here.Itisimportant to under- standthe differenceherewith check_period:if check_period excludes time periods,Nagios cannotevendetermine whether thereisanerror or not. But if thehostismonitored round-the-clock andonlythe notificationperiod is restricted by theparameter notification_period,Nagios willcertainly log 45 2 Nagios Configuration errors andalsodisplay them in theWeb frontend andinlog evaluations. Outsidethe notification_period thesystem doesnot send anymessages. A more detaileddescription of thenotification system is giveninSection 12.3 from page 217. notification_options This parameter describesthe states aboutwhich Nagiosshouldprovidenoti- ficationwhentheyoccur. Nagios knowsthe followingstates for computers: d down u unreachable (host is notreachable because anetworknode between Nagios andahosthas failedand theactualstate of thehostcannot be determined) r recovery (OKstate after an error) f flapping (state changesveryquickly; more on this in Appendix Afrom page 401). By specifying d,u ,the system willsendmessagesifthe host is notonthe networkornot reachable over thenetwork, butnot if it can be reached againafter an errorstate (recovery).If n (none)isusedasthe value, Nagios willnormally notgive anynotification. Theforminwhich Nagiossends outamessage dependsonhow thecontact is defined.Irrespective of when youwanttobenotified,the Webinterface always showsthe currentstate, even if Nagios doesnot send amessage because thetimeperioddoesnot matchorthe system is still repeatingthe tests (the so-calledsoft state). parents This allows thephysical topology of thenetworktobetakenintoaccount. Here therouter or thenetworkcomponentisgivenbywhich thehostis reachable if it is notindirectcontact in thesamenetworksegment.Thiscan also be aswitch betweenthe Nagios serverand thehost. If Nagios doesnot reach thehostbecause allparents (separated by commas) aredown, then Nagios categorizes it as “unreachable”, butnot as “down”. Furtherinformation is provided by theonlinehelpunder http://localhost/nagios/ docs/xodtemplate.html#host4 . 2.4GroupingComputers Together with hostgroup Ahostgroup contains oneormorecomputerssothattheycan be represented in theWeb interfacetogether (see Figure on page 280)—in addition, certainobjects 4 canbefound locally in /usr/local/nagios/share/docs/xodtemplate.html . 46 2.5 DefiningServicestoBeMonitored with service (e.g., services) can be appliedtoanentiregroup of computersinsteadofhavingto definethemindividually for each host. The hostgroup_name parameter specifies auniquenamefor thegroup, alias ac- ceptsashortdescription.The members parameter lists allhosts namesbelonging to thegroup,separated by commas: #--/etc/nagios/mysite/hostgroups.cfg define hostgroup{ hostgroup_name linux-servers alias Linux Servers members linux01,linux02 } If youspecify to whichgroup they belong in thehostdefinition for individual member computers, withthe parameter hostgroups (page 44), the members entry maybeomitted from version2.0.Thismeans that younolongerhavetwo search throughall groupdefinitionsifyou just want to delete asinglehost. Thecombined use—of members in the hostgroup object andatthe same time,of hostgroups in the host object—is equally possible. 2.5Defining Services to Be Monitoredwith service AserviceinNagios always consists of thecombinationofahostand aservice name.Thiscombinationmustbeunique. Servicenames,onthe otherhand, may occurmanytimes,aslongastheyare combined withdifferent hosts. Thesimplestserviceconsistsofasimpleping, whichtests whether therelevant host is reachable,and whichregistersthe response time andany packetlossthat mayoccur: #--/etc/nagios/mysite/services.cfg define service{ host_name linux01 service_description PING check_command check_ping!100.0,20%!500.0,60% max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 check_period 24x7 notification_interval 120 notification_period 24x7 notification_options w,u,c,r,f contact_groups localadmins } 47 2 Nagios Configuration In contrast to ahostcheck, whichNagios carries outonlyifitcannotreach any otherserviceofthe host,aping service is carried outatregular intervals.Problems in thenetworkcan be detected relatively simply throughresponse timesand packet loss rates.The host checkislesssuitable for this purpose. host_name This refers to thenamedefinedinthe host object.Nagios also obtainsthe IP addressofthe computer via this.Insteadofasingle host name,you can also enter acomma-separated listofmultiplehosts.Asanalternative to host_name ,itisalsopossible to usethe parameter hostgroup_name to specifyanentirehostgroup insteadofindividualhosts.The service is then considered to be defined for each of theindividualcomputersgroupsto- gether in this way. Whether youmake useofthisoptimization, or allocate your ownservicedefinitionstoeach computer individually,makesnodiffer- ence to Nagios. service_description This parameter defines theactualnameofthe service.Spaces,colons, and dashesmay be included in thename. Nagios always addressesaservice as acombinationofhostname(here: linux01 )and service description(PING ). This mustbeunique. check_command This defines thecommand withwhich Nagiostests theservicefor function- ality. Argumentsare passedontothe actualcommand, check_ping,sep- arated by exclamation marks. Thedefinition of the check_ping command, predefinedinthe examplefiles, is explainedinSection 2.9onpage 53. In theexample,the values for thewarning limit ( 100 ms,20%)and for the CRITICALstatus(500 ms,60%)are determined. Youcould comparethistoa trafficlight:the stateOK(green)occurs if theresponse time remainsunder thewarning limit of 100 milliseconds,and if none or less than 20 percentof packets have been lost.The WARNINGstate (yellow) occurs if thepacket loss or response time liesabove thedefinedwarning limit,but still beneaththe critical limit.Above thecritical limit,Nagios issues aCRITICALstate (red). The return valueofthe plugin is describedatthe beginning of Chapter 6(page 6),the underlyingplugin check_icmp is introducedindetail in Section6.2 from page 88. max_check_attempts This specifies howoften Nagios should repeat atestinorder to verify and definitively acceptanerror statewhich hasbeen discovered (oralsothe recoveredfunctionality),thatis, to recognizeitasahard state In thetran- sitional phase(for examplefromOKtoCRITICAL) we speakofasoft state . Basicdistinctionsbetween soft andhardare only made by theNagios no- tificationsystem, whichiswhy thetwo states aredescribedinmoredetail 48 2.5 DefiningServicestoBeMonitored with service in thecontextofthis(Chapter 12 from page 215).The differencehas no influence in therepresentationinthe Webinterface. normal_check_interval This specifies at what intervalNagios should testthe service when thesy- stem is in astable condition—this can equally be an OK or an errorstate. In theexample this is fivetimeunits,which is normally fiveminutes. retry_check_interval This describesthe time intervalbetween twotests when thestate is in the processofchanging(for example, from OK to WARNING),thatis, when there is asoft state As soon as Nagios hasperformedthe number of tests specified in max_check _attempts,itchecks theserviceagainatintervals of normal_check_interval . check_period This describesthe time period in whichthe service is to be monitored. The entryrepresentsatimeperiod object,the definitionofwhich is describedin more detail in Section2.10frompage 54.Hereyou should enter 24x7 for “round theclock” unlessyou want to explicitly stop thetestfromrunning at specifictimes (perhaps because of ascheduled maintenance slot). If only the notificationistobeprevented at specifictime, it is better to usethe option notification_period or otherfiltersofthe Nagios notificationsystem (see Section12frompage 215). notification_interval This determinesatwhatregular intervals Nagios repeatsreports on error states.Inthe example, thesystem doesthisevery 120 time units (normally minutes), as long as theerror statecontinues. Avalue of 0 causesNagios to announcethe currentstate only once. notification_period This describesthe time period withinwhich anotification should take place. This againinvolvesatimeperiod object (see Section2.10).Hereinthe ex- ample, 24x7 is used,sonotification is sent round theclock. Amoredetailed discussion of the notification_period parameter can be found in Section 12.3frompage 217. notification_options This determineswhich errorstates Nagios should report. Possible values whichcan be used here arethe fivestates already describedfor host objects, c (critical), w (warning), u (unknown), r (recovered)and f (flapping).speci- fying c,r only informs thesystem is aserviceisinaCRITICALstate andifit subsequently recovers (RECOVERY). 49 2 Nagios Configuration If youuse n (none)asthe value, Nagios willnormally notsendany notifica- tion.The Webinterfaceneverthelessshows thecurrent states. contact_groups Finally,thisparameter defines therecipient groupwhose membersshouldre- ceive thenotifications. Severalgroupscan be entered as acomma-separated list. Furtherinformation can be found in theonlinehelpat http://localhost/nagios/ docs/xodtemplate.html#service . 5 2.6GroupingServicesTogether with servicegroup Servicegroups, likehostgroups, combineseveral servicesintoagroup, so that they can be represented together in theWeb frontend.Thisincreases clarityand simplifies certainevaluations,but it is optional, andisnot recommended at the beginning,because of themoresimpleconfiguration. #--/etc/nagios/mysite/servicegroups.cfg define servicegroup{ servicegroup_name all-ping alias All Pings members linux01,PING,linux02,PING } servicegroup_name and alias have thesamemeaningsasfor thehostgroup.It should be noted that thesyntaxisthe same as forthe members entry: sincea service in Nagios always consists of thecombinationofhostand service names, both mustalways be listed in pairs. Thecomputer comesfirst, andthenthe service: members host1, service1, host2, service2,... 2.7Defining Addressees forError Messages: contact Acontact is basically apersontowhomamessage addressedvia acontact group is sent: 5 Thecorresponding fileislocated afterinstallation in thedirectory /usr/local/nagios/share/docs/ 50 2.7 DefiningAddressees forError Messages: contact #--/etc/nagios/nagios/mysite/contacts.cfg define contact{ contact_name nagios alias Nagios Admin host_notification_period 24x7 service_notification_period 24x7 service_notification_options w,u,c,r host_notification_options d,u,r service_notification_commands notify-by-email host_notification_commands host-notify-by-email email nagios-admin@localhost } Thecontact also playsarole during authentication:auser whologs in at theWeb frontend only gets to seethe hostsand servicesfor whichthatuserisentered as thecontact.The user for logging in to theWeb interfacemustthereforebe identical withthe valueof contact_name specified here.The first time it is used, theuser nagios is sufficient. contact_name This parameter defines theusername. It mustmatch thecorresponding user- name in thepasswordfile htpasswd . alias This parameter describesthe contact briefly.Spaces areallowedhere. host_notification_period This defines thetimeperiodduringwhich messagesonthe reachabilityof acomputer can be sent.Section 12.3(page 217) showshow thetimepe- riod details can be sensibly combined in thedifferent object types. At the beginning,the value 24x7 (thatis: always)iscertainly notabadoption. service_notification_period This defines thetimeperiodinwhich Nagiossends notifications to therele- vant user service.The entrytakeseffect as afilter:the generated message is simply discardedhereifitissentoutside thespecifiedtimeperiod. If no fur- ther message follows,the contact remainsuninformed. Youmusttherefore thinkabout combiningindividualtimeperiods in variousdifferent defini- tions. Dependencies aredescribedextensively in Section12.3. host_notification_options This defines what typesofhostmessagesthe user should receive.The same options areusedhereasfor thehostparameter notification_options (page 46). 51 2 Nagios Configuration service_notification_options This parameter describeswhattypes of service messagesare receivedbythe contact.The same fivevaluesare involvedasfor the notification_options parameter for service andhostobjects. service_notification_commands This parameter defines whichcommands(oneormore) take charge of no- tification. They mustbedefinedasthe command object type (see Section 2.9);basically anyexternalprogramscan be integrated. host_notification_commands This parameter specifies,likethe service_notification_commands ,which commandsare to be carried outtosendthe notification, although here it concerns thereachabilityofcomputers. email This specifies oneormoree-mailaddresses(separated by commas) to which amessage should be sent.The notificationcommand can evaluate this value (one exampleofthisisthe command notify-by-email6 ). Furtherinformation can be found in theonlinehelpat http://localhost/nagios/ docs/xodtemplate.html#contact. 2.8The MessageRecipient: contactgroup The contactgroup servesasthe interfacebetween thenotification system andthe individualcontacts. Nagios neveraddressesindividualcontactsdirectlyinvarious object definitions,but always goesthrough thecontact group. Here Nagios also expectsaname ( contactgroup_name )and acomment ( alias ), whichreveals to visitorstothe websitewhatthe purposeofthe groupis. For members(members )ofthe group, youcan enter an individualcontact or acomma- separated listofseveral contacts: #--/etc/nagios/mysite/contactgroups.cfg define contactgroup{ contactgroup_name localadmins alias Local Site Administrators members nagios } 6 seetable 12.1onpage226 52 2.9 When Nagios NeedstoDoSomething:the command Object 2.9WhenNagiosNeedstoDoSomething: the command Object Everything that Nagios doesisdefinedin command objects. In theexample file supplied, checkcommands.cfg-sample defines abroad rangeofcommandswhich only need to be included.Todothis, youjustcopy filetothe subdirectory mysite : nagios@linux:/etc/nagios$ cp sample/checkcommands.cfg-sample \ mysite/checkcommands.cfg Thealready existing command check_ping illustrates thedefinition of this object type: #--/etc/nagios/mysite/checkcommands.cfg ... define command{ command_name check_ping command_line $USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$ \ -c $ARG2$ -p 5 } ... check_ping is thenamebywhich thecommand willlater be calledwhendefining aservice. command_line describesthe command to be executed.Not only the oldplugin check_ping is used here,but also themoreefficient check_icmp.The differences betweenthe twoare explainedinmoredetail in Section6.2 from page 88, buttheyuse thesameparameterstoalargeextent. Theidentifiersusedhere, surrounded by dollarsigns,are macros.Nagios recognizes threedifferent typesofmacros: $USER x $ macros ( x maytake on values between 1and 32) definethe file resource.cfg.The macro $USER1$,which contains the pathtothe plugin directory, belongstothis. Thesecondgroup of macros areargumentswhich can be passedonwhenacom- mand is called. Theseinclude $ARG1$ and $ARG2$. Thethird groupdefinedbyNagios includes themacro $HOSTADDRESS$,which references theIPaddressofthe host in thehostdefinition (thatis, theparameter address). This type of macroisdocumented in theonlinehelpat http://localhost/ nagios/docs/macros.html . If youcallthe service defined on page 47, linux01,PING as a check_command check_ping!100.0,20%!500.0,60% then 100.0,20% willappear in $ARG1$,and 500.0,60% in $ARG2$.Toseparate thecommand andthe argumentstobepassedon, theexclamation mark is used. 53 2 Nagios Configuration In theory,any programsatall can be started via the command_line,but Nagios expectsacertaintypeofbehaviorhere, particularly wherethe return valueiscon- cerned.For this reason,onlyNagios pluginsshouldbeused(seeChapter 6upto 9). 2.10 DefiningaTime Period with timeperiod timeperiod objectsdescribe time periods in whichNagios generates and/or sends notifications.The included examplefiles minimal.cfg-sample and bigger.cfg- sample containanumber of definitions that can simply be copied to your own timeperiods.cfg file. In this,the definitionof 24x7 is stated as “SundaystoSaturdays, from 0to24hours in each case”: #--/etc/nagios/mysite/timeperiods.cfg define timeperiod{ timeperiod_name 24x7 alias 24 Hours ADay, 7Days AWeek sunday 00:00-24:00 monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00 } Thetimes of day on individualweekdayscan also be “cobbledtogether”fromtime periods,separated by acomma: define timeperiod{ ... monday 00:00-09:00,12:00-13:00,17:00-24:00 ... } if aday specification is omitted completely, thedefinedtimeperiodwill notinclude this day in itsentirety. 2.11 Templates Nagios categorizes definitions as objectsfor averygood reason:their features can namely be inherited by otherobjects—afeature that can save alot of time 54 2.11 Templates otherwisespent typing.You can defineaso-called template andpassthisonto otherobjectsasabasisfromwhich youonlyneed to describe thosedetails that are different. This is best illustrated by an example(theparametersthatare required for theuse of templates areprinted in bold): #--/etc/nagios/mysite/hosts.cfg define host{ name Generic-Host register 0 check_command check-host-alive max_check_attempts 3 check_period 24x7 contact_groups localadmins notification_interval 120 notification_period 24x7 notification_options d,u,r,f } With name,the template is first givenaname so that it can be referenced later on. Thefollowingentry, register 0 ,preventsNagios from tryingtotreat this template as arealhost. In theexample,the entriesfor thegenuine host object arenot sufficient; consequently Nagios wouldbreak offwhenreading theconfiguration file, withthe errormessage that parametersare missing that areobligatoryfor such adefinition,for example: Error: Host name is NULL Allthe otherparametersinvolve settingsthatare to apply to alldefinitionsdepen- dent on Generic-Host. In theactualhostdefinition—in thefollowingexample for linux03 und linux04 — theparameter use references thetemplateand thus takesoverthe preset values: #--/etc/nagios/mysite/hosts.cfg define host{ host_name linux03 use Generic-Host alias Linux File Server address 192.168.0.1 } define host{ host_name linux04 use Generic-Host alias Linux Print Server address 192.168.0.2 } 55 2 Nagios Configuration In this wayyou only need to complete thoseentries that vary anywaybetween the twohosts. Butparametersmay also appearinhostdefinitionsthathavealready been defined by thetemplate. In this casethe definitionatthe host haspriority, it overwrites thevalue from thetemplate. Templates created in this waycan generally be used for allobject types. Furtherin- formation on theiruse can be found in theonlinehelpat http://localhost/nagios/ docs/templaterecursion.html . 7 2.12 ConfigurationAidsfor ThoseToo Lazy to Type 2.12.1Definingservicesfor severalcomputers Youcan simplifythingsalotinthe service definitionbydefining aservicefor severalhosts,orevenhostgroups, at thesametime: #--/etc/nagios/mysite/services.cfg define service{ host_name linux01,linux02,linux04,... service_description PING ... } Specifyingseveral hosts, separated by commas, ensuresthatNagios defines multi- pleservices in parallel. Youcan go onestep furtherbyspecifyingthe ’ * ’character insteadofindividualcomputer aliases. This will assign this service to allhosts. Athird possibilityisanallocation in parallelvia host groups: #--/etc/nagios/mysite/services.cfg define service{ hostgroup_name linux-servers,windows-servers service_description PING ... } In this casethe parameter hostgroup_name is used insteadofthe host_name parameter. 7 canbefound locally in /usr/local/nagios/share/docs/templaterecursion.html . 56 2.13 CGIConfiguration in cgi.cfg 2.12.2One hostgroup forall computers Thequickest waytodescribe ahostgroup containing alldefinedcomputersiswith thewild card’* ’: #--/etc/nagios/mysite/hostgroups.cfg define hostgroup{ hostgroup_name all-hosts members * ... } 2.12.3Other configurationaids In practice,the definitionofservices covering multiplehosts,describedonpage 56,isbyfar themostimportant.But thereare otherconfiguration aids basedon theescalationand dependency objects, introducedonpage 224(seeSections12.5 and12.6).There youcan also use hostgroup_name insteadof host_name (a list of host groups)or servicegroup_name insteadof service_description .Inaddition youmay setthe value * for host_name and service_description ,which covers all hostsorservices. 2.13 CGIConfigurationin cgi.cfg In orderfor theWeb frontend to work correctly, Nagios needsthe configuration file cgi.cfg.The exampleincluded, called cgi.cfg-sample ,can initially be takenover one-to-one,since thepaths containedinitwereset correctlyduringinstallation: nagios@linux:/etc/nagios$ cp sample/cgi.cfg-sample ./cgi.cfg Important:the file cgi.cfg should be locatedinthe same directoryas nagios.cfg, because theCGI programshavebeen compiledinthispathpermanently.If cgi.cfg is locatedinadifferent directory, theWeb servermustalsobegivenanenviron- ment variable withthe correctpath, called NAGIOS_CGI_CONFIG .How this is set in thecaseofApache is describedinthe corresponding onlinedocumentationat http://httpd.apache.org/docs-2.0/env.html . Outofthe box, only afew parametersare enabledinthe CGIconfiguration file. What theseare is revealed by thefollowing egrep command,which excludes com- mentsand emptylines: nagios@linux:/etc/nagios$ egrep -v ’ˆ$|ˆ#’ cgi.cfg-sample |less main_config_file=/etc/nagios/nagios.cfg 57 2 Nagios Configuration physical_html_path=/usr/local/nagios/share url_html_path=/nagios show_context_help=0 use_authentication=1 default_statusmap_layout=5 default_statuswrl_layout=4 refresh_rate=90 main_config_file This parameter specifies themainconfiguration file. physical_html_path This specifies theabsolutepathinthe filetreetothe directoryinwhich the HTML documents—includingonlinedocumentation, imagesand CSSstyle- sheets—are located. url_html_path This also describesthe pathtothe Nagios HTML documents, butfromthe perspective of theWeb server, notofthe operating system. show_context_help This optionprovides—aslongasitisswitched on (value 1 )—acontext-de- pendenthelpifyou move themouse in theWeb interfaceoverindividual links or buttons. use_authentication This optionshouldalways be switched on (value 1 ). Nagios willthenonly allowaccess to authenticatedusers. Theauthenticationitselfisconfigured in a .htaccess fileinthe CGIdirectory (see Section1.3 on page 33).Ifthisfile is missing,and if use_authentication=1 ,thenthe CGIprogramswill refuse to work. default_statusmap_layout and default_statuswrl_layout Theselayoutparametersdescribe forms of representation in thegraphical illustration of networkdependencies. Possible values aredescribedinAp- pendix D.2onpage 444. refresh_rate This specifies thetimespan in secondsafter whichthe browserisinstructed to reload datafromthe Webserver. In this waythe displayinthe browseris always up-to-date. authorized_for_all_services and authorized_for_all_hosts In orderfor aspecific user to be able to seeall computersand servicesin theWeb interfaceright from thebeginning,without taking account of the allocation of hostsand servicestothe correctcontact group, youshouldalso activate thefollowingtwo parametersinthe file cgi.cfg: 58 2.14 TheResources File resource.cfg authorized_for_all_services=nagios authorized_for_all_hosts=nagios TheWeb user (and contact) nagios is nowable to seeall hostsand allservices in theWeb interface, even if he is notentered as thecontact responsible for allhosts or services. Acomplete listofall parameterscan be found in Appendix D.2onpage 443. 2.14 TheResourcesFile resource.cfg Nagios expectstofind thedefinition of macros,concerninghow they areused to create command objects(Chapter 2.9frompage 53), in theresources file re- source.cfg.Thiscan also be copied from theexample supplied: nagios@linux:/etc/nagios$ cp sample/resource.cfg-sample./resource.cfg Thelocationwhere Nagios should search forthisfile is defined by the resource.cfg parameter in themainconfiguration file nagios.cfg.itmakessense here to usethe same directoryinwhich nagios.cfg is also located. In its“factory settings”, resource.cfg defines only the $USER1$ macro, whichcon- tainsthe pathtothe plugins: $USER1$=/usr/local/nagios/libexec In total, Nagios hasprovisions for 32 freelydefinable $USER x $ macros,where x can be from 1to32. Thesecan be very useful in combinationwithpasswords,for example: apasswordisdefinedvia such amacrointhe file resource.cfg,which maybereadonlybythe user nagios.The defined macroisusedinthe actual service definitions,thus hiding thepasswordfromviewofcurious onlookers. 59 3 Ch ap te r Startup Once Nagiosand theplugins areinstalled, andApache is setupfor theWeb inter- face, as well as theminimal configuration as describeduntil now, operation of the system can getunder way. If youhavenot already done so,itisrecommended that youfirstspend abit of time on thetestfor the check_icmp plugin,describedin Section1.2 (page 30), to checkthe initialconfiguration. 3.1Checkingthe Configuration The nagios program,which normally runs as adaemonand continually collects data, can also be used to testthe configuration: nagios@linux:˜$ /usr/local/nagios/bin/nagios -v /etc/nagios/nagios.cfg [...] Checking services... 61 3 Startup Checked 1services. Checking hosts... Warning: Host ’linux02’ has no services associated with it! Checked 2hosts. Checking host groups... Checked 1host groups. Checking service groups... Checked 0service groups. Checking contacts... Warning: Contact ’wob’ is not amember of any contact groups! Checked 2contacts. Checking contact groups... Checked 1contact groups. Checking service escalations... Checked 0service escalations. Checking service dependencies... Checked 0service dependencies. Checking host escalations... Checked 0host escalations. Checking host dependencies... Checked 0host dependencies. Checking commands... Checked 22 commands. Checking time periods... Checked 4time periods. Checking extended host info definitions... Checked 0extended host info definitions. Checking extended service info definitions... Checked 0extended service info definitions. Checking for circular paths between hosts... Checking for circular host and service dependencies... Checking global event handlers... Checking obsessive compulsive processor commands... Checking misc settings... Total Warnings: 2 Total Errors: 0 Things look okay -Noserious problems were detected during the pre-flight check Althoughwarningsdisplayed here can in principlebeignored,thisisnot always what theinventorhad in mind:perhaps youmade amistake in theconfiguration, andNagios is ignoring aspecific object,which youwould actually liketouse. Thefirstwarning in theexample refers to ahostcalled linux02 ,which hasnot been allocatedany services. Since Nagiosworks primarily withservicechecks,and uses host checks only if it needsthem, acomputer should basically always be allocated at leastone service.Nagios issues awarning,ashere, if no service at allhas been defined for aparticularhost. 62 3.2 Getting MonitoringStarted It is also recommended,however,toalways definea“PING” service for everyhost, although this is notabsolutelyessential. Even if thesameplugin, check_icmp, is used here as withthe host check, this is notthe same thing: thehostcheck is satisfied withasingle response packet, after all, it only wantstofind outifthe host “isalive”. As aservicecheck, check_icmp registerspacket runtimes andlossrates, whichcan be used to draw conclusions, if necessary,concerningexistingproblems withanetworkcard. Thesecondwarning refers to acontact named wob ,who,althoughdefined, is not used,because he doesnot belong to anycontact group. In contrast to warnings,genuine errors mustbeeliminated,because Nagios will usually notstart if theparserfindsanerror,asinthe followingexample: Error: Could not find any host matching ’linux03’ Error: Could not expand hostgroups and/or hosts specified in service (config file ’/etc/nagios/mysite/services.cfg’, starting on line 0) ***> One or more problems was encountered while processing the config files... Here theconfiguration mistakenlycontainsahost called linux03 ,for whichthere is no definition. If youreadthrough theerror message carefully,you willquickly realizethatthe errorcan be found in thefile /etc/nagios/mysite/services.cfg . In thedefinition of independencies ( host and service dependencies,see Section 12.6page 234) thereisafundamentalriskthatcirculardependenciescould be specified by mistake.Because Nagios cannotautomatically resolve such depen- dencies, this is also checkedbeforethe start, andifnecessary,anerror is displayed. When usingthe parents parameter,itisalsopossible that twohosts mayinadver- tentlyserve mutually as “parents”; Nagios also testthis. 3.2GettingMonitoring Started 3.2.1Manualstart Duringthe Nagios installation,the command linux:src/nagios # make install-init savesastartupscriptinthe /etc/init.d directory. If theconfiguration testran with- outerror,Nagios is first started manually withthisscript: linux:˜ # /etc/init.d/nagios start 63 3 Startup 3.2.2Automatic start If allruns smoothlyhere—whichcan be checkedbyrunning theWeb interface (see Chapter 3.3)—youonlyneed to ensure that thescriptisalsostarted when the system boots. Symbolic links existinthe directories /etc/init.d/rc[235].d for this purpose: linux:˜#ln -s /etc/init.d/nagios /etc/init.d/rc2.d/S99nagios linux:˜ # ln -s /etc/init.d/nagios /etc/init.d/rc2.d/K99nagios Corresponding links arealsoset in thesubdirectoriesresponsible for runlevels3 and5rc3.d and rc5.d . 3.2.3Making configuration changescomeintoeffect If configuration changesare made,itisnot required,and notevenrecommended, that yourestart Nagios each time.Instead, youjustperformareload: linux:˜#/etc/init.d/nagios reload This causesNagios to reread theconfiguration,end tests for hostsand servicesthat no longer exist, andintegrate newcomputersand servicesintothe test. However, witheach reload thereisarenewed scheduling of checks,meaning that Nagios planstocarry outall tests afresh. To preventall tests from beingstarted simultaneouslyatbootup,Nagios performs aso-called spreading.Herethe serverspreads thestart timesofthe tests over a configurable period.1 Foralargenumber of services, it can thereforetake awhile beforeNagios continuesthe testfor aspecific service.For this reason youshould neverrun reloads at shortintervals: in theworst case, Nagios willnot manage to performsomechecks in theintervening period andwill performthemonlysome time after themostrecentreload. Beforebeing reloaded, theconfiguration is tested to eliminateany existing errors, as showninSection 3.1. 3.3Overviewofthe WebInterface If youcallthe URL http://nagios-server/nagios in thebrowser when theNagios daemonisrunning,you willbetakentothe welcomescreen showninFigure3.1. 1 Therelevantconfigurationparametersare called max_host_check_spread and max_ service_check_spread,see Appendix D.1, page 435. 64 3.3 Overview of theWeb Interface Figure 3.1: Thestart screen Theso-called“tactical overview” ( Tactical Overview ), whichcan be reached via the first monitoring linkinthe left menu bar, is showninFigure3.2.Itsummarizesthe status of alltested systems. Figure 3.2: “Tactical” overview of allsystems and services to be monitored 65 3 Startup Considerably more interesting in practice,however,isthe displayofthe menu item ServiceProblems (Figure3.3). It documentsthe servicesthatare currentlycausing problems,those that arenot in theOKstatus, in theverysense for whichNagios wasconceived: to informthe administrator preciselyofany problems. Figure 3.3: Nagios:summaryof allservice problems Thefirstcolumn namesthe host involved. If this hasagray background,Nagios can reach thecomputer in principle. If thehostis“down” this can be seen by the redbackground.For services, redstandsfor CRITICALand yellow for WARNING. Thesecondcolumn provides theservicename, thethird column thestaus again, in plaintext. Column fourspecifiesthe time of thelastcheck. Column fiveis interesting:itshows howlongthe currentstatushas been going on. Thesixth column withthe heading Attempt revealshow often Nagios hasalready performedthe test(unsuccessfully): 3/3 meansthatthe errorstatushas been con- firmed for thethird time in succession,but that thetestisonlyperformedthree timesifthere is an error(parameter max_check_attempts,see Section2.3). Figure 3.4: Asummaryofall hosts(extract) 66 3.3 Overview of theWeb Interface Finally,the last column passesonthe information from theplugintothe adminis- trator,towhomitdescribesthe currentstatusinmoredetail. Thetop lineinFigure 3.3, for example, warnsthatonlyfive percentofstorage space is available in the /usr filesystem of thehost eli11 . The Host Detail (Figure3.4)and ServiceDetail overviews provideanoverviewofall hostsand services. In practice youwill be lookingmoreprecisely for information, either via asinglehostoronahost grouporservicegroup Thenameinquestionis entered in the ShowHost search field.Figure3.5 showsthisusing theexample of the eli11 host. Figure 3.5: Allservicesfor the host eli11 Figure 3.6: thehostgroup eli-linux in thegrid view 67 3 Startup Alternatively youcan search forthe namesofhostand service groups.Aninter- esting variationhereistohaveastatus grid output shownvia thelink Hostgroup Grid ,which displays an overviewofall hostsand theircorresponding services, together withthe status of these(Figure 3.6).Through thecolor of theservice (green/yellow/red) youcan quicklysee at aglancewhether thereare problems in theservicegroup or host groupthatyou areviewing. 68 In More Detail ... 4 Ch ap te r Nagios Basics Thefact that ahostcan be reached,initself, haslittlemeaning if no service is running on it on whichsomebody or somethingrelies. Accordingly, everything in Nagios revolvesaround service checks.After all, no service can runwithout ahost. If thehostcomputer fails,italsocannotprovidethe desiredservice. Things getslightly more complicated if arouter,for example, is broughtintoplay, whichliesbetween usersand thesystem providingservices.Ifthisfails,the desired service maystill be running on thetargethost, butitisneverthelessnolonger reachable for theuser. Nagios is in apositiontoreproduce such dependencies andtoprecisely inform theadministrator of thefailure of an important networkcomponent, insteadof flooding theadministrator withirrelevanterror messagesconcerningservices that cannotbereached.Anunderstanding of such dependencies is essentialfor the smoothoperation of Nagios,which is whySection 4.1will look in more detail at thesedependenciesand theway Nagios works. 71 4 Nagios Basics Anotherimportant item is the state of ahostorservice. On theone hand Na- gios allows amuchfinerdistinction than just “ok” or “not ok”; on theother hand thedistinction between e ( e does nothavetodealwithshort-termdisruptions that have long sincedisappeared by thetimethe administrator hasreceivedthe information.These states also influ- ence theintensity of theservicechecks.How this functionsindetail is describedin Section4.3. 4.1Taking into Account theNetwork Topology HowNagios handlesdependenciesofhosts andservices can be best illustrated withanexample.Figure4.1 represents asmall networkinwhich theDomainName Serviceon proxy is to be monitored. Figure 4.1: Topology of an example network Theservicecheck always servesasthe starting point for monitoring that is regularly performedbythe system.Aslongasthe service can be reached,Nagios takesno furthersteps; that is,itdoesnot performany host checks.For switch1 , switch2 , and proxy ,suchacheckwould be pointless anyway, because if theDNS service responds to proxy ,thenthe hostsmentioned areautomatically accessible. If thenameservicefails,however,Nagios tests thecomputer involvedwithahost check, to seewhether theserviceorthe host is causing theproblem. If proxy cannotbereached,Nagios mighttestthe parent hostsentered in theconfigura- tion (Figure4.2). With the parents host parameter,the administrator hasameans available to provideNagios withinformation on thenetworktopology. 72 soft stat )and hard stat )means that theadministrator 4.1 Taking into Account theNetwork Topology Figure 4.2: theorder of tests performed aftera servicefailure When doing this,the administrator only entersthe direct neighbor computer fo each host on thepathtothe Nagios serverasthe parent.1 Hoststhatare allocated in thesamenetworksegment as theNagios serveritselfare defined without a parent. Forthe networktopology from Figure 4.1, thecorresponding configuration (reduced to thehostnameand parent) appears as follows: define host{ host_name proxy ... parents switch2 } define host{ host_name switch2 ... parents switch1 } define host{ host_name switch1 ... } switch1 is locatedinthe same networksegment as theNagios server, so it is there- forenot allocatedaparentcomputer.Whatbelongs to anetworksegment is a matter of opinion:ifyou interpret theswitches as thesegment limit,asisthe casehere, this hasthe advantage of beingable to more closelyisolate adisruption. Butyou can also take adifferent viewand interpret an IP subnetworkasaseg- ment.Thenarouter wouldformthe segmentlimit;inour example, proxy would 1 Theparameter name parents canbeexplained by thefactthatthere arescenarios—suchasin high availabilityenvironments—inwhich ahosthas twoupstream routersthatguarantee the Internet connection, forexample. 73 4 Nagios Basics then count in thesamenetworkasthe Nagios server. However, it wouldnolonger be possible to distinguishbetween afailure of proxy andafailure of switch1 or switch2 . Figure 4.3: Classification of individual network nodesbyNagios If switch1 in theexample fails,Figure4.3 showsthe sequence in whichNagios proceeds: first thesystem, when checking theDNS service on proxy ,determines that this service is no longer reachable (1). To differentiate, it nowperforms ahost checktosee what thestate of the proxy computer is (2). Since proxy cannotbe reached,but it has switch2 as aparent, Nagios also subjects switch2 to ahost check(3).Ifthisswitch also cannotbereached,the system checks itsparent, switch1 (4). If Nagios can establishcontact with switch1 ,the cause for thefailure of theDNS service on proxy can be isolated to switch2 .The system accordingly specifies the states of thehost: switch1 is UP, switch2 DOWN; proxy ,onthe otherhand, is UN- REACHABLE. Throughasuitable configuration of theNagios messaging system (see Section12.3onpage 217) youcan usethisdistinction to determine,for example, that theadministrator is informedonlyabout thehostthatisinthe DOWN state andrepresentsthe actualproblem, butnot aboutthe hoststhatare dependenton thedownhost. In afurther step,Nagios can determine othertopology-specificfailuresinthe net- work (so-called networkoutages ). proxy is theparentof gate,so gate is also represented as UNREACHABLE(5). gate in turn also functionsasaparent; the Internetserverdependent on this is also classifiedas“UNREACHABLE”. 74 4.2 Forced HostChecksvs. PeriodicReachabilityTests This “intelligence”, whichdistinguishes Nagios,helps theadministrator allthe more, themorehosts andservices aredependent on afailedcomponent. Forarouter in thebackbone,onwhich hundreds of hostsand servicesare dependent, thesystem informs administratorsofthe specificdisruption, insteadofsending them hundreds of errormessagesthatare notwrong in principle, butare notreally of anyhelpin tryingtoeliminatethe disruption. 4.2ForcedHostChecksvs. PeriodicReachability Tests Servicechecks arecarried outregularly by Nagios,hostchecks only when needed. Althoughthe check_interval parameter provides away of forcingregular host checks,there is no real reason to do this.There is onereason not to do this, however: continualhostchecks have aconsiderable influence on theperformance of Nagios. If youneverthelesswanttoregularly checkthe reachabilityofahost, it is better to useaping-based service check(seeSection 6.2frompage 88). At thesametime youwill obtain furtherinformation such as theresponse timesorpossible packet losses, whichprovides indirect cluesabout thenetworkload or possible network problems.Ahost check, on theother hand,alsoissues an OK even if many packets go missing andthe networkperformanceiscatastrophic.Whatisinvolvedhere—as thename“host check” implies—is only reachabilityinprinciple andnot thequality of theconnection. 4.3StatesofHosts andServices Nagios uses pluginsfor thehostand service checks.Theyprovidefourdifferent return values (cf. Table 6.1onpage 85): O (OK), 1 (WARNING), 2 (CRITICAL),and 3 (UNKNOWN). Thereturn valueUNKNOWNmeans that therunning of theplugingenerally went wrong, perhaps because of wrongparameters. Youcan normally specifythe situa- tionsinwhich thepluginissues awarning or acritical statewhenitisstarted. Nagios determinesthe states of servicesand hostsfromthe return values of the plugin.The states for servicesare thesameasthe return values OK,WARNING, CRITICALand UNKNOWN. Forthe hoststhe pictureisslightly different:the UP statedescribesareachable host,DOWNmeans that thecomputer is down, and UNREACHABLEreferstothe stateofnonreachability, whereNagios cannottest whether thehostisavailable or not, because aparentisdown(seeSection 4.1, page 72). 75 4 Nagios Basics In addition to this,Nagios makesadistinctionbetween twotypes of state: soft stateand hard state. If aproblemoccurs for thefirsttime(that is,ifthere was nothingwrong withthe stateofaserviceuntil now) then theprogram categorizes thenew stateinitially as asoft stateand repeatsthe testseveral times. It may be thecasethatthe errorstate wasjustaone-off eventthatwas eliminated a shortwhile later.Onlyifthe errorcontinuestoexist after multipletesting is it then categorized by Nagios as ahardstate. Administrators areinformedonlyofhard states,because messagesinvolvingshort-termdisruptions that disappear again immediatelyafterwardsonlyadd to an unnecessary flood of information. In ourexample thechronological sequence of states of aservicecan be illustrated quitesimply. Aservicewiththe followingparametersisusedfor this purpose: define service { host_name proxy service_description DNS ... normal_check_interval 5 retry_check_interval 1 max_check_attempts 5 ... } normal_check_interval specifies at what intervalNagios should checkthe corre- sponding service as long as thestate is OK or if ahardstate exists—in this case, everyfive minutes. retry_check_interval defines theintervalbetween twoservice checks during asoft state—oneminuteinthe example. If anew erroroccurs, then Nagios willtake acloserlook at theserviceatshorter intervals. max_check_attempts determineshow often theservicecheck is to be repeated after an errorhas first occurred.If max_check_attempts hasbeen reached andif theerror statecontinues, Nagios inspects theserviceagainatthe intervals specified in normal_check_interval . Figure 4.4representsthe chronological progression in graphic form: theillustration begins withanOKstate (which is always ahardstate).Normally Nagios willrepeat theservicecheck at five-minuteintervals.After ten minutes an erroroccurs;the statechanges to CRITICAL, butthisisinitially asoft state. At thispoint in time, Nagios hasnot yetissued anymessage. Nowthe system checks theserviceatintervals specified in retry_check_interval, here this is everyminute. Afteratotaloffive checks ( max_check_attempts)with thesameresult, thestate changesfromsoft to hard.Onlynow doesNagios in- formthe relevant people. Thetests arenow repeated at theintervals specified in normal_check_interval . 76 4.3 States of Hosts and Services Figure 4.4: Exampleofthe chronological progressionofstates in amonitored service In thenexttestthe service is againavailable;thus itsstate changesfromCRITICAL to OK.Since an OK stateisalways ahardstate, this change is notsubject to any tests by Nagios at shorter intervals. Thetransitionofthe service to theOKstate after an errorinthe hard stateis referred to as a hard recovery .The system informs theadministratorsofthis(if it is configuredtodoso) as well as of thechangebetween variouserror-connected hard states (suchasfromWARNINGtoUNKNOWN).Ifthe service recovers from an error soft statetothe normal state(OK)—also calledasoft recovery —the administrators will, however, notbenotified. Even if themessaging system leaves outsoft states andswitches back to soft states, it will still record such states in theWeb interfaceand in thelog files.Inthe Web frontend,soft states can be identifiedbythe factthatthe value 2/5 is listed in thecolumn Attempts,for example. This meansthat max_check_attempts expects five attempts,but only twohavebeen carried outuntil now. With ahardstate, max_check_attempts is listed twiceatthe corresponding position, whichinthe exampleistherefore 5/5. Moreimportant for theadministrator in theWeb interfacethanthe distinctionof whether thestate is still “soft” or already “hard”,isthe duration of theerror state in thecolumn Duration.Fromthisabetter judgmentcan be made of howlarge theoverall problemmay be. Forservices that arenot available because thehostisdown, theentry 1/5 in the column Attempts wouldappear,since Nagiosdoesnot repeat service checks until theentirehostisreachable again. Thefailure of acomputer can be more easily recognized by itscolor in theWeb interface: theserviceoverviewinFigure4.3 (page 66)marks thefailedhostinred;ifthe computer is reachable,the background remainsgray. 77 5 Ch ap te r Service Checks and HowThey ArePerformed To testservices,Nagios makesuse of externalprogramscalled plugins.Inthe simplest casethisinvolvestesting an Internetservice, for example, SMTP.Here theservicecan be addresseddirectlyoverthe network, so it is sufficienttocalla program locally on theNagios serverthattests themailserveronthe remote host. Noteverythingyou mightwanttotestcan be reached so easily over thenetwork, however: thereisnonetworkprotocolfor checking free capacity on aharddrive, for example. Then youmusteitherstart apluginonthe remote host via aremote shell(butfirstthishas to be installedonthe remote computer), or youuse other methods,suchasthe SimpleNetwork Management Protocol SNMP,totestthe hard drive capacity. Thefact that differentmethods areavailable here doesnot make it anyeasierin getting started withNagios.For this reason,thischapter provides an overviewof 79 5 ServiceChecksand HowTheyAre Performed thecommonmethods andattempts to bringanunderstanding of theunderlying concepts involved. Later chaptersthenprovidedetailedconfiguration examples. Figure 5.1: Nagios allows different testing methods Figure 5.1shows an overviewofthe varioustestmethods supported by Nagios.The upperbox withagray background marksall thecomponents that rundirectlyon theNagios servermachine: this includes theserveritself, as well as pluginsand otherauxiliary tools. This unitisincontact withfive clients, whichare tested in variousdifferent ways.The followingsectionswill go into somewhat more detail regarding theindividualmethods. In ordertomonitor thenetworkserviceonthe first client marked as service (start- ingfromthe left), theNagios serverruns its“own” plugin, check_xyz (Section 5.1, page 81).For thesecondclient it starts the“middleplugin” check_by_ssh,inorder to executethe plugin it really wantsremotelyonthe client (Section 5.2, page 82). In thethird casethe plugin is also executed directly on theclient machine, but nowNagios uses theNRPEservice, created specifically for this purpose. Thequery is made on theNagios side with check_nrpe (Section 5.3, page 82). Thefourthmethod describesthe queryvia SNMP.For this,the client musthavean SNMP agent available (Section 11.1, page 178).Various pluginsare available for queryingdatavia SNMP (Section 5.4, page 83). Thesefourmethods represent“active”checks,because Nagios takesthe initiative andtriggers thetestitself. Thefifthmethod, in contrast,ispassive.HereNagios 80 5.1 Testing Network Services Directly doesnothing actively,but waitsfor incoming information that theclient sendsto theNagios serverwiththe program send_nsca .Onthe Nagios serveritselfthe Nagios ServiceCheck Acceptor,NSCA, is running as adaemonthatacceptsthe transmitted resultsand forwardsthemtothe interfacefor externalcommands. (Section 5.5, page 84). Thereare otherwaysofperforming checks in additiontothese.Usually aseparate service is installedonthe client,which is then queriedbythe Nagios servervia a specialized plugin.Atypical examplehereisNSClient/NC Net,which can be used to monitorWindowsservers (Section 18.1, page 354). 5.1TestingNetwork Services Directly Mail or Webservers can be tested very simply over thenetwork, sincethe underly- ingprotocols,SMTPand HTTP, are, by definition, network-capable (Figure5.1,page 80,Client 1).Nagios can callhereonawiderange of plugins, each specialized for aparticularservice. Such aspecific programhas advantagesoveragenericone:agenericplugintests only whether thecorresponding TCPorUDP portisopenand whether theservice is waitingthere,but it doesnot determine whether thecorrect service is on the port, or whether it is active. Specificplugins adopt thenetworkprotocoland testwhether theserviceonthe portinquestionbehaves as it is expected to.AMail server, for example, normally responds withaso-called Greeting after aconnectionhas been established: 220 swobspace.de ESMTP Theimportant thinghereisthe 220.Anumber in the200 rangemeans OK,220 stands forthe greeting.The check_smtp plugin evaluates this reply. It can also simulatethe initialdialogwhensending mail (inaddition to thegreeting), as shown in Section6.3 from page 92. It behavesinasimilarway withother specificplugins,suchas check_http,which notonlycan handle asimpleHTTP dialog, butalsomanipulates HTTPheaders where required,checks SSL capabilitiesand certificatesofthe Webserver, andevensends datatothe serverwiththe POST command (more on this in Section6.4 from page 97). Thepackage withthe Nagios plugins, whichisinstalledseparately(seeSection 1.2 from page 30),includesspecific pluginsfor themostimportant networkservices. If oneismissing for aspecific service,itisworth taking alook at theNagios home- page1 or theExchangefor Nagios Add-ons.2 1 http://www.nagios.org/ 2 http://www.nagiosexchange.org/ 81 5 ServiceChecksand HowTheyAre Performed If no suitable plugin can be found thereeither, youcan usethe genericplugins check_tcp or check_udp ,which,apart from apureporttest, also send datatothe target portand evaluate theresponse (but this only makessense in most cases if an ASCII-based protocol is involved).Moreongeneric pluginsinSection 6.7.1from page 110. 5.2Running Pluginsvia Secure Shellonthe RemoteComputer To testlocal resourcessuchasharddrive capacity,the load on theswaparea, the currentCPU load, or whether aspecific processisrunning,various localplugins are available.Theyare calledlocal because they have to be installedonthe computer that is to be checked. TheNagios serverhas no waytodirectlyaccess such information over thenetwork, without taking furthermeasures. Butitcan startlocal pluginsonthe remote host, via aremoteshell (Figure5.1,page 80,Client 2).Onlythe SecureShell,SSH,can be considered for usehere; the Remote Shell,RSH,simplyhas toomanysecurity holes. To do this,the Nagios serverruns theprogram check_by_ssh,which is giventhe command,asanargument,torun thelocal plugin on thetargethost. Forthis, check_by_ssh needsawayoflogging in to thetargethostwithout apassword, whichcan be setupwith Public KeyAuthentication . From theviewpoint of theNagios server, check_by_ssh is thepluginwhose results areprocessed. It doesnot notice anything concerning thestart of thesecureshell connectionand of theremoteplugin—themainthing is that thereply corresponds to theNagios standardand contains thestatusplusalineofcomment textfor the administrator,see theintroductiontoChapter 6onpage 85. Furtherinformation on the Remote Execution of pluginsvia Secure Shellispro- vided in Chapter 9frompage 157. 5.3The Nagios RemotePluginExecutor An alternative method of running pluginsinstalledonthe target computer via the secure shellisrepresented by the Nagios Remote Plugin Executor (NRPE).Figure 5.1(page 80) illustrates this withthe middleclient. TheNRPEisinstalledonthe target host andstarted via theinetdaemon, which mustbeconfiguredaccordingly. If NRPE receivesaqueryfromthe Nagios server via the(selectable)TCP port 5666,itwill runthe matching queryfor this.Aswith 82 5.4 Monitoringvia SNMP themethod usingthe Secure Shell, thepluginthatistoperformthe testmustbe installedonthe target host. So allthisissomewhatmoreworkthanusing theSecureShell, especially as SSH oughttobeinstalledonalmostevery type of Unix machineand,whenitisused, enablesmonitoringtobeconfiguredcentrally on theNagios server. TheSecure Shellmethod requires an account withalocal shell, however, thus enablingany command to be runonthe target host3 ;the Remote Plugin Executor,onthe other hand,isrestricted to thecommandsconfigured. If youdon’t want theuser nagios to be able to do anything more than runplugins on thetargethostwithout apassword, than youare better off sticking withNRPE. Theinstallation configuration for this is describedinChapter 10 from page 165. 5.4Monitoring viaSNMP With the SimpleNetwork Management Protocol,SNMP, local resourcescan also Be queriedoverthe network(seealsoClient 4inFigure5.1,page 80). If an SNMP daemon is installed(NET-SNMPD is very extensively used,and is describedinSec- tion 11.2.2frompage 187),Nagios can useittoquery local resourcessuchas processes, hard drive andinterfaceload. Theadvantage of SNMP liesinthe factthatitiswidelyused: thereare correspond- ingservices for both UNIX andWindowssystems,and almost allmodern network components such as routersand switches can be queriedvia SNMP.Evenuninter- ruptable power supplies(USPs)and otherequipment sometimeshaveanetwork connectionand can providecurrent status information via SNMP. Apartfromthe standardplugin check_snmp,agenericSNMPplugin, thereare var- ious specialized pluginsthatconcentrate on specificSNMPqueries butare some- timesmoresimpletouse.So check_ifstatus and check_ifoperstatus ,for example, focusprecisely on thestatusofnetworkinterfaces. If youare grapplingwithSNMPfor thefirsttime, youwill soon come to realize that theterm“readable for humanbeings” didnot seem to be high up on the listofprioritieswhenthe protocol wasdefined. SNMP queriesare optimized for machineprocessing,suchasfor anetworkmonitoringtool. If youuse thetool available from thevendor for itsnetworkcomponents,SNMP willbasically remain hiddentothe user.But to useitwithNagios,you have to get your handsdirty andget involvedwiththe protocol andits underlyingsyntax. It takessomegettingusedto, butit’snot really as difficult as it seemsatfirstsight. 3 TheSecureShell doesallowasingle command to be executed withoutopening aseparateshell. Usually,however,you will want to test severalresources,soyou’llneed to runmorethanone command. 83 5 ServiceChecksand HowTheyAre Performed Theuse of SNMP is thesubject of Chapter 11 (page 177);you can also learnthere howtoconfigure anduse an SNMP daemon for Linux andother UNIX systems. 5.5The Nagios ServiceCheck Acceptor Thefifthmethod of processing theresults of service checks leads to theuse of the Nagios ServiceCheck Acceptor,NSCA. This runs as adaemononthe Nagios serverand waitsfor incoming testresults (see Figure 5.1onthe rightonpage 80). This method is also referred to as passive,because Nagios itself doesnot take the initiative. NSCA uses theinterfacefor externalcommandsusedbyCGI scripts, amongothers, to send commandstoNagios.Itconsistsofanamedpipe 4 from whichNagios reads theexternalcommands. With thecommand PROCESS_SERVICE_CHECK_ RESULT Nagios processestestresults that were determinedelsewhere.The inter- faceitselfisdescribedinmoredetail in Section13.1frompage 240. Themainareaofuse for NSCA is DistributedMonitoring.Bythiswemeanseveral differentNagios installationsthatsendtheir resultstoacentral Nagios server. The distributed Nagios servers, perhaps in different branches of acompany, work as autonomous andindependent Nagios instances, except that they also send the resultstoahead office. This doesnot checkthe decentralized networks actively, butprocessesthe information sent from thebranchesinapurely passive manner. NSCA is notjustrestricted to distributed monitoring,however.Withthe program send_nsca ,testresults can be sent whichwerenot obtained from aNagios in- stance,but rather from acron-job, for example, whichexecutes thedesired service check. Beforeyou useNSCA, youshouldconsiderthe security aspects. Because it can be used by externalprogramstosendinformation andcommandstoNagios,there is adangerthatitcould be misused. This should notstopyou from usingNSCA, but rather should motivate youintopayingattention to security aspectsduringthe NSCA configuration. Furtherinformation on usingNSCA, distributed monitoring andonsecurityingen- eral is provided in Chapter 14 from page 247. 4 Anamed pipe is abuffertowhich aprocess writes somethingand from which anotherprocess readsout thedata. This buffer is givenaname in thefile systemsothatitcan be specifically addressed, which is why it is called named pipe. 84 6 Ch ap te r Pluginsfor Network Services Everypluginthatisusedfor host andservicechecks is aseparateand independent program that can also be used independentlyofNagios.The otherway round,itis notsoeasy: in orderfor Nagios to useanexternalprogram,itmuststick to certain rules. Themostimportant of theseconcernsthe return status that is returned by theprogram.Using this,Nagios preciselyevaluates thestatus. Table 6.1displays thepossible values. Table6.1: Return values for Nagios plugins Status Name Description 0OKEverythinginorder 1WARNINGWarning limit hasbeen exceeded,but critical limit notyet reached 2CRITICALCritical limit exceeded or thepluginhas broken off thetestafter atimeout 85 6 Pluginsfor Network Services continued Status Name Description 3UNKNOWNError hasoccurred inside theplugin(thewrong parameter hasbeen used,for example) Apluginthereforedoesnot distinguishbyusing thepattern“OK—Not OK”, butis more differentiated.Inorder for it to be able to categorizeastatus as WARNING, it requires details of up to what measured valueacertainevent is regardedasOK, when it is seen as aWARNING, andwhenitisCRITICAL. An example: apartfromthe response time,aping also returnsthe rate of packet loss.For aslownetworkconnection(ISDN,DSL), aresponse time of 1000 millisec- onds couldbeseen as awarning limit and5000 milliseconds as critical,because that wouldmeanthatinteractive workingisnolongerpossible.Ifthere is ahigh load on thenetworkconnection, occasionalpacket loss couldalsooccur,1 so that 20 percentpacket loss can be specified as awarning limit,60percent as thecritical limit. Thefollowingappliesinall cases:the administrator decideswhatvaluesshall serve as warningsigns or be regardedascritical.Since allservices can be individually configured, thevaluesfor each host mayvary, even in thesameplugin. Pluginsalways have a timeout ,which is usually ten seconds. This prevents the program from waitingendlessly, thus stopping alarge number of plugin processes from accumulatingatthe Nagios host.Inother ways too, aresponse time above10 secondsmakeslittlesense for many applications,since theseinterrupt connection attempts themselvesafter acertain time span,which hasthe same effect as the totalfailure of thecorresponding service.Herethe administrator can also step in andexplicitly specifyadifferent timeout. Afurther characteristic of allplugins is atextoutput, whichNagios showsinits overviewand whichisprincipally intendedfor theadministrator,soitneedstobe “human-readable”. Since Nagiosshows only thefirstline, this textoutputshould notbetoo long.Inaddition,Nagios currentlyprocessesonlyamaximum of 300 characters of thetextoutput; therestissimplycut off. We recommend thefol- lowingformfor thetextoutput: TYPE_OF_CHECK STATUS -informational text In practice,the textoutputlookslikethis: SMTP OK -0.186 sec. response time DISK WARNING -free space: /net/eli02/a 3905 MB (7%); 1 ICMP packets arenot re-sent, alostpacketremainslost. 86 6.1 StandardOptions Theabove exampleisfromthe plugin check_smtp,the second from check_disk. In both cases,the type of check(here SMTP or DISK)isfollowedbythe status in textformand then theactualinformation.Not allplugins adheretothisrecom- mendationintheir output.Sometimesthe detail of thetesttypeismissing,and sometimeseventhe status is missing. Variousplugins also provideperformanceinformation,which can be evaluated and graphically represented withexternalprograms(seeChapter 17, page 313): OK -172.17.129.2: rta 97.751ms, lost 0%| rta=97.751ms;200.000;500.000;0; pl=0%;40;80;; As can be seen here from theexample of the check_icmp plugin,the performance datafollows thetextoutput, separated by thepipecharacter | .But this datadoes notappear in theWeb interface. check_icmp here provides twovalues: themediumreply time, rta ( Real Time An- swer ), in milliseconds andthe packetlossrate, pl. 2 Foreach variable,the plugin first displays themeasuredvalue ( 97.751ms and 0%), followedbythe warning limit (200 milliseconds or 40 percent) andthe critical limit (500 milliseconds or 80 percent). To keep theinstallation (Section 1.2frompage 30)assimpleaspossible,there are no manualpagesfor theplugins.Each of theseprogramsmustmaintainanonline help whichisdisplayed withthe option -h or --help .Someplugins distinguishhere betweenashorthelp(-h)and alongone ( --help ); it is thereforerecommended that youalways tryout --help as well. This chapter introducesthe most important pluginsfromthe basicdistribution of the nagios-plugins package(version1.3.1 or 1.4.x),which testnetworkservices. With theirhelp, theNagios serverqueries servicesonother servers. Thedescription is restricted to thefunctionalitythatisimportant for normal operation. If youare interested in allthe options,werefer youtothe integrated onlinehelp. 6.1StandardOptions Table 6.2lists theoptions that arecommontoall plugins. Theoptions in bold type mustbeknown to allplugins.The keywords notinboldtypecan be omitted by the programs, butiftheyare supported at all, they mustbeusedinthe sensespecified. If an option demandsanargument,itisusually separated by spacesinthe short form, butbyequalssigns in thelongform. Butfor Perl or shellscripts in particular, notall authorsadheretothese,soyou have no optionherebut to take alook at thecorresponding description. 2 Shortfor packet loss. 87 6 Pluginsfor Network Services Table6.2: Standardoptions of plugins ShortformLong form Description -h --help Output of theonlinehelp -V --version Output of thepluginversion -v --verbose Output of additional information.Thisop- tion maybegivenmultipletimes. 3 -H --hostname Host name or IP addressofthe target -t --timeout Timeoutinseconds after whichthe plugin will interrupt theoperation andreturn the CRITICALstatus. -w --warning Specificies thewarning limit value -c --critical Specifies thecritical limit value -4 --use-ipv4 ForceIPv4tobeused -6 --use-ipv6 ForceIPv6tobeused Thus it is notallowedtouse -c,for example, for anything otherthanspecifying acritical limit.How exactly -c and -w areusedmay,onthe otherhand, vary from plugin to plugin,because sometimesanindividualvalue mayberequired, at othertimes,multiplevalues(seealsothe explanations on theplugin check_icmp), describedbelow. Notall pluginscan handle theoptions -4 and -6,withwhich theusercan choose theversion of theIPprotocoltouse,and if they can handle these, then usually only from plugin version1.4. 6.2Reachability Test withPing TheclassicreachabilitytestinUNIXsystems hasalways been aping, whichsends an “ ICMPechorequest”packet andwaits for an “ ICMPechoresponse”packet. TheNagios plugin packageincludestwo programsthatcarry outthispingcheck: check_icmp and check_ping.Eventhough check_ping is used in thestandard configuration,you should replaceitwiththe more efficient check_icmp,which hasbeen included sincepluginversion 1.4. Whereas check_ping calls theUNIXprogram /bin/ping,which is whythere are always compatibilityproblems withthe existing ping version, check_icmp sends ICMPwithout anyexternalhelpprograms. check_icmp basically worksmoreeffi- ciently, sinceitdoesnot wait for onesecondbetween individualpackets,as ping 3 Whetherthisleadstomoreinformation dependsonthe individual plugin ... 88 6.2 ReachabilityTestwithPing does. In addition it evaluates ICMPerror messagessuchas ICMPhostunreach- able ,while check_ping discards these. check_icmp is backwards-compatible to check_ping;thismakesiteasytodowithout check_ping entirely andtoreplace it with check_icmp. check_icmp measures thereply time of theICMP packets anddeterminesthe pro- portion of packets that have been lost.Ifanerror message arrivesinsteadofthe expected “ ICMPechoreply”, this is evaluated immediately. Thus Nagios breaks off thetestifan“ICMPhostunreachable ”message arrives. check_icmp hasthe followingoptions: 4 -H address Withoutthe host name or theIPaddressofthe computer to be tested, check_icmp cannotwork. With -H,multiple host entriescan be separated, usingspaces. -w response time, packet loss percent % This switch sets thewarning limit for awarning. response time stands here for thedesired response time in milliseconds, packet loss percent stands for thecorresponding packetlossasapercentage.Ifyou specify -w 500.0,20% thepluginwill give awarning either if theresponse time is at least500.0 milliseconds or if 20 percentormoreofICMP packets arelost. -c response time, packet loss percent % This switch specifies thecritical limit in thesameway as -w defines the warningvalue.The critical limit should always be larger than thewarning limit. -n packets With packets youcan setthe number of packets that check_icmp should usefor each test. Thedefault is 5 packets. -t timeout After timeout secondshavepassed, theplugininterruptsthe testand re- turnsthe CRITICALstatus. Thedefault is 10 seconds. Like theprogram /bin/ping, check_icmp mustalsorun with root permissions, whichiswhy theSUIDbit is set: 4 Theonlinehelp check_icmp -h doesstate that it knowsthe options in thelongformaswell, butthese areneither implementedinversion 1.5, included in theNagiosplugins 1.4, norin laterversionsupto1.18. 89 6 Pluginsfor Network Services linux:˜ # chown root.nagios /usr/local/nagios/libexec/check_icmp linux:˜ # chmod 4711 /usr/local/nagios/libexec/check_icmp linux:˜ # ls -l /usr/local/nagios/libexec/check_icmp -rwsr-x--x 1root nagios 61326 2005-02-08 19:49 check_icmp Foratest, youshouldexecute thepluginonthe command lineasthe user nagios, sinceNagios willlater executeitunder this account: nagios@linux:˜$ cd /usr/local/nagios/libexec nagios@linux:nagios/libexec$ ./check_icmp -H 192.168.1.13 \ -w 100.0,20% -c 200.0,40% OK -192.168.1.13: rta 0.253ms, lost 0%| rta=0.253ms;100.000;200.000;0; pl=0%;20;40;; check_icmp then sets thestandardnumber of fiveICMP packets on theirway,and insteadofanOK, issues aWARNINGassoon as theresponse time,averagedoverall thepackets,isatleast 100.0milliseconds,orif20percent or more arelost—that is, at leastone packetinfive.For aCRITICALstatus, theaverage response time must be at least200.0milliseconds,oratleast twopackets (40percent of five) must remain unanswered. 6.2.1 check_icmp as aservice check In orderthat check_icmp can be used as aservicecheck, youneed to have a suitable command object.The file checkcommands.cfg ,with check_ping,already hasone for thepingservice. We willjustreplace the check_ping plugin in it with check_icmp: define command{ command_name check_ping command_line $USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ } Themacro $HOSTADDRESS$ provides theIPaddressofthe parameter address from thehostdefinition,and withthe twofreelydefinedmacros $ARG1$ and $ARG2$,parameterscan be takenoverfromthe service definition, so that warning andcritical limitscan be setwiththese. In theservicedefinition (anextract of it is shownhere)5 for the PING service,the check_command entry, in additiontothe name of thecommand object to be executed,now needstwo arguments, whichare entered after thecommand,both separated by an exclamation mark: 5 Like anyother object,service definitions canalsobedefinedinafileofyourchoice, from which Nagios loadsobject definitions.For thesakeofclarity,itisbesttochooseadescriptivename forthe file, such as services.cfg,asinour exampleonpage39. 90 6.2 ReachabilityTestwithPing define service{ service_description PING host_name linux01 check_command check_ping!100.0,20%!500.0,60% ... } From thedefinition of thecommand object,you can seethatthe first parameter ( 100.0,20%)definesthe warninglimit,and thesecondone ( 500.0,60%)defines thecritical value. 6.2.2 check_icmp as ahostcheck To be able to usethe plugin under thename check_host for host checks,acorre- sponding symbolic linkto check_icmp is set: linux:˜ # cd /usr/local/nagios/libexec linux:nagios/libexec # ln -s check_icmp check_host If it is calledunder itsnew name, check_host,the plugin modifiesits behavior somewhat:itinterruptsthe testafter receivingthe first ICMPechoreply,because asinglereply packetisenoughtoprove that thehost“is alive”. Thesameapplies if thefirstresponse to be returned is an errormessage such as ICMPnetwork unreachable or host unreachable —the host is then considered to be unreachable. Host checks aredefinedlikeevery othercheck. Theonlydifferenceisthatthistest is specified during thedefinition of thehostobject (and notofaserviceobject): define host{ host_name linux01 alias Linux File Server address 192.168.1.21 check_command check-host-alive ... } Thenameusedhere, check-host-alive,can be freelydefinedand can be specified separatelyfor each host.The definitionofthe command itself is made in check- commands.cfg : define command{ command_name check-host-alive command_line $USER1$/check_host -H $HOSTADDRESS$ } 91 6 Pluginsfor Network Services Host checks do notalways need to be executed with check_icmp.You couldjust as well measurethe refrigerator temperatureortest, withthe genericplugins for TCPorUDP ( check_tcp and check_udp ;see Section6.7.1 from page 110), whether aspecific port is openornot.The portscanner nmap ,for example, uses TCPport 80 (HTTP). Thedisadvantage of such amethod liesinthe factthat, apartfromthe host itself, anotherapplicationalsoneedstorun—that is,the Webserver. In addition,the testofaspecific applicationbynomeans proves that thecomputer is no longer reachable.Aping hasthe greatadvantage that thekernelrepliesto“ICMPecho request”messagesitself, so that no applicationneedstoberunning for this.You should thereforechangefrompingtoother host checkmethods only if thereisa good reason to do so.One examplemight be afirewall that filtersICMP messages, andoverwhich theadministrator hasnoinfluence, butthatdoeslet throughHTTP queriesonTCP port 80. 6.3Monitoring Mail Servers Anumber of pluginsare also available to monitormailservers:the mail serveritself ( Mail TransportAgent (MTA)) is monitoredby check_smtp,and in additiontothis themailqueue on themailservercan be checkedwith check_mailq .Since this testtakesplace locally,the plugin is describedinthe next chapter in Section7.8 (page 147). To monitorthe “ Mail UserAgent (MUA)” POP3 andIMAPprotocols—includingthe SSL variants,POP3S andIMAPS—the plugin check_tcp is used: check_pop andso forth aresymboliclinks to check_tcp ,which determineswhich protocol it should testbymeans of thenamebywhich it is called, andmakesthe relevant presettings. 6.3.1MonitoringSMTPwith check_smtp TheSMTPmonitoringplugin check_smtp hasthe followingoptions: -H address / --host=address This details thecomputer on whichthe SMTP serviceshouldbechecked. -p port / --port=port port determinesthe ports,incasethe mail service is notlisteningonthe standardport25. In this waythe mail virus scannerAmavis (usually port 10024) can be monitored, for example. Butthiscan normally be reached only from localhost. 92 6.3 MonitoringMailServers -e string / --expect=string string defines thetextwhich themailservermustprovideinthe very first replyline. Thedefault setting for string is 220,withwhich thenormalSMTP greeting begins,but theremay be servers that have different settings. A wrongreply from theservicemonitored willgenerateaWARNING. -f address / --from=address With address youspecify amailaddressthat check_smtp then sendstothe serverwiththe “ MAIL FROM: ”command.Thisoptionisrequiredtotesta Microsoft Exchange 2000 Server. -C ” mail command” / --command=”mail command” (fromversion 1.4) With -C youcan send individualmailcommandstothe server, to extendthe testslightly (see examplebelow). -R ” string” / --response=” string” (fromversion 1.4) If yousendanSMTPcommand to theserverwith -C,you can specifythe expected replyhereinsteadof string (for example, 250). A“wrong” reply triggers aWARNING. -4 / --use-ipv4 (fromversion 1.4) Thetestisperformedexplicitly over an IPv4 connection. -6 / --use-ipv6 (fromversion 1.4) Thetestisperformedexplicitly over an IPv6 connection. -S / --starttls (fromversion 1.4) Theconnectionsetupduringthe testusesSTARTTLS. -w floating point decimal / --warning= floating point decimal If theservertakeslongerthan floating point decimal secondsfor thean- swer, check_smtp issues aWARNING. -c floating point decimal / --critical= floating point decimal Like -w,exceptthat check_smtp issues aCRITICALafter floating point deci- mal seconds. In thesimplestcase, youjustenter thenameorthe IP addressofthe mail server: nagios@linux:nagios/libexec$ ./check_smtp -H smtp01 SMTP OK -0,008 sec. response time|time=0,008157s;;;0,000000 Theplugin check_smtp sendsback a HELO hostname after receivingthe SMTP greeting,which should containthe reply 250. Thedefinition of thecorresponding command object in this caseappearsasfollows: 93 6 Pluginsfor Network Services define command{ command_name check_smtp command_line $USER1$/check_smtp -H $HOSTADDRESS$ } To checkthe host object linux01 withthis, it requires thefollowingservicedefini- tion: define service{ service_description SMTP host_name linux01 check_command check_smtp ... } Using the -C option, theSMTPdialogcan be extendedevenfurther,roughly until RCPT TO: nagios@linux:nagios/libexec$ ./check_smtp -H localhost \ -C "MAIL FROM: " -R "250" \ -C "RCPT TO: " -R "554" SMTP OK -0,019 sec. response time|time=0,018553s;;;0,000000 Such atestcould be used,for example, to checkthe configuration of therestric- tionsbuilt into themailserver(invalid domains,spam defenses,and more). The examplechecks whether themailserverrefuses to acceptamail containing thein- validdomain gna.dot (thatis, in the RCPT TO: ). Thetestruns successfully,therefore, if theserverrejects themailwith 554.What check_smtp doesherecorresponds to thefollowingmaildialogreproducedby telnet : user@linux:˜$ telnet localhost 25 Trying 127.0.0.1... Connected to localhost. Escape character is ’ˆ]’. 220 swobspace.de ESMTP helo swobspace 250 swobspace.de MAIL FROM: 250 Ok RCPT TO: 554 : Recipient address rejected: test not \ existing top level domain ... If themailserverdid notrejectthe recipientdomainbecause of theconfigura- tion error, thereply wouldnolongercontain 554 andthe plugin wouldissuea WARNING. 94 6.3 MonitoringMailServers In generalyou should remember,whenchecking restrictions,thatthe serverrejects mails only after a RCPT TO: ,depending on theconfiguration,evenifthe reason for this (a certainclient IP address, theservernamein HELO or thesenderaddressin MAIL FROM: )has already occurred beforethis. 6.3.2POP andIMAP Four pseudo pluginsare available for testing thePOP andIMAPprotocols: check_ pop, check_spop, check_imap,and check_simap .Theyare calledpseudoplug- insbecause they arejustsymboliclinks to theplugin check_tcp .Bymeans of thenamewithwhich thepluginiscalled, this determinesits intendeduse and correspondingly sets therequiredparameters, such as thestandardport, whether somethingshouldbesenttothe server, theexpected response andhow thecon- nectionshouldbeterminated.The options arethe same for allplugins,which is whyweshall introduce them alltogether: -H address / --host=address specifies thecomputer on whichPOP or IMAPistobechecked. -p port / --port=port port specifies an alternative portifthe plugin is intendedtomonitor adif- ferent portfromthe standardone:110 for check_pop,995 for check_spop, 143 for check_imap,and 993 for check_simap (see also /etc/services). -w floating point decimal / --warning= floating point decimal Theplaceholder floating point decimal is replaced by thewarning limit for theresponse time in seconds, specified as afloatingpoint decimal. -c floating point decimal / --critical= floating point decimal This sets thecritical limit for theresponse time in seconds(see -w). -s ” string” / --send=” string” This string is to be sent to theserver. In thedefault setting,noneofthe four pluginsusesthisoption. -e ” string” / --expect=” string” string specifies thereply that theservershouldgive.The defaultis +OK for (S)POP and *OK for (S)IMAP. -q ” string” / --quit=”string” This is thestringwithwhich theserviceisrequested to endthe connection. For(S)POPthisis QUIT\r\n ,for (S)IMAP, a1 LOGOUT\r\n. 95 6 Pluginsfor Network Services -S / --ssl (fromversion 1.4) Theconnectionset up during thetestusesSSL/TLS for theconnection. If you callthe plugins check_simap and check_spop,thisoptionisset automat- ically.Inorder for aconnectiontobeestablished, theservermustsupport SSL/TLS directly on theaddressedport. STARTTLS6 on itsown doesnot supportthe plugin.With ./check_imap -H computer -s "a1 CAPABILITY" -e "STARTTLS" youcan at leastcheck whether theserverprovides this method: theplugin returnsOKifthe replystringcontainsSTARTTLS,orWARNINGifitdoesn’t. Butthisisnot really agenuine testofwhether STARTTLS really doeswork properly. Of course,all theother options of thegeneric plugin check_tcp (describedinSec- tion 6.7.1frompage 110) can be used with check_pop, check_spop, check_imap, and check_simap . In thesimplestcaseyou just need to give thenameofthe computer to be tested (here: mailsrv)orthe IP address: nagios@linux:nagios/libexec$ ./check_pop -H mailsrv POP OK -0.064 second response time on port 110 [+OK eli11 Cyrus POP3 v2.1.16 server ready <1481963980.1118597146@eli11>] |time=0.064228s;0.000000;0.000000;0.000000;10.000000 In each casethe plugin provides just onelineofoutput, whichhas been line- wrapped here for layout reasons. Thedetails after thepipecharacter | in turn involve performancedatanot shownbythe Webinterface. Thestructure of per- formancedataand howtheyare processedare describedinmoredetail in Section 17.1frompage 314. Implemented as acommand object,the above check_pop command lookslikethis: define command{ command_name check_pop command_line $USER1$/check_pop -H $HOSTADDRESS$ } As aservicefor themachine linux01 ,itisintegrated likethis: 6 STARTTLS refers to thecapacityofaservicetoset up an SSL/TLS-secured connectionafter anormalconnectionhas been established—forexample,for POP3, viaTCP port110. Every servicethatimplementsSTARTTLS musthaveasuitable command available to do this.With POP3 this is called STLS (see RFC2595).STARTTLS is used with SMTP,LDAP, IMAP,and POP3, amongothers, butnot everyserversupports this method automatically. 96 6.4 MonitoringFTP and WebServers define service{ service_description POP host_name linux01 check_command check_pop ... } 6.4Monitoring FTPand WebServers TheNagios plugin packageprovides twoplugins to monitorthe classicInternet servicesFTP andHTTP (including HTTPS): check_ftp and check_http.Whenmany usersfromanetworkare usingWeb services, aproxy is usually used in addition. To monitorthis, youcould also use check_http,but withthe check_squid.pl plugin, TheNagios Exchange hasabetter tool available. 6.4.1FTP services Theplugin check_ftp is,likethe pluginsfor POPand IMAP,asymbolic linktothe genericplugin check_tcp ,sothatitalsohas thesameoptions.Theyare described in detail in Section6.7.1 on page 110. Thegeneric plugin sets thefollowingparametersifitiscalledwiththe name check_ftp: --port=21 --expect="220" --quit="QUIT\ r \ n" It doesnot send astringtothe server, butitexpects areply containing thetext 220,and it ends theconnectiontothe standardport21cleanly with QUIT\r\n . On thecommand linethere is,asusual,aone-linereply (withlinebreaks forthe printed version) withperformancedataafter the | characterthatisnot shownby theWeb interface, (see Section17.1frompage 314) foranexplanation of this: nagios@linux:nagios/libexec$ ./check_ftp -H ftp.gwdg.de FTP OK -0,130 second response time on port 21 [220-Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen] |time=0,130300s;0, 000000;0,000000;0,000000;10,000000 As acommand object,thiscallappearsasfollows: define command{ command_name check_ftp command_line $USER1$/check_ftp -H $HOSTADDRESS$ } 97 6 Pluginsfor Network Services Acorresponding service definitionlookslikethis: define service{ service_description FTP host_name linux01 check_command check_ftp ... } 6.4.2Web server controlvia HTTP The check_http plugin for HTTPand HTTPS checks contains alarge number of very useful options,depending on theintendeduse: -H virtual host / --hostname=virtual host This switch specifies thevirtual host name that theplugintransmits in the HTTPheaderinthe host: field: nagios@linux:nagios/libexec$ ./check_http -H www.swobspace.de HTTP OK HTTP/1.1 200 OK -2553 bytes in 0.154 seconds If youdon’t want check_http to send this,you can use -I instead. -I ip-address / --IP-address= ip-address Insteadof ip,the host name or IP addressofthe target computer is given. Forsystems withseveral virtual environments, youwill land in thedefault environment,and for most Webhosting providersyou willthenreceive an errormessage: nagios@linux:nagios/libexec$ ./check_http -I www.swobspace.de HTTP WARNING: HTTP/1.1 404 Not Found -u url or path / --url= url or path Theargument is theURL to be sent to theWeb server. If thedesigndoc- ument liesonthe servertobetested,itissufficient to enter thedirectory path, starting from the document root of theserver: nagios@linux:nagios/libexec$ ./check_http -H linux.swobspace.net\ -u /mailinglisten/index.html HTTP OK HTTP/1.1 200 OK -5858 bytes in 3.461 seconds If this optionisnot specified,the plugin asks forthe document root / . -p port / --port=port This is an alternative portspecificationfor HTTP. 98 6.4 MonitoringFTP and WebServers -w floating point decimal / --warning= floating point decimal This is thewarning limit for theresponse time of theWeb serverinseconds. -c floating point decimal / --critical= floating point decimal This is thecritical limit for theresponse time of theWeb serverinseconds. -t timeout / --timeout=timeout After timeout secondshaveexpired,the plugin interruptsthe testand re- turnsthe CRITICALstatus. Thedefault is 10 seconds. -L / --link-url This optionensures that thevirtual host in thetextoutputappearsonthe Webinterfaceasalink. nagios@linux:nagios/libexec$ ./check_http -H www.swobspace.de -L HTTP OK HTTP /1.1 200 OK -2553 bytes in 0.156 seconds -a username: password / --authorization= username: password If theWeb serverrequiresauthentication, this optioncan be used to specify auser-passwordpair. Theplugincan only handle basic authentication, however; digest authentication is currentlynot yetpossible. -f behavivor / --onredirect=behavior If theWeb serversends aredirectasareplytothe requested Webpage,the behavior parameter influencesthe behaviorofthe plugin.The values ok, warning, critical and follow areallowed. Thedefault is ok,sothe plugin will simply return an OK,without followingthe redirect.The plugin can be made to followthe redirect with follow . warning and critical witharedirect return the WARNING or CRITICAL status. -e ” string” / --expect=” string” This is thetextthatthe serverresponse should containinits first status line. If this optionisnot specified,the plugin expects HTTP/1. as a string. -s ” string” / --string=”string” This is thesearchtextthatthe plugin looksfor in thecontents of thepage returned,not in theheader. -r ” regexp” / --regex=” regexp” This is aregular expression7 for whichthe plugin should search in thepage returned. -R ” regexp” / --eregi=”regexp” This switch workslike -r,exceptthatthe plugin nowmakesnodistinction betweenupperand lowercase. 7 Posixregular expressions, see man7regex . 99 6 Pluginsfor Network Services -l / --linespan Normally thesearchfor regularexpressionsisrestricted to onelinewith -r and -R.If -l precedes theseoptions,the search patterncan refertotext covering multiplelines. -P string / --post=string Use this switch for datathatyou wouldliketosendvia aPOSTcommand to theWeb server. Thecharactersin string mustbeencodedinaccordance withRFC 1738:8 only thelettersAto Z(upperand lowercase),the special characters $-_.+!*’(), andthe numbers0to 9are allowed. To send thetext Übung fürAnfänger (“Exercise ForBeginners”inGerman) as a string,umlauts andspaces mustbeencodedbeforetheyare sent: %DCbung%20f%FCr%20Anf%E4nger . -m min bytes / --pagesize= min bytes -m min bytes : max bytes / --pagesize= min bytes : max bytes (fromversion 1.4) Thepage returned mustbeatleast min bytes in size,otherwisethe plug- in will issueaWARNING. Youcan optionally useanupperlimit as well— separated by acolon—tospecify thesizeofthe Webpage.Now check_http willalsogive awarning if thepage returned is larger than max bytes .In thefollowingexample,everythingisinorder if thepage returned is at least 500 bytes andatmost2000 bytes in size: nagios@linux:nagios/libexec$ ./check_http -H www.swobspace.de \ -m 500:2000 HTTP WARNING: page size 2802 too large|size=2802B;500;0;0 -N / --no-body (fromversion 1.4) With this optionthe plugin doesnot wait for theservertoreturn thecom- plete page contents,but just reads in theheaderdata. To do this it uses the HTTPcommands GET or POST,and not HEAD. -M seconds / --max-age= seconds (fromversion 1.4) If thereturned document is olderthanthe datespecifiedinthe header(HTTP headerfield Date: ), thepluginwill generate aWARNING. Insteadofseconds (without additionaldetails)you can also useexplicit units such as 5m (five minutes), 12h (twelve hours),or 3d (three days);combinations arenot al- lowed. -A ” string” / --useragent=” string” (fromversion 1.4) Explicitly specifies auseragent in theHTTP header, such as -A ”Lynx/1.12” for Lynxversion 1.12. Normally theplugindoesnot send this field. 8 http://www.faqs.org/rfcs/rfc1738.html,paragraph 2.2 100 6.4 MonitoringFTP and WebServers -k ” string” / --header=”string” (fromversion 1.4) This specifies anyHTTP headertags.Ifseveral tags aretobespecified, they mustbeseparated by asemicolon,asinthe followingexample: -k "Accept-Charset: iso-8859-1; Accept-Encoding: compress, gzip;" -S / --ssl This forcesanSSL connectiontobeused: nagios@linux:nagios/libexec$ ./check_http --ssl -H \ www.verisign.com HTTP OK HTTP/1.1 200 OK -33836 bytes in 1.911 seconds Thehost www.verisign.com allows an SSL connection. If this is notthe case, theserverreturnsanerror andthe plugin returnsthe valueCRITICAL:9 nagios@linux:nagios/libexec$ ./check_http --ssl -H www.swobspace.de Connection refused Unable to open TCP socket -C days / --certificate= days Testswhether thecertificateisatleast validfor thegivennumber of days. OtherwiseaWARNINGisissued. -4 / --use-ipv4 (fromversion 1.4) Thetestismade explicitly over an IPv4 connection. -6 / --use-ipv6 (fromversion 1.4) Thetestismade explicitly over an IPv6 connection. Thedefinition of acorresponding command object andits useasaservice is no different from that basedonother plugins; page 102 showsanexample. 6.4.3MonitoringWeb proxies Proxytestwith check_http Aproxy such as Squidcan also be tested with check_http,but this assumesthat youhavesomeknowledge of howabrowsermakescontact withthe proxy. It does this in theformofanHTTP header: GET http://www.swobspace.de/ HTTP/1.1 Host: www.swobspace.de User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-DE; rv:1.7.5) Gecko/20041108 Firefox/1.0 9 This canbecheckedinthe shellwith echo $?. 101 6 Pluginsfor Network Services Accept: text/xml,application/xml,application/xhtml+xml,... Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-15,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Proxy-Connection: keep-alive Pragma: no-cache Cache-Control: no-cache Thedecisive entriesare printed in bold type.IncontrasttonormalWeb server queries, thebrowser requests thedocument from theservervia aGET command, notbyspecifyingthe directorypath, butbyusing thecomplete URL, includingthe protocol type.Inthe Host: field it specifies thehostnameofthe Webserverthatit actually wantstoreach.WithnormalHTTP queriesthatgodirectlytoaWebserver (and notvia aproxy), thehostnameofthe Webserverwould be written there. This behaviorcan be reproducedwith check_http: nagios@linux:nagios/libexec$ ./check_http -H www.swobspace.de \ -I 192.168.1.13 -p 3128 -u http://www.swobspace.de HTTP OK HTTP/1.0 200 OK -2553 bytes in 0.002 seconds In ordertoset the Host: field in theheader, youspecify thenameofaWeb server with -H.The nonlocal URLisforcedbya-u,and specifying -I at thesametime ensuresthatthe proxyisaddressed, andnot theWeb serveritself. Finally youneed to select theproxy port, andthe proxytestisthencomplete. Then check_http will send thefollowingHTTP headertothe proxy: GET http://www.swobspace.de HTTP/1.0 User-Agent: check_http/1.79 (nagios-plugins 1.4-beta1) Host: www.swobspace.de This testdoesnot useany implementation-specificinformation of theproxy,soit should work withevery Webproxy. Thecommand object is defined as follows: define command{ command_name check_proxy command_line $USER1$/check_http -H www.google.de \ -u http://www.google.de -I $HOSTADDRESS$ -p $ARG1$ } Theproxy computer linux01 is then tested withthe followingservice: define service { service_description Webproxy host_name linux01 102 6.4 MonitoringFTP and WebServers check_command check_proxy!3128 ... } Theparameter 3128 ensuresthatthe command object check_proxy can read out theportfrom $ARG1$. Proxytestwith check_squid Theproxy checkwith check_http,introducedinthe last section, worksonlyifthe desiredWeb page is available or is already in thecache. If neitheristhe case, this testwill produce an error, even if theproxy is workinginprinciple. Theplugin check_squid.pl uses adifferent method, butitisnot part of thestan- dardinstallation,and is to be found in thecategory CheckPlugins → Network- ing10 whichcan be found at http://www.nagiosexchange.com/ . It makesuse of the cachemanager of theSquid proxy, whichisqueried by a pseudo protocol.Acommand is sent in theform GET cache_object://ip address/ command HTTP/1.1\ n \ n to Squidand obtainsthe desiredinformation.The plugin check_squid.pl uses the info command,which queriesarangeofstatistical usage information: user@linux:˜$ echo "GET cache object://192.168.1.13/info HTTP/1.1\ n \ n" \ |netcat 192.168.1.13 3128 ... File descriptor usage for squid: Maximum number of file descriptors: 1024 Largest file desc currently in use: 18 Number of file desc currently in use: 15 Files queued for open: 0 Available number of file descriptors: 1009 Reserved number of file descriptors: 100 Store Disk files open: 0 ... It is targeted at thenumber of still-free filedescriptors (the thirdlinefromthe end); youcan setawarningorcritical limit for this value. Thenumber of filedescriptors playsarole when accessismade to objectsinthe Squidcacheatthe same time. In environmentswithahigh number of parallelaccessestothe proxy, it is quite possible that 1024 filedescriptors areinsufficient.Insmallernetworks withjusta fewhundredusers, notall of whom aresurfingatthe same time,the compiled-in valueof1024 willbesufficient. 10 http://www.nagiosexchange.org/Networking.53.0.html 103 6 Pluginsfor Network Services Squid configuration Normally Squidallows accesstothe cachemanageronlyfrom localhost.Sothat Nagios can queryitoverthe network, theproxy mustbereconfiguredaccordingly: ... acl manager proto cache_object acl nagiosserver 192.168.1.9 http_access allow manager nagiosserver http_access deny manager cachemgr_passwd none info menu ... Thenecessary changestothe configuration file squid.conf areprinted in bold type, andthe otherrelevantlines arealready containedinthe defaultfile.The first lineto be printed defines an accesscontrol list(Access Control List, acl)called manager by meansofthe internalprotocol cache_object ,soitreferstoeverythingthat accessesthe proxyusing the cache_object protocol.Thisisfollowedbyanaccess controllistfor theNagios server, basedonits IP address, here 192.168.1.9 .The listname nagiosserver maybefreelychosenhere(as can manager in thefirst line).With http_access allow , nagiosserver obtainsaccess to thecachemanager ( manager ), beforethe line http_access deny manager prohibitsaccess to allothersthrough the cache_object protocol.Finally, cachemgr _passwd provides apasswordfor thecachemanageraccess.Ifyou omit this,with none,thenonlyselected commandsshouldbeallowedthathavenopotential to change things,suchas info and menu,which showsall thethingsthatthe cache managercan do.After theconfiguration filehas been modified, Squidneedsto read it in again: linux:˜ # /etc/init.d/squid reload Applyingthe plugin Thetestplugin check_squid.pl itself hasthe followingoptions: -H address / --hostname=address This is theserveronwhich Squidistobetested,specifiedbyIPaddressor FQDN. -P port / --port=port This specifies theportonwhich Squidislistening. Thedefault is thestandard port3128. 104 6.5 Domain Name Server underControl -p password / --password=password This is thepasswordfor access to thecachemanager. -w free descriptors / --warning= free descriptors This is thenumber of free filedescriptors, wherethe plugin will issueawarn- ingifthe number drops belowthis. Thedefault is 200. -c free descriptors / --critical= free descriptors This is thecritical limit for free filedescriptors. If thenumber falls belowthis, check_squid returnsCRITICAL. Thedefault is 50. When check_squid is run, it is usually very unspectacular: nagios@linux:nagios/libexec$ ./check_squid.pl -H 192.168.1.13 Squid cache OK (1009 FreeFileDesc) Thematchingcommand also presents no problems ... define command{ command_name check_squid.pl command_line $USER1$/check_squid.pl -H $HOSTADDRESS$ ...and thesamegoesfor service definitions: define service{ service_description Squid host_name linux01 check_command check_squid.pl ... } 6.5Domain Name Server under Control Twoplugins arealsoavailable for testing the Domain Name Services DNS: check_ dns and check_dig .While check_dns tests whether ahostnamecan be resolved, usingthe external nslookup program, check_dig allows anyrecords at alltobe queried. Bothplugins arepartofthe standarddistribution. Thesituationsinwhich they areusedoverlap somewhat.With check_dns ,you can also explicitly queryaspecificDNS server, although this plugin is really for checking whether thenameserviceisavailable generally. 105 6 Pluginsfor Network Services 6.5.1DNS checkwith nslookup The check_dns plugin checks whether aspecifiedhostnamecan be resolvedtoan IP address. Usedlocally,the plugin tests theDNS configuration of thecomputer on whichitisrun. Forthe name resolution,itusesthe name serverconfiguredin /etc/resolv.conf . Thepossible options arejustasunspectacular. -H host / --hostname=host This is thehostnametoberesolvedtoanIPaddress. -s dns-server / --server=dns-server This switch explicitly specifies thenameservertobeused. If this optionis missing, check_dns uses thenameserverfrom /etc/resolv.conf . -a ip address / --expected-address= ip address The ip address is theIPaddressthat host should have.Ifthe name service returnsadifferent address,the plugin will raisethe alarmwithCRITICAL. This optionmakessense only if it is necessary for thenameservertoprovidea fixed IP address. Withoutthisoption, thepluginwill acceptevery IP address as areply. -A / --expect-authority Thenameserverspecifiedwith -s should answer thegivenquery authori- tatively,sothe corresponding domainmustact as aprimary or secondary name server. If this is notthe case, thepluginreturnsCRITICAL. -t timeout / --timeout=timeout After timeout secondshaveexpired,the plugin interruptsthe testand re- turnsthe CRITICALstate. Thedefault is 10 seconds. Forthe local testofthe DNSconfiguration (notthatfor anameserver) youjust requireahost name that is highly unlikelytodisappearfromthe DNS, such as www.google.de: nagios@linux:nagios/libexec$ /check_dns -H www.google.de DNS OK: 0,009 seconds response time www.google.de returns 216.239.59.99 Thecorresponding command definitionappearsasfollows in this case: define command { command_name check_dns command_line $USER1$/check_dns -H www.google.de } 106 6.5 Domain Name Server underControl Thefollowingservicetests whether thenameserverconfiguration for thecom- puter linux01 is functioning: define service{ service_description DNS/nslookup host_name linux01 check_command check_dns ... } 6.5.2Monitoringthe name server with dig Theplugin check_dig provides more options for monitoring anameserverthan check_dns .Asthe name implies, it is basedonthe externalutility dig ,intendedfor preciselythispurpose. -H address / --hostname=address The address is theIPaddressfor theDNS servertobetested.Itisalso possible to specifyahost name (insteadofanIPaddress),but in most cases this makeslittlesense,because this wouldfirsthavetoberesolvedbeforeit can reach thenameserver. -l hostname / --lookup= hostname The hostname is thehostnametobetested.Ifnoparticularcomputer is looked up,but only thefunctionalityofthe DNSserveristobetested,you should specifyanaddresshereeasily reachable from theInternet, such as www.google.de. -T record type / --record_type=record type (fromversion 1.4) This switch specifies therecordtypetobequeried.The defaultis A (IPv4 address),but often NS (relevantnameserver), MX (relevant Mail Exchange), PTR ( Pointer;IPaddressfor reverselookup)or SOA ( Source of Authority , theadministrationdetails of thedomain) arealsoused. -w floating point decimal / --warning= floating point decimal (fromversion 1.4) This switch sets thewarning limit for theresponse time of thename serverinseconds (floatingpoint decimal). -c floating point decimal / --critical= floating point decimal (fromversion 1.4) This switch sets thecritical response time of thenameserverinseconds (floatingpoint decimal). -a address / --expected_address= address (fromversion 1.4) This is theaddressthat dig should return in the ANSWER SECTION.Incon- trastto check_dns , check_dig delivers aWARNINGonlyifthe IP address doesnot match, butthe replyitselfhas arrivedwithinthe giventimelimit. 107 6 Pluginsfor Network Services -t timeout / --timeout=timeout ] After timeout secondshaveexpired,the plugin breaks off thetestand re- turnsthe CRITICALstate. Thedefault is 10 seconds. Thefollowingtwo examples checkthe name server 194.25.2.129,byrequesting it for theIPaddressofthe computer www.swobspace.de .The second example ends withaWARNING, sincethe replyofthe name serverfor www.swobspace.de returnsadifferent IP addressfrom 1.2.3.4 in the ANSWER SECTION: nagios@linux:nagios/libexec$ ./check_dig -H 194.25.2.129 -l \ www.swobspace.de DNS OK -2,107 Sekunden Antwortzeit (www.swobspace.de. 1800 IN A 212.227.119.101) nagios@linux:nagios/libexec$ ./check_dig -H 194.25.2.129 -l \ www.swobspace.de -a 1.2.3.4 DNS WARNING -0,094 Sekunden Antwortzeit (Server nicht gefunden in ANSWER SECTION) Example1is implemented as acommand object as follows: define command{ command_name check_dig command_line $USER1$/check_dig -H $HOSTADDRESS$ -l $ARG1$ } In ordertotestthe specificnameserver linux01 withNagios,you look for an addressthatNagios should always be able to resolve,suchas www.google.de: define service{ service_description DNS/dig host_name linux01 check_command check_dig!www.google.de ... } 6.6Querying theSecureShell Server MonitoringofSecureShell servers (irrespective of whether they useprotocolver- sion 1or2)istakenoverbythe plugin check_ssh (includedinthe standarddis- tribution). It is quiteasimple construction andjustevaluates theSSH handshake. Usernameand passwordare notrequiredfor thetest. Nottobeconfusedwith check_ssh is theplugin check_by_ssh (see Chapter 9from page 157),which starts pluginsremotelyonadifferent computer. 108 6.6 Queryingthe Secure ShellServer -H address / --hostname=address Host name or IP addressofthe computer to whichthe plugin should setup an SSH connection. -p port / --port=port specifies an alternative port. Thedefault is 22. -r version / --remote-version=version (fromversion 1.4) Theversion details for thetested Secure Shellmustmatch thespecifiedtext insteadof version ,otherwiseaWARNINGwill be sent (see examplebelow). If theversion details containspaces,the string mustbeenclosedbydouble quotes. -4 / --use-ipv4 (fromversion 1.4) Thetesttakesplace explicitly over an IPv4 connection. -6 / --use-ipv6 (fromversion 1.4) Thetesttakesplace explicitly over an IPv6 connection. -t timeout / --timeout=timeout ] After timeout (bydefault, 10)seconds thepluginbreaks offthe testand returnsthe CRITICALstate. Thefollowingexample in turn tests theSecureShell daemons on thelocal com- puter andon wobgate,tosee whether thecurrent SSH versionfromDebianSarge is beingused: nagios@linux:nagios/libexec$ ./check_ssh -H localhost \ -r ’OpenSSH_3.8.1p1 Debian-8.sarge.4’ SSH OK -OpenSSH_3.8.1p1 Debian-8.sarge.4 (protocol 2.0) nagios@linux:nagios/libexec$ ./check_ssh -H wobgate -r \ ’OpenSSH_3.8.1p1 Debian-8.sarge.4’ SSH WARNING -OpenSSH_3.8.1p1 Debian 1:3.8.1p1-8 (protocol 2.0) version mismatch, expected ’OpenSSH_3.8.1p1 Debian-8.sarge.4’ Thelatestversion of SSH is notinuse on wobgate. In heterogeneousenvironmentswithvarious Linux distributions, youwill usually useversion checking “manually” only for plugin calls,and only rarely integrate them into theNagios configuration.Instead, it is normally sufficienttouse command andservicedefinitionsusing thefollowingsimplepattern: define command{ command_name check_ssh command_line $USER1$/check_ssh -H $HOSTADDRESS$ } 109 6 Pluginsfor Network Services define service{ service_description SSH host_name linux01 check_command check_ssh ... } Otherwiseyou runthe risk of havingtoadjustthe versionnumber in thecommand object after everysecurityupdate. 6.7Generic Network Plugins Sometimesnoplugincan be found that is preciselygearedtothe service to be monitored. Forsuchcases,two genericplugins areavailable: check_tcp and check_udp .Bothofthemtestwhether aserviceisactive on thetargetportfor the protocol in question.Althoughthisdoesnot yetguarantee that theservicerunning on theportreally is theone in question,inanenvironment that oneadminstrator looksafter andconfigures, this can be sufficientlyguaranteed in otherways. Bothplugins send astringtothe serverand evaluate thereply.Thisisatits most simple for text-basedprotocols such as POPorIMAP: thesetwo “specific”plugins, whichare tailor-made for thesetwo mail services(seeSection 6.3.2frompage 95), usenothing more than symbolic links to check_tcp ,which hasalready completed thecorresponding question-and-answergamewithrelevantdefault settings. If youknowthe protocol to be tested andyou configure a“quiz”thatwill fit this (no easy task for binaryprotocols), acheck becomesconsiderably more than just aportscan.Inthisway thegeneric pluginscan also be substituted for specific missing plugins. 6.7.1Testing TCPports check_tcp is concentrated on TCP-basedservices.Inlinewithits genericnature, it hasalargenumber of options: -H address / --hostname=address This is theIPaddressorhostnameofthe computer whoseportshouldbe tested. -p port / --port=port This specifies thetargetport. In contrast to theplugins that areformedasa symbolic linkto check_tcp ,thisdetail is always required. -w floating point decimal / --warning= floating point decimal This sets thewarning limit for theresponse time in seconds. 110 6.7 Generic Network Plugins -c floating point decimal / --critical= floating point decimal This sets atimelimit like -w butspecifiesthe critical limit value. -s ” string” / --send=” string” This is thestringthatthe plugin should send to theserver. -e” string” / --expect=” string” This is thestringthatthe replyofthe servershouldcontain.The plugin does notrestrictits search here to thefirstline. -q” string / --quit=”string” This is thestringthatrequeststhe service to endthe connection. -m bytes / --maxbytes=bytes Thepluginclosesthe connectionifithas receivedmorethan bytes . -d floating point decimal / --delay= floating point decimal This is thetimeperiodinseconds betweensending astringand checking the response. -t timeout / --timeout=timeout After timeout (the defaultis 10)seconds thepluginstops thetestand re- turnsthe CRITICALstatus. -j / --jail Setting this displays theTCP output.For text-basedprotocols such as POP or IMAP, this is usually “human-readable”, butfor binaryprotocols yougen- erally cannotdecipherthe output,sothat -j is appropriate. -r return value / --refuse=return value (fromversion 1.4) This switch specifies what valuethe plugin returnsifthe serverrejects the TCPconnection. Thedefault is crit (CRITICAL).With ok as the return value, youcan testwhether aserviceisavailable that should notbeaccessible from outside. Thethird possible value, warn,ensures that aWARNINGisgiven. -M return value / --mismatch=return value (fromversion 1.4) Howshouldthe plugin react if areturned string doesnot matchwhatis specified with -e?The defaultis warn,which meansthataWARNINGis given. With crit,afalse return valuecould be categorized as CRITICAL, and with ok,asOK. -D days / --certificate= days (fromversion 1.4) This is thetimespan in daysfor whichaservercertificatemustatleast be valid for thetesttorun successfully.Itisrelevantonlyfor SSL connections. Note that thereisadangerofconfusion:inthe check_http plugin this same optionis -C (see page 101).Ifthe time span dropsbelow thetimeperiod specified for theservercertificate, thepluginreturnsaWARNING. 111 6 Pluginsfor Network Services -S / --ssl (fromVersion 1.4) SSL/TLS should be used for theconnection. TheplugincannothandleSTART- TLS11. -4 / --use-ipv4 (fromversion 1.4) Thetesttakesplace specifically over an IPv4 connection. -6 / --use-ipv6 (fromversion 1.4) Thetesttakesplace specifically over an IPv6 connection. Thefollowingexample checks on thecommand linewhether aserviceonthe target host 192.168.1.89 is active on port5631, theTCP port for theWindowsremote- controlsoftware,PCAnywhere: nagios@linux:nagios/libexec$ ./check_tcp-H192.168.1.89 -p 5631 TCP OK -0,061 second response time on port 5631 |time=0,060744s;0, 000000;0,000000;0,000000;10,000000 Forall servicesfor whichthe computer name andportdetail aresufficient as pa- rametersfor thetest, thecommand object is as follows: define command{ command_name check_tcp command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ } To monitorthe said PCAnywhereonthe machine Win01 ,the followingservice definitionwould be used: define service{ service_description pcAnywhere host_name Win01 check_command check_tcp!5631 ... } 6.7.2MonitoringUDP ports It is notsosimpletomonitor UDPports,since thereisnostandardconnection setup, such as the three-way-handshake for TCP, in thecourseofwhich acon- nectionisopened, butdataisnot yettransferred.For astatelessprotocolsuchas UDPthere is no regulated sequence for sent andreceivedpackets.The servercan 11 Seefootnote on page 96. 112 6.7 Generic Network Plugins replytoaUDP packetsentbythe client withaUDPpacket, butitisnot obliged to do this. If youfind an unoccupied port, therequested host normally sendsback an “ ICMP portunreachable ”message,which evaluates theplugin. If thereisnoreply,there aretwo possibilities: either theserviceonthe target portisnot reactingtothe request, or afirewall is filteringout networktraffic(either theUDP trafficitselfor theICMP message). This is whyyou can neverbesurewithUDP whether theserver behind aparticularportreally is offering aserviceornot. In ordertoforce apositive response wherepossible,you normally have to send data to theserver, withthe option -s,containingsomekindofmeaningfulmessage for theunderlyingprotocol. Most serviceswill notrespondtoempty or meaningless packets.Thisiswhy youcannotavoid getting to gripswiththe corresponding protocol,since youwill otherwisenot be in apositiontosendmeaningfuldatato theserver, to prompt it into givingareply at all. Thepluginitselfhas thefollowingoptions: -H address / --hostname=address This is theIPaddressorhostnameofthe computer whoseportshouldbe tested by theplugin. -p port / --port=port This switch specifies thetargetport. -w floating point decimal / --warning= floating point decimal This sets thewarning limit for theresponse time in seconds. -c time / --critical= time This sets thecritical limit in seconds(see -w). -s ” string” / --send=” string” This is thestringthatthe plugin sendstothe server. -e ” string ” / --expect=” string” This is thestringthatthe first replylineofthe servershouldcontain. -t timeout / --timeout=timeout After timeout (default: 10)seconds have expired, thepluginstops thetest andreturnsthe CRITICALstatus. Thefollowingexample tests whether aserviceonthe target host 192.168.1.13 is active on thetimeserver(NTP) Port 123. TheNTP daemononlyrepliestopackets containing ameaningfulrequest (e.g., to ones whosecontents beginwith w ): nagios@linux:nagios/libexec$ ./check_udp -H 192.168.1.13 -p 123 -s "w" Connection accepted on port 123 -0second response time 113 6 Pluginsfor Network Services It doesnot respondtopackets withdatanot in theprotocolform. Normally NTP expectsarelatively complexpacket12 containing variousinformation.The w used here,reached by trialand error, doesnot containreally meaningfuldata, butitdoes provokethe serverintogivingaresponse. Thecommand linecommand shownabove is implemented as follows as acom- mand object: define command{ command_name check_udp command_line $USER1$/check_udp -H $HOSTADDRESS$ -p $ARG1$ -s $ARG2$ } In contrast to check_tcp ,itisusefulheretogive servicesthatare basedonthis possibilityofsending testdatawith -s.You thereforeneed twoarguments. Checking an NTPtimeserveristhentakenoverbythe followingservicedefinition: define service{ service_description host_name timesrv check_command check_tcp!123!w ... } As in thecommand lineexample,Nagios sendsthe string w to theservicetopro- voke apositive response. 6.8Monitoring Databases Nagios provides threeplugins for monitoring databases: check_pgsql for Post- greSQL, check_mysql for MySQL, and check_oracle for Oracle. Thelastwill notbe coveredinthisbook. 13 They allhaveincommonthe factthattheycan be used both locally andoverthe network. Thelatter hasthe advantage that theplugininques- tion doesnot have to be installedonthe databaseserver. Thedisadvantage is that youhavetoget more deeply involvedwiththe subject of authentication,because configuring asecure local accesstothe databaseissomewhatmoresimple. Forlesscritical systems, networkaccess by theplugincan be donewithout apass- word.Todothis, theuser nagios is setupwithits owndatabaseinthe database management system to be tested,which doesnot containany (important)data. 12 Theprotocolversion NTPv3isdescribedinRFC 1305: http://rfc.sunsite.dk/rfc/rfc1305.html. 13 Theplugin check_oracle assumesthe installation of an Oracle Full Client on theNagiosserver; it doesnot work together with theInstant Client andexpects itsusers to have an extensive knowledge of Oracle.Toexplain allthishereisfar beyond thescope of this book. 114 6.8 MonitoringDatabases Areasaccessedbythisusercan be isolated from otherdata, stored in theDBMS, throughthe database’sown permissionssystem. Of course,there is nothingstopping youfromsettingupapasswordfor theuser nagios.But if youcannotmake useofSSL-encrypted connections,thiswill be transmitted in plaintextfor most databaseconnections.Inaddition,itisstored unencrypted in theNagios configuration files.Inthisrespect thepassworddoes offersomeprotection,but it is notreally that secure. As an additionalmeasure,you should certainlyrestrictthe IP addressfromwhich a user nagios user can accessthe databaseonthe Nagios server. Theplugins introducedherehaveonlyreadaccess to thedatabase. check_mysql additionally allows apureconnectioncheck, without read access. Awrite accessto thedatabaseisnot available in anyofthe pluginsmentioned.For Oraclethere is a plugin on TheNagios Exchange14 called check_oracle_writeaccess.sh,which also tests thewriteabilityofthe database. 6.8.1PostgreSQL With the check_pgsql plugin youcan establishbothlocal andnetworkconnections to thedatabase. Local connections arehandled by PostgreSQL via aUnixsocket, whichisapurely local mechanism. An IP connectionisset up by check_pgsql if a target host is explicitly passedtoit. Thepluginperforms apureconnectiontestto atestdatabasebut doesnot read anydatafromit. In orderthatPostgreSQLcan be reached over thenetwork, youmuststart the postmaster program,eitherwith -i,orbysettingthe parameter tcpip_socket in theconfiguration file postgresql.conf to thevalue true. Configuring amonitor-friendly DBMS In ordertoseparatethe datathatthe user nagios (executing theplugin) gets to seemoreclearly from otherdata, youfirstset up adatabaseuserwiththe same name,and adatabasetowhich this user is givenaccess: postgres@linux:˜$ createuser --no-adduser --no-createdb nagios postgres@linux:˜$ createdb --owner nagios nagdb Of particular importancewhencreatingadatabaseuserwiththe command create- user is theoption --no-adduser.ToPostgreSQL, theabilitytobeallowedtocreate usersautomatically meansthatyou arethe superuser, whocan easily getround the 14 http://www.nagiosexchange.org/Databases.57.0.html 115 6 Pluginsfor Network Services variouspermissionsset.15 But nagios should notbegivensuperuser permissions under anycircumstances. createdb finally creates anew,empty databasecalled nagdb,which belongsto nagios. Access to thedatabasecan be restricted in thefile pg_hba.conf .Depending on the distribution,thiscan be found either in /etc/postgresql or in thesubdirectory ./data of thedatabaseitself(for example, /var/lib/pgsql/data for SUSE). Thefollowing extract restrictsaccess by thedatabaseuser nagios to aspecific database andto theIPaddressofthe Nagios server(insteadofthe IP addresstobecompleted by ip-nagios): #type db user ip-address ip-mask method options local nagdb nagios ident sameuser host nagdb nagios ip-nagios 255.255.255.255 ident sameuser Thefirstlineisacomment describingthe function of thecolumns.The second lineallows thedatabaseuser nagios accesstothe database nagdb over alocal connection. Even though theauthenticationmethod here is called ident ,you do notneed alocal identdaemonfor Linux andBSD variants (NetBSD,FreeBSD,etc.). Thelastlinedescribesthe same restriction, butthistimeitisfor aTCP/IPconnection to theNagios server. Butnow PostgreSQL asks theident daemonofthe Nagios serverwhich user hasset off theconnectionrequest.Thismeans that an ident daemonmustbeinstalledon ip-nagios.Inthisway theDBMS tests whether the user initiating theconnectionfromthe Nagios serverreally is called nagios.Itwill notpermitanother user (oraconnectionfromdifferent host). Normally theident protocol is only partially suited for user authentication.But in thecaseofthe Nagios serveryou can assume that ahostisinvolvedthatisunder thecontrol of theadministrator whocan ensure that an identdaemonreally is running on port113. Thereisahugerange of differentident daemons. pidentd 16 is widelyusedand is included in most Linux distributions. Normally it is already preconfiguredand just needstobestarted.But howitisstarted dependsonthe distribution;usually inetd or xinetd takesoverthistask. Aglanceatthe documentation(should)put youstraight. After modifyingthe configuration file pg_hba.conf youmuststopthe DBMS so that it can reload theconfiguration files.Thisisbestdonewiththe command linux:˜ # /etc/init.d/postgresql reload (a restartisnot necessary).Ifthe configuration of the inetd/ xinetd wasmodified, this daemonisreinitialized in thesameway. 15 PermissionsinPostgreSQLare givenbythe database command GRANT . 16 http://www.lysator.liu.se/~pen/pidentd/. 116 6.8 MonitoringDatabases Thetestplugin check_pgsql check_pgsql hasthe followingoptions: -H address / --hostname=address If giventhisoption, thepluginestablishesaTCP/IP connectioninsteadof making contact withalocal DBMS throughaUnix socket. -P port / --port=port In contrast to theplugins discusseduntil now, check_pgsql uses acapital Ptospecify theportonwhich PostgreSQL is running.Inits defaultvalue it is connected to port5432. This optionisonlyusefulifPostgreSQLallows TCP/IP connections. -d database / --database=database This is thenameofthe databasetowhich thepluginshouldbeconnected. If this detail is missing,itusesthe standarddatabase template1 . -w floating point decimal / --warning= floating point decimal This is thewarning time in secondsfor theperformancetimefor thetest. -c floating point decimal / --critical= floating point decimal This is thecritical limit for theperformancetimeofthe testinseconds. -l user / --logname=user This is thenameofthe user whoshouldestablishcontact to thedatabase. -p passwd / --password=passwd This switch sets thepasswordfor access to thedatabase. Since this must be stored in plaintextinthe service definition, apotential security problem is involved. It is preferable to explicitly definearestricted,password-free accesstothe databaseinthe PostgreSQL configuration for theuser nagios. -t timeout / --timeout=timeout ] After 10 secondshaveexpired,the plugin stops thetestand returnsthe CRITICALstatus. This optionallows thedefault valuetobechanged. -4 / --use-ipv4 (fromversion 1.4) Thetesttakesplace explicitly acrossanIPv4connection. -6 / --use-ipv6 (fromversion 1.4) Thetesttakesplace explicitly acrossanIPv6connection. To testthe reachabilityacrossthe networkofthe database nagdb setupspecially for this purpose, this is passedonasaparameter together withthe target host (here: linux01 ): 117 6 Pluginsfor Network Services nagios@linux:nagios/libexec$ ./check_pgsql -H linux01 -d nagdb CRITICAL -noconnection to ’nagdb’ (FATAL: IDENT authentication failed for user "nagios") Thefact that thecheck went wronginthe exampleisclearly duetothe ident authentication.Thishappens, for example, if youforgettoreload theident daemon after theconfiguration hasbeen modified. Once theerror hasbeen rectified,the plugin—hopefully—will work better: nagios@linux:nagios/libexec$ ./check_pgsql -H linux01 -d nagdb OK -database nagdb (0 sec.)|time=0,000000s;2,000000;8,000000;0,000000 If thedatabaseparameter is omitted, check_pgsql willaddressthe database tem- plate1 : nagios@linux:nagios/libexec$ ./check_pgsql -H linux01 CRITICAL -noconnection to ’template1’ (FATAL: no pg_hba.conf entry for host "172.17.129.2", user "nagios", database "template1", SSL off) Asimilarresultisobtainedifyou runthe testwiththe correctdatabase, butwith thewrong user: wob@linux:nagios/libexec$ ./check_pgsql -H linux01 -d nagdb CRITICAL -noconnection to ’nagdb’ (FATAL: no pg_hba.conf entry for host "172.17.129.2", user "wob", database "nagdb", SSL off) Youshouldcertainly runthe last twotests,justtocheck that thePostgreSQL databasereally doesrejectcorresponding requests.Otherwiseyou willhavease- curity leak, andwerecommend that youremovesettings in theconfiguration that aretoo generous. If youhavecreated aseparatedatabasefor thecheck, thereisnoreasonwhy you shouldn’twrite this explicitly in thecommand definition, insteadofusing parame- ters, with $ARG1$: define command{ command_name check_pgsql command_line $USER1$/check_pgsql -H $HOSTADDRESS$ -d nagdb } Then theservicedefinition for linux01 is as simple as this: define service{ service_description PostgreSQL host_name linux01 check_command check_pgsql ... } 118 6.8 MonitoringDatabases 6.8.2MySQL With the check_mysql plugin,MySQLdatabases can be tested both locally and acrossthe network. Forlocal connections,itmakescontact via aUnixsocket, and notvia arealnetworkconnection. MySQLconfiguration In orderthatthe databasecan be reached acrossthe network, the skip-networking optioninthe configuration file my.cnf mustbecommented out. Thedatabase should then be running on TCPport3306, whichcan be tested with netstat -ant, for example: user@linux:˜$ netstat -ant |grep 3306 tcp 000.0.0.0:3306 0.0.0.0:* LISTEN To setupthe password-free accesstothe databaserelatively securely,aseparate nagdb databaseisalsocreated here that does notcontain anycritical data, andfor whichthe user nagios is givenrestricted accessfromthe Nagios server. To do this, youconnect yourself,asthe databaseuser root,tothe database mysql,and there youcreatethe database nagdb: user@linux:˜$ mysql --user=root mysql mysql> CREATE DATABASE nagdb; If thecommand mysql --user=root mysql functionswithout theneed to enter a root password, then youhaveaserioussecurityproblem. In that case, anyone— at leastfromthe databaseserver—is able to obtain full accesstothe database. If this is thecase, it is essentialthatyou read thesecuritynotes in theMySQL documentation.17 Recreating auserand theaccess restrictions can be doneinone andthe same step: mysql> GRANT select ON nagdb.* TO nagios@ip-nagios; Thecommand sets up theuser nagios,ifitdoesnot exist. It mayonlyaccept con- nections from theNagios serverwiththe IP address ip-nagios andobtains access to alltablesinthe database nagdb,but mayexecute only the SELECT command there(no INSERT,no UPDATE or DELETE); that is,user nagios only hasreadaccess. 17 To be found,for example, at http://dev.mysql.com/doc/mysql/de/Security.html. 119 6 Pluginsfor Network Services Thetestplugin check_mysql check_mysql hasfewer options than itsPostgreSQLequivalent—apart from -H,it doesnot implementany standardflags andhas neitherawarningnot acritical limit for theperformancetimeofthe test. Forthe database-specificoptions,it uses thesamesyntaxas check_pgsql,exceptfor theuserentry: -H address / --hostname=address This sets thehostnameorIPaddressofthe databaseserver. If theoption -H is omitted,orifitisusedinconnectionwiththe argument localhost, check_mysql doesnot setupanetworkconnectionbut uses aUnixsocket. If youwanttoestablishanIPconnectionto localhost,you mustexplicitly specifythe IP address 127.0.0.1 . -P port / --port=port This is theTCP port on whichMySQLisinstalled. In thedefault,port3306 is used. -d database / --database=database This is thenameofthe databasetowhich thepluginshouldset up acon- nection. If this optionisomitted,itonlymakesaconnectiontothe database process, without addressing aspecific database. -u user / --username= user This is theuserinwhose name thepluginshouldlog in to theDBMS. -p passwd / --password=passwd This switch is used to providethe passwordfor logging in to thedatabase. To setupaconnectiontothe database nagdb as theuser nagios,bothparameters arepassedontothe plugin: nagios@linux:nagios/libexec$ ./check_mysql -H dbhost -u nagios -d nagdb Uptime: 19031 Threads: 2Questions: 80 Slow queries: 0Opens: 12 Flush tables: 1Open tables: 6Queries per second avg: 0.004 In contrast to PostgreSQL,withMySQLyou can also make contact without estab- lishing aconnectiontoaspecific database: nagios@linux:nagios/libexec$ ./check_mysql -H dbhost Uptime: 19271 Threads: 1Questions: 84 Slow queries: 0Opens: 12 Flush tables: 1Open tables: 6Queries per second avg: 0.004 With amanualconnectiontothe database, with mysql,you can then subsequently change to thedesired database, usingthe MySQLcommand use : 120 6.9 MonitoringLDAPDirectory Services user@linux:˜$ mysql -u nagios mysql> use nagdb; Database changed mysql> With this plugin,asubsequent databasechangeisnot possible.Hereyou must decide from thebeginning whether youwanttocontact adatabaseorwhether youjustwanttoestablishaconnectiontothe MySQLdatabasesystem. To testanagdb databaseset up explicitly for this purpose, youcan do without pa- rameterswhencreatingthe corresponding command object,and explicitly specify both user anddatabase: define command { command_name check_mysql command_line $USER1$/check_mysql -H $HOSTADDRESS$ -u nagios -d nagdb } This simplifies theservicedefinition: define service{ service_description MySQL host_name linux01 check_command check_mysql ... } 6.9Monitoring LDAP DirectoryServices FormonitoringLDAP directoryservices,the check_ldap plugin is available.Itruns asearchquery that can be specified anonymouslyorwithauthentication. It has thefollowingparameterstodothis: -H address / --hostname=address This is thehostnameorIPaddressofthe LDAP server. -b base dn / --base=base dn This is thetop element(BaseDomainName )ofthe LDAP directory, formed for examplefromthe components of thedomainname: dc=swobspace,dc= de. -p port / --port=port This is theportonwhich theLDAP serverisrunning.The defaultisthe standardport389. 121 6 Pluginsfor Network Services -a ” ldap-attribute ” / --attr=”ldap-attribute ” This switch enablesasearch according to specificattributes.Thus -a ”(objectclass=inetOrgPerson)” searches for allnodesinthe directorytree containing theobject class inetOrgPerson (normally used for telephone and e-mail directories, for example). Specifyingattributes in thecheck is less useful than it mayseem.Ifyou search throughanLDAP directoryfor nonexistentattributes,you willnor- mally receive an answer withzeroresults,but no errors. -D ldap bind dn / --bind= ldap bind dn This specifies abindDN 18 for an authenticatedconnection, such as: uid=wob,dc=swobspace,dc=de Withoutthisentry,the plugin establishesananonymous connection. -P ldap passwd / --pass=ldap passwd This is thepasswordfor an authenticatedconnection. It only makessense in conjunction withthe option -D. -t timeout / --timeout=timeout After timeout secondshaveexpired ( 10 secondsifthisoptionisnot given), thepluginstops thetestand returnsthe CRITICALstatus. -2 / --ver2 (fromversion 1.4) Use LDAP versionv2(thedefault). If theserverdoesnot supportthisprotocol version, theconnectionwill fail. In OpenLDAP from version2.1,v3isusedby default; to activate protocol versionv2, thefollowinglineisentered in the configuration file slapd.conf: allow bind_v2 Many clients, such as Mozilla andthe Thunderbirdaddressbook,are still usingLDAP versionv2. -3 / --ver3 (fromversion 1.4) Use LDAP versionv3. Formanymodern LDAP servers such as OpenLDAP,this is nowthe standard, buttheyusually also have parallelsupportfor theolder versionv2, sincevarious clientscannotyet implementv3. -w floating point decimal / --warning= floating point decimal If theperformancetimeofthe plugin exceeds floating point decimal sec- onds,itissues awarning. 18 AbindDNservestoidentify theuserand refers to theuser’snodesinthe directorytree, spec- ifyingall theoverlying nodes. ThebindDNinLDAPcorresponds in itsfunction more or less to theusernamewhenlogging in under Unix. 122 6.9 MonitoringLDAPDirectory Services -c floating point decimal / --critical= floating point decimal If theperformancetimeofthe plugin exceeds floating point decimal sec- onds,itreturnsCRITICAL. -4 / --use-ipv4 (fromversion 1.4) Thetestisdoneexplicitly acrossanIPv4connection. -6 / --use-ipv6 (fromversion 1.4) Thetestisdoneexplicitly acrossanIPv6connection. In thesimplestcaseitissufficient to querywhether theLDAP serverreally does ownthe base DN specified with -b: nagios@linux:nagios/libexec$ ./check_ldap -H ldap.swobspace.de \ -b "dc=swobspace,c=de" LDAP OK -0,002 seconds response time|time=0,002186s;;;0,000000 This querycorresponds to thefollowingcommand object: define command{ command_name check_ldap command_line $USER1$/check_ldap -H $HOSTADDRESS$ -b $ARG1$ } Since an LDAP servercan handle many LDAP directorieswithdifferent base DNs, it is recommended that youconfigure this withparameters: define service{ service_description LDAP host_name linux01 check_command check_ldap!dc=swobspace,dc=de ... } If authentication is involved, things getslightly more complicated.Onthe onehand thepluginisgiventhe bind-DN of the nagios user,with -D.Onthe otherhand, thefollowingexample protects thenecessary passwordfromcurious onlookersby storingthisasthe macro $USER3$ in thefile resource.cfg,which maybereadable only for theuser nagios (see Section2.14, page 59): define command{ command_name check_ldap_auth command_line $USER1$/check_ldap -H $HOSTADDRESS$ -b $ARG1$ -D $ARG2$ \ -P $USER3$ } 123 6 Pluginsfor Network Services Accordingly, thematchingservicedefinition contains thebaseDNand bindDNas arguments, butnot thepassword: define service{ service_description LDAP host_name linux01 check_command check_ldap_auth!dc=swobspace,dc=de!uid=nagios,\ dc=swobspace,dc=de ... } 6.10 Checking aDHCPServer To monitorDHCPservices,the plugin check_dhcp is available.Itsends a DHCP- DISCOVER via UDPbroadcasttothe target port67and waitsfor an offerfroma DHCP serverinthe formofaDHCPOFFER,which offers an IP address andfurther configuration information. Because check_dhcp doesnot send a DHCPREQUEST after this,the serverdoes notneed to reserve thesources andtoconfirmthisreservation with DHCPACK, nordoesitneed to reject therequest with DHCPNACK. Granting theplugin root permissions Thereisafurtherrestriction to the check_dhcp:itrequiresfullaccess to thenet- work interfaceand mustthereforerun with root privileges. It should,however,beexecuted—like allother plugins—by theuser nagios.The program is accordingly transferredtothe user root,and theSUIDbit is setwith chmod .Suchs-bitsare always apotential danger, sincebufferoverflowscould be used to obtain general root privilegesifcode hasbeen written carelessly. Forthis reason,the chmod command is chosen so that,apart from root,onlythe group nagios is giventhe permission to executethe plugin: linux:nagios/libexec # chown root.nagios check_dhcp linux:nagios/libexec # chmod 4750 check_dhcp linux:nagios/libexec # ls -l check_dhcp -rwsr-x--- 1root nagios 115095 Jan 812:15 check_dhcp The chown command assignsthe plugin to theuser root andtothe group nagios, to whom nobody else should belong apartfromthe user nagios itself.(Theuserin whosenamethe Webserverisrunning should be amemberofadifferent group, such as nagcmd,asisdescribedinChapter 1frompage 25.) In addition the chmod ensuresthatnobody apart from root mayevenreadthe plugin file, letalone edit it. 124 6.10 Checking aDHCPServer Applyingthe plugin check_dhcp only hasfive options: -s server ip / --serverip=server ip This is theIPaddressofaDHCP serverthatthe plugin should explicitly query. Withoutthisentry,itissufficient to have afunctioningDHCPserverinthe networktopassthe testsatisfactorily.Soyou have to decide whether you want to testthe generalavailabilityofthe DHCP service or thefunctionality of aspecific DHCP server. -r requested ip / --requestedip= requested ip With this optionthe plugin attempts to obtain theIPaddress requested ip from theserver. If this is notsuccessful because it is already reservedorlies outsidethe configuredarea, check_dhcp reactswithawarning. -i interface / --interface=interface This selectsaspecificnetworkinterfacethrough whichthe DHCP request should pass. Withoutthisparameter,the plugin always uses thefirstnetwork cardtobeconfigured(in Linux,usually eth0). -t timeout / --timeout=timeout After 10 secondshaveexpired (the default),otherwise timeout seconds, the plugin stops thetestand returnsthe CRITICALstate. With aconfigurable warningorcritical limit for theperformancetime, theplugin is of no use. Here youmust, wherenecessary,explicitly setatimeout, whichcauses theCRITICALreturn valuetobeissued. Thefollowingexample showsthatthe DHCP service in thenetworkisworking: nagios@linux:nagios/libexec$ ./check_dhcp -i eth0 DHCP ok: Received 1DHCPOFFER(s), max lease time =600 sec. Thepluginincludesonlythe leasetime as additional information,thatis, thetime for whichthe client wouldbeassigned an IP address.Ifyou want to seeall the information containedin DHCPOFFER,you should usethe option -v (“verbose”). In thenextexample thepluginexplicitly requests aspecific IP address ( 192.168.1. 40), butthisisnot available: nagios@linux:nagios/libexec$ ./check_dhcp -i eth0 -r 192.168.1.40 DHCP problem: Received 1DHCPOFFER(s), requested address (192.168.1.40) was not offered, max lease time =600 sec. nagios@linux:nagios/libexec$ echo $? 1 125 6 Pluginsfor Network Services TheresultisaWARNING, as is shownbythe output of thestatus, with $?. If youwanttotestboththe availabilityofthe DHCP service overalland theservers in question individually,you need twodifferent commands: define command{ command_name check_dhcp_service command_line $USER1$/check_dhcp -i eth0 } check_dhcp_service grills theDHCPserviceasawholebysending abroadcast, to whichany DHCP serveratall mayrespond. define command{ command_name check_dhcp_server command_line $USER1$/check_dhcp -i eth0 -s $HOSTADDRESS$ } check_dhcp_server on theother hand explicitly tests theDHCPserviceonaspe- cific server. To matchthis, youcan then defineone service that monitors DHCP as awhole and anotherone that tests DHCP for aspecific host.Evenifthe first variationisin principlenot host-specific, it still needstobeassigned explicitly to acomputer for it to runinNagios: define service{ service_description DHCP Services host_name linux01 check_command check_dhcp_service ... } define service{ service_description DHCP Server host_name linux01 check_command check_dhcp_server ... } 6.11 Monitoring UPS withthe Network UPS Tools Thereare twopossibilitiesfor monitoring uninterruptible power supplies(UPS): the NetworkUPS Tools supportnearlyall standarddevices.The apcupsd daemon is specifically tailoredtoUPS’s from thecompanyAPC, describedinSection 7.10 126 6.11 MonitoringUPS with theNetwork UPSTools from page 149.The plugin check_ups included in Nagios only supports thefirst implementation. Thefollowingrulegenerally applies: no plugin directly accessesthe UPSinterface. Rathertheyrelyonacorresponding daemonthatmonitorsthe UPSand provides status information.Thisdaemonprimarily servesthe purposeofshutting downthe connected servers in time in caseofapower failure.But it also always provides status information,which pluginscan queryand whichcan be processedbyNagios. Boththe solution withthe Network UPS Toolsand that with apcupsd arefunda- mentally network-capable,thatis, thedaemonisalways queriedvia TCP/IP (through aproprietary protocol,oralternatively SNMP).But youshouldbeaware here that apower failure mayaffect thetransmission path, so that thecorresponding infor- mation mightnolongerevenreach Nagios. Monitoringvia thenetworktherefore makessense only if theentirenetworkpathissafeguardedproperlyagainstpower failure.Inthe idealscenario, theUPS is connected directly to theNagios server. Callingthe check_ups plugin is no different in this casefromthatfor thenetwork configuration,since even for local useitcommunicatesvia TCP/IP—butinthiscase, withthe host localhost). TheNetworkUPS Tools TheNetworkUPS Toolsisamanufacturer-independent packagecontainingtools for monitoring uninterruptible power supplies. Different specificdrivers take care of hardware access, so that newpower suppliescan be easily supported,provided theirprotocols areknown. Theremaining functionalityisalsospreadacrossvarious programs: whilethe dae- mon upsd provides information,the program upsmon shutsdownthe computers suppliedbythe UPSinacontrolledmanner. It takescarebothofmachines con- nected via serial interfacetothe UPSand,inclient/servermode,ofcomputers suppliedvia thenetwork. Thehomepage http://www.networkupstools.org/ lists thecurrently supported models andprovides furtherinformation on thetopicofUPS.Standarddistribu- tionsalready containthe software,but notalways withpackage namesthatare very obvious:inSuSEand Debiantheyare knownbythe name of nut . To querythe information provided by thedaemon upsd,there is the check_ups plugin from theNagios Plugin package. It queriesthe status of theUPS through thenetworkUPS Tools’ ownnetworkprotocol. Asubprojectalsoallows it to query thepower suppliesvia SNMP. 19 However, furtherdevelopmentonitisnot taking placeatthe presenttime. 19 http://eu1.networkupstools.org/server-projects/ 127 6 Pluginsfor Network Services Forpurelymonitoringpurposes via Nagios (without shutting downthe computer automatically,depending on thetestresult),itissufficient to configure andstart the upsd on thehosttowhich theUPS is connected via serial cable.The relevant configuration fileinthe directory /etc/nut is called ups.conf.Ifyou performthe queryvia thenetwork, youmustnormally add an entryfor theNagios serverin the(IP-based) access permissions. Detailedinformation can be found directly in thefilesthemselvesorinthe documentationincluded, whichinDebianisinthe directory /usr/share/doc/nut,and in SuSE,in /usr/share/doc/packages/nut. Provided that theNetworkUPS Toolsinclude asuitable driverfor theuninterrupt- able power supplyused, thedriverand communication interfaceare entered in the file ups.conf: #--/etc/nut/ups.conf [upsfw] driver =apcsmart port =/dev/ttyS0 desc ="Firewalling/DMZ" In theexample,aUPSofthe companyAPC is used.Communication takesplace on theserialinterface /dev/ttyS0.Aname for theUPS is giveninsquarebrackets,with whichitisaddressedlater on: desc can be used to describe theintendedpurpose of theUPS in more detail, butNagios ignoresthis. Nextyou mustensurethatthe user withwhose permissionsthe Network UPS Toolsare running (suchasthe user nut from thegroup nut)has full accesstothe interface /dev/ttyS0: user@linux:˜$ chown nut:nut /dev/ttyS0 user@linux:˜$ chmod 660 /dev/ttyS0 In orderfor Nagios to accessinformation from theUPS via the upsd daemon, corresponding dataisentered in an Access Control List in the upsd configuration file upsd.conf : #--/etc/nut/upsd.conf #ACL aclname ipblock ACL all 0.0.0.0/0 ACL localhost 127.0.0.1/32 ACL nagios 172.17.129.2/32 #ACCESS action level aclname ACCESS grant monitor localhost ACCESS grant monitor nagios ACCESS deny all all 128 6.11 MonitoringUPS with theNetwork UPSTools With thekeyword ACL youfirstdefine hostsand networkrangeswiththeir IP address. Youmustalways specifyanetworkblock here: /32 means thatall 32 bitsofthe netmaskare setto1(this corresponds to 255.255.255.255),which is thereforeasingle host address.Itisnot sufficientjusttospecify theIPaddress here. An ACCESS entrytransfers theactualaccess permissionstothe computersspecified in theACL aclname.The computersdefinedinthe ACLs localhost and nagios are allowedtoaccess themonitoringdatathankstothe monitor permission ( grant ), butnothing more.The last ACCESS finally denies ( deny)any accesstoall others. To conclude theconfiguration,you should make sure that theUPS daemonis started withevery system start. In SuSE this is donevia YaST2; in Debianthisis takencareofduringthe installation. The check_ups plugin Themonitoringpluginitselfhas thefollowingoptions: -H address / --host=address This is thecomputer on which upsd is installed. -u identifier / --ups= identifier This is thenamefor theUPS in ups.conf,specifiedinsquarebrackets. -p port / --port=port This is thenumber of theportonwhich the upsd is running.The defaultis TCPport3493. -w whole number / --warning= whole number This switch defines awarning limit as awhole number.Ifnovariable is given(see -v), whole number meansaresponse time in seconds; otherwise thevalue rangeofthe variable (e.g., 80 for 80% in BATTPCT). Specifying multiplewarning limitsiscurrently notpossible:the plugin then only uses thelastvariable andthe last warninglimit. -c whole number / --critical= whole number This optionspecifiesacritical limit in connectionwithavariable (see -v). -v variable / --variable= variable With this option, specificvaluesofthe UPScan be queried. Thelimit val- uesthenreferredtothisparameter. check_ups currentlysupports only the followingvariables: LINE:input voltage of theUPS. TEMP:Temperature of theUSV. 129 6 Pluginsfor Network Services BATTPCT:Remaining batterycapacity in percent. LOADPCT:Load on theUPS in percent. If this optionismissing,the plugin only checks thestatusofthe UPS(online or offline). Since -v thus hasanother value, check_ups doesnot know theobligatory option --verbose (see Table 6.2onpage 88), even in itslongform. -T / --temperature This command issues temperature values in degrees Celsius. -t timeout / --timeout=timeout After timeout secondshaveexpired,the plugin stops thetestand returns theCRITICALstate. Thedefault is 10 seconds. Thefollowingexample tests theabove defined local UPSwiththe name upsfw . The -T switch should ensure that theoutputofthe temperatureisgivenindegrees Celsius, whichonlypartially workshere: thetextdisplayed by Nagios beforethe pipe sign | contains thecorrect details,but in theperformancedataafter the | ,the plugin version1.4 still showsthe information in degrees Fahrenheit. user@linux:nagios/libexec$ ./check_ups -H localhost -u upsfw -T UPS OK -Status=Online Utility=227.5V Batt=100.0% Load=27.0% Temp=30.6C| voltage=227500mV;;;0 battery=100%;;;0;100 load=27%;;;0;100 temp=30degF;; ;0 If avariable is notused, thepluginreturnsaCRITICALifthe UPSisswitched off ( Status=Off )orhas reached lowbatterycapacity ( Status=On Battery,Low Bat- tery). check_ups issues awarning if at leastone of thethree states On Battery, LowBattery or Replace Battery applies, butthisisnot sufficientfor aCRITICAL status (for examplebecause of correspondingly setvariables).With On Battery the power supplyisprovided by thebattery, with LowBattery theUPS is onlinewitha lowbatterystate, andwith Replace Battery,the batterymustbereplaced. If none of thesepointsapply, thepluginissues an OK for thefollowingstates: In thenormal online state If theUPS is beingcalibrated ( Calibrating) If it is currentlybeing bypassedand thepower supplyisprovided directly from thepower supplygrid(On Bypass) If theUPS is overloaded(Overload) If thevoltage in thepower grid is toohighand theUPS restrictsthe voltage to thenormalvalue ( Trimming ) 130 6.11 MonitoringUPS with theNetwork UPSTools If thevoltage in thepower grid is toolow andissupplemented by theUPS ( Boosting) If theUPS is currentlybeing charged(Charging) If theUPS is currentlybeing discharged (e.g., during aprogrammedmaintenance procedure)(Discharging). Transformedintoacommand object,the abovetestfor anyhostlookslikethis: define command{ command_name check_ups command_line $USER1$/check_ups -H $HOSTADDRESS$ -u $ARG1$ -T } Thecorresponding service definitionfor thecomputer linux01 ,towhich theUPS is connected,and for theabove defined UPS upsfw ,would then look likethis: define service{ service_description UPS host_name linux01 check_command check_ups!upsfw ... } If check_ups is to determine theUPS status by meansofthe currentload, the relevant information is takenfromthe variable LOADPCT: user@linux:nagios/libexec$ ./check_ups -H linux01 -u upsfw -T -v \ LOADPCT -w 60 -c 80 UPS WARNING -Status=Online Utility=227.5V Batt=100.0% Load=61.9% Temp=30.6C|voltage=227500mV;;;0 battery=100%;;;0;100 load=61%;60000; 80000;0;100 temp=30degF;;;0 With 61 percent, theUPS hasaheavierload than specified in thelimit value -w,but it doesnot yetreach thecritical area above80percent,sothere is just awarning. If twoerror criteria occur, such as awarning limit for aqueried variable being exceeded andacritical statesimultaneously, because theUPS is losing power ( On Battery and LowBattery simultaneously),the most critical statehas priority for thereturn valueofthe plug in,sohere, check_ups wouldreturn CRITICAL, andnot theWARNINGwhich resultsfromthe queryof LOADPCT. 131 7 Ch ap te r TestingLocalResources Theplugins introducedinthischapter from thebasis rangeofthe nagios-plug- ins packagetestlocal resourcesthatdonot have theirown networkprotocoland thereforecannotbeeasily queriedoverthe network. They mustthereforebelocally installedonthe computer to be tested.Suchplugins on theNagios servercan testonlythe serveritself—withcommand andservicedefinitionsasdescribedin Chapter 6. To performsuchlocal tests from acentral Nagios serveronremotehosts,you requirefurther utilities: theplugins arestarted via asecureshell, or youuse the Nagios Remote Plugin Executor (NRPE).Using thesecureshell is describedin Chapter 9frompage 157, andChapter 10 (page 165) is devoted to NRPE. Thedefinition of command andservicedepends on thechoiceofmechanism.Ifyou want to testfor free hard drive capacity withthe check_by_ssh plugin installedon theNagios server, whichremotelycalls check_disk on thetargetserver(seeSection 7, page 133),thenaspecialcommand definitionisrequiredfor this,which differs 133 7 Testing LocalResources somewhat from thedefinitionsgiveninChapter 6(page 85).Whatcommand and service definitions for remotelyexecuted local pluginslook likeisdescribedinthe aforementionedchaptersonNRPEand SSH. Forthe remote queryofsomelocal resourcesyou can also useSNMP(seeChapter 11 from page 177),but thechecks arethenrestricted to thecapabilitiesofthe SNMP daemon used.Local pluginsare usually more flexible here andprovidemore options for querying. 7.1FreeHardDrive Capacity Thequestionofwhenthe hard drive(s)ofacomputer maythreaten to overflow is answered by the check_disk plugin,which in version1.4 includes considerably more functionsthanits predecessor: -w limit / --warning= limit Thepluginwill give awarning if thefreeharddrive capacity drops belowthis limit,expressedasapercentage or as an integer.Ifyou specifypercentage, thepercent sign % mustalsobeincluded; floating-point decimals such as 12.5% arepossible.Integer values in kBytes aredemandedbyversion 1.3.x, butbyversion 1.4inMBytes (ineach casewithout aunitabbreviation). The unitcan also be influencedwith -k, -k,and -u. -c limit / --critical= limit If thefreeharddrive capacity levelfalls belowthisasapercentage or integer (see -w), check_disk displays theCRITICALstatus. Thecritical limit mustbe smallerthanthe warninglimit. -p path or partition / --path= path or --partition= partition This specifies theroot directoryinfile systemsorthe physical deviceinpar- titions(e.g., /dev/sda5). From version1.4 -p can be calledmultipletimes.If thepathisnot specified,the plugin tests allfile systems(seealso -x and -X). -e / --errors-only With this switch,the plugin showsonlythe filesystems or partitions that areinaWARNINGorCRITICALstate. -k / --kilobytes (from1.4) With this switch,limit values givenaswhole numberswith -c and -w areto be interpreted as kBytes. -m / --megabytes (from1.4) With this switch,whole number limit values with -c and -w areinterpreted by thepluginasMBytes (the default).Caution:inversion 1.3.x, -m hasa completelydifferent meaning!! 134 7.1 Free Hard DriveCapacity -m / --mountpoint (1.3.x) Normally check_disk in version1.3.x will return thephysical device(e.g., /dev/sda5). -m ensuresthatthe filesystem path(e.g., /usr)isnamed instead. -M / --mountpoint (from1.4) From version1.4 on, check_disk by defaultdisplaysthe filesystem path(e.g., /usr). With -M youare told insteadwhatphysical device(e.g., /dev/sda5)is involved. -t timeout / --timeout=timeout After timeout secondshaveexpired thepluginstops thetestand returns theCRITICALstatus. Thedefault is 10 seconds. -u unit / --units=unit (from1.4) In what unitdoyou specifyinteger limit values? kB, MB, GB and TB areall possible. -x path / --exclude_device=path This switch excludes themount point specified as path from thetest. This optionmay be used severaltimes in aplugincommand. -X fs typ / --exclude-type= fs typ (from1.4) This switch excludes aspecific filesystem type from thetest. It is giventhe same abbreviation as in the -t optionofthe mount command.Inthisway fs type can take thevalues ext3 , reiserfs,or proc,for example(seealso man 8mount). This optioncan be used severaltimes in aplugincommand. -C / --clear (from1.4) From version1.4 on, -p can be used multipletimes.Ifyou want to test severalfile systemsatthe same time,but usingdifferent limit values, -C can be used to delete oldlimit values that have been set: -w 10% -c 5% -p /-p/usr -C -w 500 -c 100 -p /var Theorder is important here:the limit values arevalid for thefile system details until they arereset with -C.Thennew limitsmustbeset with -w and -c. Thepluginversions1.3.1 (above example) and1.4 differnot only in theiroptions, butalsointheir output.Performancedataare missing from thelatter (see Chapter 17 from page 313): user@linux:nagios/libexec$ ./check_disk -w 10% -c 5% -p /usr DISK CRITICAL [87000 kB (5%) free on /usr] user@linux:nagios/libexec$ ./check_disk -w 10% -c 5% -p / DISK OK -free space: /710 MB (74%);| /=247MB;861;909;0;957 135 7 Testing LocalResources Thesecan be extracted from theNagios logfilesand prepared in graphic form. Thefollowingexample functionsonlywithversion 1.4: user@linux:nagios/libexec$ ./check_disk -w 10% -c 5% -p /-p/usr \ -p /var -C -w 5% -c 3% -p /net/emil1/a -p /net/emil1/c -e DISK WARNING -free space: /net/emil1/c 915 MB (5%);| /=146MB;458;483;0; 509 /usr=1280MB;3633;3835;0;4037 /var=2452MB;3633;3835;0;4037 /net/emil1 /a=1211MB;21593;22048;0;22730 /net/emil1/c=17584MB;17574;17944;0;18499 Everything is in orderonthe filesystem / , /usr,and /var,since more space is avail- able on them—as can be seen from theperformancedata—than thelimit valueof 10 percent(for awarning), andcertainly more than 5percent (for thecritical sta- tus).The filesystems /net/emil1/a and /net/emil1/c encompasssignificantlylarger ranges of data, whichiswhy thelimit values areset lower, after theprevious ones have been deleted with -C. -e ensuresthatNagios showsonlythe filesystems that really displayanerror status.Infact theoutputofthe plugin before the | sign,with /net/emil1/c,only displays onesinglefile system.The performanceinformation after thepipecan only be seen on thecommand line—it contains allfile systemstested,asbefore. This is slightly confusing, because aNagios plugin restrictsits output to asingle line, whichhas been linewrappedherefor this printed version. 7.2Utilizationofthe Swap Space The check_swap plugin tests thelocally available swap space.Herethere areagain fundamentaldifferences betweenversions1.3.x and1.4: -w limit / --warning= limit Thewarning limit can be specified as apercentage or as an integer,aswith check_disk,but theinteger valueisspecifiedin bytes ,not in kBytes! In version1.3.x thepercentage specification refers to used,and notfree, swap space.Ifatleast 10 percentshouldremainfree, youmustspecify -c 10% in version1.4,but -c 90% in version1.3.The integer specification, however, refers to theremaining free space forbothversions. -c limit / --critical= limit Critical limit,similartothe warninglimit.Ifapercentage is specified,ver- sions1.3.x and1.4 differ, as in the -w option. -a / --allswaps Teststhe thresholdvaluesfor each swap partition individually. 136 7.3 Testing theSystemLoad Thefollowingexample tests to seewhether at leasthalfofthe swap space is avail- able.Ifthere is less than 20 percentfreeswapspace, thepluginshouldreturn a critical status.The output is from plugin version1.4,and after the | sign thepro- gram again provides performancedata, whichisloggedbyNagios butnot displayed in themessage on theWeb interface: user@linux:nagios/libexec$ ./check_swap -w 50%-c20% swap OK: 100% free (3906 MB out of 3906 MB) |swap=3906MB;1953;781;0;3906 7.3Testingthe System Load Theload on asystem can be seen from thenumber of simultaneouslyrunning processes, whichistested by the check_load plugin.Withthe help of the uptime program,itdeterminesthe average valuefor thelastminute, thelastfive minutes, andthe last 15 minutes. uptime displays thesevaluesinthissequenceafter the keyword load average : user@linux:˜$ uptime 16:33:35 up 7:05, 18 users, load average: 1.87, 1.38, 0.74 check_load hasonlytwo options (the twolimit values), butthese can be specified in twodifferent ways: -w limit / --warning= limit This optionspecifiesthe warninglimit either as asimplefloating-point dec- imal ( 5.0 )orasacomma-separated tripletcontainingthree-floatingpoint decimals ( 10.0,8.0,5.0 ). In thefirstcase, thelimit specified appliestoall threeaverage values.The plugin issues awarning if (atleast)one of theseisexceeded.Inthe second casethe tripletallows thelimit valuetobespecifiedseparatelyfor each average value. Here as well, check_load issues awarning as soon as oneof theaverage values exceedsthe limit defined for it. -c limit / --critical= limit This specifies thecritical limit in thesameway as -w specifies thewarning limit.These critical limit values should be higher than thevaluesfor -w. In thefollowingexample Nagios wouldraise thealarm if more than 15 processes were active on average in thelastminute, if more than 10 were active on average in thelastfive minutes,orifeight were active on average in thelast15minutes. Thereisawarningfor average values of ten,eight,orfive processes: 137 7 Testing LocalResources user@linux:local/libexec$ ./check_load -w 10.0,8.0,5.0 -c 15.0,10.0,8.0 OK -load average: 1.93, 0.95, 0.50| load1=1.930000;10.000000;15.000000; 0.000000 load5=0.950000;8.000000;10.000000;0.000000 load15=0.500000; 5.000000;8.000000;0.000000 7.4Monitoring Processes The check_procs plugin monitors processesaccordingtovarious criteria.Usually it is used to monitorthe running processesofjustone single program.Herethe upperand lowerlimitscan also be specified. nmbd ,for example, thenameserviceofSamba,always runs as adaemonwithtwo processes. Alargernumber of nmbd entriesinthe processtable is always asure sign of aproblem; it is commonlyencountered,especially in olderSamba versions. Services such as Nagiositselfshouldonlyhaveone main process. This can be seen by thefact that itsparentprocess hasthe processID 1 ,marking it is achild of the init process. It wasoften thecase, in thedevelopmentphase of Nagios 2.0, that severalsuchprocesseswereactive in parallelafter afailedrestart or reload, which ledtoundesirable side effects. Youcan testtosee whether therereally is just one single Nagios main processactive,asfollows: nagios@linux:nagios/libexec$ ./check_procs -c 1:1 -C nagios -p 1 PROCS OK: 1process with command name ’nagios’, PPID =1 Theprogram to be monitorediscalled nagios (option -C), andits parentprocess should have theID 1 (option -p). Exactly oneNagios process mustberunning,no more andnoless; otherwisethe plugin will issueaCRITICALstatus. This is specified as arange: -c 1:1. Anotherexample:between oneand foursimultaneousprocessesofthe OpenLDAP replicationservice slurpd should be active: nagios@linux:nagios/libexec$ ./check_procs -w 1:4 -c 1:7 -C slurpd PROCS OK: 1process with command name ’slurpd’ If theactualprocess number liesbetween 1 and 4 ,the plugin returnsOK, as is the casehere. If it finds betweenfive andseven processes, however, awarning willbe given. Outsidethisrange, check_procs categorizes thestatusasCRITICAL. This is thecasehereifthere areeithernoprocessesrunning at all, or more than seven running. Insteadofthe number of processesofthe same program,you can also monitor theCPU load causedbyit, itsuse of memory,oreventhe CPU runtimeused. check_procs hasthe followingoptions: 138 7.4 MonitoringProcesses -w start : end / --warning= start : end Thepluginissues awarning if theactualvalueslie outside therange spec- ified by thestart andend value. Withoutfurther details,itassumesthatit should count processes: -w 2:10 meansthat check_procs givesawarning if it finds less than twoormorethanten processes. If youomitone of thetwo limitvalues, zero appliesasthe lowervalue,or infiniteasthe upperlimit.Thismeans that therange :10 is identical to 0:10; 10: describesany number larger than or equalto10. If youjustenter a single wholenumber insteadofarange, this represents themaximum. The entry 5 thereforestandsfor 0:5. If youswapthe maximum andminimum, thepluginwill give awarning if theactualvalue lies within therange,sofor -w 10:5 this will be if thevalue is 5, 6, 7, 8, 9or10. Youmay always specifyonlyone interval. -c start : end / --critical= start : end This specifies thecritical range, in thesameway as forthe warninglimit. -m type / --metric=type (fromversion 1.4) This switch selectsone of thefollowingmetricsfor thetest: PROCS:number of processes(thedefault if no specifictypeisgiven) VSZ:the virtual size of aprocess in thememory(virtualmemorysize ), consisting of themainmemoryspacethatthe processusesexclusively, plus that of thesharedlibrariesused. Theseonlytake up memory space once,eveniftheyare used by severaldifferent processes. The specification is giveninbytes. RSS :the proportion of main memory in bytes that theprocess actually uses for itself ( Resident Set Size), that is, VSZ minus thesharedmemory. CPU:CPU usage in percent. Thepluginherechecks theCPU usage foreach individualprocess for morningand critical limits. If oneofthe pro- cessesexceedsthe warninglimit,Nagios willissueawarning. In the textoutputthe plugin also showshow many processeshaveexceeded thewarning or critical limit. ELAPSED :The overalltimethathas passedsince theprocess wasstarted. -s flags / --state= flags This restrictsthe testtoprocesseswiththe specified status flag.1 Theplugin in thefollowingexample givesawarningifthere is more than onezombie process(status flag: Z ): 1 Thefollowing states arepossible in Linux: D (uninterruptible waiting, usually a Disk Wait), R (running process), S (wait status), T (process halted), W (paging, only up to kernel 2.4), X (a finished, killedprocess),and Z (zombie). Furtherinformation is provided by manps . 139 7 Testing LocalResources nagios/libexec@linux: $ ./check_procs -w 1-c5-s Z PROCS OK: 0processes with STATE =Z Things become critical here if more than fivezombies“block up”the process table.Several states can be queriedatthe same time by by addingindividual flags together,asin -s DSZ .Now Nagios cancels theprocessesthatare in at leastone of thestates mentioned. -p ppid / --ppid=ppid This switch restrictsthe testtoprocesseswhose parentprocesseshavethe parent process ID ( ppid ). TheonlyPPIDsthatare knownfromthe begin- ning,and that do notchange, are0(started by thekernel, andusually only concerns theinitprocess) and1(theinitprocess itself). -P pcpu / --pcpu=pcpu (fromversion 1.4) This optionfiltersprocessesaccordingtothe percentage of CPU they use: nagios/libexec@linux:$./check_procs -w 1-c5-P 10 PROCS OK: 1process with PCPU >= 10,00 Theplugininthisexample takesintoaccount only processeswhich have at leastaten percentshare of CPU usage.Aslongasthere is just one such process(-w 1 ), it returnsOK. If thereare betweentwo andfive such processes, thereturn valueisaWARNING. With at leastsix processes, each withaCPU usage of at leastten percent, things getcritical. -r rss / --rss=rss (fromversion 1.4) This optionfiltersout processesthatoccupy at least rss bytes of main mem- ory. It is used like -P. -z vsz / --vsz=vsz (fromversion 1.4) This optionfiltersout processeswhose VSZ(seeabove)isatleast vsz bytes. It is used like -P. -u user / --user= user This optionfiltersout processesthatbelongtothe specified user (see exam- plebelow). -a ” string” / --argument-array=”string” This optionfiltersout commandswhose argument listcontains string. -a .tex,for example, refers to allprocessesthatworkwith *.texfiles ; -a -v to allprocessess that arecalledwiththe -v flag. -C command / --command= command This causesthe processlisttobesearchedfor thespecifiedcommand name. Command mustexactly matchthe command specified,without apath(see examplebelow). 140 7.5 Checking LogFiles -t timeout / --timeout=timeout After timeout secondshaveexpired,the plugin stops thetestand returns theCRITICALstatus. Thedefault is 10 seconds. Thefollowingexample checks to seewhether exactly oneprocess called master is running on amailserveronwhich theCyrus Imapd is installed. No processisjust as muchanerror as more than oneprocess: user@linux:nagios/libexec$ ./check_procs -w 1:1 -c 1:1 -C master CRITICAL -2processes running with command name master Thefirstattemptreturnstwo processes, although only asingleCyrus Master pro- cess is running.The reason can be found if yourun ps: user@linux:˜$ ps -fC master UID PID PPID CSTIME TTY TIME CMD cyrus 431 102004 ?00:00:28 /usr/lib/cyrus/bin/master root 1042 102004 ?00:00:57 /usr/lib/postfix/master ThePostfixmailservicealsohas aprocess withthe same name.Tokeep an eye just on themaster processofthe Imapd, thesearchisadditionally restricted to processesrunning withthe permissionsofthe user cyrus : user@linux:nagios/libexec$ ./check_procs -w 1:1 -c 1:1 -C master -u \ cyrus OK -1processes running with command name master, UID =96(cyrus) 7.5CheckingLog Files Monitoringlog files is notreally part of theconcept of Nagios.Onthe onehand, thesyslogdaemonnotices critical events thereimmediately, so that an errorstatus can be correctlydetermined. Butifthe errorstatuscontinues, this cannotbeseen in thelog fileinmostcases. Correspondingly theplugins describedherecan determine only whether other, new entriesonerror events areadded. In ordertocommunicate information on acon- tinuing errorbehaviortoNagios via alog file, theservicemonitored mustlog the errorstatusregularly—at leastatthe same intervals as Nagios reads thelog file— andrepeatedly.Otherwisethe plugin will alternate betweenreturninganerror status,and then an OK status,depending on whether the(continuing)error hasin themeantimeturnedupinthe logornot. Undernocircumstances mayNagios repeat itstest. Theparameter max_check_ e 1 .OtherwiseNagios wouldfirstassign 141 attempts (see page 45) musthavethe valu 7 Testing LocalResources theerror status as asoft state, wouldrepeatthe test, andwould almost always arrive at an OK,since it only takesintoaccount newentries during repeat tests. max_check_attempts =1ensuresthatNagios diagnoses ahardstate after the first test. Foreventsthatlog an errorjustonce, Nagios has volatile services ,describedin Section14.5.2frompage 257. Forservices defined in this way, thesystem treats everyerror status as if it wasoccurring for thefirsttime(causing amessage to be sent each time,for example).Suchservices mustbereset manually to theOK status.How this is doneisdescribedinSection 14.5.3frompage 258. 7.5.1The standard plugin check_log With check_log ,Nagios provides asimplepluginfor monitoring logfiles. It cre- ates acopy of thetested logfile each time it is run. If thelog filehas changed sincethe previous call, check_log searches thenewly added datafor simple text patterns.The plugin doesnot have anylongeroptions andjusthas thestates OK andCRITICAL: -F logfile This is thenameand pathofthe logfile to be tested.Itmustbereadable for theuser nagios. -O oldlog This is thenameand pathofthe logfile copy.The plugin just examines the differencebetween oldlog and logfile when it is run. Afterwardsitcopies thecurrent logfile to oldlog. oldlog mustcontain theabsolutepathand be readable for theuser nagios. -q query This is thepatternsearchedfor in examiningthe logfile.Not found means OK;amatch returnsthe CRITICALstatus. It is recommended that yougenerally do notuse messagesofthe type recovery notification (OKafter an errorstate). An OK in arepeated testjustmeans that no newerror in events have occurred sincethe last test. The notification_options parameter (see page 46) in theservice definitionshouldthereforenot containan r . Thefollowingcommand examines thefile /var/log/auth for failedlogins: nagios@linux:local/libexec$ ./check_log -F /var/log/auth \ -O /tmp/check_log.badlogin -q "authentication failure" (1) ’/var/log/messages’, ’reg_exp’ =>’ntpd’, } , { ’file_name’ => ’/var/log/warn’, ’reg_exp’ =>’(named|dhcpd)’, } , ); 1; ThePerlvariable $seek_file_template contains thepathtothe fileinwhich the plugin savesthe currentpositionofthe last search. check_logs.pl remembershere 2 http://www.nagiosexchange.org/Misc.54.0.html. 143 7 Testing LocalResources at what pointinthe logfile it should carry on searchingthe next time it is run. This meansthatthe plugin doesnot requireacopy of theprocessedlog file. Insteadof thevariable $log_file,itusesthe name of thelog filetobeexaminedineach case andcreates aseparatepositionfile for each logfile. What exactly check_logs.pl is to do is defined by thePerlarray @log_files .The entry file_name points to thelog filetobetested (withthe absolute path), and reg_exp contains theregular expression3 ,for which check_logs.pl should search thelog file. In theexample abovethisisjustasimple textcalled ntpd in the caseofthe /var/log/messages logfile,but thereisanalternative in thecaseof /var/log/warn :the regularexpression (named|dhcpd) matcheslines that contain either thetext named or thetext dhcpd . Theonlyspecificationthatthe plugin itself requires when it is runisthe configu- ration file(option -c): nagios@linux:local/libexec$ ./check_logs.pl -c /etc/nagios/check_logs.cfg messages => OK; warn => OK; nagios@linux:local/libexec$ ./check_logs.pl -c /etc/nagios/check_logs.cfg messages => OK; warn => (4): Jul 214:33:25 swobspace dhcpd: Configuration file errors encountered -- exiting; Thefirstcommand showsthe basicprinciple:inthe textoutputthe plugin for each logfile announces separatelywhether it hasfound amatchingevent or not. In the aboveexample it didn’t findanything, so it returnsOK. In thesecondcommand the plugin comesacrossfourrelevantentries in the warn logfile,but it doesn’t find anyin /var/log/messages.Because of this,the plugin returnsaWARNING; OK is givenonlyifnorelevanteventswerefound in anyofthe logfileschecked. In its output line, after (4):,the plugin remembersthe last of thefourlines found. 7.6KeepingTabsonthe NumberofLogged-in Users Theplugin check_users is used to monitorthe number of logged-in users: user@linux:nagios/libexec$ ./check_users -w 5-c10 USERS CRITICAL -20users currently logged in |users=20;5;10;0 3 In theformofPerl-compatible regularexpressions(PCRE,see manperlre ), since check_logs.pl is aPerlscript. 144 7.7 Checking theSystemTime It hasjusttwo options: -w number / --warning= number This is thethreshold for thenumber of logged-in usersafter whichthe plugin should give awarning. -c number / --critical= number This is thethreshold for acritical state, measured by thenumber of logged-in users. Theperformancedataafter the | is as usualvisible only on thecommand line; Nagios doesnot includeitinthe Webinterface. 7.7Checkingthe System Time 7.7.1Checkingthe system time viaNTP The check_ntp plugin compares theclock time of thelocal computer withthat of an available NTPserverinthe network. If theNagios serverkeepstimevia NTP accuratelyenough, so that it can serve as areference itself,thenitcan also be used as anetworkplugin, provided that thehosttobecheckedinthe networkhas an NTPdaemoninstalled. Thepluginrequiresthe program ntpdate,which,ifyou compile Nagios yourself, mustalready be available beforethe check_ntp installation.You should also install theprogram ntpq,which determinesthe jitter .Thisisameasureofthe runtime deviationsofincomingNTP packages.Ifthe fluctuationsare toolarge,the time synchronizationwill be imprecise. In thesimplestcase, check_ntp is called, specifyingthe computer (here: ntpserver ) whosetimeshouldbecompared withthatofthe local computer: nagios@linux:nagios/libexec$ ./check_ntp -H ntpserver NTP OK: Offset -8.875159 secs, jitter 0.819 msec, peer is stratum 0 Thedeviation found here is over eightseconds.Whether this is tolerated or not dependsonthe intendeduse.Ifyou want to comparelog fileentries for many computers, then they should allbeNTP-synchronized. Then thereisnoproblem in using -w 1-c2,which wouldalready categorizeadeviation of twoseconds as critical. check_ntp hasthe followingoptions: -H address / --host=address This is theNTP server withwhich thepluginshouldcomparethe local system time. 145 7 Testing LocalResources -w floating point decimal / --warning= floating point decimal This is thewarning limit in seconds. Thewarning is givenifthe fluctuation of thelocal system time is larger than thethreshold specified.The defaultis 60 seconds. -c floating point decimal / --critical= floating point decimal If thelocal system time deviates more than floating point decimal seconds (inthe defaultsetting 120 seconds) from that of theNTP server,the status becomesCRITICAL. -j milliseconds / --jwarn= milliseconds This is thewarning limit for thejitter in milliseconds.The defaulthereis 5000. -k milliseconds / --jcrit= milliseconds Thecritical thresholdfor thejitter.The defaultis 10000 milliseconds. 7.7.2Checkingsystemtimewiththe time protocol Apartfromthe NetworkTime Protocol NTPthere is anotherprotocol, olderand more simple:the Time Protocol describedinRFC 868, in whichcommunication takesplace via TCPport37. On many Unix systemsthe corresponding serveris integrated into theinetdaemon, so youdonot have to startaseparatedaemon. With check_time,Nagios provides an appropriate testplugin. check_time can also be used as anetworkplugin, in asimilarway to check_ntp , butthisagainassumesthatthe time service is available for everyclient.Inmost cases it will thereforebeusedasalocal plugin that compares itsown clocktime withthatofacentral time server(here: timesrv ): nagios@linux:nagios/libexec$ ./check_time -H timesrv -w 10 -c 60 TIME CRITICAL -1160 second time difference| time=0s;;;0 offset=1160s;10 ;60;0 Theperformancedataafter the | sign,not showninthe Webinterface, contains theresponse time in seconds, with time (here:zeroseconds); offset describesby howmuchthe clocktimediffers from that of thetimeserver(here: 1160 seconds). Theother values,each separated by asemicolon,providethe warninglimit,the critical threshold, andthe minimum (see also Section17.1frompage 314).Since we have notset anythreshold values withthe options -W or -C,the corresponding entriesfor time areempty. check_time hasthe followingoptions: -H address / --hostname=address This is thehostnameorIPaddressofthe time server. 146 7.8 RegularlyCheckingthe Status of theMailQueue -p port / --port=port This is theTCP port specification,ifdifferent from thedefault 37. -u / --udp Normally thetimeserverisqueried via TCP. With -u youcan useUDP if the serversupports this. -w integer / --warning-variance= integer If thelocal time deviates more than integer secondsfromthatofthe time server, thepluginreturnsaWARNING. integer is always positive,and this covers clocks that arerunning both slow andfast. -c integer / --critical-variance= integer If thereismorethan integer secondsdifferencebetween thelocal andthe time servertime, thereturn valueofthe plugin is CRITICAL. -W integer / --warning-connect= integer If thetimeserverneedsmorethan integer secondsfor theresponse,a WARNINGisreturned. -C integer / --critical-connect= integer If thetimeserverdoesnot respondwithin integer seconds, thepluginreacts withthe return valueCRITICAL. 7.8Regularly Checking theStatusofthe Mail Queue The check_mailq plugin can be used to monitorthe mail queueofamail server for e-mails that have notyet been delivered. check_mailq runs theprogram mailq of themailserviceinstalled. Unfortunatelyeach MTAinterprets themailqueue differently,sothe plugin can evaluate only mail queues from mail servicesthat theprogrammerhas takenintoaccount.These are, specifically: sendmail, qmail , postfix ,and exim . check_mailq hasthe followingoptions: -w number / --warning= number If thereare at least number mails in themailqueue,the plugin givesa warning. -c number / --critical= number As soon as thereare at least number of mails in thequeue waitingtobe delivered, then thecritical status hasbeen reached. 147 7 Testing LocalResources -W number of domains / --Warning= number of domains This is thewarning limit withrespect to thenumber of recipientdomains of amessage waitinginthe mail queue. Thus -W 3 generates awarning if thereare anymails in thequeue that areaddressedtothree or more different recipientdomains. -C number of domains / --Critical= number of domains This is thecritical thresholdwithrespect to thenumber of recipientdomains (like -W). -M daemon / --mailserver=daemon (fromversion 1.4) This specifies themailserviceused. Possible values for daemon are sendmail (the default), qmail , postfix ,and exim . -t timeout / --timeout=timeout After timeout seconds, thepluginstops thetestand returnsthe CRITICAL status.The defaulthere—as an exception—is 15 seconds(usually it is 10 seconds). In thefollowingexample,Nagios should give awarning if thereare at leastfive mails in thequeue;ifthe number reaches ten,the status of theMTAsPostfixused here becomesCRITICAL: user@linux:nagios/libexec$ ./check_mailq -w 5-c10-Mpostfix OK: mailq reports queue is empty|unsent=0;5;10;0 Since thequeue is empty, check_mailq returnsOKhere. 7.9KeepinganEye on theModificationDateofa File With the check_file_age plugin youcan monitornot only thelastmodification dateofafile, butalsoits size.Fromversion 1.4itisincludedinthe defaultinstal- lation.Inversion 1.3.xthe sourcescan be found in thesubdirectory contrib ;the plugin created from this mustbecopied manually to theplugindirectory. In thesimplestcaseitisjustrun withthe name andpathofthe filetobemonitored: user@linux:nagios/libexec$ ./check_file_age /var/log/messages WARNING -/var/log/syslog/messages is 376 seconds old and 7186250 bytes Here theplugingivesawarning, sincethe warninglimit setis240 secondsand thecritical limit,600 seconds. Thelastmodification of thefile was376 seconds ago—thatis, inside thewarning range. 148 7.10 MonitoringUPSswith apcupsd Thefile size is takenintoaccount by check_file_age only if awarning limit for the filesize(option -W)isexplicitly specified.The plugin couldthengive awarning if thefile is smallerthanthe givenlimit (inbytes). Thedefaultsfor thewarning and critical limitshereare both zero bytes. check_file_age hasthe followingoptions: -w integer / --warning-age= integer If thefile is olderthan integer4 (the defaultis 240)seconds,the plugin issues awarning. -c integer / --critical-age= integer Acritical status occurs if thefile is olderthan integer (default: 600)seconds. -W size / --warning-size=size If thefile is smallerthan size bytes,the plugin givesawarning. If theoption is omitted, 0 bytes is thelimit.Inthiscase check_file_age doesnot take the filesizeintoaccount. -C size / --critical-size=size Afile size smallerthan size bytes sets off acritical status.The defaultis 0 bytes,which meansthatthe filesizeisignored. -f file / --file= file Thenameofthe filetobetested.The optionmay be omitted if youinstead— as in theabove example—just give thefile name itself as an argument. 7.10 Monitoring UPSswith apcupsd To monitoruninterruptible power supplies(UPS) from thecompanyAPC thereisthe possibility, apartfromthe Network UPS ToolsdescribedinSection 6.11 from page 126 of usingthe apcupsd daemon, optimized specifically for usewiththese UPSs. Thesoftware can be obtained from http://www.apcupsd.com/ andislicensedun- derthe GPL, despitethe factthatitisvendor-dependent. Theprincipal function here is thecapacity to be able to shut downsystems in the eventofpower failure,ratherthanamere monitoring function withNagios.For this latter purpose, it is easier to configure theNetworkUPS Tools. Nearlyall Linux distributionscontain aworking apcupsd package,5 so youdon’t have to worry aboutinstallingit. Nagios doesnot includean apcupsd plugin,but 4 Because check_file_age is aPerlscript, it doesnot matterinthiscasewhether an integerora floating-point decimalisspecified. Fractionsofasecond do notplayarole in thefile system. 5 At leastSuSEand Debianuse this package name. 149 7 Testing LocalResources thereisavery simple andeffective script available for download at http://www. negative1.org/check_apc/ : check_apc .Itisalsolicensedunder theGPL,but it has no networkcapabilities. Theplugincannotbegivenahost when it is run, anditalso doesnot supportany othertypes of options.Insteadofthis, internalcommands controlits functionality, whichare givenasthe first argument. Executing check_apcstatus tests whether theUPS is online. If this is thecase, the plugin returnsthe OK status,inall othercases it returnsCRITICAL: user@linux:nagios/libexec$ ./check_apc status UPS OK -ONLINE check_apcload warn crit checks theload currentlyonthe UPSand displays it as apercentage of themaximum capacity.Awarningisgivenifthe load is greater than thewarning limitsspecifiedin warn (inthe followingexample,60percent), CRITICALifthe load is greater than crit (here 80 percent): user@linux:nagios/libexec$ ./check_apc load 60 80 UPS OK -LOAD: 39% Theload status of theUPS is checkedbythe command check_apcbcharge warn crit.Herethe warninglimit warn andthe critical limit crit arealsogiveninpercent. Thevalue 100 means“fully loaded.”The plugin accordingly givesawarningifthe load is smallerthanthe warninglimit,and aCRITICALifthe load is smallerthan thecritical limit: user@linux:nagios/libexec$ ./check_apc bcharge 50 30 UPS OK -Battery Charge: 100% Youcan findout howlongthe savedenergywill last with check_apctime warn crit.Here check_apc givesawarningifthe remainingtimeislessthan warn minutes,and aCRITICALifthe remainingtimeislessthan crit minutes: user@linux:nagios/libexec$ ./check_apc time 20 10 UPS OK -Time Left: 30 mins 7.11 Nagios Monitors Itself If necessary,Nagios can even monitoritself: theincludedplugin, check_nagios, tests,onthe onehand, whether Nagios processesare running and, on theother hand,the age of thelog file nagios.log in theNagios var directory, for example /var/nagios/nagios.log . 150 7.11 Nagios Monitors Itself Despitethis, thequestionneedstobeasked:ifNagios itself is notrunning,then thesystem simply cannotperformthe plugin,which in turn cannotdeliveranerror message.The solution to this problemconsistsinhavingtwo Nagios servers, each of whichaddressesthe locally installedpluginonthe oppositeserver, withthe help of NRPE (see Chapter 10 from page 165). If youhavejustone Nagios serveryou can also run check_nagios alonevia cron andhavethe return valuecheckedusing ashell script.Inthiscase, youtake action yourself,asshown in Section7.11.1, so that youare suitably informedofthis. Thepluginhas thefollowingoptions: -C /path/to/nagios / --command= /path/to/nagios This is thecomplete nagios command,including thepath(e.g., -C /usr/local/ nagios/bin/nagios). -F /path/to/logfile / --filename= /path/to/logfile This is thepathtowhere theNagios logfile nagios.log is saved. Thefile is locatedinthe Nagios var directory. -e integer / --expires=integer This is themaximum age of thelog file. If therehavebeen no changesto thefile for longer than integer minutes, check_nagios issues awarning. Youshouldmake sure that this time specification is largeenough: if no errors arecurrently occurring,Nagios willnot loganythinginthe logfile. Theonlyreliable waytoobtainaregularentry is withthe parameter reten- tion_update_interval in theconfiguration file nagios.cfg (see page 438). Thedefault valueis60minutes. In thefollowingexample thelog fileshouldnot be olderthan60minutes (this corresponds to thedefault retentionupdateinterval (see page 438): user@linux:nagios/libexec$ ./check_nagios -e 60 \ -F /var/nagios/nagios.log -C /usr/local/nagios/bin/nagios Nagios ok: located 5processes, status log updated 303 seconds ago With currentlyfive running Nagios processesand alog filelastchanged 303 sec- onds ago (a good fiveminutes), everything is in orderhere. If the -e parameter is omitted,the plugin always givesawarning. 7.11.1Running thepluginmanually with ascript Thefollowingexample script demonstrates howthe plugin is calledoutside the Nagios environment.Itstarts check_nagios initially as Nagios doesand then eval- uates thereturn value. If thestatusisnot 0 ,itsends an e-mail to theadministrator nagios-admin@example.com,using theexternal mailx program: 151 7 Testing LocalResources #!/bin/bash NAGCHK="/usr/local/nagios/libexec/check_nagios" PARAMS="-e 60 -F /var/nagios/nagios.log -C /usr/local/nagios/bin/nagios" INFO=‘$NAGCHK $PARAMS‘ STATUS=$? case $STATUS in 0) echo "OK :"$INFO ;; *) echo "ERROR :"$INFO | \ /usr/bin/mailx -s "Nagios Error" nagios-admin@example.com ;; esac Thescriptcan be runatregular intervals via acronjob—suchasevery 15 minutes. Butthenitwill also “irritate” theadministrator everyquarter of an hour withan e-mail. Thereiscertainly room for improvementinthisrespect—butthatwould go beyond thescope of this book. 7.11.2 check_nagios as atool forCGI programs Using the nagios_check_command parameter (see page 445) youcan also usethe plugin in thefile cgi.cfg.Ifthe parameter is setthere,the CGIprogramsuse the specified command to seeifNagios is operational.The testintegrated into theCGI programsfunctionssowell, however, that youdonot need to go to thetrouble of defining nagios_check_command. 7.12 Hardware Checks withLMSensors Modernmainboards areequippedwithsensors that allowyou to checkthe “health” of thesystem. In the lm-sensors6 projectitisalsopossible in Linux to querythis datavia I2C or SMBus(SystemManagement Bus ,aI2C specialcase). To enable this,the kernel musthaveasuitable driver. Kernel 2.4.xnormally requires additionalmodules, whichare included in thesoftware. 7 With alittleluck, your distribution mayinclude precompiledmodules(e.g. SuSE). Kernel 2.6, however, already includes many drivers;hereyou just compile theentirebranchbelow I2C Hardware SensorsChip support . It wouldtake toomuchspaceheretodetail theinstallation of thenecessary mod- ules.Wewill thereforeonlygointodetail for the check_sensors plugin,and assume 6 http://www.lm-sensors.nu/ 7 http://secure.netroedge.com/~lm78/download.html 152 7.12 Hardware Checks with LM Sensors that thecorresponding kernel driverisalready loadedasamodule.Helpisprovided during operation withthe sensors-detect program from the lm-sensors package, whichdoesanumber of tests andthentells youwhich modulesneed to be loaded. If allrequirementsare fulfilled, running the sensors program willproduce an out- putsimilartothe followingone,and showsthatthe onboardsensors areproviding data: user@linux:˜$ sensors fscher-i2c-0-73 Adapter: SMBus I801 adapter at 2400 Temp1/CPU: +41.00 C Temp2/MB: +45.00 C Temp3/AUX: failed Fan1/PS: 1440 RPM Fan2/CPU: 0RPM Fan3/AUX: 0RPM +12V: +11.86 V +5V: +5.10 V Battery: +3.07 V Theoutputdepends on thehardware, so it will be slightly different for each com- puter.Hereyou can see, for example, theCPU andmotherboardtemperatures(41 and45degrees Celsius),the rotation speed of thefans, andthe voltagesonthe 12- and5-volt circuits andonthe battery. Dependingonthe boarddesignand the manufacturer,somedetails maybemissing;inthisexample,onlythe fan for the power supply FAN1/PS8 provides information; Fan3/AUX refers to an additional fan inside thecomputer boxthat, although it is running,isnot recorded by the chipset. Apartfromthe standardoptions -h (help function), -v ( verbose ), whichdisplays theresponse of thesensors, and -V,which showsthe plugin version, theplugin itself hasnospecial options.Warning andcritical limitsmustbeset via the lm- sensors configuration. check_sensors only returnsthe status givenbythe onboard sensors: user@linux:nagios/libexec$ ./check_sensors sensor ok If this is calledwiththe -v option, youcan seemoreclearly whether thetestworks: user@linux:nagios/libexec$ ./check_sensors -v fscher-i2c-0-73 Adapter: SMBus I801 adapter at 2400 Temp1/CPU: +40.00 C Temp2/MB: +45.00 CTemp3/AUX: failed Fan1/PS: 1440 RPM Fan2/CPU: 0RPM Fan3/AUX: 0RPM +12V: +11.86 V+5V: +5.10 VBattery: +3.07 V sensor ok 8 PS standsfor power supply ;but thenames displayedcan be edited in /etc/sensors.conf . 153 7 Testing LocalResources Theoutputlineisonlywrappedfor printing purposes;the plugin displays verbose information on asingleline. Alternatively youcan useSNMPtoaccess thesensordata: theNET-SNMPpackage (see Chapter 11.2frompage 184) provides thedatadeliveredby lm-sensors,and withthe SNMP plugin check_snmp,warning limitscan also be setfromNagios. This solution is describedinSection 11.3.1frompage 196. 7.13 TheDummy Plugin forTests Fortests expected to endwithadefined response,the check_dummy plugin can be used.itisgivenareturn valueand thedesired response textasparameters, and it provides exactly thesetwo responsesasaresult: nagios@linux:nagios/libexec$ ./check_dummy 1"Debugging" WARNING: Debugging nagios@linux:nagios/libexec$ echo $? 1 Theoutputlinecontainsthe defined response,precededbythe status in textform. thereturn valuecan againbecheckedwith echo $?: 1 stands forWARNING. Alternatively youcan give check_dummy a 0 (OK), an 2 (CRITICAL) or a 3 (UN- KNOWN) as thefirstargument.The second argument,the response text, is optional. 154 8 Ch ap te r Manipulating Plugin Output 8.1NegatingPluginResults In some situations youmay want to testthe oppositeofwhatthe standardplugin normally tests,suchasaninterfacethatshould not be active,aWebpage or ahost that should normally not be reached.Inthese cases theprogram negate,included in theNagios plugins, provides away of negatingthe return valueofthe original check. Like plugins, negate hasanoptiontospecify atimeout in seconds, with -t,after whichitshouldabort theoperation.The actualcommand linemustalways contain thecomplete pathtothe plugin: negate plugin command negate -t timeout plugin command 155 8 Manipulating Plugin Output negate changesthe return valueof 2 (CRITICAL) to 0 (OK) andviceversa.The return codes 1 (WARNING)and 3 (UNKNOWN) remain unchanged. Thefollowingexample carries out check_icmp on thehost 192.0.2.1 ,which in normal cases should notbereachable: nagios@linux:nagios/libexec$ ./negate \ /usr/local/nagios/libexec/check_icmp -H 192.0.2.1 CRITICAL -192.0.2.1: rta nan, lost 100%| rta=0.000ms;200.000;500.000;0; pl=100%;40;80;; nagios@linux:nagios/libexec$ echo $? 0 ThepluginitselfreturnsaCRITICALinthiscasewithacorresponding text. negate “inverts”the return value; 2 (CRITICAL) turnsinto 0 (OK). Since thetext originates from thepluginand is notchanged,the information CRITICAL remains here.For Nagios itself,however,nothing butthe return valueisofany interest. 8.2InsertingHyperlinks with urlize Theprogram urlize represents thetextoutputofapluginasahyperlink, if required, so that clicking in theNagios Webinterfaceonthe testresulttakesyou to another Webpage.Like negate, urlize functionsasawrapper around thenormalplugin command andisincludedwiththe otherNagios plugins. As the first argument it expectsavalidURL to whichthe hyperlinkshouldpoint. This is followedbythe plugin command,including itspath: urlize url plugin command To avoidproblems withspaces in plugin arguments, youcan setthe complete plugin command in double quotationmarks. Thehyperlinkaround thenormalpluginoutputcan be easily recognized when running thecommand manually: nagios@linux:nagios/libexec$ ./urlize http://www.swobspace.de \ /usr/local/nagios/libexec/check_http -H www.swobspace.de HTTP OK HTTP/1.1 200 OK -2802 bytes in 0.132 seconds |time=0.132491s;;;0.000000 size=2802B;;;0 In version1.4 urlize also embeds theperformanceoutputinthe linktext, butNa- gios cutthisoff beforethe representation in theWeb interface, together withthe endtag. Butmostbrowsersdonot have anyproblemwiththe missing . 156 9 Ch ap te r ExecutingPlugins viaSSH Local plugins, that is,programsthatonlyrun tests locally because thereare no networkprotocols available,mustbeinstalledonthe target system andstarted there. They checkprocesses, CPU load, or howmuchfreeharddiskcapacity is still available,among otherthings. Butifyou still want to executethese pluginsfromthe Nagios server, it is rec- ommended that youuse thesecureshell, especially if anykindofUnixsystem is installedonthe machinetobetested—a Secure Shelldaemonwill almost always be running on such atargetsystem, andyou do notrequire anyspecial permissionsto runmostplugins.The Nagios administrator needsnothing more than an account, whichhecan usefromthe Nagios server. On theserveritself, the check_by_ssh plugin mustbeinstalled. In heterogeneousenvironmentsthe Secure Shellitselfoften create conditions that maycause problems:depending on theoperating system,anSSH daemonmay be 157 9 ExecutingPlugins viaSSH in usethatreturnsafalse return code1 or is so oldthatitcannothandlethe SSH protocol version2.0.Inthiscaseitisbetter to installthe currentOpenSSH version from http://www.openssh.org/ .InpureLinux environmentswithup-to-dateand maintained installations, such problems generally do notoccur. 9.1The check_by_ssh Plugin check_by_ssh is runonthe Nagios serverand establishesaSecure Shellconnection to aremotecomputer so that it can performlocal tests on it.The programsrun on theremotemachineare to alarge extentlocal plugins(seeChapter 7frompage 133);the useof check_by_ssh is notjustrestricted to these, however. Thepluginsends acomplete command linetothe remote computer andthenwaits for aplugin-compatible response:aresponse status between 0 (OK) and 3 (UN- KNOWN), as well as aone-linetextinformation for theadministrator (page 85). If yourun networkplugins via check_by_ssh in ordertoperformtests on othercom- puters, theseare knownas indirectchecks ,which willbeexplained in thecontext of the Nagios Remote Plugin Executor in Section10.5frompage 174. Thefollowingexample showshow check_by_ssh can be used to checkthe swap partition on thetargetcomputer: nagios@linux:nagios/libexec$ ./check_by_ssh -H target computer \ -i /etc/nagios/.ssh/id_dsa \ -C "/usr/local/nagios/libexec/check_swap -w 50% -c 10%" SWAP OK: 100% free (972 MB out of 972 MB) |swap=972MB;486;97;0;972 Thecommand is similartothatfor asecureshell, in theformof ssh -i private_key target computer " command" Thefact that aseparateprivate key—not thedefault private keyinthe home directory—is used,isoptionaland is describedindetail in section9.2 from page 160.The command to be runisspecifiedin check_by_ssh—in contrast to these- cure shell ssh—withthe option -C,the plugin is always specified withanabsolute path. check_by_ssh hasthe followingoptions: -H address / --hostname=address ThehostnameorIPaddressofthe computer to whichthe plugin should set up an SSH connection. 1 In the nagios-users mailinglistitwas reportedthat Sun_SSH_1.0 returnsareturn code of 255 instead of 0, which makesitunsuitable forthe deployment describedhere. 158 9.1 The check_by_ssh Plugin -C command / --command= command Thecommand to be runonthe remote computer,thatis, thepluginwithits complete pathand allthe necessary parameters: -C "/usr/local/nagios/libexec/check_disk -w 10% -c 5% -e -m" -1 / --proto1 (from nagios-plugins-1.4) Forceversion 1ofthe secure shellprotocol. -2 / --proto2 (fromVersion 1.4) Forceversion 2ofthe secure shellprotocol. -4 / --use-ipv4 (fromversion 1.4) TheSSH connectionisset up explicitly over an IPv4 connection. -6 / --use-ipv6 (fromversion 1.4) TheSSH connectionisset up explicitly over an IPv6 connection. -i keyfile / --identity=keyfile Whichfile should be used insteadofthe standardkey filecontainingthe private keyofthe user nagios?For oneoption, whichisrecommended,see Section9.2.3,page 162. -p port / --port=port This specifies theportifthe Secure Shelldaemononthe target serverisnot listeningonthe standardTCP port 22. -l user / --logname=user Usernameonthe target host. -w floating point decimal / --warning= floating point decimal If theresponse to thecommand to be executed takesmorethan float- ing point decimal seconds, thepluginwill issueawarning. -c floating point decimal / --critical= floating point decimal Thecritical valueinseconds concerning theresponse time of thecommand to be executed. -f2 Starts abackground processwithout opening an interactive terminal(tty). -t timeout / --timeout=timeout After timeout secondshaveexpired,the plugin stops thetestand returns theCRITICALstatus. Thedefault is 10 seconds. 2 Thereiscurrently no long form forthisoption. 159 9 ExecutingPlugins viaSSH In addition to this, check_by_ssh hasparametersavailable, -O, -s and -n,enabling it to writethe result in passive mode to the interface forexternalcommands (see section13.1frompage 240).The mode is namedthisway because Nagios doesnot receive theinformation itself butreads it indirectly from theinterface. This procedurehas theadvantage of beingable to runseveral separatecommands simultaneouslyoverasingle SSH connection. This maycause thecommand defi- nition to be rather complicated,however.Since theplugins themselvesare called andexecuted as programs on thetargetserver, it hardly matterswhether theSSH connectionisestablishedonceorthree times. Forthisreasonitisbetter to usea simple command definitionratherthanthe passive mode. Butifyou still want to findmoreinformation aboutthis, youcan look in theonline help,which is calledwith check_by_ssh -h. 9.2ConfiguringSSH So that Nagios can runplugins over thesecureshell remotelyand automatically, it—or, strictly speaking,the user nagios on theNagios server—mustnot be dis- tracted by anypasswordqueries.Thisisavoided withaloginvia aPublic Key mechanism. 9.2.1GeneratingSSH keypairs on theNagiosserver Thekey pairrequiredtodothisisstoredbythe keygenerator ssh-keygen by default in thesubdirectory .ssh of therespective user’s home directory(for theuser nagios, this thereforecorresponds to theinstallation guideinChapter 1.1frompage 26, that is, /usr/local/nagios ). If it is also sent on itsway withthe -f private keyfile option(without pathspecification), it will land in thecurrent workingdirectory, whichinthe followingexample is /etc/nagios/.ssh: nagios@linux:˜$ mkdir /etc/nagios/.ssh nagios@linux:˜$ cd /etc/nagios/.ssh nagios@linux:/etc/nagios/.ssh$ ssh-keygen -b 2048 -f id_dsa -t dsa -N ’’ Generating public/private dsa key pair. Your identification has been saved in id_dsa. Your public key has been saved in id_dsa.pub. The key fingerprint is: 02:0b:5a:16:9c:b4:fe:54:24:9c:fd:c3:12:8f:69:5c nagios@nagserv Thelengthofthe keyhereis2048 bits, andDSA is used to encryptthe keys. -N ’’ ensuresthatthe private keyin id_dsa doesnot receive separatepasswordprotec- tion:thisoptionforcesanempty password. 160 9.2 ConfiguringSSH 9.2.2Setting up theuser nagios on thetargethost Similartothe configuration on theNagios server, thegroup andthe user nagios arealsoset up on thecomputer to be monitored: target computer:˜ # groupadd -g 9000 nagios target computer:˜ # useradd -u 9000 -g nagios -d /home/nagios -m \ -c "Nagios Admin" nagios target computer:˜ # mkdir /home/nagios/.ssh Thetargetcomputer is giventhe directory /home/nagios as thehomedirectory, whereasubdirectory .ssh is created.Inthisthe administrator (oranother user3 ) savesthe public keygenerated on theNagios server /etc/nagios/.ssh/id_dsa.pub , in afile called authorized_keys : linux:˜ # scp /etc/nagios/.ssh/id_dsa.pub \ target computer:/home/nagios/.ssh/authorized_keys Nowthe user nagios doesnot requireits ownpasswordonthe target server. You just need to make sure that on thetargetserverthe .ssh directory, together with authorized_keys ,belongs to theuser nagios: target computer:˜ # chown -R nagios.nagios /home/nagios/.ssh target computer:˜ # chmod 700 /home/nagios/.ssh 9.2.3Checkingthe SSH connection and check_by_ssh With this configuration youshouldfirstcheck whether thesecureshell connection is workingproperly. Thetestisperformedasthe user nagios,since Nagiosmakes useofthisduringthe checks: nagios@linux:˜$ ssh -i /etc/nagios/.ssh/id_dsa target computer w 18:02:09 up 128 days, 10:03, 8users, load average: 0.01, 0.02, 0.00 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT wob pts/1 linux01:S.1 08Sep04 1:27 4.27s 0.03s -bin/tcsh ... The -i optionexplicitly specifies thepathtothe private keyfile.Ifthe command w to be runonthe target computer doesnot provideany output or if theopposite SSH daemonrequestsapassword, then theloginvia public keyisnot working. In this caseyou mustfirstfind andeliminatethe errorbeforeyou can move on to testing check_by_ssh. 3 ...but notthe user nagios,because when an account is created, useradd first sets an invalid password here,which we do notchangeintoavalid one. This meansthatyou cannotcurrently logintothe target computer as nagios. 161 9 ExecutingPlugins viaSSH In this next step,you runthe local plugin on thetargetcomputer,with check_by_ ssh,which later on is runautomatically,fromthe command lineofthe Nagios server. Make sure that thepluginpaths arecorrect in each case. Thepathtothe private keyfile of theuser nagios on theserverisspecifiedwith -i: nagios@linux:˜$ /usr/local/nagios/libexec/check_by_ssh \ -H target computer -i /etc/nagios/.ssh/id_dsa \ -C "/usr/local/nagios/libexec/check_disk -w 10% -c 5% -e -m" DISK CRITICAL [2588840 kB (5%) free on /net/linux04/b] [937152 kB (5%) free on /net/linux04/c] In theexample, check_by_ssh should startthe /usr/local/nagios/libexec/check_ disk plugin on thetargetcomputer withthe options -w 10% -c 5% -e -m.If this doesnot work,thenthisisfirstrun locally on thetargethostwiththe same parameter.Bydoing this youcan rule outthatthe problemliesinthe plugin command itself andnot in thesecureshell connection. 9.3NagiosConfiguration Thematchingcommand object is againdefinedinthe file checkcommands.cfg ; similarto check_local_disk,itshouldbenamed check_ssh_disk: #check_ssh_disk command definition define command{ command_name check_ssh_disk command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ \ -i /etc/nagios/.ssh/id_dsa \ -C "$USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$" } Thecommand linestoredin command_line first runs check_by_ssh; $USER1$ contains thelocal plugin pathonthe Nagios server. Nextcomethe arguments— theIPaddressofthe target host (parameter -H), theprivate keyfile (parameter -i)and finally,withthe -C parameter,the complete command that thetargethost should carry out. If thepluginpathonthe target host andonthe Nagios server areidentical,thenyou can also usethe $USER1$ macroinit; otherwisethe plugin pathonthe target computer is givenexplicitly. Setting up thecommand is no different here to theone in check_local_disk in Section7.1 on page 134. This meansthatapart from thewarning andcritical limits, we explicitly specifyafilesystem or aharddrive partition, withthe -p parameter. Thecommand check_ssh_disk defined in this wayisappliedasfollows,hereona computer called linux02 : 162 9.3 Nagios Configuration define service{ host_name linux02 service_description FS_root ... check_command check_ssh_disk!10%!5%!/ ... } Theserviceobject defined in this wayensures that Nagios checks its / filesystem. Thewarning limit liesat10percent,the critical limit at 5percent. If youuse the check_by_ssh plugin with check_ssh_disk,asinthe examplehere, youmustmake sure that thepluginpathisidentical on alltargethosts.Thisisalso worthdoing for reasonsofsimplicity, though it is notalways possible in practice. Thefollowingservicedefinition,for this reason,givesthe plugin pathtothe target computer as an additionalargument: define service{ host_name linux02 service_description FS_root ... check_command check_ssh_disk!/usr/lib/nagios/plugins!10%!5%!/ ... } In orderfor this to work,you mustchangethe command lineinthe command definition, passedonwith -C,asfollows: -C " $ARG1$/check_disk -w $ARG2$ -c $ARG3$ -p $ARG4$" Caution: this causesthe numbersofeach of the $ARGx macros for -w, -c,and -p to be shifted by one. 163 10 Ch ap te r TheNagios Remote Plugin Executor (NRPE) The Nagios Remote Plugin Executor (orinshort,NRPE) as thenamesuggests, executes programsonaremote host.These areusually pluginsthattestthe corre- sponding computer locally andthereforemustbeinstalledonit. Theuse of NRPE is notrestricted to local plugins; anyplugins at allcan be executed,including those intendedtotestnetworkservices—for example, to indirectly testcomputersthat arenot reachable from theNagios server(as showninSection 10.5frompage 174). While agenuine user account mustbeavailable on theremotecomputer when the secure shellisused(seeChapter 9),which can also be used to do otherthingsthan just startplugins,NRPEisrestricted exclusively to explicitly configuredtests.Ifyou want to,orare forcedto, do without aloginshell on thetargethost, it is better to useNRPE, even if thereissomewhatmoreconfiguration work involvedthanwith thesecureshell. In addition to theNagios configuration andthe installation of the required pluginsonthe target system: 165 10 TheNagios Remote Plugin Executor (NRPE) Theprogram nrpe mustbeinstalledonthe target system. Theinetdaemonthere ( inetd or xinetd)mustbeconfiguredwithadministrator privileges. The check_nrpe plugin mustbeinstalledonthe Nagios server. 10.1Installation NRPE andthe pluginsare installedfromthe sources, or youcan fallback on thepackagesprovided by thedistributor. Youshoulduse at leastversion 2.0of NRPE, sincethisisincompatible withits predecessors. As thiswas released back in September2003, thereshouldnow be corresponding packages for it. Version1.3.1 of theplugincollectionisalsofrom2003; version1.4 wasonlyre- leased at thebeginning of 2005 andhad notbeen integrated into allthe standard distributionsatthe time of going to press. Whether youneed themostup-to-date versiondepends on your expectationsofthe respective plugins. 10.1.1Distribution-specificpackages SuSE Linux 9.3includesthe packages nagios-nrpe-2.0-111.i586.rpm , nagios-plug- ins-1.4-3.i586.rpm ,and nagios-plugins-extras-1.4-3.i586.rpm . nagios-nrpe con- tainsboththe daemonand theplugin check_nrpe. nagios-plugins-extras installs severaladditional plugins, such as databasechecks,FPing testorRadius test, which can be omitted,depending on your specificmonitoringneeds. Forthe sake of simplicity, thedesignpackagesare installedvia YAST2 1 or rpm-ihv package.the second method is also opentoFedorausers. ForFedoraCore3,the corresponding Nagios packages have been made available by DagWieersat http://dag.wieers.com/home-made/apt/packages.php : nagios- nrpe-2.0-3.1.fc3.rf.i386.rpm , nagios-plugins-nrpe-2.0-3.1.fc3.rf.i386.rpm ,and nagios-plugins-1.4-2.1.fc3.rf.i386.rpm . Debian/Sargedistributes theNRPEdaemonand theNRPEplugin check_nrpe in twodifferent packages called nagios-nrpe-server and nagios-nrpe-plugin,which can be installedseparatelyvia apt-getinstall package.Ifyou want to do without local documentation, youcan omit thepackage nagios-nrpe-doc andjustadd the plugin package nagios-plugins to thetargethosts. Thepaths for theprogram nrpe,the configuration file nrpe.cfg,and theplugin directoryare listed in Table 10.1. 1 On thecommand line, using yast -i package. 166 10.1 Installation Table10.1: Installation paths for NRPE andplugins Distribution NRPE programNRPEconfiguration file Plugins Self- compiled 2 /usr/local/sbin/nrpe /etc/nagios/nrpe.cfg /usr/local/nagios/libexec SuSE /usr/bin/nrpe/etc/nagios/nrpe.cfg /usr/lib/nagios/plugins Debian /usr/sbin/nrpe /etc/nagios/nrpe.cfg /usr/lib/nagios/plugins Fedora3 /usr/sbin/nrpe /etc/nagios/nrpe.cfg /usr/lib/nagios/plugins 10.1.2Installation from thesourcecode Theplugins areinstalledonthe computerstobemonitored exactly as describedin Section1.2 from page 30 for theNagios server. TheNRPEsourcecode is obtained from TheNagios Exchange. 4 Thedirectory /usr /local/src5 is idealfor unloading thesources. linux:˜ # mkdir /usr/local/src linux:˜ # cd /usr/local/src linux:local/src # tar xvzf /path/to/nrpe-2.0.tar.gz In thenew directorythathas been created,you runthe configure command: linux:local/src # cd nrpe-2.0 linux:src/rnpe-2.0 # ./configure --sysconfdir=/etc/nagios --enable-ssl Therecommended pathspecifications arelisted in Table 10.1. Theonlydifference from thedefault settingsare for thedirectory in whichthe NRPE configuration file is stored ( configure option --sysconfdir ). Accordingly, we can leaveout theentry for --with-nrpe-user and --with-nrpe- group in the configure command.Bothoptions arerelevantonlyifthe nrpe pro- gram is running as adaemon, andtheycan be overwritten in theconfiguration file. If theinetdaemonisused, youshouldspecify theuserwithwhose permis- sions nrpe should startinthe configuration filefor theinetdaemon. --enable-ssl ensuresthatNRPEcommunicatesoveranSSL-encrypted channel. This will only work,ofcourse, if both nrpe on thetargethostand check_nrpe on theNagios serverhavebothbeen compiledaccordingly. 2 Recommended. 3 From thepackagesprovidedbyDag Wieers. 4 http://www.nagiosexchange.org/NRPE.77.0.html. 5 Thesubdirectory src mayneed to be created first. 167 10 TheNagios Remote Plugin Executor (NRPE) Thecommand make all compilesthe programs nrpe and check_nrpe,but it does not copy them from thedirectory /usr/local/src/nrpe-2.0/src to thecorrespond- ingsystem directories. Since thereisno make install ,you mustdothisyourself, followingthe details in Table 10.1: youneed to have nrpe on thecomputer to be monitoredand the check_nrpe plugin on theNagios server. If theNagios serverand thetargethostusedthe same platform, youcan compile both programsonone computer (e.g., theserver) andthencopy nrpe together withits configuration filetothe computer to be monitored, insteadofseparately compiling check_nrpe on theNagios serverand nrpe on thetargetsystem. 10.2Startingvia theinet Daemon It is best to startthe program nrpe on themachinetobemonitored via theinet daemonratherthanasaseparatedaemon, sincethe Nagios serveronlyperforms thetests occasionally,and nrpe doesnot need to load anylarge resources. If youhaveachoice,you should usethe more modern xinetd.But to keep work to aminimum, theinetdaemonwill normally be used,asitisalready running on the target system. In orderthatNRPEcan be started as aservicevia inetd or xinetd,the nrpe service is defined in thefile /etc/services: nrpe 5666/tcp #Nagios Remote Plugin Executor NRPE Even if this hasbeen installedasadistribution package, youshouldstill checkto seewhether this entryexists. By default,NRPEusesTCP port 5666. 10.2.1 xinetd configuration If xinetd is used,aseparatefile is stored in thedirectory /etc/xinetd.d foreach service to be started,sofor nrpe it is best to create afile called nrpe or nagios- nrpe: #/etc/xinetd.d/nrpe #description: NRPE #default: on service nrpe { flags =REUSE socket_type =stream wait =no user = nobody group = nogroup 168 10.2 Starting viathe inet Daemon server = /usr/local/sbin/nrpe server_args =-c /etc/nagios/nrpe.cfg --inetd log_on_failure += USERID disable =no only_from = 127.0.0.1 ip of_the_nagios_server } Thevaluesprinted in italicsare passedontoyourown environment;insteadofthe placeholder ip of the nagios server youshouldenter,for examplefor only_from, theIPaddressofyourown Nagios server. TheNRPEaccess from outsideisthen restricted to this computer andto localhost ( 127.0.0.1 ). Thelatter address allows local tests;multipleIPaddressesare separated by aspace. However, this restrictive configuration functionsonlyif xinetd hasbeen compiledwithsupportfor theTCP wrapper (thisisnormally thecase). Undernocircumstances should NRPE runwiththe permissionsofaprivileged user— nobody is thereforeasensible value. The server parameter specifies thecom- plete pathtothe program nrpe;for server_args youshouldenter thematching pathtothe configuration file. Afterthismodification,the configuration of xinetd is reloaded, with linux:˜ # /etc/init.d/xinetd reload 10.2.2 inetd configuration In thestandard inetd,the followinglineisaddedtothe configuration file /etc/ inetd.conf : nrpe stream tcp nowait nobody /usr/sbin/tcpd /usr/local/sbin/nrpe -c /etc/nagios/nrpe.cfg --inetd Thelinehas been split up for reasonsofspace, butinthe configuration filethis mustall be in asingleline. Here theTCP wrapper tcpd is used.Ifthisisnot intended, yousimplyleave outthisentry. 6 Here youshouldalsoexplicitly enter the user nobody,the complete pathtothe binary nrpe,and theconfiguration file, also withits complete path. Thesestrings,printed aboveinitalics, should be adjusted to your ownsystem, wherenecessary.After theconfiguration change, inetd is reloaded: linux:˜#/etc/init.d/inetd reload 6 inetd doesnot have abuilt-inmethodtoallowaccess to services only from specificIPaddresses. This function is addedinthe TCPwrapper tcpd.The access configuration is then takenover by thefiles /etc/hosts.allow and /etc/hosts.deny .Moreinformation on this is givenby man host_access. 169 10 TheNagios Remote Plugin Executor (NRPE) 10.3NRPEConfigurationonthe ComputertoBe Monitored When compilingNRPE, thefile nrpe.cfg is created in thesourcedirectory,which contains severalparametersaswellasthe commandstorun NRPE. Theseare copied manually to theconfiguration directory, whichnormally first hastobecreated on thetargetcomputer: linux:src/rnpe-2.0 # mkdir /etc/nagios linux:src/rnpe-2.0 # cp nrpe.cfg /etc/nagios/. Distribution-specificpackagesare unpacked from thelocationspecifiedinTable 10.1onpage 167. nrpe is giventhe permissionsofthe user at runtimespecifiedinthe inet daemon configuration,which in ourcaseisthatof nobody.Therefore nrpe.cfg needstobe readable for this user.Aslongasthe filedoesnot containany passwords (these really should notbeused) or othercritical information,thenreadpermissionsfor allcan be allowed. Theconfiguration filecontainsmanycomments; thefollowingcommand displays theactive parameters:7 user@linux:˜$ egrep -v ’ˆ#|ˆ$’ nrpe.cfg |less server_port=5666 allowed_hosts=127.0.0.1 nrpe_user=nobody nrpe_group=nogroup dont_blame_nrpe=0 debug=0 command_timeout=60 ... Theparameters server_port , allowed_hosts, nrpe_user,and nrpe_group areonly relevant if nrpe is workingasadaemon. When theinetdaemonisused, the program ignoresthese values sincetheyhavealready been determinedbythe (x)indetd configuration. Theentry dont_blame_nrpe=0 prevents nrpe from acceptingparameters, thus closingapotential security hole. debug=1 allows extensive logging,usefulifyou arelookingfor errors ( debug=0 switches off theoutputfor debugging informa- tion), and command_timeout specifies atimespan in secondsafter which nrpe abruptly interruptsaplugin that hashung.Commentsinthe configuration file explainall theseparametersaswell. 7 Theregular expression ^#|^$ matchesall lines that either beginwithacomment sign # or that consistofanempty line. Theoption -v ensuresthat egrep showsall lines that are not matched by this. 170 10.3 NRPE Configuration on theComputertoBeMonitored After this,the commandsare defined that aretobeexecuted by NRPE. Theconfigu- ration file nrpe.cfg already containssome, butfirsttheyall have to be commented out, andonlythose commandsactivated that really areintendedfor use. Thekeyword command is followedinsquarebrackets by thenamewithwhich check_nrpe should callthe command.After theequalssign(= ), thecorresponding plugin command is specified,withits complete path:8 command[check_users]=/usr/local/nagios/libexec/check_users -w 5-c10 command[check_load]=/usr/lib/nagios/libexec/check_load -w 8,5,3 -c 15,10,7 command[check_zombies]=/usr/lib/nagios/libexec/check_procs -w :1 -c :2 -s Z With thepath, caremustbetakenthatthisreally doespoint to thelocal plugin di- rectory. In thedirectory specifiedhere, /usr/local/nagios/libexec,the self-compiled pluginsare located 9 ;and for installationsfromdistribution packages thepathis usually /usr/lib/nagios/plugins. From theNagios server, thecommand just defined, check_users is nowrun on the target computer via check_nrpe: nagios@linux:nagios/libexec$ ./check_nrpe -H target host -c check_users 10.3.1Passing parameters on to localplugins Themethod describedsofar hasone disadvantage:for each testonthe target system,aseparatelydefinedcommand is required for this.Hereisthe exampleofa serveronwhich theplugin check_disk (see Section7.1 from page 134)isrequired to monitorninefile systems: command[check_disk_a]=path/to/check_disk -w 5% -c 2% -p /net/linux01/a command[check_disk_b]=path/to/check_disk -w 4% -c 2% -p /net/linux01/b command[check_disk_c]=path/to/check_disk -w 5% -c 2% -p /net/linux01/c command[check_disk_d]=path/to/check_disk -w 5% -c 2% -p /net/linux01/d command[check_disk_root]=path/to/check_disk -w 10% -c 5% -p / command[check_disk_usr]=path/to/check_disk -w 10% -c 5% -p /usr command[check_disk_var]=path/to/check_disk -w 10% -c 5% -p /var command[check_disk_home]=path/to/check_disk -w 10% -c 5% -p /home command[check_disk_tmp]=path/to/check_disk -w 10% -c 5% -p /tmp To avoidall this work,NRPEcan also be configuredsothatparametersmay be passedonto check_nrpe: 8 The check_users command is explainedinSection 7.6frompage144, check_load is explained in Section7.3 from page 137, andSection 7.4frompage138 dealswith check_procs. 9 ...providedyou have followed theinstructionsinthe book. 171 10 TheNagios Remote Plugin Executor (NRPE) dont_blame_nrpe=1 ... command[check_disk]=path/to/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$ In orderfor this to work,the NRPE configure script mustberun withthe option --enable-command-args .The reason forthisinconvenientprocedure is that pass- ingparametersonisafundamentalrisk, sinceitcannotberuled outthatacertain choice of parameterscould cause an (asyet unknown)bufferoverflow,allowing thetargetsystem to be penetrated. If youstill decide on this,despite allthe security risks, youshoulduse aTCP wrapper (see Section10.2.2, page 169),toensurethatonlythe Nagios serveritselfisallowed to send commandstoNRPE. If thepluginprovides thecorresponding options,there is sometimesathirdmethod, however: theabove-mentioned problemcan also be solvedbygetting check_disk, if necessary,totestall filesystems withone single command: user@linux:nagios/libexec$ ./check_disk -w 10% -c 4% -e -m DISK WARNING [2588840 kB (5%) free on /net/linux1/b] [937160 kB (5%) free on /net/linux1/c] The -e parameter persuadesthe plugin to displayonlythose filesystems that pro- ducedawarning or an error. Onerestriction remains: thewarning andcritical limitsare,bynecessity,the same for allfile systems. 10.4NagiosConfiguration Commandsthat“trigger” local pluginsonremotecomputersvia check_nrpe are defined as beforeinthe file checkcommands.cfg on theNagios server. 10.4.1NRPEwithoutpassing parameters on If no parametersare passedontothe target plugin,thingswill look likethis: define command{ command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ } As the only argument,Nagios passesthe command here that NRPE is to execute. If the check_nrpe plugin on theNagios serverislocated in adifferent directoryto theother plugins, youmustenter thecorrect pathinsteadof $USER1$. 172 10.4 Nagios Configuration Aservicetobetested via NRPE uses thecommand just defined, check_nrpe,as check_command.Asanargument,the command is specified that wasdefinedin nrpe.cfg on thetargetsystem (here: linux04 ): define service{ host_name linux04 service_description FS_var ... check_command check_nrpe!check_disk_var ... } 10.4.2Passing parameters on in NRPE In ordertoaddressthe command defined in Section10.3.1onpage 171 command[check_disk]=path/to/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$ from theNagios server, the check_nrpe is giventhe corresponding arguments throughthe option -a: define command{ command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$ } So that $ARG2$ can correctlytransportthe parametersfor theremoteplugin, theseare separated by spacesinthe service definition. in addition, youshould ensure that theorder is correct: define service { host_name linux04 service_description FS_var ... check_command check_nrpe!check_disk!10% 5% /var ... } Thelocally installed check_disk on linux04 distributes thethree strings 10%, 5%, and /var to itsown threemacros $ARG1$, $ARG2$,and $ARG3$ for thecommand defined in nrpe.cfg. 10.4.3Optimizingthe configuration If theNRPEcommandsare givenidentical namesonall target systems, then all NRPE commandswiththe same name can be included in asingleservicedefinition. 173 10 TheNagios Remote Plugin Executor (NRPE) When doing this youcan make useofthe possibilityofspecifyingseveral hosts, or even an entire groupofhosts: define service{ host_name linux04,linux02,linux11 service_description FS_var ... check_command check_nrpe!check_disk_var ... } With thecommand check_disk_var,definedatthe beginning of Section10.3.1 on page 171, Nagios nowchecks the /var filesystems on thecomputers linux04 , linux02 ,and linux11 .Ifother filesystems aretobeincludedinthe test, asep- arateserviceiscreated for each one, thus avoiding thesecurityprobleminvolved in passing parameterson. If youuse theoptionoftesting allfile systemsatthe same time,withthe check_disk plugin (see Section7.1), then ultimately, onesin- gleservicedefinition is sufficienttomonitor allfile systemsonall Linux servers— provided youhaveacorresponding NRPE configuration on thetargetsystem: define service{ hostgroup_name linux-servers service_description Disks ... check_command check_nrpe!check_disk ... } 10.5IndirectChecks NRPE executes notjustlocal plugins, butany pluginsthatare available.Ifyou use networkplugins via NRPE, theseare referred to as indirect checks,asillustrated graphically in Figure 1. If everynetworkservicewas tested directly acrossthe firewall, it wouldhaveto openall therequiredports.Inthe example, thesewould be theports for SMTP, HTTP, LDAP,PostgreSQL, andSSH.Ifthe checks areperformedindirectlyfroma computer that is behind thefirewall,onthe otherhand, then it is sufficientjust to have theportfor NRPE (TCP port 5666)openonthe firewall. As long as it is configuredvia NRPE, theNRPEserverbehindthe firewallcan performany tests it wants. 174 10.5 Indirect Checks Figure 10.1: Indirect checks with NRPE Whether theeffortinvolvedinindirectchecks is greater than that fordirectones is dependentonthe specificimplementation: if this meansthatyou wouldhave to “drill holesintoyourfirewall,” then theadditional work on theNRPEservermay be worthwhile.But if theports involvedare openanyway, then thedirecttestcan usually be recommended;thiswould make additional configuration work on an NRPE host unnecessary. 175 11 Ch ap te r CollectingInformation Relevant forMonitoringwithSNMP SNMP stands for SimpleNetwork Management Protocol,aprotocol defined above alltomonitor andmanage networkdevices.Thismeans beingable to have notonly read access, butalsowrite accesstonetworkdevices,sothatyou can turn aspecific port on aswitch on or off, or intervene in otherways. Nearlyall network-capable devices that can also be addressedvia TCP/IP can handle SNMP,and notjustswitches androuters. ForUnixsystems thereare SNMP dae- mons;evenWindowsservers containanSNMPimplementationintheir standard distribution,althoughthismustbeexplicitly installed. Butevenuninterruptible power supplies(UPSs)ornetwork-capable sensorsare SNMP-capable. If youare usingNagios,thenatsomepoint youcan’t avoidcomingintocontact withSNMP, because although youusually have agreat choice of queryingtech- niques for Unix andWindowssystems,whenitcomes to hardware-specificcom- 177 11 CollectingInformation Relevant forMonitoringwithSNMP ponents such as switches,without theirown sophisticatedoperating system,then SNMP is often theonlyway to obtain information from thenetworkdevice. SNMP certainlydoesnot have areputationofbeing easy to understand, whichamong otherthingsliesinthe factthatitisintendedfor communication betweenpro- grams, andmachineprocessing is in theforeground.Inaddition,you generally do notmake direct contact withthe protocol andwiththe original information,since even modems or routersprovideasimple-to-operate interfacethatdisguises the complexity of theunderlyingSNMP. If youwanttouse SNMP withNagios,you cannotavoid getting involvedwith theinformation structureofthe protocol.Section 11.1thereforeprovides ashort introductiontoSNMP. Section11.2frompage 184introducesNET-SNMP, probably themostwidelyusedimplementationfor SNMP on Unix systems. On theone hand it showshow to obtain an overviewofthe information structureofanetwork devicewithcommand-linetools, andonthe otheritdescribesthe configuration of theSNMPdaemoninLinux.Finally,Section 11.3frompage 196isdevoted to the concrete useofSNMPwithNagios. 11.1IntroductiontoSNMP Although SNMP contains thePfor “protocol” in itsname, this doesnot standfor a protocol alone, butisusedasasynonymfor the Internet StandardManagement Framework.Thisconsistsofthe followingcomponents: Manageable networknodesthatcan be controlledremotelyvia SNMP.Aspe- cific implementation of an SNMP engine,whether by software or hardware,is referred to as an agent. At leastone SNMP unitconsistingofapplications withwhich theagentscan be managed. This unitisreferredtoasamanager . Aprotocolwithwhich agentand managercan exchange information:the Sim- pleNetwork Management Protocol (SNMP). Awell-defined information structure, so that anymanagers andagentscan un- derstand each other: theso-called Management InformationBase ,orinshort, MIB. Theframework assignsthe managerthe active role.The agent itself just waitspas- sively for incoming commands. In addition,so-called traps extendthe application possibilitiesofSNMP: theseare messagesthatthe agent actively sendstoasingle managerorawholegroup of managers, for exampleifpredefined limit values are exceeded or if functionsofthe networkdevicefail. 178 11.1 Introduction to SNMP As agents, SNMP enginesimplemented by themanufacturer areusedfor hardware- specificdevices (switches,routers).For Linux andgeneral Unix systems, theNET- SNMP implementation is available (see Section11.2),for Windowsservers thereis equivalent software already included withthe operating system. In combinationwithNagios,there aretwo possibilities. With respecttoNagios in theactive role,corresponding Nagios plugins, as themanager, askthe agentsfor thedesired information.The otherway round,Nagios can also passively receive incoming SNMP traps usingutilitiesand processthese.Section 14.6frompage 260isdevoted to this topic. An understandingofthe SNMP information structure, theso-called Management InformationBase (MIB),iscritical if youwanttouse SNMP withNagios success- fully.For this reason this sectionwill focusonthis. Theprotocolitselfisonly mentionedbrieflytoillustrate thedifferences betweendifferent protocol versions. If youwanttoget involvedmoredeeply withSNMP, we referyou to thenumer- ous Request forComments (RFCs)describingSNMP. Thebestplace to startwould be in RFC3410, “Introductionand ApplicabilityStatements for InternetStandard Management Framework”,and RFC3411: “AnArchitecture for DescribingSim- pleNetworkManagement Protocol (SNMP) Management Frameworks.” Apartfrom an introductionand numerous cross-links,you willalsofind references thereto theoriginaldocumentsofthe olderversions, today referred to as SNMPv1and SNMPv2. 11.1.1The Management InformationBase TheSNMPinformation structureconsistsofahierarchical namespace construction of numbers. Figure 11.1shows an extract from this.The tree structureissimilarto thoseofother hierarchical directoryservices,suchasDNS or LDAP. Itsroot is called 1 ( iso )and stands forthe InternationalOrganizationfor Stan- dardization.The next level, 3 ( org )shown in Figure11.1provides aspacefor general, national andinternational organizations.Beneath this is 6 ( dod)for the U.S. DepartmentofDefense.The general(IP-based) internet owes itsassignment as asubitem 1 ( internet )of dod to itsoriginasamilitaryproject. If youbring together thecorresponding numbersfromlefttoright andseparate them withthe dot,thenfor the internet node in thetree, youarrive at thedes- ignation 1.3.6.1 .Suchnodesare referred to in generalas objectidentifiers (OID). Theirsyntaxisusednot only in SNMP butalsointhe definitionofLDAP objects andattributes,for example. TheOID 1.3.6.1 is notexactly easily readable for humans,which is whyother no- tation methods have gainedacceptance:both iso.org.dod.internet andthe com- bination iso(1).org(3).dod(6).internet(1) y make readable descriptions infinitelylongifthe tree were deep enough,another 179 is allowed. Because this wouldquickl 11 CollectingInformation Relevant forMonitoringwithSNMP abbreviated notation method hasbecomeestablished: as long as thetermremains unique, youmay simply write internet insteadof 1.3.6.1 . Theimportant thinghereisthatthe communication betweenmanagerand agent is exclusively of anumerical nature.Whether themanageralsoallows textinput or is capable of issuinginformation as textinsteadofasanumericOID dependson theimplementationineach case. Theinformation on individualnodesisprovided by themanufacturer of theSNMPagent as aManagement Information Base(MIB) in fileform. Figure 11.1: SNMP namespace usingthe example of theMIB-IIinterfaces Thedatastoredinthe MIBincludescontact information (who designed theMIB; usually themanufacturer of thedevicewill be givenhere),the definitionofindivid- ualsubnodesand attributes,and thedatatypes used.IfanMIB filealsodescribes theindividualsubnodesand attributes,thisputsthe managerinapositiontosup- plythe user withadditional information on themeaning andpurposeofthe entry in question. 180 11.1 Introduction to SNMP Below internet ,the next levelisdivided into variousnamespaces.The management node 1.3.6.1.2 is especially important for SNMP,thatis, iso(1).org(3).dod(6).inter- net(1).mgmt(2) .The namespace here is describedbyRFC 1155, “Structure and Identification of Management Information for TCP/IP-basedInternets.” In orderfor managerand agent to be able to understandeach other, themanager needstoknowhow theagent structures itsdata. This is wherethe Management InformationBase, VersionII comesintoplay. SNMP requests information from theagentsontheir implementation;withthis, everymanagercan accessthe most important parametersofthe agent,without aprevious exchange of MIB definitions. The Management InformationBaseII ,orMIB-II (ormib-2) for short, can be found in thenamespaceat 1.3.6.1.2.1 or iso(1).org(3).dod(6).internet(1).mgmt(2).mib- 2(1).Since it is well-defined andunique, OIDs lyingbeneath that areusually de- scribedinshort,startingwithMIB-II or mib-2. Manufacturer-specificinformation can also be defined in your ownManagement Information Base. Corresponding MIBs arelocated beneath internet.private.enter- prise .OnceanOID hasbeen describedinanMIB, themeaning of this entrymay neverbechanged.The descriptionformatfor an MIB is standardizedbyRFC 1212, whichisthe reason that specialMIBs,includedbyavendor for itsagents, can be integrated into almost anymanager. MIB-II MIB-II,the Management Information Base,whichisobligatoryfor allSNMPagents, contains severalinformation groups.The most important of theseare summarized in Table 11.1. Thenotation mib-2. x stands for 1.3.6.1.2.1.x . Table11.1: MIB-II groups (a selection) Group OIDDescription system mib-2.1Information on thedevice, (e.g., thelocation, contact partner,oruptime) interfaces mib-2.2Information on thenetworkinterfaces (Name, inter- facetype, status,statisticsetc.) at mib-2.3Assignment of physical addresses(e.g.,ofMAC ad- dresses) to theIPaddress(Address TranslationTable ) ip mib-2.4Routing tablesand IP packetstatistics icmp mib-2.5StatisticsonindividualICMP packettypes tcp mib-2.6Openports andexistingTCP connections udpmib-2.7ditto for UDP host mib-2.25 Information on storage media, devices,running pro- cessesand theiruse of resources 181 11 CollectingInformation Relevant forMonitoringwithSNMP Howyou specifically handle information stored in theMIB-II can be explainedusing theexample of the interfaces group: Figure 11.1shows howtheyare split up into thetwo OIDs interfaces.ifNumber and interfaces.ifTable .Thisisbecause one networknode initially revealsanunknown number of interfaces.Thisnumber is takenupby ifNumber.Beforelookingatthese interfaces more closely, amanager can getthe information from ifNumber abouthow many therereally are. ifTable then contains theactualinformation on thedifferent interfaces.Toobtain this information for aspecific interface, themanagerqueries allthe entriesin whichthe last number is thesame, likethis: ifEntry.ifIndex.1 =INTEGER: 1 ifEntry.ifDescr.1 =STRING: eth0 ifEntry.ifType.1 =INTEGER: ethernetCsmacd(6) ifEntry.ifMtu.1 =INTEGER: 1500 ifEntry.ifSpeed.1 =Gauge32: 100000000 ifEntry.ifPhysAddress.1 =STRING: 0:30:5:6b:70:70 ifEntry.ifAdminStatus.1 =INTEGER: up(1) ifEntry.ifOperStatus.1 =INTEGER: up(1) ... ifIndex describesthe device-internalindex—SNMP always starts countingfrom 1 , switches startcountingherefrom 100. ifDescr contains thenameofthe interface, here eth0—thisisobviously aLinux machine. It can be assumedfromthe next fourentries that anormal100-MBit Ethernet interfaceisinvolved. If ethernetC- smacd is givenasthe interfacetype ifType, 1 that is,Ethernet. ifMtu specifies the Maximum Transfer Unit,which in local networks is always 1500 bytes for Ether- net. Theinterfacespeed ifSpeed is 100,000,000 bitshere, that is,100 MBit.And ifPhysAddress contains thephysical networkaddress, also calledthe MACaddress. ifAdminStatus revealswhether theadmin hasswitched theinterfaceon(up)or off ( down )via theconfiguration. ifOperStatus on theother hand specifies theac- tual status,since even interfaces activated by an administrator arenot necessarily connected to adevice, or even switched on. Thereisasimilarpicture for thesecondinterface: ifEntry.ifIndex.2=INTEGER: 2 ifEntry.ifDescr.2 =STRING: lo ifEntry.ifType.2 =INTEGER: softwareLoopback(24) ifEntry.ifMtu.2 =INTEGER: 16436 ifEntry.ifSpeed.2 =Gauge32: 10000000 ifEntry.ifPhysAddress.2 =STRING: ifEntry.ifAdminStatus.2 =INTEGER: up(1) 1 Carrier Sense (CS) meansthateachnetwork interfacechecks to seewhether thelineisfree, basedonthe networksignal(in contrast to TokenRing, forexample,where thenetwork card mayuse thelineonlyifitexplicitly receives atoken); Multiple Access (MA) meansthatseveral networkcards mayaccess acommonnetwork medium simultaneously. 182 11.1 Introduction to SNMP ifEntry.ifOperStatus.2 =INTEGER: up(1) ... This is notanethernetcardhere, however, butalocal loopbackdevice. 11.1.2 SNMP protocol versions ThefirstSNMPversion and Internet StandardManagement Framework were describedback in 1988 in RFCs 1065–1067; thecurrent documentationonthis version, namedSNMPv1, can be found in RFC1155–1157. It is still used today, sincehigherversionsare fundamentally backward-compatible. Thebig disadvantage of SNMPv1isthatthisversion allows only unsatisfactory authentication in preciselythree stages: no access, read access, andfullaccess for read andwrite operations. Twosimplepasswords,the so-called communities , providealittleprotection here:theydivideusers into onecommunity withread permissions, andthe second onewithreadand writepermissions. No furtherdif- ferentiation is possible.Ifthiswas notenough, thecommunity is transmitted in plaintext, making it an easy pray for sniffertools. Furtherdevelopmentonthe second version, SNMPv2, wasintendedtosolve prob- lems concerning thedisplay of valueranges, errorevents, andthe performance if thereare mass requests (RFC1905).ThisRFC wasnever fully implemented, however. Theonlyrelatively complete implementation that wasusedinpractice is knownasthe Community-based SNMPv2,orSNMPv2c for short(RFC1901– 1908).The currentversion,SNMPv3(RFC3411–3418),has thestatusofanInternet standard. Agents withSNMPv3implementations always understandrequestsfrom SNMPv1. Apartfromextendedprotocoloperations, thereare no fundamentaldifferences betweenSNMPv1and SNMPv2c.Thisisprobably also thereasonwhy SNMPv2 couldnot really gainafoothold.The hoped-for increase in security wascertainly missing in this version. It is only theextensionsofthe frameworkinSNMPv3which allowmoreprecise accesscontrol,but this is muchmorecomplicated than thetwo community stringsinSNMPv1. RFC3414 describesthe user-based security model (USM), RFC3415 the view-based access controlmodel (VACM). When accessing an SNMP agent,you musttellall tools, includingplugins,which protocol versionistobeused. In Nagios youexclusively requirereadaccess.If this is restricted to therequiredinformation andyou only allowthe accessfrom theNagios server, youneed have no qualms aboutdoing without theextended authentication of SNMPv3. It is only important that youconfigure theagent— if possible—sothatitcompletelypreventswrite accesses, or at leastdemands a password. Youshouldnever usethis: sinceitistransmitted in plaintext, thereis always adangerthatsomebody maybelistening, andmisusethe passwordlater on. 183 11 CollectingInformation Relevant forMonitoringwithSNMP In NET-SNMP,write accessescan be completelyprevented,access can be restricted to specifichosts,and information revealed can be limited.For otheragentsimple- mented in hardware such as switches androuters, youmustweigh up whether you really need SNMPv3, assuming themanufacturer hasmade this available.SNMPv1, however, is available for allSNMPdevices. We willthereforeonlyexplain accessvia SNMPv1below,and assume that this is generally read accessonly. If youstill want to getinvolvedwithSNMPv3, we refer youtothe NET-SNMP documentation.2 11.2NET-SNMP Probably themostwidelyusedSNMPimplementationfor Linux andother UNIX systemsisNET-SNMP 3 andwas originally conceivedatCarnegie-MellonUniver- sity.Wes Hardacker,asystem administrator at theUniversityofCaliforniainDavis, continueddeveloping thecode andfirstpublisheditunder thenameUCD-SNMP (Version 3.0). With version5.0 theproject finally got thenameNET-SNMP. Butvarious distribu- tionsstill callthe packageUCD-SNMP, in part because it contains version4.2,in part because themaintainerhas simply notgotten around to renaming it. NET-SNMP consists of aset of command linetools, agraphic browser(tkmib), an agent ( snmpd ,see Section11.2.2onpage 187) andalibrary, whichnow forms the basisofnearlyall SNMP implementationsinthe Open Source field. Allcommondistributionsinclude corresponding packages.InSuSEthisiscalled net-snmp andcontainsall thecomponents;Debianpacksthe toolsinthe package snmp,and thedaemoninthe package snmpd .Atthe time of going to press, version 5.2.1was thecurrent version, butanolder version(even 4.2) willdothe jobfor our purposes.Their outputsdiffertosomeextent, butthe exact optionscan be looked up wherenecessary in themanpage. 11.2.1 Toolsfor SNMP requests Command linetools snmpget , snmpgetnext and snmpwalk Forreadaccess,the programs snmpget , snmpgetnext and snmpwalk areused. sn- mpget specifically requests asingleOID andreturnsasingle valuefromit. sn- mpgetnext displays thenextvariable existing in theManagement Information Base, includingits value: 2 http://net-snmp.sourceforge.net/docs/FAQ.html#How_do_I_use_SNMPv__ 3 http://net-snmp.sourceforge.net/ 184 11.2 NET-SNMP user@linux:˜$ snmpget -v1 -c public localhost ifDescr.1 IF-MIB::ifDescr.1 =STRING: eth0 user@linux:˜$ snmpgetnext -v1 -c public localhost ifDescr.1 IF-MIB::ifDescr.2 =STRING: lo user@linux:˜$ snmpgetnext -v1 -c public localhost ifDescr.3 IF-MIB::ifType.1 =INTEGER: ethernetCsmacd(6) Theoption -v1 instructs snmpget to useSNMPv1asthe protocol.With -c you specifythe read community an;inthiscasethen, thepasswordis public.Thisis followedbythe computer to be queried, here localhost,and finally thereisthe OID whosevalue we wouldliketofind out. TheNET-SNMPtoolsare mastersofOID abbreviation:without specialinstructions, they always assume that an OIDisinvolvedwhich liesinsidethe MIB-II. Forunique entriessuchas ifDescr.1 ,thisissufficient.But whether thevarious SNMP plugins for Nagios can also handle this dependsonthe specificimplementation; it is best to tryout cases on an individualbasis.Tobeonthe safe side,itisbetter to use complete OIDs,eithernumerical in readable form. Thelatter is obtained if you instruct snmpget to displaythe full OID: user@linux:˜$ snmpget -v1 -On -c public localhost ifDescr.1 .1.3.6.1.2.1.2.2.1.2.1 =STRING: eth0 user@linux:˜$ snmpget -v1 -Of -c public localhost ifDescr.1 .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifDescr.1= STRING: eth0 The -On optionprovides thenumerical OID, -Of thetextversion.Inthisway you can easily findout thecomplete OID, for pluginswhich cannothandlethe abbre- viation.Itisimportant to remember here:each OIDalways starts withaperiod. If youomitthis, therewill always be apluginwhich doesn’t work properly. In ordertoobtainthe entire information stored in theMIB-II, it is better to use snmpwalk.Asthe name suggests, theprogram takesawalk throughthe Manage- ment Information Base, either in itsentirety or in aspecifiedpartofthe tree.If youwould liketofind outabout allthe entriesbeneath thenode mib-2.interfaces (Figure11.1onpage 180),you simply give snmpwalk therequiredOID: user@linux:˜$ snmpwalk -v1 -c public localhost mib-2.interfaces IF-MIB::ifNumber.0 =INTEGER: 3 IF-MIB::ifIndex.1 =INTEGER: 1 IF-MIB::ifIndex.2 =INTEGER: 2 IF-MIB::ifIndex.3 =INTEGER: 3 IF-MIB::ifDescr.1 =STRING: eth0 IF-MIB::ifDescr.2 =STRING: lo IF-MIB::ifDescr.3 =STRING: eth1 IF-MIB::ifType.1 =INTEGER: ethernetCsmacd(6) ... 185 11 CollectingInformation Relevant forMonitoringwithSNMP snmpwalk hidesthe exact structureslightly (links to ifTable and ifEntry aremiss- ing, for example, seeFigure11.1),sothatitisbetter to use -Of: user@linux:˜$ snmpwalk -v1 -Of -c public localhost mib-2.interfaces ...mib-2.interfaces.ifNumber.0 =INTEGER: 3 ...mib-2.interfaces.ifTable.ifEntry.ifIndex.1 =INTEGER: 1 ...mib-2.interfaces.ifTable.ifEntry.ifIndex.2 =INTEGER: 2 ...mib-2.interfaces.ifTable.ifEntry.ifIndex.3 =INTEGER: 3 ...mib-2.interfaces.ifTable.ifEntry.ifDescr.1 =STRING: eth0 ...mib-2.interfaces.ifTable.ifEntry.ifDescr.2 =STRING: lo ...mib-2.interfaces.ifTable.ifEntry.ifDescr.3 =STRING: eth1 ...mib-2.interfaces.ifTable.ifEntry.ifType.1 =INTEGER: ethernetCsmacd(6) Thethree dots ... in theversion here abbreviated for printstand for .iso.org.dod.in- ternet.mgmt. As the nextstep,you couldtake alook around your ownnetworkand querythe Management Information Bases available there. Normally youwill getquite far withthe read community public,since this is often thedefault setting.Soyou should also tryout thecommunity string private,which is thedefault setbymany vendors. An extremelydubious practice,bythe way: anyone whoknows abit aboutSNMPand whohas accesstothe networkcan usethistomanipulatedevice settings, such as switchingoff certainports or theentireswitch.But even withall theother defaultpasswords,you should take thetrouble to change them. Entire passwordlists can be found on theInternet, sorted by vendorsand devices—easily found throughGoogle. Whether youalsochangethe preset read community (suchas public)depends on theinformation available on it andonyourown security requirements.But the read-writecommunity should under no circumstances retain thedefault setting. In addition it is recommended that youswitch off SNMP completelyfor devices that areneither queriednor administrated via SMNP,justtobeonthe safe side. Taking agraphic walk with mbrowse Agraphic interfaceisoften recommended for interactive research andfor initial explorations of theManagement Information Base, such as theSNMPbrowser mbrowse 4 (see Figure 11.2).Thisisnot acomponentofNET-SNMP, butmostLinux distributionsprovidean mbrowse packagefor installation. 4 http://www.kill-9.org/mbrowse/ 186 11.2 NET-SNMP Figure 11.2: SNMP browser mbrowse If youhighlight an entryand clickonthe Walk button, thelower window dis- playsthe same output as snmpwalk.The graphical display, however, allows better orientation—it is easier to seeinwhich partialtreeyou arecurrently located. It is also interesting that mbrowse showsthe numericOID of each selected object,in Object Identifier . 11.2.2The NET-SNMP daemon TheNET-SNMPdaemon snmpd worksasanSNMPagent for Linux andother Unix systems; that is,itanswers requests from amanagerand also provides away of making settingstothe Linux system via writeaccesses, such as manipulating the routingtable. 187 11 CollectingInformation Relevant forMonitoringwithSNMP SupportedMangement Information Bases Theagent initially provides information on theMIB-II describedinRFC 1213 (Sec- tion 11.1.1frompage 179),but also thehostextensionsbelonging to this from RFC 2790 (host MIB).Table 11.2summarizesthe groups of thehostMIB, andthe most important MIB-II groups areintroducedinTable 11.1(page 181). If youare interested in adetaileddescription of theMIB-II, includingthe host MIB, we referyou to theMIB browserofTUBraunschweig. 5 In addition to thebasic MIB- II, theNET-SNMPimplementationhas itsown extension at private.enterprises.uc- davis (UCD-SNMP-MIB). Thedirectivesgivenintable 11.3refer to instructions in theconfiguration file snmpd.conf (see page 190).Someofthe information here is also giveninthe host resource MIB. Table11.2: Componentsofthe Host ResourcesMIB mib-2.host (RFC 2790) Group OIDDescription hrSystem host.1 System time anduptimeofthe host,logged-in users, andnumber of active processes hrStorage host.2 Details on allstorage mediasuchasswap, hard drives, removablemedia,and main memory hrDevicehost.3Listofavailable devices andtheir properties: apart from details on theprocessor, networkinterfaces, printer andDVD-/CD-ROM drives, thereisalsoinfor- mation on hard drives, theirpartitioning, filesystems, mount pointsand filesystem types hrSWRun host.4 Allrunning processesincluding PID andcommand lineparameters hrSWRunPerf host.5 CPU usage andmemoryusage for theprocessesfrom hrSWRun hrSWInstalledhost.6Installedsoftware;the information originates from theRPM database(unfortunatelythisdoesnot work in Debian). Table11.3: Extractfromthe UCD-SNMP-MIB Group OIDDirectivedescription prTable ucdavis.2 proc details of running processes memory ucdavis.4 –Memoryand Swapspaceload, as in theprogram free 5 http://www.ibr.cs.tu-bs.de/cgi-bin/sbrowser.cgi 188 11.2 NET-SNMP continued Group OIDDirectivedescription extTable ucdavis.8 exec Information on self-definedcom- mandsinthe configuration file 6 dskTable ucdavis.9 disk Information on filesystems,see ex- ampleinthe text laTable ucdavis.10 load System load ucdExperi- mental ucdavis.13– Experimental extension containing an entrywithlm-sensor information, amongother things fileTable ucdavis.15 file Information on files to be explicitly monitored versionucdavis.100 –Details on theNET-SNMPversion and theparameterswithwhich thedae- monwas compiled While mib-2.host only specifies absolute values,suchasfor filesystems,UCD- SNMP-MIBalsoallows thresholdvaluestobeset for agent pages, whichthenex- plicitly generate an errorvalue ( dskErrorFlag)witherror text(dskErrorMsg): user@linux:˜$ snmpwalk -v1 -c public localhost ucdavis.dskTable | \ grep ’.2 =’ UCD-SNMP-MIB::dskIndex.2 =INTEGER: 2 UCD-SNMP-MIB::dskPath.2 =STRING: /net/swobspace/b UCD-SNMP-MIB::dskDevice.2 =STRING: /dev/md6 UCD-SNMP-MIB::dskMinimum.2 =INTEGER: -1 UCD-SNMP-MIB::dskMinPercent.2 =INTEGER: 10 UCD-SNMP-MIB::dskTotal.2 =INTEGER: 39373624 UCD-SNMP-MIB::dskAvail.2 =INTEGER: 1694904 UCD-SNMP-MIB::dskUsed.2 =INTEGER: 35678636 UCD-SNMP-MIB::dskPercent.2 =INTEGER: 95 UCD-SNMP-MIB::dskPercentNode.2 =INTEGER: 1 UCD-SNMP-MIB::dskErrorFlag.2 =INTEGER: 1 UCD-SNMP-MIB::dskErrorMsg.2 =STRING: /net/swobspace/b: less than 10% free (= 95%) The grep ’.2=’ filtersall entriesonthe second devicefromthe snmpwalk output, theLinux software-RAID /dev/md6 .The entry dskPercent showsthe currentload of this datamedium. An errorexistsif dskErrorFlag contains thevalue 1insteadof 0; dskErrorMsg addsareadable message to theerror message.Itcan be assumed from this that theagent is beingconfiguredsothatitwill announceanerror if free capacity falls below10percent. 6 Anyexecutable programscan be used here. 189 11 CollectingInformation Relevant forMonitoringwithSNMP Theconfiguration file snmpd.conf Configuring theagent is doneinthe file snmpd.conf,which is either locatedinthe directory /etc directly (the casefor SUSE)orin /etc/snmp (Debian), dependingon thedistribution. Authentication andsecurity As the first step towardsafinelytunedaccess control, youfirstneed to definewho should have accesstowhich community: #(1) source addressesQuelladressen com2sec localnet 192.168.1.0/24 public com2sec localhost 127.0.0.1 public com2sec nagiossrv 192.168.1.9 public com2sec links thesourceIPaddressestoacommunity string (the SNMP password). This keyword is followedbyanaliasfor theIPaddressrange,the address range itself,and then afreelyselectable community string,for whichwewill use public here,tokeep things simple. 7 192.168.1.0/24refers to thelocal network; theNagios serveritselfhas theIPaddress 192.168.1.9 .Ifyou setaccess permissionsfor the alias localnet later on,theywill apply to theentirelocal network 192.168.1.0/24, butifyou reference nagiossrv when doing this,theywill only apply to theNagios serveritself. Then thedefinedcomputersand networks areassigned via theiraliasestogroups whichhavedifferent security models: #(2) assignment of group -security model -source-IP alias group Local v1 localhost group Nagios v1 nagiossrv Thekeyword group is followedfirstbyafreelyselectable groupname: here we definethe group Local withthe security model v1,which belongstothe address rangedefinedas localhost,and thegroup Nagios withthe same security model containedinthe Nagios server. Youcan choose from v1 (SNMPv1), v2c (community-based SNMPv2),and usm (the UserModel from SNMPv3) as thesecuritymodel. If youassign acomputer or a networkseveral security models at thesametime, then separateentries withthe same groupnameare required: group Nagios v1 nagiossrv group Nagios usm nagiossrv 7 Seealsopage186. 190 11.2 NET-SNMP With thedefinition of views (keyword view )the viewfromthe outsidecan be restricted preciselytopartial treesofthe Management Information Base. Each viewhereisalsogivenaname for referencing: #(3) View definition for partial trees of the SNMP namespace view all included .1 view system included .iso.org.dod.internet.mgmt.mib-2.system Thereference included includes thefollowingpartial tree in theview. Thus the view all covers theentiretree(.1). If youwanttoexclude certainpartial treesin this,thenthe keyword excluded is used: view all included .1 view all excluded .iso.org.dod.internet.private Thepartial tree beneath private in all is nowblocked, such as theMIB ucdavis ( private.enterprises.ucdavis). Oneinteresting featureisthe mask;itspecifiesinhexadecimalnotationwhich nodescorrespondexactly to thesubtree: view all included .iso.org.dod.internet.mgmt F8 Allplacesofthe queriedOID,for whichthe mask contains a1in binaryno- tation,mustbeidentical in thequeried partialtreetothe OIDspecifiedhere, .iso.org.dod.internet.mgmt,otherwisethe daemonwill refuse accessand notpro- videany information. .iso.org.dod.internet.mgmt is written numerically as .1.3.6. 1.2 . Thanks to themask F8, 8 binary11111000, thefirstfive places from theleftinthe OIDmustalways be .iso.org.dod.internet.mgmt.Ifsomebody queriedanOID (such as the private tree .1.3.6.1.4), whichdeviates from this,the agent wouldremain silent andnot provideany information.Ifyou leaveout themaskdetail, FF will be used. If youhavedefinedthe alias, community,securitymodel, andview, youjustneed to bringthemtogether for thepurposeofaccess control. This is donewiththe access instruction: #(4) Definition of the access control access Local any noauth exact all none none access NagiosGrp any noauth exact all none none Theaccess restrictions arebound to thegroup.The context column remainsempty ( ””), sinceonlySNMPv3requiresit.9 As the security model, youthennormally 8 F= 1 · 2 3 +1· 2 2 +1· 2 1 +1· 2 0 =1111, 8=1000 9 Corresponding descriptions on SNMPv3 wouldgobeyondthe boundsofthisbook. 191 11 CollectingInformation Relevant forMonitoringwithSNMP choose any ,but youmay defineaspecificmodelwith v1, v2c or usm ,since several different security models maybeassigned to agroup,asshown in thediscussion of “Authenticationand Security”atthe beginning of this Section. Thefifthcolumn specifies thesecuritylevel,which is also of interestonlyfor SNMPv3. In theother twosecuritymodels (weare only using v1), noauth is givenhere. Thefourthlast column also hasjustone meaninginSNMPv3. Butsince youmustenter avalid valueforSNMPv1and SNMPv2c as well,then exact is chosen here. Thelasttwo columns specifywhich viewshouldbeusedfor whichaccess (reador write).Inthe example, thegroups Local and NagiosGrp obtain read accessfor the view all ,but no writeaccess.The final column defines whether theagent should send SNMP traps—thatis, active messages, to themanager—for events that occur withinthe rangeofvalidityofthe view. Section14.6frompage 260goesintomore detail aboutSNMPtraps. With theconfiguration describedhere, youcan nowexclusively accessthe Nagios serverand localhost via SNMPv1for information.The serveraccess can be re- stricted furtherbydefining aviewthatmakesonlyparts of theMIB visible. But youshouldonlytry this once theconfiguration describedisworking,toavoid log- ical errors andtime-consuming debugging. Systemand local information Thepartial tree mib-2.system provides information on thesystem itself andonthe available (thatis, implemented)MIBs.With syslocation youcan specifywhere a system is locatedinthe companyoronthe campus, andafter thekeyword syscon- tact youenter thee-mailaddressofthe administrator responsible: #(5) mib-2.system syslocation Server room Martinstr., 2nd rack from the left syscontact root As long as youdonot redefinethe parameters sysname and sysdescr at this point, thecorresponding MIBs in thedefault will reveal thehostnameand/or thesystem andkernelspecification, corresponding to uname -a: user@linux:˜$ snmpwalk -v1 -c public localhost system system.sysDescr.0 =STRING: Linux swobspace 2.6.10 #20 SMP Mon Dec 27 11:55:25 CET 2004 i686 system.sysObjectID.0 =OID: NET-SNMP-MIB::netSnmpAgentOIDs.10 system.sysUpTime.0 =Timeticks: (1393474) 3:52:14.74 system.sysContact.0 =STRING: root system.sysName.0 =STRING: swobspace system.sysLocation.0 =STRING: Serverraum Martinstr., 2. Rack von links ... 192 11.2 NET-SNMP Definingprocessestobemonitored Processesthatyou want to monitorusing SNMP arespecifiedwiththe proc di- rective,and if required youcan specifythe minimum or maximum number of pro- cesses: #(6) Processes: enterprises.ucdavis.procTable #proc process maximum minimum #proc process maximum #proc process proc sshd proc nmbd 21 proc smbd proc slapd If theentry for maximum andminimum is missing,atleast oneprocess mustbe running.Ifonlythe minimum is omitted,NET-SNMPwill definethiswithzero processes. Thecorresponding entriesend up in theMIB ucdavis.prTable;incaseof erroryou willreceive an errorflag ( prErrorFlag andanerror description(prErrMes- sage)(whichunfortunatelyyou cannotdefine yourself): user@linux:˜$ snmpwalk -v1 -c public localhost prTable ... prTable.prIndex.4 =INTEGER: 4 prTable.prNames.4 =STRING: slapd prTable.prMin.4 =INTEGER: 0 prTable.prMax.4 =INTEGER: 0 prTable.prCount.4 =INTEGER: 0 prTable.prErrorFlag.4 =INTEGER: 1 prTable.prErrMessage.4 =STRING: No slapd process running. ... ucdavis.prTable only revealsthe configuredprocesses; on theother hand it allows mib-2.host.hrSWRun and mib-2.host.hrSWRunPerf in generaltoquery allrun- ning processes. If youwanttoprevent this,the viewmustexclude theareayou do notwant. Your owncommands With the exec directive youcan specifycommandsinthe extension ucdavis.ext- Table ,which theagent willexecute in thecorresponding queries. Theresultthen appears in therelevantentries.Inthe followingexample theagent calls /bin/echo if it is askedfor ucdavis.extTable : #(7) your own commands: enterprises.ucdavis.extTable #exec name command arguments exec echotest /bin/echo hello world 193 11 CollectingInformation Relevant forMonitoringwithSNMP Theprogram to be executed mustappear withits absolute pathinthe configura- tion.Running snmpwalk provides only thefollowing: user@linux:˜$ snmpwalk -v1 -c public localhost extTable extTable.extEntry.extIndex.1 =INTEGER: 1 extTable.extEntry.extNames.1 =STRING: echotest extTable.extEntry.extCommand.1 =STRING: /bin/echo hello world extTable.extEntry.extResult.1 =INTEGER: 0 extTable.extEntry.extOutput.1 =STRING: hello world ... extTable.extEntry.extResult contains thereturn valueofthe command executed, and extTable.extEntry.extOutput contains thetextoutput. With the exec directive youcan thus queryeverythingthatalocal script or program can findout.Thiscould be asecurityproblem, however: if theprogramsusedare susceptible to bufferoverflows, this featurecould be misusedasastarting point for adenial-of-service attack. Monitoring hard drivecapacity The disk directive is suitable for monitoring filesystems.The keyword disk is fol- lowedbythe pathfor amount point,and then theminimum hard drive space in kBytes or in percentthatshouldbeavailable.Ifyou omit thecapacity entry, at least100 MBytes mustbeavailable;otherwiseanerror message willbegiven. In thefollowingexample thefreecapacity in the / filesystem should notdrop below 10%,and on /usr,atleast 800 MBytes10 should remain free: #(8) File systems: enterprises.ucdavis.dskTable #disk mount point #disk mount point minimum_capacity_in_kbytes #disk mountpoint minimum_capacity_in_percent% disk /10% disk /usr 819200 disk /data 50% As far as thedatapartition /data is concerned, thealarm should be raised if free capacity falls below 50%. dskErrorFlag in this casecontainsthe value1insteadof 0, and dskErrorMsg contains an errortext: ... UCD-SNMP-MIB::dskPercent.3 =INTEGER: 65 UCD-SNMP-MIB::dskErrorFlag.3 =INTEGER: 1 UCD-SNMP-MIB::dskErrorMsg.3 =STRING: /data: less than 50% free (= 65%) ... 10 1024kBytes ∗ 800 194 11.2 NET-SNMP dskPercent revealsacurrentload of 65%.Insteadofthe partialtreeconfigured here, ucdavis.dskTable , mib-2.host.hrStorage also provides an overviewofall file systems, even thosenot explicitly defined.These aremissing percentage details, however, andyou do notreceive an errorstatusorerror message,assuppliedby ucdavis.dskTable . Youshouldthink hard aboutwhether youset thewarning limit in theNET-SNMPor in theNagios configuration.Inthe first caseyou mustconfigure thevaluesoneach individualhost. If youquery thepercentage load, however, withthe check_snmp plugin (see section11.3.1frompage 196),thenyou setwarning andcritical limits centrally on theNagios server, savingyourselfalotofworkifyou make changes later on. The includeAllDisks directive addsall existing filesystems to the dskTable table: includeAllDisks 10% It requires aminimum limit to be specified in percent, andalsoreturnserror values. An absolute specification in kBytes is notpossible here.Ifyou setwarning and errorlimitscentrally for check_snmp;(seeSection 11.3.1frompage 196) theerror attributes dskErrorFlag and dskErrorMsg arenot queried, so that thevalue sethere as theminimum limit can be ignored. Systemload The load directive queriesthe CPU load. As the limit values,you specifythe average values for oneminute, andoptionally for fiveand 15 minutes: #(9) System Load: enterprises.ucdavis.laTable #load max1 #load max1 max5 #load max1 max5 max15 load 532 If thevaluesare overstepped, laErrorFlag willcontain thestatus 1 (otherwise: 0 ) and laErrMessage willhavethe textofthe errormessage. In asystem that exceedsone of thespecifiedlimits, snmpwalk returnsthe follow- ing: user@linux:˜$ snmpwalk -v1 -c public localhost laTable ... UCD-SNMP-MIB::laNames.1 =STRING: Load-1 UCD-SNMP-MIB::laNames.2 =STRING: Load-5 UCD-SNMP-MIB::laNames.3 =STRING: Load-15 UCD-SNMP-MIB::laLoad.1 =STRING: 5.31 UCD-SNMP-MIB::laLoad.2 =STRING: 2.11 195 11 CollectingInformation Relevant forMonitoringwithSNMP UCD-SNMP-MIB::laLoad.3 =STRING: 0.77 ... UCD-SNMP-MIB::laLoadInt.1 =INTEGER: 530 UCD-SNMP-MIB::laLoadInt.2 =INTEGER: 210 UCD-SNMP-MIB::laLoadInt.3 =INTEGER: 77 UCD-SNMP-MIB::laLoadFloat.1 =Opaque: Float: 5.310000 UCD-SNMP-MIB::laLoadFloat.2 =Opaque: Float: 2.110000 UCD-SNMP-MIB::laLoadFloat.3 =Opaque: Float: 0.770000 UCD-SNMP-MIB::laErrorFlag.1 =INTEGER: 1 UCD-SNMP-MIB::laErrorFlag.2 =INTEGER: 0 UCD-SNMP-MIB::laErrorFlag.3 =INTEGER: 0 UCD-SNMP-MIB::laErrMessage.1 =STRING: 1min Load Average too high (= 5.31) UCD-SNMP-MIB::laErrMessage.2 =STRING: UCD-SNMP-MIB::laErrMessage.3 =STRING: From laLoadInt.1 we aretoldthe one-minuteaverage valuefor thesystem load as an integer,from laLoad.1 as astring, andfrom laLoadFloat.1 as afloating-point decimal. laErrorFlag.1 contains thecorresponding errorstatus, laErrMessage.1 the corresponding errormessage.The same appliesfor theother twoaverages. Youcan also usethe check_snmp plugin here to querythe floating-point decimal values just as accurately, andspecify limit values centrally. 11.3Nagios’sOwn SNMPPlugins Among the standardNagios pluginsthere arethree programswithwhich datacan be obtained via SNMP: ageneric plugin that queriesany OIDs youwant, andtwo Perl scriptsthatare specialized in interfacedataofnetworkcards andthe ports of switches,routersand so forth.Inaddition to this,the directory contrib contains the source code of otherSNMPplugins that arenot automatically installed. Apparently theseare no longer maintained andcannotrun without majoradjustments to the code. http://www.nagiosexchange.org/ also provides some useful specialized plugins, some of whichare introducedinSection 11.4frompage 205. Thefollowingde- scriptions arelimited,for reasonsofspace, to SNMPv1/2queries;for SNMPv3- specificoptions,werefer youtothe onlinehelpfor thecorresponding plugin. 11.3.1The genericSNMPplugin check_snmp With check_snmp ageneric plugin is available that queriesall available informa- tion via SNMP,accordingtoyourrequirements. However, itsoperation doesrequire adegreeofcare, sinceasagenericplugin, it hasnoideaofspecifically what data it is querying. 196 11.3 Nagios’s OwnSNMPPlugins Forthisreasonaswell, itsoutputlooksquite meager; specialized pluginsprovide more conveniencehere. Butsince thesedon’t existfor everypurpose, check_snmp is then quitejustified. It calls theprogram snmpget auf, whichmeans that the NET-SNMP toolsmustbeinstalled. It provides thefollowingoptions: -H address / --host=address This is thehostnameorIPaddressofthe SNMP agent to be queried. -o OID / --oid= OID This is theobject identifiertobequeried,eitherasacomplete numerical OID or as astring, whichisinterpreted by snmpget (e.g., system.sysName.0). Attention:incontrastto snmpwalk,you mustalways specifythe endnodes containing theinformation. -p port / --port=port This is thealternative portonwhich theSNMPagent is running.The default is UDPport161. -C password / --community=password This is thecommunity string for read access. Thedefault valueis public. -w start : end / --warning= start : end If thequeried valuelieswithinthe rangespecifiedby start and end , check_ snmp doesnot give outawarning. For -w 0:90 it mustthereforebelarger than 0and smallerthan90. -c start : end / --critical= start : end If thequery valueliesoutside therange,the plugin givesout CRITICAL. If thewarning andcritical limitsoverlap, thecritical limit always haspriority. -s string / --string= string Thecontents of thequeried OIDmustcorrespondexactly to thespecified string,otherwise check_snmp willgive outanerror. -r regexp / --ereg=regexp This optionchecks thecontents of thequeried OIDtosee whether theregu- larexpression regexp11 is matched. If this is thecase, thepluginreturnsOK, otherwiseCRITICAL. -R regexp / --erexi= regexp As -r,exceptthatthere is no casedistinction. -l prefix / --label= prefix Astringthatisplacedinfront of thepluginresponse.The defaultis SNMP. 11 POSIXregular expression,see man7regex . 197 11 CollectingInformation Relevant forMonitoringwithSNMP -u string / --units=string SNMP only hassimplevalues, notunits.Astring that is specified insteadof string is extendedbythe plugin in thetextoutputsothatitservesthe value as aunit. Because only textisinvolvedhere, youcan also specify apples or pears ,for example, as “units”. -d delimiter / --delimiter= delimiter This characterseparates theOID in the snmpget output from thevalue.The defaultis = . -D delimiter / --output-delimiter= delimiter Thepluginisable to queryseveral OIDs simultaneously. Theresultvaluesare separated with delimiter,which in thedefault is aspace. -m mibs / --miblist=mibs This specifies theMIBs that should be loadedfor snmpget .The defaultis ALL . -m +UCD-DEMO-MIB12 loads in addition, -m UCD-DEMO-MIB (without the + sign) only loads thespecifiedMIB. 13 -P version / --protocol=version Defines theSNMPprotocolversion.The values for version are 1 or 3 .With- outthisoption, SNMPv1isused. SNMP provides almost unlimited possibilities, so thefollowingexamplescan merely convey afeelingfor otherplugins used. Testing hard drivecapacity via SNMP Thefollowingcommand queriesthe load of afile system andtodothisaccesses thepartial tree ucdavis.dskTable of alocally running NET-SNMP agent: nagios@linux:local/libexec$ ./check_snmp -H swobspace -C public \ -o dskTable.dskEntry.dskPercent.2 -w 0:90 -c 0:95 -u percent SNMP WARNING -*95* percent Thequery appliestothe percentage load of thefile system withthe indexnumber 2. As long as no more than 90 percentofthe hard drive space is then occupied, thetestshouldreturn OK;hereawarningwill be returned if it is between91and 95 percent, andcritical status if it goesbeyondthis. Thanks to the -u option, check_snmp adds thedescription percent to theoutputofthe figure determined. Nevertheless, theplugindoesnot tellthe wholetruth:atestcheck with df shows a96percent load, whichcomes from thefact that this program correctlyrounded 12 UCD-DEMO-MIB is an MIBincludedfor demonstrationpurposes. 13 Seealsothe onlinehelp, with mansnmpcmd . 198 11.3 Nagios’s OwnSNMPPlugins up theactual95.8percent load, whileinteger values in SNMP areseldom rounded up,but simply cutoff. So youjusthavetolive withslight inaccuracies as long as theMIB doesnot provideany floating-point decimals. If youwould likethingstobemoredetailed, youcan usethe option -l: -l ’SNMP- DISK:/net/swobspace/b’ causesother,self-defined information to be added to the output of theabove command: SNMP-DISK: /net/swobspace/b WARNING -*95* percent Theabove querycan be more generally runthrough acommand object such as the following: define command{ command_name check_snmp command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C $USER3$ \ -P 1-o$ARG1$ -w $ARG2$ -c $ARG3$ -l $ARG4$ } This definitionassumesthatthe valuebeing queriedisnumerical,and notBoolean (see page 201),otherwisespecifyingawarningand critical valuesimultaneously wouldmake no sense. We storethe community here in themacro $USER3$. 14 this is followedbythe protocol version(-P 1 stands forSNMPv1),the OID, thewarning andcritical limits, andaprefix. Thecallfor this command in service definitions is then made in theform check_snmp!oid! warn! critical! prefix If youwanttospecifically monitorthe load of thefile system withthe indexnum- ber2on thecomputer swobspace through dskTable,thenthe followingdefinition wouldbeused: define service{ service_description SNMP-DISK-a host_name swobspace check_command check_snmp!dskTable.dskEntry.dskPercent.2!\ 0:90!0:95!DISK: /net/swobspace/a ... } Even though the check_command lineiswrappedhere, in practice allparameters mustbeonasingle line, separated by an exclamation point ! (without spaces beforeorafter thedelimiter). 14 The $USERx $ macros aredefinedinthe resource file resource.cfg. 199 11 CollectingInformation Relevant forMonitoringwithSNMP Measuringtemperaturevia lm-sensors Thenexttestchecks theCPU temperatureofthe host.For thesensor, thepackage lm-sensors15 is used here,which accessescorresponding chipsonmodern main- boards.Assoon as lm-sensors is active,itallows theNET-SNMPagentstoread outthe corresponding information from thepartial tree ucdavis.ucdExperimental. lmSensors : nagios@linux:local/libexec$ ./check_snmp -H localhost -C public \ -o lmTempSensorsValue.1 -w 25000:45000 -c 20000:48000 \ -u ’degrees Celsius (* 1000)’ -l ’Temp1/CPU’ Temp1/CPU OK -41000 degrees Celsius (* 1000) Theoutputdepends on thechipset: here youmustmultiplythe queryvaluesbythe factor1000. Accordingly, youhavenoother alternative buttoadjustthe warning andcritical limitstothe main boardyou areusing.Inthe example, theCPU temper- ature, 41 degrees Celsius, is “onagreen light”: if it were to drop below25degrees or rise above45degrees,itwould cause awarning,while below20orabove 48 degrees,thiswould be critical. Regularexpressionsand comparingfixedstrings Youcan checkwhether thetext swobspace occurs in thesystem name as follows: nagios@linux:local/libexec$ ./check_snmp -H localhost -C public \ -o system.sysName.0 -r swobspace SNMP OK -"swobspace" Insteadofdefining thestringbeing searched for,with -r as theregular expression, youcould also usethe -s option. Then thetextmustmatch exactly,however, whichmay be quitetricky,since everything countsthat snmpget outputsafter the delimiter, = . Monitoring network interfaces Thefinalexample querieswhether thefirstnetworkinterfaceofaCiscorouter is in operation: nagios@linux:local/libexec$ ./check_snmp -H cisco1 -C public \ -o ifOperStatus.1 -w 1:1 -l ’SNNP: Port Status for Port 1is: ’ SNNP: Port Status for Port 1is: OK -1 15 http://www.lm-sensors.nu/ 200 11.3 Nagios’s OwnSNMPPlugins Theinformation sought can be found in ifOperStatus .Hereweare queryingport1. While ifOperStatus givesout theoperating status, ifAdminStatus revealswhether theinterfaceisadministratively switched on or off. When specifyingthe warninglimit here,weuse therange 1:1,sothatthe plugin givesout awarning if theinterfaceisphysically switched off, andthe return value is thus 0. We willdowithout thedefinition of acritical status here,since thereare only twostates,“on”or“off.”Ifthe plugin returnsaCRITICALwhenthe interfaceis switched off, youshoulduse -c 1:1 andomit -w entirely. If youjustwanttoquery thestatusofnetworkinterfaces,you should certainly take alook at theplugins check_ifstatus and check_ifoperstatus ,describedbelow, whichprovideslightly more operating convenience. If MIB-II or MIB ucdavis do notprovidethe desiredinformation,you couldalso take alook at theMIB provided by themanufacturer.You can findout from mib- 2.system in whichpartial tree theoverall MIBishidden: user@linux:˜$ snmpwalk -v1 -c public konica01 system system.sysDescr.0 =Konica IP Controller system.sysObjectID.0 =OID: enterprises.2364 ... Theexample involvesanetwork-capable Konica photocopyingmachinecalled konica01 . system.sysObjectID.0 revealsthat enterprises.2364 servesasthe en- trypoint for devicespecific details.With snmpwalk youcan then obtain further information: user@linux:˜$ snmpwalk -v1 -c public konica01 enterprises.2364 ... enterprises.2364.1.2.6.1.1.5.1.1 ="Ready to Print" ... In theconcrete caseofthisphotocopier,you can querythe currentdevicestatus through enterprises.2364.1.2.6.1.1.5.1.1 .Manufacturersusually storeinformation on theimplemented MIBs,sothatyou arenot restricted to just guessing. 11.3.2Checkingseveral interfaces simultaneously Active networkcomponents such as switches usually have quitealargenumber of ports,and it wouldbeverytime-consuming to checkevery single oneofthem. Here the check_ifstatus plugin is very useful,since it tests allports simultaneously. It retrievesthe information necessary for this via SNMP,and hasthe following options: -H address / --host=address This is thehostnameorIPaddressofthe SNMP agent to be queried. 201 11 CollectingInformation Relevant forMonitoringwithSNMP -C password / --community=password This sets thecommunity string for read access. -p port / --port=port This parameter is thealternative portonwhich theSNMPagent is running. Thedefault is UDPport161. -v version / --snmp_version= version This parameter specifies theSNMPversion ( 1 , 2 ,or 3 )for thequery. -x list / --exclude=list Use this to specifyacomma-separated listofinterfacetypes that should not be queried(seeexample below). -u list / --unused_ports=list Use this to specifyacomma-separated listofall ports that should be ex- cluded from thetest. Like -x,the listconsistsofthe indicesofthe interfaces whichare determinedfrom ifIndex : -u 13,14,15,16. -M bytes / --maxmsgsize=bytes This is themaximum size of theSNMPdatapackets;the defaultis 1472 bytes. With exclusionlists it is possible to excludecertain interfacetypes or portnumbers from thetest, perhaps because theseare notoccupied,orare connected to PCs or otherdevices that arenot always running. With thefollowingquery we can findout,for example, whichinterfacetypes are gatheredtogether on theCisco switch here named cisco01: user@linux:˜$ snmpwalk -v1 -c public cisco01 ifType ... interfaces.ifTable.ifEntry.ifType.12 =ethernetCsmacd(6) interfaces.ifTable.ifEntry.ifType.13 =other(1) interfaces.ifTable.ifEntry.ifType.14 =propVirtual(53) ... If theinterfacetypes other(1) and propVirtual(53) should nowbeexcluded, the plugin is sent off withthe twofigures, separated by acomma,asthe exclusionlist -x 1,53: nagios@linux:local/libexec$ ./check_ifstatus -C public -H cisco01 \ -x 1,53 CRITICAL: host ’cisco01’, interfaces up: 2, down: 10, dormant: 0, excluded: 4, unused: 0
GigabitEthernet0/2: down
GigabitEthernet0/3: down
GigabitEthernet0/4: down
GigabitEthernet0/10: down
GigabitEthernet0/5: down
GigabitEthernet0/11: down
GigabitEthernet0/6: down 202 11.3 Nagios’s OwnSNMPPlugins
GigabitEthernet0/7: down
GigabitEthernet0/8: down
GigabitEthernet0/9: down
|up=2,down=10,dormant=0,excluded=4, unused=0 In reality, this plugin also does not displayits output over severallines,asthe line wrap here maysuggest.The factthatthisinformation appearsonthe Nagios Web interfaceinarelatively clearformisbecause theHMTLformatting element
is thrown in.Thiscausesthe output for each porttobedisplayed on aseparate line. The | characterdefinesthe beginning of theperformancedata, whichdoes notappear at allinthe Webinterface. Aquery of this type is implemented as acommand object as follows: define command{ command_name check_ifstatus command_line $USER1$/check_ifstatus -H $HOSTADDRESS$ \ -C $USER3$ -x $ARG1$ } Here themacro $USER3$ is also used to definethe community string in thefile resource.cfg.Altogether,32 $USER x $ macros areavailable,ofwhich thefirsttwo usually containpathdetails,and theotherscan be used in anyway youwant. If youwould prefer to excludeports rather than interfacetypes,you can usethe -u optioninsteadof -x in thedefinition. If Nagios is to monitorthe switch cisco01,asshown above, excludingthe two interfacetypes 1 and 53,the corresponding service definitionbeginsasfollows: define service{ service_description Interfaces host_name cisco01 check_command check_ifstatus!1,53 ... } 11.3.3Testing theoperating statusofindividualinterfaces To testanindividualinterface, youcan useeitherthe genericplugin check_snmp or check_ifoperstatus ,which specifically tests theoperating status ( ifOperStatus ) of thenetworkcard. Theadvantage of this over thegeneric plugin consists above allinits ease of use: insteadofanindex for theport, youcan also specifyits descriptionhere—for example, eth0. check_ifoperstatus hasthe followingoptions: -H address / --host=address This is thehostnameorIPaddressofthe SNPagent to be queried. 203 11 CollectingInformation Relevant forMonitoringwithSNMP -C password / --community=password This parameter givesthe community string for read access. -p port / --port=port As long as theSNMPagent is notrunning on UDPport161, theportis specified withthisoption. -k ifIndex / --key=ifIndex ifIndex is thenumber of thenetworkinterfacetobequeried (suchasthe networkcardofacomputer or theportofaswitch). -d ifDescr / --descr= ifDescr Insteadofthe indexkey,the plugin processesthe name of theinterfacefrom ifDescr (see below). -v version / --snmp_version= version This specifies theSNMPversion ( 1 , 2 ,or 3 )for thequery. -w return value / --warn= return value This optionselects thereturn valueifthe interfaceisdormant.The re- turn value can be i (ignorethe dormant status andreturn OK!), w (WARN- ING) or c (CRITICAL, thedefault). -D return value / --admin-down=return value What value(i , w or c )shouldthe plugin return if theinterfacehas been shut downadministratively? Thedefault, w ,issues awarning, c returnsCRITICAL, and i returnsOK. -M bytes / --maxmsgsize=bytes This is themaximum size of theSNMPdatapackets;the defaultis 1472 bytes. On asystem called igate,onwhich snmpwalk finds thefollowinginterfaces ... ... interfaces.ifTable.ifEntry.ifDescr.3 =ipsec0 interfaces.ifTable.ifEntry.ifDescr.4 =ipsec1 ... interfaces.ifTable.ifEntry.ifDescr.7 =eth0 interfaces.ifTable.ifEntry.ifDescr.8 =eth1 interfaces.ifTable.ifEntry.ifDescr.9 =eth2 interfaces.ifTable.ifEntry.ifDescr.10 =ppp0 thefirstEthernetcardistested either with -k 7 or with -d eth0.Since theplugin in thesecondcasehas to queryall ifDescr entriestodetermine theindex itself,this variationgenerates asomewhathighernetworkload. It can be especially useful if notall networkinterfaces areactive on ahost, causing itsindex to change. 204 11.4 OtherSNMP-basedPlugins Thepluginitselfreveals whichindex this portcurrently has: nagios@linux:local/libexec$ ./check_ifoperstatus -H igate -c public \ -d eth0 OK: Interface eth0 (index 7) is up. As the command object in theNagios configuration,the calllookslikethis: define command{ command_name check_ifoperstatus command_line $USER1$/check_ifoperstatus -H $HOSTADDRESS$ \ -C $USER3$ -d $ARG1$ } The $USER3$ macroagaincontainsthe community string,definedinthe file resource.cfg.The service definitionfor igate specifies thenameofthe interfaceto be tested as apluginargument: define service{ service_description Interface eth0 host_name igate check_command check_ifoperstatus!eth0 ... } 11.4Other SNMP-based Plugins Apartformthe SNMP pluginsfromthe Nagios Plugin package, theNagios com- munity provides alarge variety of otherplugins for specialpurposes.Mostofthem can be found at http://www.nagiosexchange.org/ in thecategory CheckPlugins → SNMP. 16 11.4.1Monitoringharddrive spaceand processeswith nagios-snmp-plugins One of theseisthe package nagios-snmp-plugins, 17 whichexistsnot only as source code butalsoasanRPM package(for RedHat andFedora).Itcontains twoveryeasy-to-useplugins: check_snmp_disk and check_snmp_proc. Bothabsolutelyrequire theNET-SNMPagent as thepartner on theother side (see Section11.2.2frompage 187) anduse ucdavis.dskTable and ucdavis.prTable to 16 http://www.nagiosexchange.org/SNMP.51.0.html 17 ftp://ftp.hometree.net/pub/nagios-snmp-plugins/ 205 11 CollectingInformation Relevant forMonitoringwithSNMP testthe processesand filesystems specified in theconfiguration file snmpd.conf. Itsoptions arerestricted to specifyingthe host andthe community string: -H address / --host=address This is thehostnameorIPaddressofthe NET-SNMP agent to be queried. -C password / --community=password This is thecommunity string for read access. Thenextexample tests theavailable capacity of the /data filesystem; public is againusedasthe community string: nagios@linux:local/libexec$ ./check_snmp_disk -H swobspace -C public /data: less than 50% free (= 95%) (/dev/md6) Theconfiguration of theNET-SNMPagent specifies,withthe disk directive (page 194), 50% as thethreshold for this filesystem. In this casethe plugin accordingly returnsaCRITICAL. It can only distinguishbetween an errorand OK;itdoesnot have aWARNINGstatus. Using check_snmp_proc is just as easy: nagios@linux:local/libexec$ ./check_snmp_proc -H localhost -C public No slapd process running. Thepluginagaintests theprocessesdefinedinthe configuration of theNET-SNMP agent withthe proc directive (page 193).The process slapd is missing here,which is whyaCRITICALisreturned.The return valueisrevealedby echo $?. Thecorresponding command objectsare defined in asimilarunspectacularway: define command{ command_name check_snmp_proc command_line $USER1$/check_snmp_proc -H $HOSTADDRESS$ -C $USER3$ } define command{ command_name check_snmp_disk command_line $USER1$/check_snmp_disk -H $HOSTADDRESS$ -C $USER3$ } This definitionalsoassumesthatthe community string is stored in the $USER3$ macrointhe file resource.cfg.Inorder to querythe NET-SMTPDonthe computer linux01 for its hard drive load, thefollowingserviceobject is defined: 206 11.4 OtherSNMP-basedPlugins define service{ service_description DISK host_name linux01 check_command check_snmp_disk ... } 11.4.2Observing theloadonnetwork interfaces with check-iftraffic TheMIB-II contains only numbersthatprovideinformation on theload on net- work interfaces,but no average values for theusedbandwidth,for example. If the vendor hasnot specifically made such an entryavailable in hisMIB, then youwill always have to make anoteofthe last counter status andthe timestamp, so that youcan work outthe relative usage yourself. http://www.nagiosexchange.org/ introducestwo pluginsthattake over this task. ThePerl-basedplugin check_traffic 18 writes thequery values into a round-robin database (RRD, seepage 317),which makesitsomewhatmorecomplex to handle. Thesamepurposeisachieved, butwithmoresimplemeans,bythe check_iftraffic.pl plugin. 19 It hasthe followingoptions: -H address / --host=address address is thehostnameorIPaddressofthe NET-SNMP agent that is to be queried. -C password / --community=password password is thecommunity string for read access. Thedefault is public. -i ifDescr / --interface=ifDescr From theinterfacename ifDescr theplugindeterminesthe indexsothatit can accessother values (e.g., thecounter states). -b integer / --bandwith=integer This is themaximum bandwidth of theinterfaceinbits(see -u). -u unit / --units=unit This is theunitfor bandwidth specification with -b.Possible values are g (Gbit), m (Mbit), k (kbit)and thedefault b (bit): -b 100 -u m corresponds to 100 Megabits(Fast Ethernet). 18 http://nagios.sourceforge.net/download/contrib/misc/check_traffic/ 19 http://www.nagiosexchange.org/SNMP.51.0.html?&tx_netnagext_pi1[p_view]=37 207 11 CollectingInformation Relevant forMonitoringwithSNMP -w integer / --warning= integer If trafficexceedsthiswarning limit in percent(default: 85 percent),the plugin issues aWARNING. -c integer / --critical= integer This is thecritical thresholdinpercent (default: 92 percent). Thepluginsaves thetimestamp andcounter status of theinterfacequeried in files in /tmp,towhich it addsthe prefix traffic .Soifyou areusing adifferent user ID than nagios for themanualtestonthe command line, youshoulddelete the files /tmp/traffic_interface_ computer beforeactivating theappropriate Nagios service. Thefollowingcommand lineexample queriesthe Fast Ethernet networkinterface eth0 on thecomputer linux01 ,which in theory hasabandwidth of 100 MBit: nagios@linux:local/libexec$ ./check_iftraffic.pl -H linux01 -i eth0 \ -b 100 -u m Total RX Bytes: 60.32 MB, Total TX Bytes: 26.59 MB
Average Traffic: 1.14 kB/s (0.0%) in, 777.93 B/s (0.0%) out |inUsage=0.0,85,98 outUsage =0.0,85,98 Theamount of datatransmitted here is reported separatelybythe plugin,depend- ingonthe direction, andhereitannounces 60.32 ( RX,“received”)and 26.59 MBytes ( TX,“transmitted”).The textcontainsthe HTML element
(linebreak),todis- play theoutputinthe Nagios Webinterfaceontwo lines. This is followedbythe average transmission rate, againseparated for incoming andoutgoing datatraffic. Theperformancedata(seeSection 17.1, page 314 pp.)after the | sign containonly theaverage load as apercentage,each separated by incoming andoutgoing values. Thenumbers 85 and 98 arethe defaultvaluesfor thewarning andcritical limits. Thecorresponding command object is implemented as follows: define command{ command_name check_iftraffic command_line $USER1$/check_iftraffic.pl -H $HOSTADDRESS$ \ -C $USER3$ -i $ARG1$ -b $ARG2$ -u m } If thedefinition is takenoverliterally,you mustdefine thecommunity string in the $USER3$ macro. If youonlygenerally use public as thepassword, it is better to write -C public insteadof -C $USER3$. To simplifythe callofthe command withinthe followingservicedefinition,weset theunittoMBit/second ( -u m ). 208 11.4 OtherSNMP-basedPlugins define service{ service_description Traffic load eth0 host_name linux01 check_command check_iftraffic!eth0!100 ... max_check_attempts 1 normal_check_interval 5 retry_check_interval 5 ... } check_iftraffic calculates thebandwidth used by comparingtwo counter states at differenttimes.Because Nagios doesnot testexactly downtothe second,the checkintervalyou chooseshouldnot be toosmall. The Multi RouterTrafficGra- pher, 20 whichdisplaysthe bandwidth used in graphic form, normally worksat five-minuteintervals. If youselect max_check_attempts otherthan 1 ,you should make sure that the retryinterval(retry_check_interval)isthe same as thenormalcheck interval. For max_check_attempts 1 this makesnodifference, butyou have to definearetry_ check_interval at some time or other. 11.4.3The manubulon.com pluginsfor specialapplication purposes TheNagios Exchange,withthe SNMP pluginstobefound under http://www.manu bulon.com/nagios/ (see Table 11.4),alsoincludessomethatare customized to a specificapplication, such as queryingharddrive space.Theyare relatively simple to use. Twoofthe plugins—check_snmp_storage.pl and check_snmp_load.pl—are intro- ducedhereindetail. Table11.4: Themanubulon.com- SNMP plugins Plugin Description check_snmp_storage.pl Queryofstorage devices (hard drives, swap space, main memory,etc.) check_snmp_int.pl Interfacestatusand load check_snmp_process.pl processes: status,CPU andmemoryusage check_snmp_load.pl System load check_snmp_mem.pl main memory andswapusage check_snmp_vrrp.pl queryingaNokia-VRRP cluster21 20 http://www.mrtg.org/ 21 Theabbreviation VRRP standsfor VirtualRouterRedundancy Protocol. 209 11 CollectingInformation Relevant forMonitoringwithSNMP continued Plugin Description check_snmp_cpfw.pl queryingaCheckpoint firewall-1 22 Keepingchecksonstorage media with check_snmp_storage While the check_snmp_disk plugin,introducedinSection 11.4.1frompage 205, only checks thefile systemsentered in theNET-SNMPconfiguration, check_snmp_ storage.pl is capable of queryingany storage media—even swap space or main memory—without previous configuration on thetargethost. check_snmp_stor- age.pl tests thepartial tree mib-2.host here,while check_snmp_mem.pl uses uc- davis.memory,sothatitremains restricted to NET-SNMP. Thefact that youdonot have to battlewithOIDs, butinsteadcan work with descriptions of the swap space type to specifythe type of thestorage medium, provides acertain levelofconvenience.These can be queriedwith snmpwalk as follows: user@linux:˜$ snmpwalk -v1 -c public swobspace hrStorageDescr hrStorageDescr.2 =STRING: Real Memory hrStorageDescr.3 =STRING: Swap Space hrStorageDescr.4 =STRING: / ... hrStorageDescr.11 =STRING: /net/swobspace/b When thepluginiscalled, thetextspecifiedafter the STRING: is sufficientor—if unique—apartofthis: nagios@linux:local/libexec$ ./check_snmp_storage.pl -H swobspace \ -C public -m /net/swobspace/b -w 90 -c 95 /net/swobspace/b :91%used (34842MB/38451MB) (< 90) :WARNING nagios@linux:local/libexec$ ./check_snmp_storage.pl -H swobspace\ -C public -m "Swap" -w 50 -c 75 -f Swap Space :0%used (0MB/3906MB) (< 50) :OK|’Swap Space’=0MB;1953; 2930;0;3906 In thesecondexample,itissufficient to specify Swap ,inorder to querythe datafor SwapSpace,since thepatternisunique. The -f optionensures that check_snmp_storage.pl willinclude performancedatainits output. -w and -c specifyinnormalfashion thewarning or critical limitsinpercent of the available memory space.The followingoverviewlists allthe options: 22 http://www.checkpoint.com/products/firewall-1/ 210 11.4 OtherSNMP-basedPlugins -H address / --host=address This is thehostnameorIPaddressofthe NET-SNMP agent that is to be queried. -C string / --community=string This is thecommunity string for read access. -p port / --port=port port specifies an alternative portifthe SNMP agent is notrunning on the defaultUDP port161. -m string / --name=string string contains adescription of thedevicetobequeried,corresponding to itsdescription in hrStorageDescr (see above),suchas -m ”SwapSpace” for swap devices, -m ”RealMemory” forthe main memory,or -m ”/usr” for the partitionmounted under /usr in thefile tree. -w percent / --warn= percent Awarning is giveninthe defaultifthe proportion of used memory is larger than thespecifiedthreshold.Other warninglimitscan be defined withthe -T parameter. -c crit / --critical= crit In thedefault,the status is categorized as critical if theproportion of used memory is larger than thespecifiedcritical limit.Other critical limitscan also be specified withthe -T parameter. -T option /--type=option Selectionoptions for specifyingthe critical andwarning limits: pu (percent used): used capacity in percent pl (percent left): free capacity in percent bu (bytes used): used capacity in megabytes bl (bytes left): free capacity in megabytes Thedefault is -T pu. -r / --noregexp Normally thedescription in the -m parameter is treated as aregular expres- sion.For example, /var here stands forall filesystems containing /var,for example /var and /var/spool/imap ,provided that theseare really twoin- dependentfile systems. The -r optionswitches off theregular expression capability, so that specifying /var will then matchthisfile system exactly, butnot /var/spool/imap ,for example. 211 11 CollectingInformation Relevant forMonitoringwithSNMP -s / --sum Insteadofindividualtests for severalnamed memories,these arefirstadded together (user space andoverall capacity), andonlythenisthe testper- formedonthe limit values. -i / --index With -m,atextisnormally specified,which turnsupagaininthe description hrStorageDescr.Withthe -i option, theindex table is used insteadofthe description. Here theRegexpcapabilityalsoapplies: -m 2 matchesall the entriescontainingthe number 2 in theindex (thatis, 2 , 12, 20,etc.).Itthen makessense to usethe -r optionatthe same time. -e / --exclude Nowall thememoriesthatare matchedbythe -m specification areexcluded from thetest, theremaining ones areincludedinthe test. -f / --perfparse This optionprovides an additionaloutputofperformancedatathatisnot showninthe Webinterfacebut can be evaluated by additonal tools(see Chapter 17). Testing systemloadwith check_snmp_load Thepluginchecks either theaverage system loadedagainstthe usualspecification of threeaveragesof1min,5min, and15min,orthe CPU loadedinpercent. -H address / --host=address This is thehostnameorIPaddressofthe NET-SNMP agent to be queried. -C string / --community=string This is thecommunity string for read access. -p port / --port=port port is thealternative UDPportonwhich theSNMPagent is running.The defaultisUDP port161. -w warning limit / --warn= warning limit Thewarning limit is giveneitherasasimple integer valueinpercent (e.g. 90) or as an integer tripletseparated by commas, whichdefinesthe thresholds for thesystem load average forone,five,and 15 minutes (e.g. 8,5,5 ). The percentage load, on theother hand,always refers to theCPU load of thelast minute. If thepluginqueries aNET-SNMPagent,you must additionally specifythe -L optioninthe second variation, for thepercentage, -N. 212 11.4 OtherSNMP-basedPlugins -c critical limit / --crit=critical limit This specifies acritical limit;the syntax is thesameasthatfor -w. -L / --linux This optionspecifiesthatthe plugin queriesthe system mode of aLinux system via NET-SNMP. -A / --as400 This optionspecifiesthatthe CPU loadedonanAS/400 machineisqueried. -I / --cisco This optionspecifiesthatthe CPU load of aCisco networkcomponentis involved. -N / --netsnmp If thepluginqueries thepercentage CPU load of aLinux system via NET- SNMP,the -N optionmustbespecified. -f / --perfparse This optionensures theoutputofperformancedatathatisnot displayedin theWeb interface, butcan be evaluated by additionaltools(seeChapter 17). Thefollowingexample queriesthe system load on thecomputer swobspace via NET-SNMP andspecifiesthreshold values for theone-, five-,and fifteen-minute averages: nagios@linux:local/libexec$ ./check_snmp_load.pl -H swobspace \ -C public -w 1,2,3 -c 3,5,6 -L Load :0.05 0.07 0.06 :OK nagios@linux:local/libexec$ ./check_snmp_load.pl -H swobspace \ -C public -N -w 80 -c 90 -f CPU used 3.0 :<80 :OK|cpu_prct_used=3%;80;90 Thesecondexample involvesthe percentage CPU load on thesamemachine. Here we additionally requestperformancedata, whichasusual repeatsnot only the measured valuebut also thethresholds. 213 12 Ch ap te r TheNagios Notification System What wouldbethe point of system andnetworkmonitoringifitdid notinform theright contact partner when things went wrong? Hardly anysystem or network administrator can afford to keep an eyeonthe Nagios Webinterfacecontinually andwaitfor changesinstatustooccur. Apractical workingsystem mustinform theadmin actively (pushinformation), so that theadmin hastimetodevoteto otherthingsand needstointervene only when Nagios raises thealarm. Whether anotification system doesits jobinpracticeornot is ultimatelydecided by howwellitcan be adjusted to therequirementsofaspecific situation. What mayalready be acritical errorfor onepersonmay,for another, notbenormalbut still tolerable,and nothingisworse than beingbombarded withsupposed error messagesthatare notevenseen as errors in acertain environment.Anexcessof wronginformation can make theadministrator careless, andatsomepoint thereal problems getlostinaflood of falsemessages. 215 12 TheNagios Notification System Nagios provides asophisticatednotification system allowingyourown environ- ment to be fine-tunedtoyourown requirements.The widerange of settingsat first seem confusing, butonceyou have understood thebasic principle, everything becomesmuchclearer. Theefforts to keep Nagios smalland modularalsoapplytothe notificationsystem: sendingamessage is againleftbythe system to externalprograms: from asimple e-mail throughSMS,downtohardwaresolutions—suchasareal trafficlight on theservercabinet—anything is possible. 12.1Who Should be InformedofWhat, When? In orderfor Nagios to send meaningfulmessages, theadministrator mustanswer fourquestions: When should thesystem generate amessage? When should it be delivered? Whom should thesystem inform? Howshouldthe message be sent? Figure 12.1: An overview of the notificationsystem Figure 12.1givesaroughoutlineofthe concept. Theserviceand host checkgen- eratethe message,which then runs throughvarious filters,1 whichusually refer to thetime. The contact refers to thepersonwhomNagios should inform. If the message haspassedall tests,the system handsittoanexternalprogram,which informs therespective contact. 1 Strictly speaking, filters defined in thehostorservice preventamessage from beingcreated, instead of filtering alreadygenerated messages. To keep things simple,however,wepretend that Nagios hascreated amessage that is then discardedbyacorresponding filter. 216 12.2 When Does aMessage Occur? 12.2WhenDoesaMessageOccur? Each message is preceded by ahostorservicecheck, whichdeterminesthe current status.Inthe followingtwo cases it generates amessage: Onehardstate changestoanother hard state. Onecomputer or service remainsinahard errorstate. (The testthereforecon- firmsaproblemthatalready exists.) To remind you: the max_check_attempts parameter (see Sections 2.3and 2.5) defines in host andserviceobjectshow often atestshouldberepeated before Nagios categorizes anew status as “hard.”Ifitisset to 1 ,thisisimmediatelythe caseand is followedbythe corresponding message.Withavaluegreater than 1, thesystem repeatsthe testthatnumber of times, andonlyiftheyall come to the same newresult—such as determining theCRITICALerror status—doesthe status finally change to thenew hard state, thus triggering anew notification. As long as Nagios hasnot exhausted thespecifiednumber of repeats, asoft state exists.Ifthe oldstatusreoccurs beforethese have finished, theadministrator re- mainsuninformedunlesshelooksatthe Webinterfaceorinthe logfile.Ultimately theadministrator is only interested in genuine unsolvedproblems.Onthe other hand,toassess availabilityassuch, it normally doesmatter if aserviceisnot avail- able for minutes on end, whichiswhy thesoft states arealsotakenintoaccount in theevaluation. 12.3The MessageFilter Even if youdefine on asystemwidebasis that Nagios maybring attention to errors notjustthrough theWeb interfaceand logfilesbut also via e-mail and/or SMS, filter parametersinthe host andservicedefinition mayinindividualcases cancel outthese basicdecisions.Inall cases thefinalwordishad by thefiltersdefinedfor therelevantcontact.Which parametersplayarole on each of thesethree levels (systemwide, host/service, contact), is describedinFigure12.2. If afilter stops anotification,the filter chainends“in avacuum,”sotospeak—filter options furtherdowninthe hierarchyremainunaccounted for—and Nagios does notgenerateany message. 217 12 TheNagios Notification System Figure 12.2: Sequenceoffiltersin theNagios notificationsystem 12.3.1Switching messages on andoff systemwide With the enable_notifications parameter in thecentral configuration file nagios. cfg ,you can in principal definewhether Nagios should send messagesatall. Only if it is setto 1 willthe notificationsystem work: enable_notifications=1 218 12.3 TheMessage Filter 12.3.2Enablingand suppressing computer andservice-related messages When definingahost or service,various parameterscan influence themessaging system.Hereyou can define, for example, at what time Nagios should send mes- sages, whether thecontact person is regularlyinformedoferror states,and about whichstates or changesinstate he should be informed(just CRITICAL, or WARNING as well,etc.). Theswitch notifications_enabled determineswhether this specificcomputer or service is important enough for theadmin to be informedoferrorsnot just through theWeb interface, butalsoinother ways as well.Ifthisisso, theparameter must be setto 1 : notifications_enabled=1 This is also thecaseinthe default, so that youhavetoset thevalue explicitly to 0 at this point to stop separatenotifications. Taking downtimesinto account At timeswhenaspecificserviceorhostisintentionally notavailable,Nagios should certainlynot send anyerror messagesthrough thenetwork. Theconfiguration of corresponding maintenance periods ( downtimescheduling)isonlypossible throughthe Webinterfaceand is describedinSection 16.3frompage 304. What statesand changesofstate areworthanotification? If aregular testshows that service or computer is changing itsdatacontinu- ously, this is called flapping in Nagios (see also Appendix Afrompage 401).If the flap_detection_enabled parameter is setto 1 ,the system triestodetect this situation. Whether Nagios sendsamessage in this casedepends on the notification_options filter.Thisdecides on whichstates or changesofstate Nagios willinformthe contact involved. In host definitions it can have thefollowingcombinations of values,separated by commas: d (switched off or crashed, down ), u ( unreachable ), r (computer againreachable, recovered), and f (quicklyalternating state, flapping). Forserviceobjects, notification_options recognizes thefollowingstates: c (CRITI- CAL), w (WARNING), u (UNKNOWN, unknown problem), r (serviceagainreachable, recovered), and f ( flapping). Nagios correspondingly informs theadmin of the stateofthe service whosedefinition is containedinthe line notification_options=c,r 219 12 TheNagios Notification System only if this is critical or wasrecreated after an errorstate. Messagesinvolvinga WARNINGorflappingare discardedbythe system. If notification_options is setto n ( none), Nagios willgenerally notsendamessage concerning this computer or service. When should Nagios send messages? At what time should amessage be sent?Thiscan be defined withthe notifica- tion_period parameter: notification_period=24x7h notification_period expectsatime object (see Section2.10frompage 54)asthe value; 24x7h is such avalue andstandsfor “round theclock.” Outsidethe specified time period, Nagios suppressespossible messages, butdoes notsimplydiscardthem, in contrast to theother filters. Insteadofthis, thesystem places themessage in akindofqueue andsends it as soon as thenotification period begins ( rescheduling). This meansthatthe relevant contact willcertainly gettohearabout theproblem. Nagios also ensuresthatthe admin receivesthe message only once,evenifmultiplemessagesonthe same eventweregenerated outsidethe time period. notification_period is theonlytime-controlledfilter in whichamessage is not lost,despite filtering. With allthe othertimefilters, themessage neverreaches its destinationoutside thespecifiedperiodoftime. With an interval check ,Nagios can be instructed to reportatregular intervals on problems that persistfor alongertime: notification_interval=120 If astate persists that Nagios should normally report, corresponding to the no- tification_option parameter—CRITICAL, for example—for alongtime, thesystem wouldgrant this wish, in theexample,every 120 time units (normally,minutes). In otherwords it suppressesthe notificationthatisgenerated anywaywithev- erycheck, after acorresponding notificationuntil thespecifiedtimehas elapsed. If nothinghas changedinthe stateuntil then,itthensends thecorresponding notification. If youset notification_interval to 0 ,Nagios willsendanotification of this only once.You should be careful when doing this,however:filtersdefinedfor the contact can also reject messages. If younormally generate just onesinglemessage, whichmight arrive at therelevantadmin outsidethe admin’s chosen contact time period, then theadmin will neverbetoldanythingabout theproblem, even if it persists into workinghours. 220 12.3 TheMessage Filter Whoseconcern is themessage? Thecontact groupdefinedinthe host or service object doesnot itself belong to themessage filters, butitstill decidesonwho is informedand whoisnot: contact_group=admins What contactsbelongtothe specified group(here: admins)isdefinedbythe cor- responding contact_group object in itsdefinition object (see also Section2.8 from page 52): #--/etc/nagios/global/contactgroups.cfg define contactgroup{ contactgroup_name admins alias administrators members nagios,wob,mwi } Thespecifiedcontact group, though,merelymakesaroughpreselection: which of thecontactsspecifiedinitactually receive themessage dependsonthe filter functionsinthe definitionofthe individualcontact.Inthisway youcan ensure that oneemployeeisonlynotified during normal officehours, anotherone round- the-clock, andthatone of them is kept up to dateabout allchanges in status,and theother oneisinformedonlyofaselection (for example, only CRITICALbut not WARNING). 12.3.3Person-relatedfilter options When definingthe contact objects, themethod is also specified in whichNagios delivers thenotification in specificcases (see Section12.4frompage 224).Itcan be describedseparatelyfor host andserviceproblems.Several parallelmethods are also possible,suchasvia e-mail and SMS. Since thecontact-related filtersare specifically for thecorresponding contact ob- ject,itcan certainlybeusefultodefine severalcontactsfor oneand thesame recipientthatdifferinindividualparameters, such as acontact object that keeps thepersoninformedvia e-mail of allproblems during normal workinghours, and asecondone for SMSmessagesconcerningcritical events outsideworking hours. What should Nagios inform youabout? Theeventsfor whichsomebody should be informedcan be specified notonlyby host or service,but also by contact.Hostand service-related states aredefined separatelyhere: 221 12 TheNagios Notification System host_notification_options=d,u,r service_notification_options=c,r,u Thepossible values arethe same as thosefor thehost-service parameter notifica- tion_options (page 219). When do messages reach therecipient? Thefinalfilter in thefilter chainagainreferstotimeperiods.Ifamessage is producedinthe time period specified here,Nagios notifiesthe contact;otherwise it discards themessage.The notificationwindow can againbeset separatelyfor hostsand services, andasavalueitexpects a timeperiod object defined elsewhere: host_notification_period=24x7 service_notification_period=workhours 12.3.4Caseexamples Letting youknowonce, butdoingthisreliably What should youdoifonlyasingle message should be sent for each change in status of theservice, butthismessage mustalways reach therelevantrecipient during workinghours?Wecan illustrate thesolutiontothisproblemthrough the exampleofthe admins contact grouptowhich thecontact wob is assigned,... define contactgroup{ contactgroup_name admins alias Local Site Administrators members wob } ...and to the PING service for thecomputer linux01 : define service{ host_name linux01 service_description PING check_command check_ping!100.0,20%!500.0,60% max_check_attempts 3 normal_check_interval 2 retry_check_interval 1 check_period 24x7 notification_interval 0 notification_period workhours notification_options w,u,c,r,f contact_groups admins } 222 12.3 TheMessage Filter notification_interval0normally forcesNagios nottoproduce anyrepeatmes- sages. The notification_period ensuresthe desiredtimeperiodthrough the time- period object workhours :ifNagios raises thealarm at othertimes,the inbuilt rescheduling is used,thatis, thenotification is sent on itsway only if thespecified time period again applies. It is definitelynot discarded. In orderfor Nagios to be active in allchanges of state, the notification_options mustalways coverall possible events for services. To guarantee that thecontact wob always receivesthe messages, it is essentialthat the service_notification_period in thecorresponding contact object is 24x7 : define contact{ contact_name wob alias Wolfgang host_notification_period 24x7 host_notification_options d,u,r service_notification_period 24x7 service_notification_options w,u,c,r,f ... } Arestricted time filter at this positioncould,under certaincircumstances,leadto thelossofeach of theindividualmessages. Thesameappliesfor thevaluesof service_notification_options:onlyifall areentered here as well willnomessage be lost. Informingdifferent admins at different times If youwanttoinformdifferent personsatdifferent timesabout different events, youmay notrestricteitherthe notification_period or the notification_options of ahostorservice: define service { ... notification_interval 120 notification_period 24x7 notification_options w,u,c,r,f ... } Filteringtakesplace exclusively for individualcontacts. Forthistoworkonatime levelyou mustensurethatNagios generates amessage regularly(here every120 time units,normally minutes)iferror states persist. If admin Aistobeinformedonlyduringhis workinghours, andthenonlyof changestocritical or OK states,A’s contact object willbesentwiththe follow- ingparameters: 223 12 TheNagios Notification System define contact{ ... service_notification_period workhours service_notification_options c,r ... } Thereisalsoasecondand notquite so obvious differencetothe first example: letusassume that theservicereports theCRITICALstatusat7.30inthe morning, whichwill persistfor severalhours. The workhours object is defined so that it describesthe time from MondaytoFriday between8.00and 18.00. In theabove example, Nagios holdsback themessage (rescheduling),until thetimeperiodde- fined in it hasbeen reached.The administrator thereforereceivesacorresponding message at 8.00. In thecasedescribedhere, no reschedulingtakesplace,Nagios generates acorre- sponding message everytwo hours, whichisfiltered outifthe contact is currently taking a“break.”The system correspondingly discards themessage at 7:30, but allows thenextmessage twohours later to passthrough.The administrator there- foredoesnot receive thecorresponding information until 9:30, provided that the problemstill exists at this point in time. Whichofthe twosolutions is more suitable dependsonspecific requirements.For an e-mail notification, for example, it makeslittledifferenceifthe administrator receivesmails round-the-clock butreads them only when sitting in hisoffice.A filter for Nagios messagesinthe mail client,sorting them in reversechronological order(themostcurrent mail first)makessense in this case. Sitting in frontofthe screen,the administrator can also take aquick look at theWeb interfacewhen problems areannounced,tocheck whether anything haschanged. If themethods of differentiation describedsofar arenot sufficient, then escalation management,describedinSection 12.5, maybeoffurther help. 12.4ExternalNotificationPrograms Whichexternalprogramsdeliverthe messagesisdefinedbythecontact definition. Here thereare againtwo parameterstodefine thecommandstobeused, onefor servicesand onefor hosts: define contact{ ... service_notification_commands notify-by-email,notify-by-sms host_notification_commands host-notify-by-email,host-notify-by-sms email nagios-admin@localhost pager +49-1234-56789 address1 root@example.com 224 12.4 External Notification Programs address2 123-456789 ... } Both *_notification_commands allowcomma-separated lists,soitispermitted to specifymorethanone command at thesametime. Themessage is then sent simultaneouslytothe recipientinall thewaysdefined. Thenames of thecommand objectsdescribe theseways: via e-mail andvia SMS. To achieve abetter overview, thecorresponding commandsare notdefinedto- gether withthe plugin commandsinthe file checkcommands.cfg ,but in asep- arateobject file, misccommands.cfg .Nagios loads theselikeany otherfile with object definitions,which is whyany name can be chosen for them. Theother parameters, email , pager , address1 ,and address2 ,can be regardedas variables. Thedelivery commandsaccess thevaluesset in thesethrough macros. Whether pager contains atelephone number for SMSdelivery or an e-mail address pointingtoane-mailSMS gateway is immaterialfor thecontact definition. The decisive factoristhatthe valuematches thecorresponding command that refer- encesthisvariable. 12.4.1Notification viae-mail In definingthe notify-by-email command,aname andthe command linetobe executed is specified,aswithevery othercommand object.Onlyits length is un- usual, whichiswhy it hashad to be line-wrapped severaltimes for this printed version: define command{ command_name notify-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\ n \ n Notification Type: $NOTIFICATIONTYPE$\ n \ nService: $SERVICEDESC$\ nHost: $HOSTALIAS$\ nAddress: $HOSTADDRESS$\ nState: $SERVICESTATE$\ n \ nDate/Time: $LONGDATETIME$\ n \ nAdditional Info:\ n \ n$SERVICEOUTPUT$" |/usr/bin/mail -s "** $NOTIFICATIONTYPE$ alert -$HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$ } Theprinted-out command object comesfromthe included examplefile misccommands.cfg-sample.The command linedefinedinitcan be reducedin principletothe followingpattern: printf text |mail -s " subject" e-mail address With thehelpofthe macro, printf generates themessage text, whichispassed on to themailprogram throughapipe.Whatiscausedbythe macros specifically 225 12 TheNagios Notification System used is revealed in Table 12.1.2 Using this,the jumbolineshown aboveproduces messagesthatlook somethinglikethis: To: wob@swobspace.de Subject: ** PROBLEM alert -mail-WOB/SMTP is CRITICAL ** Date: Fri, 14 Jan 2005 16:22:47 +0100 (CET) From: Nagios Admin ***** Nagios ***** Notification Type: PROBLEM Service: SMTP Host: mail-WOB Address: 172.17.168.2 State: CRITICAL Date/Time: Fri Jan 14 16:22:47 CET 2005 Additional Info: CRITICAL -Socket timeout after 10 seconds Table12.1: Macros used in notify-by-email and host-notify-by-email MacroDescription $CONTACTEMAIL$ Valueofthe email parameter from thecontact defi- nition $LONGDATETIME$ Long formofdataspecification, e.g., FriJan 14 16:22:47 CET2005 $HOSTALIAS$ Valueofthe alias parameter from thehostdefinition $HOSTADDRESS$ Valueofthe address parameter from thehostdefini- tion $HOSTNAME$ Valueofthe host_name parameter from thehost definition $HOSTOUTPUT$ Text output of thelasthostcheck $HOSTSTATE$ Stateofthe host: UP, DOWN,or UNREACHABLE $NOTIFICATIONTYPE$ Type of notification: PROBLEM (CRITICAL, WARN- ING, or UNKNOWN), RECOVERY (OKafter errorstate), ACKNOWLEDGEMENT (anadmin hasconfirmedthe errorstate; seeSection 16.1.2, page 278), FLAP- PINGSTART or FLAPPINGSTOP 2 Acompletelistofall macros is containedinthe original documentationat http://localhost/nagios/docs/macros.html (normally to be found in thefile systemun- der /usr/local/nagios/share/docs/macros.html). 226 12.4 External Notification Programs continued MacroDescription $SERVICEDESC$ Valueofthe description parameter in theservicedef- inition $SERVICEOUTPUT$ Text output of thelastservicecheck $SERVICESTATE$ Stateofthe service: OK, WARNING , CRITICAL, UN- KNOWN Forthe command host-notify-by-email,the command linelookssimilar, except that nowhost-related macros areused: /usr/bin/printf "%b" "***** Nagios *****\ n \ nNotification Type: $NOTIFICATIONTYPE$\ nHost: $HOSTNAME$\ nState: $HOSTSTATE$\ nAddress: $HOSTADDRESS$\ nInfo: $HOSTOUTPUT$\ n \ nDate/Time: $LONGDATETIME$\ n" | /usr/bin/mail -s "Host $HOSTSTATE$ alert for $HOSTNAME$!" $CONTACTEMAIL$ It generates e-mails withthe followingcontent: To: wob@swobspace.de Subject: Host UP alert for wob-proxy! Date: Fri, 14 Jan 2005 17:50:21 +0100 (CET) From: Nagios Admin ***** Nagios ***** Notification Type: RECOVERY Host: wob-proxy State: UP Address: 172.17.168.19 Info: PING OK -Packet loss =0%, RTA =69.10 ms Date/Time: Fri Jan 14 17:50:21 CET 2005 12.4.2Notification viaSMS While theinfrastructurenecessary for sendinge-mails3 is usually available anyway, programsfor sendingSMS messagessuchas yaps, 4 smssend, 5 or smsclient6 usually have to be additionally installed. yaps and smsclient requirealocal modemorISDN cardand “telephone”directlywiththe cell phoneprovider (e.g., T-Mobile), smssend establishesaconnectiontothe Internetservers of thecellphone provider andsends 3 Apartfromthe /usr/bin/mail client,alocalmailserverisrequired. 4 http://www.sta.to/ftp/yaps/ 5 http://zekiller.skytech.org/smssend_menu_en.html 6 http://www.smsclient.org/ 227 12 TheNagios Notification System theSMS message on this route. With yaps und smsclient youcan also useamail gateway that generates andsends an SMSmessage from an e-mail. Whichevermethod youchoose, youshouldbeaware of possible interferencein sendingmessages: aconnectionbetween theNagios serverand theInternetpasses throughmanyhosts,routers, andfirewalls. Especially if Nagios is itself monitoring oneofthe computersinvolved, things getinteresting:ifthismachineisdown, then amessage sent via smssend willnolongerworkeither. Thesamething appliesfor e-mail-SMSgateways. Whether aself-made construction is involved, with yaps or smsclient,each of whichrepresentsits ownSMS gateway,oratelecom installation withasophisticatedunifiedmessaging solution,ifthe actualsenderofthe SMSis many nodesremoved from theNagios server(because youhaveanetworkedtele- phoneinstallation withseveral locations, for example),the chancesincreasethat themessage willnot reach itsdestination because of an interrupted connection. Forthisreasonthe best solution is an smsclient or yaps installation on theNagios serveritselfwithadirect telephone access. In larger,networkedtelephone systems youcan also consider givingthe telephone accessadedicated, direct linefromthe telephone system.Whether this is ISDN or analog is just aquestionhereofthe technology used. To representthe programsmentioned here,wewill take acloserlook at smsclient, whichcan be configuredverysimply, andhas an active community.Onits home- page youcan also findalinktoamailinglistwhose memberswill be pleasedto help in caseyou have questions. Setting up smsclient While Debianhas itsown precompiled smsclient package, for SuSE andother dis- tributionsyou have to compile thesoftware yourself.For historical reasonsthe pro- gram itself is called sms_client;ashortsubtextisprovided with mansms_client. Theinstallation from thesourcecode follows theusual procedure: linux:˜ # cd /usr/local/src linux:local/src # tar xvzf /path/to/sms_client-2.x.y linux:local/src # cd ./sms_client-2.x.y linux:src/sms_client-2.x.y # ./configure linux:src/sms_client-2.x.y # make && make install Theonlypoint worthmentioninghereisthatthe “homemade” configure procedure manageswithout autoconf and automake. Theconfiguration files listed in Table 12.2are nowlocated in thedirectory /etc/sms; theDebianpackage installs it to /etc/smsclient. 228 12.4 External Notification Programs Table12.2: smsclients configuration files File Description sms_addressbook Definitionofaliasesand groups sms_config Main configuration file sms_daemons Configuration filefor thedaemonmode of smsclient,in whichthiscan be reached via aproprietary protocol.Isnot required. sms_modem Modem configuration sms_services Supported provider Thefile sms_services lists thesupported providersand at thesametimeassigns them to theprotocolused. Theprecise telephone number dialed is specified by thecorresponding service fileinthe directory services (ifyou have compiledthis yourself)or /usr/lib/smsclient/services (for Debian). In caseofdoubt,you should requestthe telephone number of your ownmobile cell provider.The mailinglist can also be of assistance here. In thefile sms_config youset adefault provider,which theprogram uses for calls when theprovider is notspecifically given: SMS_default_service ="d1" Only theconfiguration of themodemisnow missing in thefile sms_modem .In principle, however, anymodemthatfunctionsunder Linux can be used.Inthe followingexample we address an ISDN cardwiththe Isdn4Linux-HiSax driver: MDM_lock_dir ="/var/lock" #directory for the lock files MDM_device ="ttyI0" #device name of the modem ... MDM_command_prefix ="AT" MDM_init_command ="Z&E" MDM_dial_command ="D" MDM_number_prefix ="0" #outside line, if required ... /dev/ttyI0 is used as thedevicehere; for MDM_init_command,yourown MSN is used.Thisappliesparticularlytoprivate branch exchanges, whichallowacon- nectiononlyifyourown MSNhas been correctlyspecified. Since Isdn4Linux does notrecognizetoneorpulse dialing, we useonly D insteadofthe usual DT as the MDM_dial_command.Ifthe ISDN connectionrequiresanoutside lineaspartofa phoneexchange, youshouldenter thecorresponding prefix; otherwisethisstring remainsempty. smsclient requires writepermissionsbothfor thedeviceusedand for thelog file /var/log/smsclient.log : 229 12 TheNagios Notification System linux:˜ # touch /var/log/smsclient.log linux:˜ # chgrp dialout /usr/bin/sms client linux:˜ # chgrp dialout /dev/ttyI0 /var/log/smsclient.log linux:˜ # chmod 2755 /usr/bin/sms client linux:˜ # chmod 664 /dev/ttyI0 /var/log/smsclient.log To testthis, youshouldnow send—preferably as theuser nagios,who willlater use smsclient—anSMS message to your owncellphone (here to be reached at the number 01604711): nagios@linux:˜$ sms_client 01604711 "Text" Dialing SMSC 01712521002... WARNING: read() Timeout Connection Established. Login... SMSC Acknowledgment received Login successful Ready to receive message Received Message Response: Message 3003123223 send successful -message submitted for processing Successful message submission Disconnect... Disconnected from SMSC Hangup... d1 Service Time: 17 Seconds [000] d1:01604711 "Text" Total Elapsed Time: 17 Seconds Getting Nagios to work together with smsclient If thesecondargument is missing in smsclient,which contains themessage text, theprogram willreaditfromSTDIN: nagios@linux:˜$ /bin/printf "%b" message |sms_client number Basedonthe command notify-by-email,describedfrompage 225, we willuse the second variationherefor definingthe notify-by-sms command: #’notify-by-sms’ command definition define command{ command_name notify-by-sms command_line /usr/bin/printf "%.150s" "$NOTIFICATIONTYPE$ $HOSTNAME$[$HOSTADDRESS$]/$SERVICEDESC$ is $SERVICESTATE$ /$SHORTDATETIME$/ $SERVICEOUTPUT$" | /usr/bin/smsclient $CONTACTPAGER$ } 230 12.5 Escalation Management As usual, theentire command_line is written on asingleline. Nagios obtainsthe telephone number (oralias) throughthe macro $CONTACTPAGER$,which reads outthe valueofthe pager parameter from thecontact definition. Since an SMS here maynot be longer than 150characters, we willconsiderably abbreviate the information,compared to thee-mailmessage.Tobeonthe safe side (you never know howlongthe plugin output ( $SERVICEOUTPUT$)really is), the printf format specification .150 (insteadof %b)cutsoff thetextafter 150 characters. Although we then do without thelinebreaks in themessage,bymeans of \n,anSMS is neverformatted cleanly, duetoits limited display. Thus notify-by-sms generates a one-linemessage of thefollowingtype: PROBLEMelimail[172.17.130.1]/UPS is CRITICAL /2005-03-30 17:00:53/ Connection refused 12.5EscalationManagement Whenever theadministratorsresponsible cannotfind asolutioninthe specified time when important components fail, although ServiceLevel Agreements or other contractscommitthe IT departmenttodothis,7 Nagios’s abilitytoescalatenoti- fications makesallowances for conflicts,atleast on an organizationallevel.Itcan be used to providemultilevelsupport. Forexample,Nagios first informs the First Level Support (usually the Help Desk). If theproblemstill persists after oneday, then the SecondLevel Support is notified, andsoon. Nagios also makesadistinctionherebetween host-and service-related escalation stages. In essence, both function identically. In theescalation, Nagios doesnot count in time units,but in howmanymessages it hasalready sent out. In thefollowingexample thesystem should reportonerror states of the Database service on linux01 every120 minutes, 8 andthis, round-the- clock: define service{ host_name linux01 service_description Database notification_period 24x7 notification_interval 120 ... contact_groups admins } Thecorresponding messagesalways go to acontact group, so without escalation, that is to admins. 7 Thesecan also be internal specialistdepartments. 8 To be precise, every120 time units,whereby thedefault time unitis60seconds. 231 12 TheNagios Notification System Figure 12.3: Nagios escalates, dependingonthe number of messages already sent Afterthe fourthnotification,Nagios should switch on thefirststage of escalation (asillustrated in Figure 12.3) and, in additionto admins,shouldnotifythe second- level contact group. Theeighthmessage triggers thesecondlevel,atwhich Nagios informs the contact_groupthird-level . As showninFigure12.3, escalationsmay certainlyoverlap. It can also be seen from thegraphicsthatthe contact groupdefinedinthe service object only appliesas long as Nagios doesnot escalate.Assoon as an escalation stage is switched on, thesystem puts thedefault contact groupout of action. If theoriginalcontact group—here admins—shouldalsoreceive amessage in the first escalation level, then this mustbeadditionally specified in theescalationdef- inition. If severallevelsoverlap, Nagios informs allthe groups involved. In Fig- ure12.3the eighth to thetenth messagesaccordinglygobothto admins andto second-level and third-level ,while only thelatter receivesmessage numbers11 and12. From message number 13, Nagios keepsonlythe contact group admins informed, sinceescalationisnolongerdefinedhere. Thelatter takesplace via separate serviceescalation (for services) and hostescala- tion objects(for computers).For aserviceescalationobject,Nagios requires the beginning andthe endofexceptional circumstances to be defined,apart from service details (consistingofthe service_description and host_name )parameters andthe name of thecontact groups responsible: define serviceescalation{ host_name linux01 service_description Database first_notification 4 last_notification 10 notification_interval 60 contact_groups admins,second-level } Theescalationlevel defined here starts,asdesired,withmessage No 4and ends withmessage No 10. If last_notification is giventhe value 0 ,the escalation only ends if theservicechanges back to theOKstate. 232 12.5 Escalation Management In addition youmustspecify the notification_interval parameter for service es- calations: this changesthe notificationinterval(previously 120 according to the service definition) to 60 time units.Thisparameter is also mandatoryfor ahost escalation.The only differenceinthe definitionofahostescalation object is that insteadofthe host name,you can also specifyone or more host groups (inaddition the service_description parameter is dropped, of course). Thesecondescalationstep is defined in thesameway: define serviceescalation{ host_name linux01 service_description Database first_notification 8 last_notification 12 notification_interval 90 contact_groups third-level } If thereare overlapping escalationswithdifferent notification_interval s, Nagios chooses thesmallest defined time unitineach case. Nagios thereforesends mes- sages8to 10 at intervals of 60 minutes,numbers11and 12 at intervals of 90 minutes,and then theoriginalintervalof120 minutes againapplies. With escalation_period and escalation_options thereare twomoresettingpa- rametersspecially for escalations. Bothhavethe same function as notification_ period and notification_options in thehostorservicedefinition,but they refer only to theescalationcase. In contrast to notification_interval , escalation_period does notreplace the noti- fication_period,but actsinaddition to this.Fromthe intersectionof notification_ period and escalation_period,the actualtimeperiodisdeduced.Supposethat notification_period refers to thetimebetween 7:00 A:Mand 5:00 P.M.,and esca- lation_period to theperiodfrom8:00 A.M. to 8:00 P.M..ThenNagios willonlysend outmessagesinthe escalation levelbetween 8:00 A.M. and5:00 P.M..You must always remember here that it is only thenumber of messagesthathavealready been sent that decideswhether an escalation levelexists. escalation_period and escalation_options only have an effect as additional filters. Beforethese twoparametersare used,you should carefully consider what it is you want to achieve withthem. To restrict theescalationtoaspecifictimeperiodcould under certaincircumstances result in it beingomitted entirely.Ifyou restrict them to weekdays, for example, this wouldmeanthatifthe Database service faileddur- ingthe weekend, Nagios wouldinformthe contact group admins only on Monday morning: over theweekendthe system hasalready sent more than 12 messages, so it no longer even uses itsescalationmechanism.Ifthere is atimerestriction via escalation_period,you should set last_notification to 0 to ensure that the escalation really doestake place. 233 12 TheNagios Notification System Everycaseoferror is followedatsomepoint in time by arecovery. An intelligent mechanismensures that Nagios only notifiesthose contactsofthe corresponding recovery whohavepreviously been informedofanerror state. 12.6Dependences between Hostsand Services as a FilterCriterion If youtestservices withlocal plugins(seeChapter 7) via NRPE (see Chapter 10),all thesetests willcometonothing themomentthe Plugin Executor fails.With service dependencies youcan preventNagios from flooding theappropriate administrator withmessagesonthe dependentservices.Insteadofthis, thesystem informs him specifically of theNRPEfailure. Aa withsuchservicedependencies, Nagios also has host dependencies,which suppressmessages, dependingonindividualhosts.Bothvariationscan also be used to specifically ”switch off” tests. 12.6.1The standard case:service dependencies Letustake as an examplethe host linux01 ,illustrated in Figure 12.4, on which locally installedplugins,controlledvia NRPE, monitorharddrive space ( Disks ser- vice, seepage 174),the number of logged-in users(Users service), andthe system load ( Load service). If NRPE were nowtofail, Nagios wouldannouncethe CRITICAL statefor allthree services, although theiractualstate is unknown,and thereal problemisthe “NRPEdaemon.” In ordertosolve this contradiction, NRPE is defined andmonitored as aseparate service anddescribesthe dependencies in a servicedependency object. Figure 12.4: Thethree above-mentioned services depend on NRPE To definethe additionalservicecheck for NRPE, we make useofthe possibilityof callingthe check_nrpe plugin (see page 166) (almost) without anyparametersat all. It then simply returnsthe versionofthe NRPE daemons beingused: 234 12.6 Dependences between Hosts and Services as aFilterCriterion nagios@linux:˜$ /usr/local/nagios/libexec/check nrpe -H linux01 NRPE v2.0 Thecommand defined in Section10.4onpage 172, check_nrpe,requiresfurther argumentsand thereforecannotbeusedfor ourpurposes.For this reason we set up anew command object, test_nrpe ,which exclusively tests NRPE: define command { command_name test_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ } With this,an NRPE service can nowbedefined: define service{ host_name linux01 service_description NRPE check_command test_nrpe ... } Thedependenciesofthe threelocal servicesofNRPEare describedbythe following servicedependency object. define servicedependency{ host_name linux01 service_description NRPE dependent_host_name linux01 dependent_service_description Disks,Users,Load notification_failure_criteria c,u execution_failure_criteria n } host_name and service_description definethe master service,the failure of which leads to thefailure of theservices namedin dependent_service_description on thecomputer specified in dependent_host_name .Multipleentries,separated by commas, arepossible for allfourparametersmentioned.You should bear in mind, however, that each dependentserviceisdependent on everypossible master ser- vice. Theremaining parametersinfluenceservicechecks andnotifications: notifica- tion_failure_criteria specifies for whichstates of themaster service notifications involvinganerror of thedependent services(e.g., Disks )shouldnot appear.Possible values are u (UNKNOWN), w (WARNING), c (CRITICAL), p (PENDING,i.e., an initial checkisplannedbut wassofar notyet carried out), o (OK), and n (None). u,c in theexample abovemeans that Nagios doesnot informthe administrators responsible of “errors” in theservices Disks , Users ,and Load on linux01 if themas- ter service is in theCRITICALorUNKNOWNstate. With an o for OK,the logiccan 235 12 TheNagios Notification System be reversed:herethere is no message if thereisanerror in thedependent service, as long as themaster service is in an OK state. Accordingly, n meansthatNagios provides anotification irrespective of thestatusofthe master service. The execution_failure_criteria parameter controls tests,depending on thestate of themaster service.The details u (UNKNOWN), w (WARNING), c (CRITICAL), p (PENDING), o (OK), and n (None),aswith notification_failure_criteria,refer to states of themaster service for whichthere should be no check. In theexample, n is specified,sothatNagios tests Disks , Users ,and Load even if NRPE fails. Nagios thereforesuppressesmessages, butsince it still carries outthe service checks on thedependent services, theWeb interfacealways showsthe current status of these. Thedetails for notification_failure_criteria interact withthe Freshness mecha- nism of passive tests (see Section13.4frompage 243).If check_freshness is used in theservicedefinition,and if Nagios considersthe most recently determinedsta- tustobeout of date, it will carry outactive tests even if it oughttosuppressthem, according to theservicedependency. Inheritance Nagios doesnot automatically inheritdependencies. An exampleofthisisshown in Figure12.5: on theinternalsideofafirewall,the system should queryvarious resourcesvia SNMP.For security reasons, thetestisperformedindirectlyvia NRPE, that is,the Nagios serverruns theSNMPplugins,which areinstalledonahost inside thefile,indirectlyvia NRPE. Figure 12.5: Multilevel dependenciesfor services Thefollowingtwo servicedependency objectsdescribe adependencybetween the SNMP (Master)serviceand the Disks service (dependent service)onthe host linux04 ,aswellasbetween the NRPE service on linux01 andthe SNMP service on linux04 : 236 12.6 Dependences between Hosts and Services as aFilterCriterion define servicedependency{ host_name linux04 service_description SNMP dependent_host_name linux04 dependent_service_description Disks notification_failure_criteria c,u execution_failure_criteria c,u } define servicedependency{ host_name linux01 service_description NRPE dependent_host_name linux04 dependent_service_description SNMP notification_failure_criteria c,u execution_failure_criteria c,u } If theNRPEdaemonon linux01 fails,Nagios wouldonlyrecognizethe defined dependencies between NRPE and SNMP,but notthe implicit dependency between NRPE and Disks .Totake theseintoaccount as well,the parameter inherits_parent is inserted in thedefinition of theservicedependencybetween Disks and SNMP: inherits_parent 1 With this,Nagios tests whether themaster service itself (here SNMP)isdependent on anotherservice, thanks to acorresponding servicedependency .Ifthe NRPE service on linux01 fails (CRITICALstate),Nagios leaves outthe checkof Disks on linux04 ,thanksto execution_failure_criteria c,u ,and also doesnot send anyno- tificationofthe most recently detected status of Disks . Otherapplication cases Dependency definitions betweenservices areparticularlyusefulifagreat deal de- pendsonasingle service,sothatthe actualproblemisindangerofdisappearing under aflood of errormessages. Apartfromthe already describeduse in combi- nation withNRPE, this appliesfor allservices that theNagios servercannottest directly andfor whichitmustuse toolsinstead(NRPE, SNMP,oreven NSCLIENT for Windows, seeSection 18.1).Ifasimple connectiontothe utilitycannotbees- tablishedand aconstantvalue (version number,system name)cannotbequeried, youcan still useagenericplugintoaddressthe corresponding port. Anotherexample of usingservicedependenciesare theapplications that depend on adatabase: aWeb applicationwithdynamic Webpagesfails if theunderlying database(whichmay be locatedsomewhere in thenetworkonanother host)is notworking.Apreciselydefineddependencybetween thedatabaseserviceand 237 12 TheNagios Notification System dynamicWeb applicationalsoensures here that theadministrator is notifiedofthe actualcause. 12.6.2Onlyinexceptional cases: hostdependencies Host dependencies function in principleexactly likeservicedependencies; the host- dependency object is also capable of suppressing messages. Thereare anumber of subtle differences in thedetail, however. Only explicitly configuredregular host checks can be suppressedinwhich checkedintervals are defined as forservices.Thistypeofhostcheck should be used only in exceptional circumstances,however,since it can have asignificantinfluenceonthe perfor- manceofNagios.Normally Nagios decidesfor itself when it will performahost check(seeSection 4.1frompage 72). In nearly allcases the parents parameter in thehostdefinition is better at de- scribingthe dependencies betweenhosts.AslongasNagios can testindividual hostsdirectly, thesystem can distinguishmuchbetter betweenDOWNand UN- REACHABLE(seeSection 4.1frompage 72). If youdonot want anynotification for particular hosts, dependentonthe networktopology,thenyou should be informed only for DOWN, butnot for UNREACHABLE. Host dependencies should be used only when Nagios can no longer distinguishbe- 238 tween DOWNand UNREACHABLE. This is usually thecasewhen the host check is isperformedindirectly(e.g.,inFigure12.1onpage 175). 13 Ch ap te r PassiveTests with the External Command File Apartfromactive service andhostchecks,Nagios also makesuse of passive tests (and combinations of both typesoftest).While thesystem itself defines thetime for active checks when they areperformed, andtheninitiates them,Nagios in passive mode only processesincomingresults. Forthistowork, an interfaceisrequiredthatallows testresults from theoutside to be passedontoNagios,aswellascommandsthatperformchecks andfeed in theresults throughthe interface. Normally remote hostssendtheir testresults, determinedbyshell scripts, via the Nagios ServiceCheck Acceptor (NSCA), which is introducedinthe next chapter (page 247),tothe Nagios server. Passive checks areusedinparticularwithdistributed monitoring,inwhich noncen- tral Nagios servers send alltheir resultstoacentral Nagios instance.Thissubject is discussedinChapter 15. Anotherfieldinwhich they areusedisinthe processing 239 13 PassiveTests with theExternal Command File of asynchronous events,the time of whichNagios cannotdefine itself.One exam- pleofthisisabackupscriptthatsends aresulttoNagios (OKorCRITICAL) when it hascompleted adatabackup,and anotherexample is processing SNMP traps (see Section14.6). 13.1The Interface forExternalCommands Theinterfacefor externalcommands, knowninNagios jargon as External Com- mand Files ,consistsofanamed pipe (FIFO)1 in thesubdirectory rw of theNagios var directory: user@linux:˜$ ls -lF /var/nagios/rw prw-rw---- 1nagios nagcmd 0Dec 19 10:56 nagios.cmd| Thepipe, marked in the ls output with p ,correctly sets up the make install- commandmode command during installation.For reasonsofsecurityitisessential that youensurethatonlythe group nagcmd can read from andwrite to thepipe. Anyonewho hasaccess here can controlNagios remotelyvia commands, andcan, if they want,shut it downentirely. CommandsthatNagios accepts from theExternalCommand File have thefollow- ingform: [ timestamp] command; arguments As the timestamp in square brackets,Nagios expectsthe currenttimeinepoch seconds, that is thenumber of secondswhich have elapsed in theUTC time zone sinceJanuary 1, 1970. This is followedbyaspace,thenacommand followedbya matching number of arguments, separated by asemicolon. Theinterfacemakesextensive useofthismechanism,allowingits userstomake varioussettings via mouseclick.2 In this chapter we willlimit ourselvestothe two processing commandswithwhich computersdeliverthe resultsofpassive checks to theNagios server, PROCESS_SERVICE_CHECK_RESULT and PROCESS_HOST_ CHECK_RESULT. Forreasons of security,the processing of externalcommandsmustbeexplicitly switched on in themainconfiguration file nagios.cfg withthe directive check_ external_commands=1 : 1 Anamed pipe is abuffertowhich aprocess canwrite something, which canthenberead by anotherprocess.Whateveriswrittenfirstisalsoread first: FirstIn, FirstOut (FIFO). Since this involves spaceinthe main memory,anamedpipedoesnot need anyspace on theharddrive. 2 Adetailed descriptionofall possible commandsisprovidedbythe onlinedocumen- tation at http://localhost/nagios/docs/extcommands.html or file:/usr/local/nagios/share/ docs/extcommands.html . 240 13.2 PassiveService Checks #/etc/nagios/nagios.cfg ... check_external_commands=1 command_check_interval=-1 command_file=/var/nagios/rw/nagios.cmd ... The command_check_interval determinesthatNagios checks theinterfacefor existing commandsevery so many seconds. -1 means“as often as possible.” com- mand_file specifies thepathtothe namedpipe. 13.2Passive ServiceChecks In orderfor Nagios to be able to acceptpassive service checks via theinterface, this mustbeexplicitly allowedinthe global configuration andinthe corresponding service definition. Thecorresponding entryin nagios.cfg is #/etc/nagios/nagios.cfg ... accept_passive_service_checks=1 ... In theservicedefinition youcan select whether youwanttoperformactive checks in paralleltothe passive ones.Active checks areonlypossible,ofcourse, if Nagios can querythe information itself.The followingexample allows passive checks and stops allactive ones: define service{ host_name linux01 service_description Disks passive_checks_enabled 1 active_checks_enabled 0 check_command check_dummy check_period none ... } An exceptionisnormally made for freshness checks (see Section13.4frompage 243)—hereNagios makesuse of thecommand defined in check_command.To banactive checks entirely,the check_period parameter is setto none.The check command doesnot play aroleinthiscase, so youcan just enter adummy check here,for example(whichlikeall othercommandshas to be defined,ofcourse). On thecomputer to be tested passively (inthisexample, linux01 )you mustensure, via NSCA (see Chapter 14),thatitcontactsthe Nagios serverthrough theinterface 241 13 PassiveTests with theExternal Command File for externalcommands. Thereitwrites thecommand for passive service checks in thefollowingone-lineform: [ timestamp]PROCESS_SERVICE_CHECK_RESULT;host-name; service; return value; plugin output Thetimestamp can be created in ashell script,for examplewith date: user@linux:˜$ date +%s 1112435763 Asimplescriptthatpassesonthe result of apassive service checkonthe Nagios serveritselftothe Nagios installedthere,could look likethis: #!/bin/bash EXTCMDFILE="/var/nagios/rw/nagios.cmd" TIME=‘date +%s‘ HOST=$1 SRV=$2 RESULT=$3 OUTPUT=$4 CMD="[$TIME] PROCESS_SERVICE_CHECK_RESULT;$HOST;$SRV;$RESULT;$OUTPUT" /bin/echo $CMD >> $EXTCMDFILE When it is runitexpects theparametersinthe correctsequence: name_of_script linux01Disks 0’Disks ok: everything in order :-)’ Afterthe host andservicenames,the teststatusfollows as adigit,and finally theoutputtext. If theservicenamecontainsspaces,thenitshouldalsobeset in quotationmarks. 13.3Passive Host Checks Passive host checks followthe same principleaspassive service checks,except that they involve computersand notservices.Toallowthemglobally,the ac- cept_passive_host_checks parameter is setin nagios.cfg to 1: #/etc/nagios/nagios.cfg ... accept_passive_host_checks=1 ... 242 13.4 Reacting to Out-of-DateInformation of PassiveChecks In addition,the host definitionfor thecomputer to be monitoredpassively must allowthiskindofhostcheck: define host{ host_name linux01 passive_checks_enabled 1 active_checks_enabled 0 check_period none check_command check_dummy ... } In this exampleitsimultaneouslyforbidsactive checks. Thecommand to be sent throughthe externalinterfacewithwhich thecomputer delivers itstestresults differs here only marginally from thesyntaxusedinthe service checkcommand already introduced: [ timestamp]PROCESS_HOST_CHECK_RESULT;hostname; return value; plugin output Active andpassive host checks differinone important respect: withpassive checks, Nagios is no longer in apositiontodistinguish betweenDOWNand UNREACHABLE (see Section4.1 from page 72).Ifyou still want to take account of networktopol- ogy dependencies when making notifications andtogive specificinformation on theactualhostthatisdown, youmustmake useofhostdependenciesinthiscase (see Section12.6.2frompage 238). 13.4ReactingtoOut-of-DateInformationof Passive Checks It liesinthe nature of passive checks that Nagios is contentwiththe information delivered. Nagios hasnoinfluenceoverwhenand at what intervals theremote host delivers them. It mayevenbethe casethatthe information doesnot arrive at all. In ordertoclassify the“knowledge state” of theserverasout of date, Nagios has theabilitytobecomeactive itself,withafreshness check .Likepassive checks, freshness checking mustbeenabledbothglobally andinthe relevant serviceable host object.Todothis, youneed to setthe followingglobalparametersinthe file nagios.cfg: #/etc/nagios/nagios.cfg ... check_service_freshness=1 243 13 PassiveTests with theExternal Command File service_freshness_check_interval=60 check_host_freshness=0 host_freshness_check_interval=60 ... Thevalue 0 in check_host_freshness andthe value 1 in check_service_freshness ensure that Nagios carries outfreshness checks only for services, andnot for hosts. Thecheck intervaldefinesthe intervals at whichthe serverupdates itsinformation, in this case, every60seconds.WhenNagios really becomesactive in thecaseofa specificserviceorhostdepends on thethreshold value, whichyou can setinthe appropriate service or host definitionwiththe freshness_threshold parameter: 3 define service{ host_name linux01 service_description Disks passive_checks_enabled 1 active_checks_enabled 0 check_freshness 1 freshness_threshold 3600 check_command service_is_stale ... } So in this exampleNagios performs thefreshness checkfor this service only if the last transmitted valueisolder than 3600 seconds(onehour).ThenNagios starts the command defined in check_command,evenifactive checks have been switched off in thecorresponding host or service definition, or even globally. If youdefine thecommand namedhereinthe example, service_is_stale ,sothat Nagios really doescheck theserviceorhost, then Nagios willperformactive tests even if active checking is switched off, butalways only if passive resultsare overdue for longer than thethreshold valueset. If active checks arenot possible or notwanted,you can ensure,using apseudo- test, that Nagios willexplicitly signal an errorstatus, so that theadministrator’s attention is drawntoit. OtherwiseNagios willalways displaythe last status to be received. If this wasOK, then it will notnecessarily be noticedthatcurrent results have notbeen arrivingfor some time.The followingpseudo-testscriptdelivers an appropriate errormessage with echo,and with exit2delivers thereturn valuefor CRITICAL, so that theadministrator can react accordingly: #!/bin/bash /bin/echo "CRITICAL: no current results of the service" exit 2 3 If youdonot explicitly specify freshness_threshold ,the valueset for normal_check_interval will be used in thehardstate,and if thereisasoft state,the value retry_check_interval will serveasthe default. 244 13.4 Reacting to Out-of-DateInformation of PassiveChecks If youstart thescriptfromthe plugin directoryas service_is_stale.sh,the Nagios command service_is_stale willbedefinedasfollows: define command{ command_name service_is_stale command_line $USER1$/service_is_stale.sh } If theresults for theservice Disks on linux01 failtoappear for longer than one hour,Nagios willrun thescript service_is_stale.sh,which always returnsCRITICAL, irrespective of what data linux01 last sent.ThisCRITICALstatusisonlyended when thehostpassesonnew andmorepositive resultstothe serverthrough apassive check. 245 14 Ch ap te r TheNagios Service Check Acceptor(NSCA) In ordertosendserviceand host checks acrossthe networktothe centralNa- gios server, atransmission mechanismisrequired. This is provided by the Nagios ServiceCheck Acceptor (NSCA). It consists of twocomponents:aclient program send_nsca ,which accepts theresults of aserviceorhostcheck on theremotehost andsends them to theNagios server, andthe NSCA daemon nsca,which runs on theserver, receivesdatafromthe client,processesthisfor theExternalCommand File interface(seeSection 13.1),and passesthisdataontoit(Figure 14.1). TheNagios ServiceCheck Acceptor wasoriginally developedtoenable distributed monitoring in whichdecentralized Nagios servers can send theirresults to acentral Nagios server(seeChapter 15 from page 265).Inprinciple,the datathat send_nsca sendstothe Nagios servercan comefromany applications youlike. 247 14 TheNagios ServiceCheck Acceptor (NSCA) Sendingcommandsacrossthe networktothe centralNagios instance is notin- significant, from asecuritypoint of view, sinceNagios could be completelyswitched off usingthe ExternalCommand File.Thisiswhy NSCA sendsthe datainencrypted form, andclientsmusthavethe correctkey to obtain accesstothe interface. This prevents an arbitrary networkparticipantfrombeing able to runany commandsat allonthe Nagios server. Figure 14.1: Howthe NSCA functions 14.1Installation NSCA version2.4,current at thetimeofgoing to press, waspublishedinthe sum- merof2003; thechances arethereforequite high that thedistribution youare usingcontainsacurrentpackage.The source code1 is quiteeasytocompile your- self,however.Asaprerequisite, youneed to have thelibrary libmcrypt installed, together withthe relevant headerfiles,2 or else theintegrated encryption cannot be used. In theunpacked source directory, youshouldrun theincluded configure script, specifyingthe Nagios configuration and var directories: linux:local/src # tar xvzf /path/to/nsca-2.4.tar.gz ... linux:local/src # cd nsca-2.4 linux:src/nsca-2.4 # ./configure --sysconfdir=/etc/nagios \ --localstatedir=/var/nagios ... *** Configuration summary for nsca 2.4 07-23-2003 ***: 1 http://www.nagiosexchange.org/Communication.41.0.html 2 Thecorresponding binarypackage usually contains -dev or -devel in itsname. 248 14.2 Configuringthe Nagios Server General Options: ------------------------- NSCA port: 5667 NSCA user: nagios NSCA group: nagios ... At the enditdisplaysoutput, showingthe permissionswithwhich theNSCAuser starts by default, if nototherwisespecifiedinthe configuration.Normally the NSCA daemon waitsonTCP port 5667. Afinal make all compilesthe twoprograms nsca and send_nsca .Theyare now locatedinthe subdirectory src andneed to be copied manually to asuitable direc- tory: linux:src/nsca-2.4 # cp src/nsca /usr/local/sbin/. linux:src/nsca-2.4 # scp src/send_nsca remote host:/usr/local/bin/. nsca is copied to theNagios server, preferably to thedirectory /usr/local/sbin . send_nsca belongsonthe remote host that is to send itstestresults to theNagios server. If this computer hasadifferent operating system versionorplatform, it is possible that theclient to runthere willneed to be recompiled. Bothprogramseach requiretheir ownconfiguration file, whichisbeststoredinthe directory /etc/nagios : linux:src/nsca-2.4 # cp nsca.cfg /etc/nagios/. linux:src/nsca-2.4 # scp send_nsca.cfg remote_host:/etc/nagios/. 14.2Configuringthe Nagios Server 14.2.1The configurationfile nsca.cfg ForNSCAtowork, theExternalCommand File interfaceonthe Nagios servermust be activated in theconfiguration file /etc/nagios/nagios.cfg (Section 13.1, page 240) andthe corresponding dataentered in theNSCAconfiguration file nsca.cfg : #/etc/nagios/nsca.cfg server_port=5667 server_address=192.168.1.1 allowed_hosts=127.0.0.1 nsca_user=nagios nsca_group=nagios debug=0 command_file=/var/nagios/rw/nagios.cmd 249 14 TheNagios ServiceCheck Acceptor (NSCA) alternate_dump_file=/var/nagios/rw/nsca.dump aggregate_writes=0 append_to_file=0 max_packet_age=30 password=verysecret decryption_method=10 Theparameters server_port , server_address, allowed_hosts, nsca_user,and nsca_ group take effect only if nsca is started as adaemon. If it is started as an inet daemon, thevaluesset in itsconfiguration applytothe NSCA serveraddressand theportonwhich theNSCAislistening, theIPaddressesofthe hoststhatare allowedtoaccess theinterface,3 andthe usersand groupwithwhose permissions theServiceCheck Acceptor runs. The debug parameter makesiteasiertosearchfor errors, butitshouldnormally be switched off (value 0 ). If it is setto 1 ,NSCAwrites debugging information in the syslog. Thenamed pipe is defined by theentry command_file.Ifyou specifyanalternative output file, with alternate_dump_file,thisservesasafallback in casethe named pipe givendoesnot exist. Beforeversion 2.0, Nagios removedthe pipe each time it wasshut down, butthisshouldnot happenanymore. If it is setto 1 , aggregate_writes ensuresthatNSCAcollectsall theincomingcom- mandsjustonceand then passesthese on to theinterfaceasablock. If thevalue at this positionis 0 ,thenNSCAsends on each incoming command immediatelyto theExternalCommand File. append_to_file can have thevalues 0 (opens theExternalCommand File in write mode)or 1 (opens it in theappend mode), anditshouldalways be setto 0 . 4 Client messagesolder than max_packet_age secondsare discardedbyNSCA, to avoidreplayattacks.Thisvalue maynot be larger than 900seconds (15minutes) andshouldbeassmall as possible. Thelasttwo parametersrefer to theencryptionofthe communication. password contains theactualkey,which is identical for clients, andwhich mustbeentered in theconfiguration for theclients(cf.Section 14.3onpage 252).Because thekey is written in thefile in plaintext, nsca.cfg should be readable only for theuserwith whosepermissionsthe NSCA is running,which in ourcaseis nagios: linux:/etc/nagios # chown nagios.nagios nsca.cfg linux:/etc/nagios # chmod 400 nsca.cfg 3 If youwanttodefine more than oneIPaddressfor allowed_hosts,theyare separatedbya comma. 4 Theappend mode only makessense if theExternal Command File is replaced fordebugging purposes with asimplefile. 250 14.2 Configuringthe Nagios Server Finally, decryption_method defines theencryptionalgorithm. Thedefault is 1 (XOR), whichisalmostasinsecureas 0 (no encryption). 10 stands forLOKI97, whichisregardedassecure.5 Thelistofall possible algorithms is containedinthe suppliedconfiguration file, whichcontainsmanyold algorithms andsomenewer ones,suchasDES ( 2 ), Triple-DES ( 3 ), Blowfish(8 ), andRijndael(AES). 6 14.2.2Configurung theinetdaemon If youwanttostart nsca withthe inet daemon, thefollowingentry is added in the file /etc/services: nsca 5667/tcp #Nagios Service Check Acceptor (NSCA) xinetd configuration If thenewer xinetd is used,the file nagios-nsca is created in thedirectory /etc/ xinetd.d withthe followingcontents: #/etc/xinetd.d/nrpe #description: NRPE #default: on service nrpe { flags =REUSE socket_type =stream wait =no user = nagios group = nagios server = /usr/local/sbin/nsca server_args =-c /etc/nagios/nsca.cfg --inetd log_on_failure += USERID disable =no only_from = 127.0.0.1 ip1 ip2 ... ipn } Thevaluesprinted in bold type for theuserand groupwithwhose permissions theNSCAshouldrun, andthe pathtothe NSCA daemon nsca (parameter ser- ver )and thecorresponding configuration file, areadjusted if necessary to your ownenvironment.The line only_from,asanequivalent to the nsca.cfg parame- ter allowed_hosts,takesinall theIPaddresses, separated by spaces, from which theNSCAmay be addressed. Distributionsthatinclude NSCA as afinished pack- ageand install xinetd by default, includeaready-to-use xinetd configuration file, whereyou only need to adjustthislastparameter. 5 http://en.wikipedia.org/wiki/LOKI97 6 Rijndael-128: 14;Rijndael-192: 15;Rijndael-256: 16 251 14 TheNagios ServiceCheck Acceptor (NSCA) In orderfor thenew configuration to become effective,the xinetd init script is run withthe reload argument: linux:˜ # /etc/init.d/xinetd reload inetd configuration If thestandard inetd command is run, thefollowinglineisadded(line-wrapped for theprinted version) in theconfiguration file /etc/inetd.conf : nsca stream tcp nowait nagios /usr/sbin/tcpd /usr/local/sbin/nsca -c /etc/nagios/nsca.cfg --inetd If youwanttoleave outthe TCPwrapper tcpd,you just omit thestring /usr/sbin/ tcpd.Inthiscaseyou mustalsoexplicitly specifythe user ( nagios)withwhose permissionsthe NSCA starts,the complete pathtothe binary nsca,and theconfig- urationfile withits absolute path. So that theInternetdaemoncan take account of themodification,its configuration mustbereloaded: linux:˜ # /etc/init.d/inetd reload 14.3Client-sideConfiguration Theconfiguration file send_nsca.cfg on theclient side mustcontain thesame encryption parametersasthe fileonthe Nagios server: password=verysecret decryption_method=10 Since thekey is also written here in plaintext, it should notbereadable for just anyuser. Forthisreasonitisbesttocreateauser nagios andagroup nagios on theclient side: linux:˜ # groupadd -g 9000 nagios linux:˜ # useradd -u 9000 -g nagios -d /usr/local/nagios \ -c "Nagios Admin" nagios Youshouldnow protectthe file send_nsca.cfg so that only theuser nagios can read it,and ensure,using theSUIDmechanism,thatthe program send_nsca always runs under theuserIDofthisuser. If younow grantexecute permission to the group nagios,onlyits membersmay executethe NSCA client program: 252 14.4 SendingTestResults to theServer linux:˜ # chown nagios.nagios /etc/nagios/send_nsca.cfg linux:˜ # chown nagios.nagios /usr/bin/send_nsca linux:˜ # chmod 400 /etc/nagios/send_nsca.cfg linux:˜ # chmod 4710 /usr/bin/send_nsca linux:˜ # ls -l /usr/bin/send_nsca -rws--x--- 1nagios nagios 83187 Apr 217:56 /usr/local/bin/send_nsca 14.4SendingTest Results to theServer Theclient program send_nsca reads thedetails of ahostorservicecheck from the standardinput,which theadministrator mustformatasfollows: 7 host-name\ t service\ t return value\ t output host-name\ t return value\ t output send_nsca sendsthistothe Nagios server. Thefirstlinedescribesthe formatfor service checks andthe second line, that forhostchecks.The placeholder return value is replaced by thestatusdetermined, that is, 0 for OK, 1 for WARNING, 2 for CRITICAL, and 3 for UNKNOWN. By output,aone-linetextismeant,ofthe type that pluginsprovideasasupportfor theadministrator.Asthe separator,atabu sign is used ( \t). In ordertomake acomplete command from this that can be understood by theex- ternalcommand,the NSCA daemon first prefixesthe timestampand thematching command ( PROCESS_SERVICE_CHECK_RESULT or PROCESS_HOST_CHECK_RE- SULT). This is whyonlythese twocommandscan be sent usingNSCA. send_nsca itself hasthe followingoptions: -H address This is thehostnameorIPaddressofthe Nagios servertobeaddressedby NSCA. -d delimiter This is thedelimiter for theinput;the defaultisatabsign. Thefollowing examplepage uses thesemicolon as a delimiter. -c path/to the/configuration file This parameter specifies thepathtothe configuration file send_nsca.cfg. Since no pathhas been compiledintothe client, send_nsca expectsbyde- fault to findthe fileinthe currentdirectory.For this reason it makessense to specifythe absolute pathwiththisoption. 7 Normally youhavetoensurethattestscripts youhavewrittenyourselfproduce thecorrect output;ifyou useNagiosplugins,you mustreformattheir output accordingly. Since thelatter canberun muchbetterdirectlywithNRPE, this should be theexception to therule. 253 14 TheNagios ServiceCheck Acceptor (NSCA) -p port This defines an alternative portifthe default, theTCP port 5667,isnot used. -to timeout After timeout seconds(by default, 10) send_nsca aborts theconnection attempttothe NSCA daemon,ifnoconnectionisestablished. With simple testscripts such as thefollowingone,the functionalityofthe NSCA can be tested.Aservice is chosen as thetestobject,which is in astate otherthan UNKNOWN(e.g.,OK), in this case, nmbd on thehost linux01 : #!/bin/bash CFG="/etc/nagios/send_nsca.cfg" CMD="linux01;nmbd;3;UNKNOWN -just one NSCA test" /bin/echo $CMD |/usr/local/bin/send_nsca -H nagios -d ’;’ -c $CFG Thescriptputsit, from Nagios’s point of view, into theUNKNOWNstatus. Afterit is run, youshoulddiscoverifthe transfer wassuccessful: nagios@linux:˜$ bash ./test_nsca 1data packet(s) sent to host successfully. As soon as Nagios processesthe command andyou have reloadedthe page in your browser, theWeb interfacedisplaysthe UNKNOWNstatusfor theselected service. With thenextactive check, theprevious status willberecovered. Because it is so simple to send Nagios checkresults with send_nsca ,itisessential that youprotectthe NSCA from misuse,asalready demonstrated.Onthe client, youshouldrestrictaccess to theclient program send_nsca andtoits configuration fileand youshouldmake sure that youhavesecureencryption, andonthe server explicitly definethe sender andIPaddressesthatare to be allowed. 14.5ApplicationExampleI:Integratingsyslog and Nagios Linux andUnixsystems as arulelog system-relevanteventsthrough syslog. Sooner or later youwill probably want Nagios to also informthe administrator of impor- tant syslog events.Todothis, yourequire passive service checks,NSCAfor trans- mitting theresults to theNagios server, andamethod of filteringindividualblock entries. If youare using syslog-ng8 insteadofthe standardBSD syslog, youcan make use of itsabilitytoset filtersand to formatthe output usingtemplates.The useof 8 The“ng”standsherefor next generation . 254 14.5 Application ExampleI:Integrating syslog and Nagios NSCA compensates for thefact that theprogram cannotitselftransmitdatain encrypted form. This connectiontoNagios is supplemented by programstoevaluatelog files,such as logcheck, 9 whichiscontained in almost everyLinux distribution,but it doesnot replacethem. This is because Nagios can send individuale-mails for each event, butnot for asummary of events,as logcheck does(usually once perhour).In additiontothis, theWeb interfacealways displays thelastevent in each case. 14.5.1 Preparing syslog-ng foruse with Nagios Apartfromthe source code,the syslog-ng homepage10 also provides adetailed manual, whichiswhy we shallonlydiscuss thebasic principleatthispoint.The software differentiates betweenthe source, filter ,and destination.All threeob- jectscan be combined in anyform; they aredefinedinthe configuration file /etc/syslog-ng/syslog-ng.conf : #/etc/syslog-ng/syslog-ng.conf source local { unix-stream("/dev/log"); internal(); file("/proc/kmsg" log_prefix("kernel: ")); } ; destination console_10 { file("/dev/tty10"); } ; filter f_messages { not facility(auth, authpriv) and level(info .. alert); } ; log { source(local); filter(f_messages); destination(console_10); } ; This exampledefinesthree sourcesatthe same time: unix-stream reads from the socket /dev/log ,through whichmostprogramssendtheir messagestothe sys- log. internal is thenameofthe source syslog-ng feedswithinternalmessages, andfromthe file /proc/kmsg syslog receiveskernelmessages. Theseare giventhe kernel: prefix, so that they can be be distinguishedfromnormallog entries. 9 http://sourceforge.net/projects/logcheck/ 10 http://www.balabit.com/products/syslog_ng/ 255 14 TheNagios ServiceCheck Acceptor (NSCA) The destination definitionensures that allsyslogoutputappearsonthe console tty10 (thiscan be displayedwith ✞ ✝ ☎ ✆Alt - ✞ ✝ ☎ ✆F10 ). filter defines what messagesshouldreach this destination, if any. In thecaseof the f_messages filter,thisisall messagesmatchingthe category(the level ) info andthatsyslogdoesnot providewiththe stamp(the facility ;see mansyslog.conf and man3syslog) auth or authpriv.Alternatively syslog-ng filtersaccordingto asearchpattern, withthe instruction match(”pattern”),accordingtothe program doing thelogging ( program(”program name”))and according to thesourcehost ( host(”hostname”)). Finally thekeyword log links thesource, filter,and destination. Multiple specifi- cationsare possible here,soseveral sourcesand destinations can be specified in a single statement: log { source1); source2;... filter1; filter2;... destination1; destination2;... } If youspecify severalfiltersinalog statement, syslog-ng only allows datathrough that matchesall filter criteria (ANDlink). To integrate this into Nagios,use is made of theoptionofdefining aprogram as a target, whichiscalledfor everyevent: destination d_nagios_warn { program("/usr/local/nagios/misc/send_syslog.sh" template("$HOST;syslog-ng;1;WARNING: $MSG\ n") template_escape(no)); } ; destination d_nagios_crit { program("/usr/local/nagios/misc/send_syslog.sh" template("$HOST;syslog-ng;2;CRITICAL: $MSG\ n") template_escape(no)); } ; The template directive formats theoutputsothatitissuitable for send_nsca ,using asemicolon as thedelimiter:hostand service names(syslog-ng)are followedby thestate ( 1 =WARNING; 2 =CRITICAL),and then theactualoutputtextisgiven. Apartfrom $HOST and $MSG, syslog-ng hasaseries of furthermacros,which aredescribedindividually in thedocumentationonthe homepage.The parameter template_escape protects quotationmarks in thetextand is intendedprincipally for SQLcommands, so in this caseitcan be setto no. Thefollowingscript send_syslog.sh uses thebashfunction read to read from the standardinput linebyline, andfor each linereaditcalls up send_nsca ,which sendsonthe data—as describedinthischapter—asapassive testresulttoNagios: 256 14.5 Application ExampleI:Integrating syslog and Nagios #!/bin/bash while read -r line; do echo $line |/usr/bin/send_nsca -H nagsrv -d ’;’ \ -c /etc/nagios/send_nsca.cfg \ 1>/usr/local/nagios/var/send_syslog.log 2>&1 done Because asemicolon is used as adelimiter,wespecify this explicitly withthe option -d.The status reportthateach send_nsca command displays on thestandardout- putisdiverted by thescriptintoaseparatelog file(/usr/local/nagios/var/send_sys- log.log ). Thanks to the program instructioninthe syslog configuration, syslog-ng starts the script automatically.Thisisalsothe reason that the send_nsca command is in an endlessloop: this meansthat syslog-ng does notrun an externalprogram every time thereisarelevant event. 14.5.2 Nagios configuration: volatile services In Nagios slang, “volatile”referstoservices that show an errorstate only once.This refers to devices,for example, that automatically resetthe statewhenanerror is queried—whichmeans that theerror cannotbereproduced. Thesameappliesfor syslog entries: if acheck followinganerror statereturnsanerror,thiswill always be asecondevent.Sowedon’t have acontinuing errorstate here,but aproblem that hasagainoccurred. Forcontinuing errorstates,Nagios normally doesnot send anyfurther messages for thetimebeing.Withthe is_volatile parameter,however,ittreatsevery erroras if it hadjustoccurred.Nagios logs thestate, sendsanotification, andimplements theevent handler—provided it is defined—(seeAppendix Bfrompage 409). For syslog-ng,thismeans that each entryisseen as an independentevent.Inorder that Nagios sees things in this wayaswell, thecorresponding service definition contains the is_volatile parameter: define service{ host_name linux01 service_description syslog-ng active_checks_enabled 0 passive_checks_enabled 1 check_freshness 0 is_volatile 1 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 check_command check_dummy!3!active check check_period none contact_groups localadmins 257 14 TheNagios ServiceCheck Acceptor (NSCA) notification_options w,c,u notification_interval 480 notification_period 24x7 } Since theNagios servershouldnot testanythingonits own, active_checks_en- abled0switches off active service checks.However, freshness checking (see Sec- tion 13.4frompage 243) can always cause Nagios to performactive tests.To preventthis, we setthe check_freshness parameter in this caseexplicitly to 0 . This service definitiondoesnot really requirethe parameters check_command and check_period,but sincethese aremandatoryparameters, they muststill be spec- ified:as check_command,the plugin check_dummy (see Section7.13onpage 154) is used. It is also important that max_check_attempts is setto 1 ,sothatatransmitted errorstate immediatelytriggers ahardstate. With avalue larger than 1 ,Nagios wouldwaitfor furthererror resultsherebeforecategorizingthe problemstate as ahardstate. The notification_options parameter ensuresthatthe system informs thespeci- fiedcontact groupofall errorstates (WARNING,CRITICAL, andUNKNOWN).The notification_interval ,which defines theintervalbetween twonotificationsfor a continuing errorstate, is actually superfluous,since Nagios, thanks to is_volatile 1 , provides notificationofevery eventimmediately, irrespective of what theprevious statelooked like. Butsince it is amandatoryparameter, notification_interval still hastobespecified. 14.5.3Resetting error statesmanually Events that aretakenintoaccount by thesyslogfilter always informyou of only onecurrent state, whichiswhy thesyslogserviceinNagios neverdisplaysanOK stateonits own(Figure 14.2).Thisproblemcan be solvedwiththe Webinterface, whichallows apassive checkresulttobegenerated manually. Figure 14.2: Thesyslog-ng service in an errorstate If youclickonthe service name in Figure 14.2, theextendedstatusinformation willbeshown (Figure14.3).There youwill findthe entry Submit passive check result forthis service ,withwhich atestresultcan be sent manually (Figure14.4). In this waythe syslog-ng service can be resettoits normal state. Since theWeb 258 14.5 Application ExampleI:Integrating syslog and Nagios interfacealways showsonlythe most recent errorstate, butnot individualerror messages, youmustlook throughthe e-mail messagestosee whether othererrors have occurred apartfromthose errors displayedbyNagios in theWeb interface. Figure 14.3: Thearrow pointsto thepossibilityof “generating” a passivetestresultfor thesyslog-ng service Youcan also defineyourown service for each syslog event, of course.Thismay sometimesbequite time-consuming,but it doesallowyou to separatevarious messagesand theirprocessing states in theWeb interface. If thefilter in syslog-ng is restricted so that asyslogserviceobject always refers to just oneresourcetobe monitored, youcan also leaveout the is_volatile parameter. Figure 14.4: Creating apassive checkresult syslog-ng 259 14 TheNagios ServiceCheck Acceptor (NSCA) 14.6ApplicationExampleII: Processing SNMP Traps Asynchronous messagesthatare sent by an SNMP agent (see Section11.1from page 178) to acentral management unit, called traps in SNMP jargon,can be processedbyNagios in away similartothe Nagios ServiceCheck Acceptor (NSCA). In addition,itallows SNMP traps to be accepted on ahostother than theNagios serveritself. Processing SNMP traps withNagios is particularly worthwhile if thesystem moni- tors thenetworkalmostcompletely, andonlyafewdevices or servicesrestricttheir communication just to SNMP andSNMPtraps.Nagios,orthe Open Source tool OpenNMS,11 arenosubstitutes for real commercialSNMPmanagement systems. In many cases,SNMPtraps arevendor-specific, so that youcannotavoid getting to gripswiththe appropriate documentationand thevendor-specificMIB ( Man- agement InformationBase ;see Section11.1.1frompage 179). 14.6.1Receiving trapswith snmptrapd In ordertoreceive SNMP traps,you requireaspecialUnix/Linux daemonthatgen- erates messagesfor Nagios from them. Thesoftware packageNET-SNMP, described in Section11.2.2frompage 187, includes thedaemon snmptrapd . In thefollowingscenario, snmptrapd is installedonathirdhost(neitherthe com- puter generating thetrap, northe Nagios server).Itevaluates theinformation receivedbymeans of ascriptand forwardsitwithNSCAtothe Nagios server.12 In the snmptrapd configuration file /etc/snmp/snmptrapd.conf,each trap type is givenaseparateentry,the syntax of whichcorresponds to oneofthe following lines: traphandle oid program traphandle oid program arguments traphandle default program traphandle default program arguments Thekeyword traphandle is followedeitherbythe object identifierofthe desired trap, or by thekeyword default.Inthe second casethe entryappliestoall traps that do nothavetheir ownconfiguration entry. Finally theprogram that should runifarelevant trap arrivesisspecified. 11 http://www.opennms.org/ 12 If youinstall the snmptrapd on theNagiosserveritself, youdonot need NSCA andyou cansend acorrespondingly formattedcommand,asdescribedinSection 13.2frompage241 directly to theinterface forexternal commands. 260 14.6 Application ExampleII: ProcessingSNMPTraps In addition youcan also includeargumentsusedwiththisprogram.But youmust be abit careful when doing this.Quotation marksare passedonby snmptrapd as charactersand spacesare always used as delimiters. This meansthatyou cannot passonany argumentscontainingspaces,which youshouldbearinmindwhen assigningnameservices in Nagios. snmpdtrapd givesthisprogram information via thestandardoutputinthe follow- ingformat: hostname ip-address oid value ... Thefirstlinecontainsthe fully qualified domainname of thehostthatsends themessage andthe second,its IP address. Then oneormoreOID-valuepairs are given, each on aseparateline. Aparticularevent is very often linkedtoaunique OID-valuepair, so that theprogram can often omit theevaluationofthe OID-value pairentirely. In thefollowing snmptrapd.conf example, thelines arewrappedfor readability. Each traphandle instruction must be entered on asingleline: #snmptrapd.conf traphandle SNMPv2-MIB::coldStart /usr/local/nagios/libexec/eventhandler/ handle-trap SNMP cold-start traphandle NET-SNMP-AGENT-MIB::nsNotifyRestart /usr/local/nagios/libexec /eventhandler/handle-trap SNMP restart traphandle NET-SNMP-AGENT-MIB::nsNotifyShutdown /usr/local/nagios/libexe c/eventhandler/handle-trap SNMP shutdown traphandle default /usr/local/nagios/libexec/eventhandler/handle-trapSN MP unknown Thetraps used here aresentbythe SNMP agent snmpd from theNET-SNMPpack- age by default, as long as adestination wasspecifiedin snmpd.conf: #snmpd.conf trapsink name_or_ip_of_the_nagios-server If atraparriveswiththe OID SNMPv2-MIB::coldStart ,for example, snmptrapd starts thescript handle-trap withthe argument cold-start .Inthisway it does nothavetosearchfirstfor thenecessary information from theOID-valuepairs. However, this shortcut only workswithtrapOID namesthatdescribe theirfunction. 14.6.2Passing on trapstoNSCA Thescript handle-trap ,which is runby snmptrapd ,breaks down theinformation passedonand handsitover, correctlyformatted,to send_nsca : 261 14 TheNagios ServiceCheck Acceptor (NSCA) #!/bin/bash NAGIOS="nagsrv" LOGFILE="/usr/local/nagios/var/handle-trap.log" read HOST && echo "host: $HOST" >> $LOGFILE read IPADDR && echo "ip: $IPADDR" >> $LOGFILE case $IPADDR in 192.168.201.4) HOSTNAME="irouter" ;; *) #silent discard from unknown hosts exit 0 ;; esac if [-z"$1" ]; then echo "usage: $0 " echo "usage: $0 " >> $LOGFILE exit 1 else SERVICE="$1" fi if [!-z "$2" ]; then SWITCH="$2" fi case $SWITCH in "cold-start") OUTPUT="snmpd: Cold Start" STATE=0 ;; restart) OUTPUT="snmpd: Restart" STATE=1 ;; shutdown) OUTPUT="snmpd: Shutdown" STATE=2 ;; *) OUTPUT="Unknown Trap" STATE=1 ;; esac CMD="$HOSTNAME;$SERVICE;$STATE;$OUTPUT" 262 14.6 Application ExampleII: ProcessingSNMPTraps echo "$CMD" >> $LOGFILE echo "$CMD" |/usr/bin/send_nsca -H $NAGIOS -d ’;’ \ -c /etc/nagios/send_nsca.cfg >> $LOGFILE 2>&1 Firstitsaves thelog fileand thenameofthe Nagios server nagsrv,each in a separatevariable.The first case statementspecifiesthe host name used by Nagios for theIPaddresspassedon(andtemporarily stored in IPADDR). HOST normally contains thefully qualified domainname, whichalsocannotbeuseddirectly, and sometimesalsojustcontainsone IP address, so that it is better to usethe latter here.The explicit testalsoallows it to discardtraps from undesired hosts. Finally, matching traps land without furtherauthenticationonthe Nagios server.13 Thefollowing if statementdetermineswhether aservicenamewas also givento thescript. If this is thecase, then it is savedinthe SERVICE variable.Ifthere wasa second argument,the procedureissimilar. Dependingonthe value, thenext“case $SWITCH ”instruction defines theoutputtextand thedesired status for Nagios. Thecommand for NSCA is finally assembledand the CMD variable is passedon by thescriptto send_nsca .Asinprevious examples,asemicolonisusedasthe delimiter,which mustbespecifiedin send_nsca withthe option -d. 14.6.3The matching service definition As in the syslog-ng example(page 257),weagaindefine theserviceonthe Nagios serverasapurely passive one: define service { host_name irouter service_description SNMP active_checks_enabled 0 passive_checks_enabled 1 check_freshness 0 max_check_attempts 1 is_volatile 1 ... } Since soft states do notmake anysense in asingletrapmessage,weshouldset max_check_attempts back to 1 .Whether theparameter is_volatile is used or not dependsonthe purposetowhich theserviceisput.Aslongasyou defineaseparate service for each errorcategory, thereisnoprobleminomitting is_volatile.But if youformdifferent errorcategoriesusing asingleservice, youshouldset is_volatile 1 ,because in this casethe previous errorwill seldom have anything to do withthe newone.Section 14.5.2onpage 257isdevoted to thesubject of volatile services. 13 AlthoughSNMPv3doesprovide authentication forSNMPtraps,thiswould go beyond thescope of this book. 263 15 Ch ap te r DistributedMonitoring Passive service andhostchecks can be used to create ascenarioinwhich several noncentral Nagios instancessendtheir resultstoacentral server. In generalthey transfer theirresults usingthe Nagios ServiceCheck Acceptor (see Chapter 14);the centralNagios instance receivesthemthrough theExternalCommand File interface andcontinuesprocessing them as passive checks (see Chapter 13). What is nowmissing is themechanism that prepares each testresultofanon- centralNagios instance to be sent withNSCA. Forsuchcases,Nagios provides the“obsessive”commands, OCSP (“Obsessive Compulsive ServiceProcessor”)and OCHP (“Obsessive Compulsive Host Processor”), twocommandsdesignedspecif- ically for distributed monitoring.Incontrastto eventhandler (see Appendix B from page 409), whichshows changesinstatusand only passesoncheck resultsif thestatushas changed, thesetwo commandsobsessively passonevery testresult (Figure15.1). 265 15 DistributedMonitoring Figure 15.1: Distributed monitoring with Nagios 15.1SwitchingOnthe OCSP/OCHPMechanism In ordertouse OCSP/OCHP, severalstepsare necessary.The mechanismisinitially switched on (only) on thenoncentralNagios servers in theglobalconfiguration file /etc/nagios/nagios.cfg,where aglobalcommand for hosts(OCHP)and services (OCSP) is defined.Thiscausesthe noncentral Nagios instance to send everyresult to thecentral server. In theserviceand host definitions youcan additionally setwhether thecorrespond- ingserviceorhostshoulduse themechanism or not. Forthe centralNagios server to be able to usethe resultstransferred,each service or host on it mustfinally be defined once again. Youshouldonlyswitch on thetwo parameters obsess_over_services and obsess_ over_hosts in nagios.cfg if youreally do want distributed monitoring: #/etc/nagios/nagios.cfg ... obsess_over_services=1 ocsp_command=submit_service_check ocsp_timeout=5 obsess_over_hosts=1 ochp_command=submit_host_check ochp_timeout=5 266 15.2 DefiningOCSP/OCHP Commands Everytimeanewtestresultarrivesonthe Nagios server, it calls thecommand object defined with ocsp_command or ochp_command.Thiscausesanadditional load on resources. Thetwo timeouts preventNagios from spending toomuchtimeonone command. If processing doesnot terminate (because thecommand itself doesnot receive a timeoutand thecentral Nagios serverdoesnot react), then theprocess table of the noncentral Nagios instance wouldfill very quickly, andmight overflow. If youwanttoselectively excludetestresults for specificservices andhosts from transmission to thecentral Nagios server, thefollowingparametersare used: define host{ ... obsess_over_host=0 ... } define service{ ... obsess_over_service=0 ... } With avalue of 1 thelocal Nagios instance sendsthe resultsofthe host or service checktothe centralserver, butwithavalueof 0 ,thisdoesnot happen. The 1 is thedefault for both obsess_over_host and obsess_over_service ;ifresults are not to be transferred, then youhavetospecify thetwo parameters. This is always recommended if thecentral location is only responsible for particularthings, and theremaining administrationiscarried outonsite. 15.2Defining OCSP/OCHPCommands Definingthe twocommandswithwhich thenoncentralinstances send theirre- sultstothe Nagios main serverinmostcases involvesscripts that arebased on send_nsca (see also theexample on page 254).For services, such ascriptwould look likethe followingone,inthiscasecalled submit_service_check : #!/bin/bash #Script submit_service_check PRINTF="/usr/bin/printf" CMD="/usr/local/bin/send_nsca" CFG="/etc/nagios/send_nsca.cfg" HOST=$1 SRV=$2 267 15 DistributedMonitoring RESULT=$3 OUTPUT=$4 $PRINTF "%b" "$HOST\ t$SRV\ t$RESULT\ t$OUTPUT" |$CMD -H nagios -c $CFG When run, thecommand expectsfourparametersonthe command lineinthe correctorder:the host monitored, theservicename, thereturn valuefor theplugin opened(0 for OK, 1 for WARNING, etc.), andthe one-lineinfotextthatisissued by theplugin. To formatthe dataweuse the printf function ( manprintf ). Thenewly formatted string is finally passedonto send_nsca . Theequivalent script for OCHP (storedhereinthe file submit_host_check )looks somethinglikethis: #!/bin/bash #Script submit_host_check PRINTF="/usr/bin/printf" CMD="/usr/local/bin/send_nsca" CFG="/etc/nagios/send_nsca.cfg" HOST=$1 RESULT=$2 OUTPUT=$3 $PRINTF "%b" "$HOST\ t$RESULT\ t$OUTPUT" |$CMD -H nagios -c $CFG Theonlything missing is thespecificationofthe service description. It is best to storethe twoscripts,inconformity withthe Nagios documentation, in asubdirectory eventhandlers (which normally needstobecreated)inthe plugin directory(usually /usr/local/nagios/libexec,but for some distributionsthiswill be /usr/lib/nagios/plugins). Youcan retrieve this from thedefinition of thematching command object usingthe macro $USER1$.Thisisbestdefinedinthe misccom- mands.cfg file: define command { command_name submit_service_check command_line $USER1$/eventhandlers/submit_check_result \ $HOSTNAME$ ’$SERVICEDESC$’ $SERVICESTATEID$ ’$SERVICEOUTPUT$’ define command{ command_name submit_host_check command_line $USER1$/eventhandlers/submit_host_result \ $HOSTNAME$ $HOSTSTATEID$ ’$HOSTOUTPUT$’ If youuse aseparatefile for this,you mustmake sure that Nagios willload this file by addinganentry to /etc/nagios/nagios.cfg.The single quotes surroundingthe $SERVICEDESC$ macroand thetwo output macros in the command_line lineare important.Their values sometimescontain emptyspaces,which thecommand line wouldinterpret as delimiterswithout thequotes. 268 15.3 PracticalScenarios 15.3Practical Scenarios Oneapplicationfor distributed monitoring is themonitoringofbranchesorexter- nalofficesinwhich anoncentralNagios installation is limited to running service andhostchecks andsending theresults to thecentral instance.The noncentral instancesdonot need furtherNagios functions, such as thenotification system or theWeb interface. On theother hand,ifadministratorslook after thenetworks at thedistributed locations, whilethe centralITdepartmentonlylooksafter specialservices,then thenoncentralNagios serverisset up as anormal, full-fledgedinstallation and selectively forwardsonlythose checkresults over theOCSP/OCHPmechanism to thecentral officefor whichthe specialists thereare responsible. Whatever thecase, youmustensurethatthe host andservicedefinition is available both noncentrally andcentrally.Thiscan be donequite simply usingtemplates (Section 2.11 on page 54) andthe cfg_dir directive (Section 2.1, page 38):you set up thedefinition so that theconfiguration files can be copied 1:1. 15.3.1Avoidingredundancy in configurationfiles In thefollowingexample we assume that thenoncentralservers only performhost andservicechecks andsendthe resultstothe centralserver, anddonot provide anyother Nagios functions. Thefollowingdirectories areset up on thecentral host: /etc/nagios/global /etc/nagios/local /etc/nagios/sites /etc/nagios/sites/bonn /etc/nagios/sites/frankfurt /etc/nagios/sites/berlin ... Each of theconfigurationsusedfor alocationlands in thedirectory /etc/nagios/ sites/location.After global ,all thedefinitionsfollowthatcan be used identically at alllocations (e.g., thecommand definitions in checkcommands.cfg ). Thedirectory local takesinspecific definitions for thecentral serverdefinitions. Theseinclude thetemplates for servicesand hosts, wheredistinction mustbemade between centraland noncentral. This directoryisalsocreated separatelyonthe noncentral servers:onlythe fold- ers global and sites/location arecopied from thecentral instance to thebranch offices. 269 15 DistributedMonitoring Thethree directoriesare read in withthe cfg_dir directive in /etc/nagios/nagios.cfg: #--/etc/nagios/nagios.cfg ... cfg_dir=/etc/nagios/global cfg_dir=/etc/nagios/local cfg_dir=/etc/nagios/sites ... Only settingsthatare identical for thenoncentraland centralpage areusedinthe service definition: #--/etc/nagios/sites/bonn/services.cfg define service{ host_name bonn01 service_description HTTP use bonn-svc-template ... check_command check_http ... } Thelocation-dependentparametersare dealtwithbythe templates. 15.3.2Definingtemplates In orderthatservicedefinitionsare identical on both thecentral andnoncentral servers, thelocal templates musthavethe same namesasthe centralones. In addition youshouldensurethatthe obligatoryparameters(seeChapter 2from page 37) arealsoall entered,eveniftheyare notevenrequiredatone of the locations, because together,the template andservicedefinitionsmustcover all obligatoryparameters. Thefollowingexample showsaservice template for oneofthe noncentral loca- tions: #--On-Site configuration for the Bonn location define service{ name bonn-svc-template register 0 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 active_checks_enabled 1 passive_checks_enabled 1 check_period 24x7 270 15.3 PracticalScenarios obsess_over_service 1 notification_interval 0 notification_period none notification_options n notifications_enabled 0 contact_groups dummy } Theparametersthatare important for thenoncentralpage areprinted in bold type. Besidesthe parametersthatrefer to thetestitself, theparameter obsess_over_ service mustalsonot be left out. This ensuresthatthe checkresults aresenttothe centralserver. notifications_enabled switchesoff notificationinthiscase, sincethe local admins do notneed to worry abouterror messagesfromservices that arecentrally moni- tored. Alternatively this can be doneglobally in thenoncentral /etc/nagios/nagios. cfg . register 0 ensuresthatthe template is used exclusively as atemplate, so that Nagios doesnot interpret it as aseparateservicedefinition. Thecounterpart withthe same name on thecentral serverlookssomethinglike this: #--Service template for the central Nagios server define service{ name bonn-svc-template register 0 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 active_checks_enabled 0 passive_checks_enabled 1 check_period none check_freshness 0 obsess_over_service 0 notification_interval 480 notification_period 24x7 notification_options u,c,r notifications_enabled 1 contact_groups admins } Theparameter passive_checks_enabled is of importancehere, as well as thecon- figuration of thenotification system.Onthe centralside, theparametersinvolving thetestitselfcomeintoplayonlyiffreshness checking is used (see Section13.4 from page 243). This worksonlyifthe centralNagios serverisitselfinaposition to actively testall servicesifthere is anydoubt.Since the check_command in this simple template solution is giveninthe location-dependent service definition, 271 15 DistributedMonitoring whichisidentical on thenoncentraland centralservers, this will work only if the same command object can be used both centrally andnoncentrally—if theobject definitions in global/checkcommands.cfg matchonbothsides. In theexample,however,wecompletelyswitch off active tests of servicesatthe Bonnlocation, with check_period none and check_freshness setto0.The system describedsofar can also be appliedtohostchecks,ofcourse. 272 16 Ch ap te r TheWeb Interface On theright is thenavigationareawiththe unmistakable black background,and theremaining area is for displayingthe CGIscripts called(Figure 16.1)—the Na- gios Webinterfaceisthatsimple. Thestart screen provides accesstothe program documentation—extremelyusefulifyou just want to look up somethingquickly. Provided youhavethe correctaccess rights,the Webinterfaceallows muchmore than just lookingupinformation.You can runaseries of commandsand control Nagios actively: from setting asinglecommand,toswitchingmessagesonand off, to restarting theserver. Aseparatebook wouldbeneeded to describe allthe features completely. This is whywewill just describe theconcept here on whichthe CGIprogramsare based,1 in this waygivingyou apicture of theextensive rangeofoptions available. 1 Thereisagood reason that we referheretoCGI programsand nottoCGI scripts: allCGI programsfor Nagios 2.0are Cprograms. 273 16 TheWeb Interface Many functionsuse theverysameCGI program.Ifyou move themouse up and downinthe navigationareashown in Figure16.1and observe thestatusdisplay of thebrowser when doing this,which revealsthe URLs to be called, youwill see that in the Monitoring sectionuptothe ShowHosts: entryfield, theCGI pro- gram status.cgi is always called, withjustfourexceptions. Only theparameters aredifferent.Thingsare similarfor theCGI program cmd.cgi ,withwhich general commandscan be run. Theparameterspassedspecify whether acomment is to be read, or amessage enabledordisabled, or if Nagios is to be restarted. Figure 16.1: Startpage of the Nagios Webinterface Table16.1: overview of CGI programs CGIprogram Description status.cgi Status displayinvarious forms;byfar themostimportant CGIprogram (Figures 16.10 to 16.14, page 280.) statusmap.cgi Topological representation of themonitored host (see Figure 16.26, page 292) statuswrl.cgi Topological representation in 3D format; requires aVRML- capable browserand allows interactive navigationinavir- tual space (Figure28, page 294) statuswml.cgi Simplestatuspage forWAP devices (cellphone) extinfo.cgi Additional information on ahostorservice, withthe possi- bilityofrunning commands(Figure 16.4, page 277) cmd.cgi Running commands(Figure 16.22, page 288) tac.cgi Overviewofall servicesand hoststobemonitored,the Tac- tical Overview (see Figure 16.25 on page 291) 274 16.1 Recognizing and Acting On Problems continued CGIprogram Description outages.cgi Network nodesthatcause thefailure of partialnetworks (Figure16.29, page 295) config.cgi DisplayofNagios object definitions (Figure16.30, page 296) avail.cgi Availabilityreport(e.g.,“98 percentofall systemsOK, 2per- cent WARNING”, seeFigure16.31, page 296) histogram.cgi Histogram of thenumber of events occurring (Figure16.33, page 298) history.cgi Displayofall events that have ever occurred (Figure16.34, page 300) notifications.cgi Overviewofall sent notifications (Figure16.35, page 300) showlog.cgi Displayofall logfileentries (Figure16.36, page 301) summary.cgi Reportofevents, whichcan be compiledbyhost, service, errorcategoryand time period (Figure16.38, page 303) trends.cgi Time axis recordingthe states that have occurred (Figure 16.39, page 304) Table 16.1shows an overviewofall theCGI programsincludedinthe package. They allcheck to seewhether thepersonrunning therequested actionisallowedtodo so.Normally ausercan only accessthe hostsand servicesfor whichheisentered as thecontact.Inaddition thereisthe possibilityofassigningspecific usersmore comprehensive rights,sothattheyare basically allowedtodisplay allhosts and services, for example, or to requestsystem information.Settings for otherusers aremade in the cgi.cfg configuration file, andthe authentication parametersare describedinAppendix D.2, page 443. 16.1Recognizingand ActingOnProblems Asuitable starting point for theadministrator is the ServiceProblems page,which can be reached throughthe menu item,shown in Figure16.2. Youcan seeall problems at aglance. If thereisjustaservice-related problem, butnot ahost- related one, thehostnameinthe Host column hasagray background,but ared background meansthe host itself is thesourceofthe trouble. 275 16 TheWeb Interface Figure 16.2: Themenu item ServiceProblems brings current problems to attention Thehosts sls-mail and sls-proxy ,which have failedinFigure16.2, can be seen againinthe Host Problems menu item (Figure16.3): sls-mail cannotbereached (UNREACHABLE), so therealproblemthereforeexistsinthe failure of thehost sls- proxy .Thisdependencyisillustrated in the Outages menu item (Figure16.29, page 295) or the StatusMap (Figure16.26, page 292).InFigure16.26 thetwo failed hostsare shownwitharedbackground,and youcan also clearlysee whichhostis dependentonthe other(always from thepoint of viewofthe centralNagios host). Figure 16.3: TheHostProblems menuitemreveals this display 16.1.1Commentsonproblematic hosts Theadministrator clarifies theproblemwiththe externaloffice by telephone:the DSLconnectionhas failed. He announces this failure to theprovider responsible. To stop hiscolleagues from going to thesametrouble again, theadmin entersa corresponding comment on thefailedhost. To do this he clicks in thestatusdisplay on thehostname, whichtakeshim to an information page for this specifichost (Figure16.4),the options of whichare describedinmoredetail in Section16.2.2, page 284. 276 16.1 Recognizing and Acting On Problems Figure 16.4: extinfo.cgi provides additional informationonthe selected host Using the Addanewcomment linkatthe bottomofthe page,the CGIprogram cmd.cgi (Section 16.2.3, page 288.), whichbypassing on acorresponding param- eter is already prepared for this task, 2 allows acomment to be recorded (Figure 16.5).The host name is already shown, thecheckmarkinthe Persistent boxen- suresthatthe commentswill also “survive”aNagios restart. Theusernamefilled outinthe Author (YourName): field can be edited,ascan theactualcomment in the Comment field. Figure 16.5: Entering acomment forahost 2 cmd_type=1&host=sls-proxy .Moreonthe parameters in Section16.2.3following,page288. 277 16 TheWeb Interface Theadministrator confirmsthe entrywiththe Commit button. Returningtothe status overview, for examplewiththe ServiceProblems menu item,the adminis- trator willsee aspeechbubble next to thehostname, indicating that acomment exists for this host (Figure16.6).Clicking on theiconopens thecorresponding in- formation page andtakesthe admin directly to thecomment entries(Figure 16.7). Clicking on theiconofthe trashcan in the Actions column deletes theseindivid- ually,ifrequired. Figure 16.6: Aspeechbubble displays theexistence of comments Figure 16.7: AclickonDeleteall comments deletesall comments at once 16.1.2Taking responsibilityfor problems: acknowledgements Acknowledgements (sospelledonthe Webinterface) areoriented more closely to theworkflow than simple comments. An acknowledgement signalstoother administratorsthatsomebody is already workingonaproblem, so nobody else needstoget involvedwithitfor thetimebeing.Inthe status overview, asmall laborericonsymbolizesthisformoftaking responsibility(Figure 16.9),and Nagios additionally notifiesthe relevant contacts.3 To issuesuchastatement, thelink Acknowledge this Host Problem is used on the extendedinfopage forthe host in question.Aswellasthe fieldsusedfor enteringa normal comment,there aretwo checkboxes in this case, Sticky Acknowledgement (Figure16.8)—if checked, this optionpreventsperiodnotification if theerror status persists—and Send Notification.Ifthe latter is also checked, Nagios notifiesthe otheradministrators. 3 Sendinganotificationtothe contactaddressesinchargedid notworkuptoand including version2.0b3,however. 278 16.2 An Overview of theIndividual CGIPrograms Figure 16.8: Entrydialogfor a host acknowlegement What we aredemonstrating here,using afaultyhoststate, can also be applied to faultyservices.The CGIprogramsare thesame, andthrough thepassing of parameterstheyreceive information on whether ahostorserviceisinvolved, and react accordingly; only thehostfieldreceivescompanyinthe formofaservice entry. Figure 16.9: Alaborer icon shows that an admin has already taken on responsibilityfor the problem (acknowledgement)16.2AnOverviewofthe IndividualCGI Programs At the time of going to press, this chapter wasthe most extensive documentation on theNagios Webinterface, especially for theindividualCGI scripts. Butfor rea- sons of space,weshall notgointoevery detail. If youwanttoknowmore, you musttake alook at thesourcecode of thescripts or look at the nagios-users 4 mailinglist. Some of theseare also read by theNagios developers, andmanya question is answered therefor whichthere is currentlynodocumentation. 16.2.1Variationsinstatusdisplay: status.cgi By far themostimportant CGIprogram, status.cgi is responsible for thestatus display. What it showsisdeterminedbythree parameter groups.The first one defines whether theWeb page generated displays allhosts,aspecifichost, or a service group: 4 http://lists.sourceforge.net/mailman/listinfo/nagios-users 279 16 TheWeb Interface http://nagiosserver/nagios/cgi-bin/status.cgi?host=all http://nagiosserver/nagios/cgi-bin/status.cgi?hostgroup=all http://nagiosserver/nagios/cgi-bin/status.cgi?servicegroup=all With host youcan select individualhosts,and all in this casestandsfor allhosts. hostgroup enablesaspecifichostgroup to be displayed, andagainyou can use all to standfor allhostgroups .Finally, servicegroup tells theCGI program to display either theindividualservicegroup givenasavalue, or allservicegroups ,given with all . Theoutputs of host=all and hostgroup=all areonlydifferent in theirstyle,which is defined by thesecondparameter group. For host=all, style=detail is thedefault setting,and for hostgroup=all ,itis style=overview . status.cgi?host=all&style= overview thereforedelivers thesameresultas status.cgi?hostgroup=all . Hoststhatdonot belong to ahostgroup only appearinthe detail view host=all& style=detail or hostgroup=all&style=hostdetail .All otherdisplay stylesalways show entire host groups from whichindividualhosts maybemissing. Figure 16.10: Theoverview output style status.cgi provides fivepossible output styles: overview represents thehosts in a table,but summarizesthe servicesaccordingtostates (Figure16.10).For thehost group SAP ,you wouldcallthe corresponding displaywiththe URL http://nagiosserver/nagios/cgi-bin/status.cgi?hostgroup=SAP&style=overview The style value summary compressesthe output of overview : status.cgi only dis- playsone host groupfor each line(Figure 16.11). Figure 16.11: Thesummaryoutput style The grid style provides an extremelyattractive summary in whichyou can seethe status of each individualservicebymeans of thecolor withwhich it is highlighted 280 16.2 An Overview of theIndividual CGIPrograms (Figure16.12). detail showseach service in detail on aseparateline. The hostdetail output style is limited just to host information,providingdetailedinformation with onelinefor each host (Figure16.14). Figure 16.12: Thegridoutput style Figure 16.13: Thedetailoutput style Figure 16.14: Thehostdetailoutput style Thethird andfinalparameter groupallows youtoinfluence, through selectors, what states andwhatpropertiesare shownby status.cgi ,suchasall servicesinan errorstate for whichnoacknowledgement hasyet been setbyanadministrator (see Section16.1.2, page 278).States arepassedonwiththe hoststatustypes or servicestatustypes parameter,propertieswith hostprops and serviceprops.All four parametersdemandnumerical values after theequalssign, andthese aresumma- rizedinTables16.2, 16.3, and16.4. 281 16 TheWeb Interface Table16.2: Possible values for hoststatustypes Value Description 1PENDING (a result of theveryfirsttestplannedfor this host is not yetavailable) 2UP 4DOWN 8UNREACHABLE Table16.3: Possible values for servicestatustypes Value Description 1PENDING (Servicewas originally plannedfor acheck, butsofar no result is available) 2OK 4WARNING 8UNKNOWN 16 CRITICAL Table16.4: Possible values for host andserviceprops Value Description 1Scheduled downtime(downtimeplanned) 2NoScheduled downtime(no downtimeplanned) 4Acknowledgement (statusconfirmedbythe admin) 8Noacknowledgement 16 Host/Servicecheck disabled 32 Host/Servicecheck enabled 64 EventHandler disabled 128 EventHandler enabled 256 Flap Detectiondisabled 512 Flap Detectionenabled 1024 Host/Serviceoscillates (flapping) 2048 Host/Servicedoesnot oscillate 4096 Hostsorservices currentlyexcludedfromanotification 8192 Notificationenabled 16384 Passive host/servicechecks disabled(Chapter 13, page 239.) 32768 Passive host/servicechecks enabled 282 16.2 An Overview of theIndividual CGIPrograms continued Value Description 65536 Hosts/servicesfor whichthere is at leastone result determinedfor each passive test 131072 Hosts/servicesfor whichthere is at leastone active checkresult If youwanttoquery severalstates or propertiessimultaneously, youjustadd thespecifiedvaluestogether: status.cgi?host=all&servicestatustypes=28 shows allservices withanerror status:WARNING, UNKNOWN, andCRITICAL, that is, 4+8+16 =28. This queryisidentical to the ServiceProblems menu item in the navigationarea. status.cgi?hostgroup=all&hoststatustypes=12&style=hostdetail corresponds to the Host Problems menu item in thenavigationarea. It queriesall hostswhich areeitherDOWNorUNREACHABLE (here 4+8=12).Since only host information should be shown, butnoserviceinformation,the output style is in theformof hostdetail . status.cgi?host=all&servicestatustypes=24&serviceprops=10 is thevariation of thefirstexample:onlythe states UNKNOWNand CRITICAL(8+16 =24) areshown, andonlythose that neithershowaplanneddowntime, norhavealready been confirmed (2 +8=10). TheCGI program specifies thefilter parameter each time in aseparatecheckbox. Figure 16.15 showsthisfor thethird example. Figure 16.15: This informationbox showswhatstates andproperties status.cgi should display If youwant, youcan defineyourown navigationareatoyourown requirements or just usethe existing one. Themainpage consistsofone frame, andthe navigation area itself is defined by anormalHTMLfile: /usr/local/nagios/share/side.html . 5 An exampleofachanged side.html is provided on theNagios Demo page6 at Net- ways. 7 5 If youhavekepttothe installation in this book. 6 http://nagios-demo.netways.de/ 7 http://www.netways.de/ 283 16 TheWeb Interface 16.2.2Additional informationand controlcenter: extinfo.cgi If calledwiththe host or service parameter, extinfo.cgi notonlyprovides detailed information on aspecific host (Figure16.4, page 277) or service,italsoservesasa controlcenter for hostsand services(parameter hostgroup )and for service groups ( servicegroup). Dependingonthe object classfor whichitiscalled, youcan run variouscommandsfromhere. In theareaonthe left,the status of thehostisextensively documented andin thebox on theright—overwritten with host commands —there is aselection of commandsthatcan be run. Thelatter commandscall cmd.cgi (Section 16.2.3, page 288) andonlyfunction if theinterfacefor externalcommands(Section13.1, page 240) is active.The lowerareaofthe page allows youtoenter object-specific comments, read them, anddelete them again. TheWeb page that extinfo.cgi generates for servicesalsofollows this pattern. Corresponding pagesfor service andhostgroups(Figure 16.16),onthe otherhand, allowonlygroup-specificcommandstoberun anddonot show anyadditional information.Each command appliestothe entire group, sparingyou from alot of mouseclicking. Disabling notificationsfor allhosts in this hostgroup ,for example, ensuresthatNagios does notsendany more messagesfor hostsinthis host group. Figure 16.16: Command center for theSAP host group: extinfo.cgi?type= 5&hostgroup=SAP Apartfromhosts,services,and corresponding groups,the CGIprogram hasother displayfunctions, enabledbythe CGIparameter type: http://nagsrv/nagios/cgi-bin/extinfo.cgi?type=value Dependingonthe valuespecified, furtherparametersare required,sotodisplay theserviceyou also have to includethe host name andservicedesignation: 284 16.2 An Overview of theIndividual CGIPrograms extinfo.cgi?type=0 Showsinformation (suchasstartingtimeand processID) for theNagios process itself andall global parameters(normally notifications aresent, per- formancedataprocessed, etc.;see Figure 16.17).Inthe Process Commands boxthe global parameterscan be changed, andNagios can also be stopped andrestarted. Figure 16.17: Informationonthe Nagios processand global settings: extinfo.cgi?type=0 extinfo.cgi?type=1&host=host Showscommandsand information on the host (see Figure 16.4, page 277). extinfo.cgi?type=2&service=service Thesamefor the service. extinfo.cgi?type=3 Showsall available host andservicecommentsonasingle page (Figure 16.18). Figure 16.18: Overview of all existingcomments: extinfo.cgi?type=3 285 16 TheWeb Interface extinfo.cgi?type=4 Provides information on theperformanceofNagios,separated according to host andservice, as well as active andpassive checks (Figure16.19). Figure 16.19: Informationonthe performance: extinfo.cgi?type=4 Themiddlecolumn revealshow many of theplannedtests Nagios hasal- ready performedinthe last 1, 5, 15, and60minutes.Aslongasthere are checks for which normal_check_interval is more than fiveminutes,the first twovaluescan neverreach 100 percent. Theright-handcolumns definethe actualvalue for this page: CheckExecu- tion Time specifies theminimum, maximum, andaverage time whichNagios requires to performactive host andservicechecks. CheckLatency measures thedistancebetween theplannedstart andthe actualrunning time of a test. If this delayisconsiderably larger than oneortwo seconds, Nagios probably hasaperformanceproblem. Onepossible cause is that thesystem is processing performancedatatoo slowly,but low-performancehardware mayalsoplayarole here.Searching for thecause can sometimesturnout to be very difficult, andthe original documentation 8 provides anumber of tips on thesubject. extinfo.cgi?type=5&hostgroup= hostgroup Showscommand center for ahostgroup (Figure16.16). extinfo.cgi?type=6 Showsall plannedmaintenance periods for hostsand services(Figure 16.20). 8 /usr/local/nagios/share/docs/tuning.html 286 16.2 An Overview of theIndividual CGIPrograms Figure 16.20: Overview of all planned maintenance periods: extinfo.cgi?type=6 extinfo.cgi?type=7 Showsanoverviewofall plannedtests,sorted by thenextimplementation time (see Figure 16.21).Nexttothis, extinfo.cgi also lists thetimeofthe last check. The Active Checks column showsifthe respective tests areactive or not, andinthe Actions column theplannedcheck can be deleted or moved to adifferent time. extinfo.cgi?type=8&servicegroup= servicegroup Showsthe command centre for aservicegroup,identical in structuretothe command center of ahostgroup. Figure 16.21: Allplanned tests, sorted by their planned implementation time: extinfo.cgi?type=7 287 16 TheWeb Interface 16.2.3Interfacefor external commands: cmd.cgi As arealall-rounder, cgi.cmd,withsome100 functions, covers nearly allthe possi- bilitiesthatthe interfaceprovides for externalcommands. The cmd_typ parameter defines whichofthese theCGI program should run. Thecommand http://nagsrv/nagios/cgi-bin/cmd.cgi?cmd_typ=6 switches off active service checks for aspecific service (Figure16.22).Inorder to describe thedesired service uniquely, youmustspecify thehostand service description. If yourun theCGI program manually,the Webformshown queries thesevalues, andif cmd.cgi is started by anotherCGI program,the required datais passedthrough CGIparameters. Possible parametershereare host, service , host- group ,and servicegroup,which arefollowedbyanequals(=) sign andthenthe appropriate Nagios object. Figure 16.22: Disablingaservice checkwith cmd.cgi?cmd_typ=6 Figure 16.23 lists thecommandswhich refertoahostorservice, andFigure16.24 showsthose that refertothe controlofglobalparameters(corresponding to the values in themainconfiguration file nagios.cfg). Thesourcecode file include/ common.h contains acomplete listofall possible values,including ones that are plannedbut notyet implemented. Thefirstcolumn in Figures16.23 and16.24 describesthe function of thecommand: ADD_HOST_COMMENT addsacomment to ahost, and DISABLE_ACTIVE_SVC_ CHECK switches off active checks for aservice(in abbreviated form: SVC). Thecolumns after this specifythe object type to whichthe respective function refers. To add acomment with ADD_HOST_COMMENT,you mustspecify thehost in question.For this reason thefunction code 1 is showninthe Host column. A specificactive service checkcan only be switched off if thematchingserviceis named, so thefunction code 6 is to be found in the Service column. With 16 youswitch off allactive service checks on ahosttobespecified; thereare also corresponding codesfor allactive service checks for ahostorservicegroup. With ACKNOWLEDGE_PROBLEM ,anadministrator confirmsthatheistaking care of aspecific problem. 33 ( Host column) refers to ahostproblem, and 34 ( Service 288 16.2 An Overview of theIndividual CGIPrograms column) to aserviceproblem. Thegrayfields mean that thereisnocorresponding function for host andservicegroups. TheWeb formthatopens with cmd_typ=33 (Figure16.8, page 279) then allows acomment to be entered. Figure 16.23: Host /Service-related codesfor cmd.cgi?cmd_typ= Functionsthatrefer to global parameters(Figure 16.24) can normally only be switched on or off. So thevalue 11 in the Start column for NOTIFICATIONS means that this command code switches on allnotificationsglobally,while 12 switches them off globally. If youare notquite certainwhether thedeterminedfunction doeswhatyou really wanted,itisbesttorun cmd.cgi manually withthe corresponding function code, such as shownhere: http://nagsrv/nagios/cgi-bin/cmd.cgi?cmd_typ=12 289 16 TheWeb Interface TheWeb page generated in this wayalways hasasmallgraybox available next to therequiredentry fieldsthatexplainsthe corresponding command (Figure16.22, on theright side of thepage). Figure 16.24: cmd.cgicommand codesfor global parameters 16.2.4The most importantthingsataglance: tac.cgi As a“tactical overview,” tac.cgi provides awealthofinformation on asingleWeb page,displayed in asummary (Figure16.25).Onthe left-handsideofthe page you can see, in orderofpriority, first thefailure of entire networkranges(Network Out- ages), followedbythe status of hostsand services, andatthe bottom tac.cgi lists whether individualmonitoringfeaturessuchasnotificationsand eventhandlers areactive. Up to this final section, everything is concentrated on displayingproblems.Pro- vided everything is OK,the CGImerelyshows thenumber of unproblematicservices or hosts, highlighted in light gray (and announces 47 Up,for example, in the Hosts box).Inproblemcases it distinguishesbetween openproblems,which nobody has looked at yet(highlighted in red, e.g., 2UnhandledProblems for Services → Crit- ical), andthose for whichanadminstratorhas already takenresponsibilitythrough an acknowledgement (pinkbackground,like 1Acknowledged for Services → Un- known ). If host or service checks aredisabled, theseare also shownwithapink background,since they areproblems that do notrequire theimmediateattention of theadmin (e.g., 2Disabled for Services → Ok). Enabledfeaturesinthe lowerparts aremarkedby tac.cgi in green,and disabled ones,inred.The vertically written green Enabled in Notifications meansthat notifications areenabledglobally,whereas thered background on theother hand, 2Services Disabled,means that they were explicitly switched off for twoindividual services. Forall theproblems displayedyou aretakentoasingleoverviewspecifically show- ingthe hostsand servicesinquestion. 290 16.2 An Overview of theIndividual CGIPrograms Figure 16.25: Tactical overview with tac.cgi On theright-handsideofthe page theupperbox summarizesthe extinfo.cgi? type=4 (see page 285) Nagios performancedata, whichcan be shownindetail. Thebar graph beneathitshows thehealthofthe entire networkmonitored as a percentage.Ifyou move themouse over oneofthe bars, youwill also seethe percentage as anumber. 16.2.5Network plan:the topologicalmap of thenetwork ( statusmap.cgi ) statusmap.cgi (Figure16.26) provides aviewofthe dependencies betweenthe monitoredhosts.Startingfromthe centralNagios serverinthe middle, lines con- nect allhosts that theserverreaches directly—andwhose host definitions do not need the parents parameter to be specified (see Section2.3,page 44.). Thegraphicsalsorevealthe hoststowhich Nagioshas only indirect accessthrough otherhosts.Sobetween sls-mail andthe Nagios serverinFigure16.26 lie thehosts sls-proxy , hspvip,and pfint. sls-proxy ,asthe comment Down andthe red(instead of green)background suggest,has failed. Since sls-mail dependsonthis, it is in an UNREACHABLEstate, which statusmap.cgi also markswitharedbackground. 291 16 TheWeb Interface Figure 16.26: Dependenciesof monitoredhosts showngraphically HowNagios arranges thehosts in thegraphicsisdefinedbythe parameter de- fault_statusmap_layout (page 444) in theconfiguration file cgi.cfg.The layout can also be changedwithaselectionwindow in theWeb interface(at thetop right in Figure 16.27).The figure showsthe demo system of Netways, 9 whoseappearance dependsonuser-specificcoordinates,which in this caseyou have to specifyindi- vidually for each host (see page 310).The question mark icon suppliedbyNagios hasbeen replaced withnicer pictures by theoperatorofthe site. Coordinates and iconsare defined withthe hostextinfo object,describedinmoredetail in Section 16.4.1, page 307. 9 http://netways.de/Demosystem.1621.0.html 292 16.2 An Overview of theIndividual CGIPrograms Figure 16.27: Statusmapwith self-defined coordinatesand icons If youmovethe mouseontoaparticularhost, Nagios opens ayellowwindow at the topleftwithstatusinformation,which includes theIPaddress, currentstatusin- formation,and thetimeofthe last check. At the bottomofthisbox, statusmap.cgi summarizesthe states of theservices running on this host. If youdouble-clickonaparticular host,Nagios branches off to theusual status overview, whichapart from dataonthe host selected,alsodisplaysall theservices belongingtothishost(Figure 16.13 on page 281 givesanexample). 16.2.6Navigationin3D: statuswrl.cgi statuswrl.cgi allows Nagios to move througha3D representation of thenetwork plan (Figure16.28).Inthisyou can zoom on to hosts, move theoverall view, rotate it,etc. AVRML-capable browserisnecessary for thedisplay. 10 Althoughthe original doc- umentation 11 provides links to thecorresponding plug-ins,two of them areout 10 The VirtualRealityMarkup Language (VRML), version2.0/1997, is used to describe thevirtual “space.” 11 /usr/local/nagios/share/docs/cgis.html#statuswrl_cgi 293 16 TheWeb Interface of date, andonly Cortona 12 couldbereached at thetimeofgoing to press. This plugin doesnot work under Linux,however;inWindowsitworks withInternet Explorer,and also withNetscape, Mozilla,and Firefox. 13 Of theVRMLplugins for Linux (three well-known projects are OpenVRML, 14 freeWRL, 15 and vrwave16)the standardLinux distributionsusually do notinclude afinished package, so youare dependentonexternalpackages. Thereare binary RPMpackagesfor OpenVRML,but thecurrent (atthe time of going to press) ver- sion 0.15.9needsthe very newest libc andcannotthereforeevenbeinstalledin SuSE Linux 9.3. Youshouldnot trycompilingthe software yourself unlessyou are an experiencedsystem administrator or software developer: thereare alarge num- berofpitfalls.Ifyou have neverworkedwiththe Java compilerbeforeand have notcompiledcomplex software packages such as Mozilla or Firefox yourself,then youshouldleave it alone. Figure 16.28: This picturemarks the beginning of thetour throughyourown network Butall of this is no reason to despair, sincethe useof3Dnavigationisquestion- able anyway, especially as the2Dviewofthe normal status mapdisplaysall the information required,and displayingsimpleflat graphicsinthe browsertakesup considerably less time than CPU-intensive 3D rendering. Beforeyou rush into the adventure of compilingsoftware yourself,werecommend that youdecidefor your- self,using theCortona plugin,whether it is worththe effortofcompilingaproject likeOpenVRML. 12 http://www.parallelgrafics.com/products/cortona/ 13 ForMozilla andFirefox youhavetoinstall it manually,select Custom instead of Typical in the installation routine, andin unsupported browsers specify theplug-in directoryofthe browser. 14 http://www.openvrml.org/ 15 http://freewrl.sourceforge.net/ 16 http://www.iicm.edu/vrwave/ 294 16.2 An Overview of theIndividual CGIPrograms 16.2.7Querying thestatuswithacell phone: statuswml.cgi In ordertomake theinformation provided by Nagios accessible for WAP 17-capable devices without afully functional browser, statuswml.cgi generates aWeb page in theWML format,18 whichcan be displayedwithacellphone—provided that the Nagios serverisreachable in theInternet. Apartfromthe status queryfor hosts andservices,italsoallows theCGI program to switch off tests andnotifications andtoconfirmexistingproblems withacknowledgements. Youshouldthink carefully beforeyou make Nagiosaccessible over theInternet: Nagios makesavailable muchsensitive datathatcan be misusedbyhackers. In caseofdoubt,you’rebetter off doing without it.Without direct Internetaccess, statuswml.cgi is useless, sinceacellphone cannotuse protected accessmethods such as aVPN tunnel. This is whyweshall notintroduce statuswml.cgi in great detail at this point. 16.2.8Analyzing disruptedpartial networks: outages.cgi TheCGI program outages.cgi only showsthose networknodesinahost overview that areresponsible for thefailure of apartial network: In contrast to astatus overview, as in Figure 16.14, page 281, outages.cgi specifies in the #Hosts Af- fected column howmanyservices andhosts this affects in each case(Figure 16.29). Figure 16.29: As long as sls-proxy fails, Nagios cannot reachany hostslying behind itWith theicons in the Actions column youcallother CGIprogramsthatselectively filter outinformation on thehostshown here.Fromlefttoright,theyshowthe status displayinthe detail view(trafficlight), thetopological networkview(net- work tree), the3Dview(3-D ), thetrend display(graph), thelog fileentries for the host (spreadsheet),and thedisplay of notifications whichhavebeen made (mega- phone). 16.2.9Querying theobjectdefinitionwith config.cgi config.cgi showsatabularoverviewofthe definitionofall objectsfor atypethat can be specified (Figure16.30)—the type of object involvedcan be defined in the selectionfieldinthe topright corner.Where theconsideration itself contains Na- gios objects(in thehostview Host CheckCommand, Default ContactGroup, 17 Wireless Access Protocol. 18 The Wireless Markup Language contains apartofHTML, heavily reducedinits functionality. 295 16 TheWeb Interface and—notvisible in thepicture— Notification Period), alinktakesyou directly to theconfiguration viewofthisobject type. Figure 16.30: config.cgidisplays thecurrent configuration of the selected object class—here hosts—(extract) TheCGI program doesnot provideany wayofchanginganythinginthe settings. In addition,onlyusers whoare entered in theparameter authorized_for_configur- ation_information (configuration file cgi.cfg,page 444) have accesstothisview. 16.2.10 Availabilitystatistics: avail.cgi If youare monitoring systems, then youalsotake an interestintheir availability. avail.cgi first asks if youare interested in Hosts , Services, Hostgroups,and Ser- vicegroups .After youhaveselected atimeperiod, youwill seeanoverview, as in Figure 16.31. For Services and Hosts youcan also have theavailabilitydata presented through AllHosts or AllServices as aCSV file. Figure 16.31: An availabilityreport usingthe example of theSAP-Services servicegroup 296 16.2 An Overview of theIndividual CGIPrograms avail.cgi showsthe hostsinvolvedseparatelyfromthe services. Howlongaservice or host remained in aparticularstate can be seen from thecorresponding colored column—green for OK,yellowfor WARNING, redfor CRITICAL(service), DOWN and UNREACHABLE(host)—in percent. Thecolumn that showshow muchtimethe status of aservicewas UNKNOWNisshown in orange.Incomplete logfiles are showninthe Undetermined column. If thereisavaluelargerthanzero, then thereare periods for whichNagios cannotmake astatementconcerningthe state. Beloweach table,the Average linespecifiesthe average of theindividualvalues. In Figure 16.31 thehosts involvedwereavailable 99.965 percentofthe time. avail.cgi showsthe availabilitytwiceineach case: first as an absolute valuefor theevaluationperiod, andthen(in brackets)withrespect to thetimeduringwhich dataactually wasavailable.Aslongasthe Time Undetermined column displays 0.000%,the twoavailabilityvaluesmatch. If youclickonone of thehosts or servicesdisplayed,adetailedviewwill appear. Figure 16.32 showssuchaviewfor thehost sap-12. Figure 16.32: Theavailabilityofthe host sap-12 explained in detail On abar diagram that showsthe states over theselected period in color, there is detailedinformation on thehostitself, followedbystatisticsonthe availability of theservicethatismonitored on this host.Thisincludesanextract from the logfile, whichonlyshows therelevantentries for theavailabilityofthe host;that is, HOSTUP , HOSTDOWN ,or HOSTUNREACHABLE .The logfileentries arecut off by avail.cgi to save space. 297 16 TheWeb Interface 16.2.11 What events occur,how often? histogram.cgi If thestate of ahostorservicechanges,thisiscalledan event .The CGIprogram histogram.cgi showsthe frequencyofthisindifferent views.Ifyou select Dayof theMonth as the Breakdowntype ,itillustrates what eventtook placeonwhich day of themonth,and howoften (Figure16.33).The redgraph in servicesstands for CRITICAL, theorangeone for UNKNOWN, yellow for WARNING, andgreen for OK.The curvefor hostsinthe DOWN stateismarkedby histogram.cgi in red, that forUNREACHABLE hostsinwine-red, andthe green linestands, as usual, for OK. Figure 16.33: Howmanyeventsof whattypewerethere on which day? If youchoosethe variation DayofWeek,the Webpage showsonwhich day of the week most events occur, so youcan findout whether Mondayreally is always the worstday.Inaddition to this youcan have thefrequency presented by day ( Hour of Day )orbythe monthofayear(Month ). With ReportPeriod youcan adjustthe reportperiod. With Assume state retention youcan adjustwhether thepreviously existing states areretained andincludedinthe evaluation ( yes )ornot ( no). If youhaveconfiguredNagios so that it explicitly logs thestates of themonitored hostsand servicesfor arestart or when thelog fileischanged, 19 andifyou set Initialstateslogged to yes ,the script includes this explicitly in theevaluation. A no ignoresthe entry; histogram.cgi then assumesthatthe stateafter asystem startisidentical to that whichexisted directly beforethe restart.20 19 Parameter log_initial_state in nagios.cfg;see page 433. 20 Thesubtledifference here liesin retain_state_information (see page 438).Ifthisparameter is setto 0 ,Nagiosforgets thepreviousstate.Without log_initial_state =yes,Nagiosacceptsan OK afterthe restart. 298 16.2 An Overview of theIndividual CGIPrograms Ignorerepeatedstates makesallowances if astate persists for alongtimeand thereforedelivers thesameresultagainand again. If youset yes here,the script evaluates it once insteadofmanytimes. If youselectthe item Hard andsoftstates in State typestograph: , histogram.cgi also countssoft states.Ifaservicechanges from OK to CRITICAL, for example, while retry_check_interval is setto 4 , 21 then histogram.cgi countsatotaloffour results, threesoft andone hard.Ifyou only evaluate hard states,the statistics evaluate thevalue 1. If an errorisrectified, thereare no soft states;therefore thevalue for CRITICALisusually larger that that forRECOVERYifyou includesoft states in theevaluation. 16.2.12 Filtering logentries after specificstates: history.cgi The history.cgi script allows thestates of atype(soft or hard)tobeextracted selectively from thelogfileusing theselection field State type options (atthe top rightinFigure16.34),and specificeventstobeextracted (all, allrelated to hosts, allserviceevents, only host-recovery, only host-down, etc.)using Historydetail levelfor allhosts.The entriestobeshown can be restricted throughparameters to individualhosts,services,orhostorservicegroupswhenthe CGIprogram is called. So thecommand histogram.cgi?host=sap-12 only displays logfileentries for thehost sap-12. If theoutputshouldberestricted to aspecific host,thenthe service description needstobespecifiedaswell: histogram.cgi?host=sap-12&service=PING Selectingahostand service groupisdoneinthe same way: histogram.cgi?hostgroup=SAP histogram.cgi?servicegroup=SAP-Services Theperiodthat history.cgi views dependsonthe archivingintervalofthe logfile. Thescriptalways refers to thecontents of an archive file. If youset theparameter log_rotation_method (page 434) in theconfiguration file nagios.cfg to d for daily archiving, theWeb page presents theentries for oneday.Using thearrows(at the topinFigure16.34) youcan then scroll up anddownthrough thedays. 21 Nagios thus repeatsthe test four timesbeforeitcategorizes thestate as “hard.” 299 16 TheWeb Interface Figure 16.34: history.cgifiltersthe informationfromthe logfile 16.2.13 Whowas told what, when? notifications.cgi Anotherfiltered viewofthe logfileisofferedby notifications.cgi :Itshows allsent messages. Here theviewcan asoberestricted to aspecific message group, through theselection field at thetop rightinFigure16.35: to allnotificationsinvolving hosts, to allwhich areabout servicesinacritical state, andsoon. Figure 16.35: notifications.cgi answersthe question of who gets messages when, aboutwhat If youjustwanttosee messageshereconcerningparticularhosts andservices,you mustagainspecify this withparameterswhenrunning theCGI program: notifications.cgi?host=host notifications.cgi?host=host&service=service name notifications.cgi?contact=contact 300 16.2 An Overview of theIndividual CGIPrograms Apartfrom host and service ,you can also select aparticularcontact,but selecting host or service groups is notpossible. 16.2.14 Showingall logfile entries: showlog.cgi TheCGI program showlog.cgi showsthe logfileasitis, withthe fewcolored icons added to help youfind your way: ared buttonmarks critical service states or DOWN/UNREACHABLEhosts,ayellow buttonmarks WARNINGs,and agreen one, OK.Other buttons refertoinformation entriesorNagios restarts (Figure16.36). Youonlyhaveasingle optionhere: thechronological order. Normally showlog.cgi showsthe newest entriesfirst. If youenable thecheckmarkin OlderEntries First: (top right),the oldest entrieswill be shownfirst. Theperiodrepresented here also dependsonthe archivingmethod: if youarchive once aday,you willobtainjustone day for each Webpage.Toreach theentries for otherdaysyou mustmake your waythrough theindividualarchive files of the logfileusing thearrowsatthe topofthe picture. Figure 16.36: Abluebuttonmarks informationentries, thegraph changing from redtogreen stands forNagios restarts,and theicon marked GO with a green checked background representsrestartsof themonitoring system 16.2.15 Evaluating whateveryou want: summary.cgi If thedisplay andselection options areintroducedsofar arenot sufficientfor you, youcan create your ownreportwith summary.cgi,which generates theselection dialog showninFigure16.37. Theuppersection, Standard Reports:,provides a quicksummary in whichjustone fixed reporttypecan be selected.Clicking on the buttondirectlybelow this generates thereport. 301 16 TheWeb Interface Thesecondsection is more sophisticated. Thefield Report Type: withthe report type Most Recent Alerts provides an individuallisting of thelast n of individual events.The number n is defined furtherdowninthe selectiondialogin MaxList Items:. 22 Report Type: can also be used to show alleventsindividually on asep- arateline, with Most Recent Alerts,oryou can have statistics displayed, for the number of events that have occurred overall, for each host group, etc.,with Alert Totals , AlertTotalsbyHostgroups ,etc.. Oneparticularlyinteresting reporttypeis TopAlert Producer :suchreports show in ahit listofwho hascausedmosttrouble during thereportperiod. In Report Period: youcan either choosethe desiredreportperiodfrompredefined intervals (thisweek,the pastseven days, this month, last week,lastmonth,etc.), or youcan specify CUSTOM REPORT PERIOD anddefine anyperiodyou choose. If youforgettospecify CUSTOM REPORT PERIOD explicitly,the CGIprogram ignores thedates youhaveset andselects what is currentlyentered in Report Period. Figure 16.37: Selectiontemplate forparametersin summary.cgi Thedetails that followthe reportperiodfilter according to host,services or their groups,state types, and/or individualstates (e.g., only servicesinaCRITICALstate). 22 If thenumber of events in thereportperiodislessthanspecifiedin MaxListItems:,the report covers allthe events that have happenedduringthisperiod. 302 16.2 An Overview of theIndividual CGIPrograms It is important to specify MaxListItems at theend: summary.cgi always shows only as many entriesasare specified here.The defaultisalittlesmall; if youwant allthe entriesinthe selected period to be shown, youshouldenter 0 as thevalue. Thelargest valuethatcan be givenexplicitly here is 999. The Create Summary Report! buttonthengenerates therequested report(Figure 16.38). Theheaderofthe reportcontainsdetails of thereportperiodand theselection made.The detail directly abovethe table is interesting: Displaying most recent 25 of 3721 total matching alerts showsthatthe selectioncriteria matchedatotalof 3721 entries, butthatthe CGIscriptrestricted theoutputtothe 25 most current entries, thanks to MaxListItems: . Figure 16.38: An individual report, as generated by summary.cgi 16.2.16 Following statesgraphically over time: trends.cgi Arapidoverviewofwhatstate occurred when for aparticularhostorserviceis provided by thegraphic output of trends.cgi (Figure16.39).After selectingaspe- cific host or service,you can defineaperiod, as with summary.cgi.The states are color-codedby trends.cgi,which makesthe overvieweasiertofollow. Thezoom function of theCGI program is an interesting detail. If youclickinthe coloredareaonaparticular section, theselected area is enlarged or reducedinsize by thezoom factor specified at thetop right. Negative entries(-1, -2, -3,and -4 arepossible)expandthe reportperiodinsteadofreducingit. 303 16 TheWeb Interface Figure 16.39: trends.cgirepresents thechronological sequenceof states—hereusingthe example of aservice 16.3Planning Downtimes In everysystem environment maintenance work accumulates from time to time that theadministrator can normally plan,sothatusers can be informedaccordingly beforehand. Nagios refers to such maintenance windowsas ScheduledDowntime ; theadminisntratorentersthese either in theinformation page for thehostor service generated by extinfo.cgi (Figure16.4, page 277) or for thecorresponding host or service group(Figure 16.16, page 284).Indoing this, extinfo.cgi makesuse of cmd.cgi (Section 16.2.3, page 288),which can also be calledselectively: http://nagsrv/nagios/cgi-bin/cmd.cgi?cmd_typ=55 opens theimporttemplatefor maintenance timesfor asinglehost. Thevaluesfor cmd_typ aresummarizedbyFigure16.23 on page 289. Afurther method of recordingmaintenance periods is provided by addons, which, likethe CGIprograms, usethe externalcommand interface, butwhich can be auto- mated,incontrasttothe interactive Webinterface. Such addonscan also be found on theNagios Exchange. 23 Forscheduled downtimes,Nagios prevents notifications from beingsent. This en- suresthatthe administrator is notfloodedwithfalse alarms.Whenchecks are made to seewhether messagesshouldbesent, adowntimeisthe thirditem in the 23 http://www.nagiosexchange.org/Downtimes.38.0.html. 304 16.3 PlanningDowntimes list(Figure 16.2, page 218).Inaddition, avail.cgi (Section 16.2.10, page 296.)takes account of thedowntimewhenevaluatingthe availabilityofindividualhosts and services, andassignserror states that occurduringthese timesnot as errorstates, butasOK. Maintenance periods can overlap. If onemaintenance window lastsfrom8:00 A.M. till 12:00 P.M.,and asecondone involvingthe same host or service,from10:00 A.M. to 2:00 P.M.,thenNagios does notsendany errormessagesbetween 8:00 A.M. and 2:00 P.M.,and thewhole period is also ignoredinthe availabilitystatistics. 16.3.1Maintenance periods forhosts What dataisrequiredtorecordthe maintenance window can be explainedquite clearlyusing theWeb interface. Figure 16.40 showsthe inputtemplatefor the downtimeofahost(cmd.cgi?cmd_typ=55 ). Figure 16.40: Thedowntimefor a host in theWeb interface is recorded usingthisdialog Thefirstlinedefies thehost, andinthe second lineNagios automatically entersthe loginwithwhich youhaveloggedintothe Webinterface. In theinput field after the Comment: keyword,you can describe thereasonfor theplanneddowntime. Specifyingthe triggershows whether it wasgenerated indirectly throughanother entry. When recordinganew downtime, youshouldleave thevalue N/A ( not available ,thatis, no trigger) as it is. In thenextfourlines youhavethe optionofenteringtwo different downtime types: fixed ones ( Type:Fixed)orvariable periods ( Flexible ). Thefirsthas afixed startand afixedend.InthiscaseNagios ignoresthe period entryinhours and minutes in the FlexibleDuration: fieldscompletely. 305 16 TheWeb Interface Aflexible downtimestartswhenthe first-everevent occurs in theperiodspecified. From this point in time Nagios plansthe downtimefor thelengthoftimethat wasspecifiedhereinhours andminutes.Thismay certainlyexceed theend point specified in EndTime: . If furtherhosts aredependent on thecomputer specified in Host Name: (perhaps because arouter is involved, whichother host objectshaveentered as parents ), youhavethe possibilityofextending thedowntimetoall dependenthosts withthe last item, Child Hosts:. Schedule triggereddowntimefor allchildhosts passeson flexible downtimes to all“child hosts,” Schedule non-triggereddowntimefor all child hosts doesthe same for fixed downtimes,and Do nothing withchildhosts ignoresdependencies, so that Nagios doesnot plan for anydowntimefor anyhosts otherthanthe onespecifiedhere. Howthishereditarybehaviortakeseffect in Figure 16.40 is shownbythe overview of allscheduled downtimes in Figure 16.20 on page 287. Thefirstlinecontainsthe downtimejustdescribedfor thehost eli-saprouter withthe Downtime ID number 1 .Entries that arecausedbyinheriting this timeoutcontain the Downtime ID of thedowntimecausing them in the TriggerID column: for sap-12 this is 1 ,since themaintenance of eli-saprouter also affects this host. Nagios simultaneouslygenerates acomment entrywhenplanning adowntime, whichisautomatically removedwhenthisperiodhas passed. This is whyaspeech bubble appears in thestatusdisplay.Duringthe downtimeNagios supplements this witha“snoringsign,”which is meanttorepresent asleep state(Figure 16.41). Figure 16.41: Thesnoring signzzzzz showsthatthe downtimefor the host hasbegun 16.3.2Downtimefor services Downtimesfor servicesdifferfromthose for hostsintwo smalldetails.Apart from host name,the service descriptionmustbeincluded, andthe possibilityof inheritanceisexcluded, sincethere arenocorresponding dependencies for services. Adowntimefor ahostdoesnot automatically apply to theservices running on it. Butsince they arealsonot available if thehostisdown, it is recommended that youplanthe same downtimefor alldependent services. It can be quitestrenuous to enter allthe servicesindividually.Itismucheasiertodothisusing ahostgroup ( cmd_typ=85 ), as showninFigure16.42. With this youcan definethe downtime for servicesinaspecifichostgroup withasingle command,and muchmoreas well:acheckmarkin Schedule Downtime ForHosts Too at thesametimedefines thesamedowntimefor allhosts belongingtothisgroup. 24 24 In theNagios-2.0betaversionsthe checkmarkhad no effect,however;there youhavetoenter thedowntimeofthe hosts separately by running cmd.cgi?cmd_typ=84 again. 306 16.4 Additional Information on Hosts and Services Figure 16.42: Onedowntimefor all services of ahost group 16.4AdditionalInformationonHosts andServices With theobjects hostextinfo and serviceextinfo youcan take in additionalinfor- mation in theWeb interfaceand also brighten this up somewhat,using suitable icons. Bothobjectsonlyhaveaneffect in theWeb interface, andtheydonot influence thecapabilitiesofNagios. 16.4.1Extended hostinformation hostextinfo objectsallowyou to enhancethe displayofhosts in theWeb interface throughadditional functionsinthe formoflinks andenhancementfeaturesinthe formoficons andcoordinates: #--/etc/nagios/mysite/hostextinfo.cfg define hostextinfo{ host_name linux01 notes Samba Primary Domaincontroller notes_url /hosts/linux01.html action_url /hosts/actions/linux01.html icon_image base/linux40.png icon_image_alt Linux Host vrml_image base/linux40.png statusmap_image base/linux40.gd2 2d_coords 120,80 3d_coords 70.0,30.0,40.0 } Theonlyobligatoryparameter when theseare defined is thespecificationofthe host,with host_name ;everythingelseisoptional: 307 16 TheWeb Interface host_name This is thenameofthe host object whoseWeb pagesare to be expandedby thefollowingproperties. notes Use this for additional information that extinfo.cgi takesintoaccount in itsinformation pages. (The entryspecifiedinthe aboveexample, Samba Primary Domaincontroller ,can be found in Figure 16.43 belowthe Linux icon.) Figure 16.43: Next to thethree iconsfor ExtraHost Actions, ExtraHost Notesand theLinux penguin, extinfo.cgi also showsan alternativetext here forthe Linux icon (beneath theTux in brackets) andthe additional informationfromthe parameter notes (beneath the alternativetext) notes_url This is theURL of a(HTML) filewithadditional information on thehostin question,towhich youare linkedbyaniconinthe formofared,slightly openedmanual, both in thestatusoverview(Figure 16.44) andinthe info page generated by extinfo.cgi (Figure16.43).Ifthe documentationonthe host involvedisstoredinthe Intranet, then maintenance contracts, hotline numbers, system configuration,etc.are then just amouse clickaway. Theparameter maycontain an absolute path(from theviewofthe Web server) or acomplete URL(http://...). Figure 16.44: This status detail view additionally showsaniconeach fornotes_url (open, read booklet), action_url (pinkstar), andicon_image (here, Linux penguin) action_url This is alinkpointingtoanactiontoberun for thehost, whichexecutes a CGIprogram such as cmd.cgi ,for example, withjustamouseclick. Since a linkinthe browserisalways just alink, this doesnot have to be acommand, andyou can just as easily linkanother webpage.Bothinthe status overview 308 16.4 Additional Information on Hosts and Services (Figure16.44),and on the extinfo.cgi info page (Figure16.43) it is hidden behind thepinkstar. As avalue,absolutepaths from theviewofthe Webserverorcomplete URLs can be used. icon_image This is an icon to enhancethe Webinterface, butalsotoprovidehelp: if you systematically usepicturesherethatrepresent theoperating system (e.g., theTux for Linux,the Windowswindow for Microsoft operating systems, theSun logo for Solariscomputers, etc.), this helpsyou to keep an overview of theoperating systemsinthe status view—especially if youhavealarge number of hosts(Figure 16.44). extinfo.cgi also uses this icon (Figure16.43). Iconsshouldbeapproximately40x40 pixels largeand be available as aGIF, JPEG,orPNG file. If youspecify arelative path(or none at all),thenthis begins withthe directory /usr/local/nagios/share/images/logos/ . 25 icon_image_alt This alternative textfor theiconappearsifthe browserdoesnot show a picture(for examplefor reading devices or output devices for Braille). From theiconand theicontextdetails,Nagios generates thefollowingHTML code: icon_image_alt vrml_image This is an image symbolizingthe host in the3Drepresentationof statuswrl. cgi .Permissible formats areagainGIF,JPEG,orPNG.You should avoidslides, sincethe image is placed on acube, andthe transparentparts in the3D interfacemay lead to unexpected results. statusmap_image This is theimage withwhich statusmap.cgi (see Section16.2.5, page 291) symbolizes thehostinits topological map. TheNagios demo page of Net- ways, 26 (Figure16.27 on page 293) showsanice example. AlthoughGIFs, JPEGs, andPNGsare allowed, it is better to usethe GD2 format, because then Nagios requires less computer time to generate the status map. Using theprogram pngtogd2,which oughttobeavailable as acomponentofthe utilitiesfor Thomas Boutells GD libraryinmostLinux distributions, PNGfilescan be easily converted.Againthe image size of 40x40 pixels is recommended. 25 If youhavekepttothe pathssuggestedinthisbook. 26 http://netways.de/Demosystem.1621.0.html 309 16 TheWeb Interface 2d_coords This parameter specifies coordinates for auser-definedlayoutofthe topo- logical map. Details aregiveninpixels, withthe origin, (0,0), at thetop left, andvaluesmustbepositive:apositive xvalue countsthe number of pixels from theorigintothe right, apositive yvalue from theorigindownwards. Figure 16.27 workswithfixedcoordinates for individualhosts.Nagios ig- nores 2d_coords details if thestatusmaps adifferent layout to theuser- defined one. 3d_coords Theseare thecoordinates for the3Drepresentation. Positive andnegative floating-point numbersare allowed. (0.0,0.0,0.0) is used as theorigin. In the startview, statuswrl.cgi scales the3Dimage so that allexistinghosts appear on thescreen.Where thestartingpoint liesonthe screen can thereforenot be predicted. On TheNagios Exchange thereisawiderange of finishedicons in thecategory Logos andImages . 27 It is best to unpack theseintoseparatesubdirectories, and then theindividualpackageswill notget in each other’sway: linux:˜ # cd /usr/local/nagios/share/images/logos linux:images/logos # tar xvzf imagepak-base.tar.gz base/aix.gd2 base/aix.gif base/aix.jpg base/aix.png base/amiga.gd2 ... imagepak-base.tar.gz contains abasic selectionoficons,which can be supple- mented as youpleasewithother packages.The base subdirectory created,aswith theobject definitionatthe beginning of this chapter,mustalsobeincluded. 16.4.2Extended service information serviceextinfo objectsare more or less identical to theirhostequivalents, so that we willonlymention thedifferences.Inaddition to thehostname, theservice descriptionin service_description is obligatory, butthe details on the2D(status map) and3Dviews areomitted: 27 http://www.nagiosexchange.org/Image_Packs.75.0.html 310 16.5 Configuration Changes throughthe WebInterfaces: the Restart Problem #--/etc/nagios/mysite/hostextinfo.cfg define serviceextinfo{ host_name linux01 service_description LPD notes Linux Print Services notes_url /hosts/linux01-lpd.html action_url /hosts/linux01-lpd-action.html icon_image base/hp-printer40.png icon_image_alt Linux Print Server } In contrast to hostextinfo ,the status overviewfor this exampleonlyshows the printer icon specified in icon_image,but notthe twoicons defined in notes_url and action_url for thetwo links notes_url and action_url.Theyonlyappear in thepage generated by extinfo.cgi withthe same iconsasfor theextendedhost information (Figure16.43, page 308). 16.5ConfigurationChanges throughthe Web Interfaces: theRestartProblem TheCGI program cmd.cgi (Section 16.2.3, page 288) enablesaseries of changes to be made to thecurrent configuration throughthe Webinterface.28 In this way notifications or active checks can be switched on andoff, for example. Nagios doesnot save such changesinthe accompanyingconfiguration file, but notes thethe currentstatusinaseparatelydefinedfile,withthe parameter state_ retention_file in nagios.cfg (see page 441).But what happens if yourestart Nagios after many changesusing theWeb interface? Whether Nagios retainsthe interactive changesmade after arestart or forgets them is dependentonthe parameter retain_state_information in theconfigura- tion file nagios.cfg (page 438).The default 0 tells thesystem to forgetinteractive changes. ForNagios to remember this,you have to set #/etc/nagios/nagios.cfg ... retain_state_information=1 ... Butthiscausesanewproblem: settingsmade in theWeb interfacedonot have pri- orityoverthe details in theconfiguration files.Ifyou change the active_checks_ enabled parameter therefor aservice, adirection of theparameter in theconfigu- ration fileisignored,since thecurrent,temporarily stored setting in thefile defined 28 TheCGI program makesuse of theExternal Command File interfacewhendoing this. 311 16 TheWeb Interface with state_retention_file willalways “win out.”Thisbehavioraffectsall parame- tersfor externalcommandsthatcan be changedinthe interface, andtherefore also via theCGI program cmd.cgi .The original documentationofNagios29 labels thesewitharedstar. Twoapproachesprovidearemedy in this case: on theone hand youcan setthe parameter retain_state_information to 0 shortlybeforearestart. Then Nagios forgets allthe changeswhenitrestartsand reads theconfiguration files in from scratch. This procedureisrecommended only in exceptionalcases,asinlarge envi- ronmentsitwill hardly be possible to go throughall theinteractive changesinthe configuration files.Alternatively youcan getintothe habit,wheneveryou make changesinthe configuration file, of making them asecondtimeinthe Webinter- face. Althoughthismeans slightly more work,there is neveradangerthatcurrent, andperhaps very important settings, willbelost. Twoadditional parametersinthe host andservicedefinitionsprovideopportunities for fine-tuning: define host{ ... retain_status_information 1 retain_nonstatus_information 1 ... } define service{ ... retain_status_information 1 retain_nonstatus_information 1 ... } retain_status_information specifies whether thecurrent stateofahostorservice should survive theNagios restart: 1 meansthatthe system temporarily stores the state, and 0 ,thatitforgets it. 1 is certainlythe more sensible valuefor states,and youshoulddepartfromthisonlyincases that can be justified. retain_nonstatus_information,onthe otherhand, refers to allinformation that describes no status.Thisincludes, for example, whether active checks areswitched on or off, whether passive checks areallowedornot,orwhether adminsare to be informedofstatuschanges for this object.Withavalueof 1 ,the system stores this information temporarily andusesitagainafter arestart,whereas withavalueof 0 , Nagios forgets thecurrent settingsand reads thesettings from theconfiguration filewhenitrestarts. 29 usr/local/nagios/share/docs/xodtemplate.html 312 17 Ch ap te r GraphicDisplay of PerformanceData When Nagios reports to theadministrator quicklyand selectively on problems that have occurred,itcan basically only distinguishbetween OK states anderror states, sparingthe admin aflood of information on problematicservices andhosts.The graphic displayofmeasuredvaluesoveratime period cannotbeintegrated into this “trafficlight approach,” butitisavailable throughthird-party software.Nagios supports externalprocessing of values withaninterfacecreated specifically for this.The dataprocessedthrough it is referred to in Nagios jargon as performance data . Nagios hastwo different classesofperformancedata. ThefirstisNagios-internal performancedata, statistics on theperformancetimes of tests andonthe differ- ence betweenthe actualtesttimeand theplannedtime(the latency ). Thesecond classincludesperformancedatathatthe plugin passesonwiththe testresult. This involveseverythingthatthe plugin can measure: response times, hard drive us- age,system load, andsoon. Theseare theverythingsthatare of interesttoan administrator,which is whythe book concentrates on howtheyare processed. 313 17 GraphicDisplay of PerformanceData Nagios extractsthisdataand either writes it to afile whereitcan be processedby otherprograms, or passesitondirectlytothe externalsoftware that is runafter everyserviceorhostcheck. 17.1Processing Plugin PerformanceDatawith Nagios Performancedataprovided by service andhostchecks can be processedonlyif thecorresponding plugin delivers it in apredefined format. As shownhereusing the check_icmp plugin (Section 6.2, page 88),itisprecededbya| sign andisnot showninthe Webinterface: nagios@linux:libexec/nagios$ ./check_icmp -H vpn01 OK -eli02: rta 96.387ms, lost 0%| rta=96.387ms;200.000;500.000;0; pl=0%; 40;80;; This standardizedformisprovided by most pluginsonlyafter version1.4. 1 The performancedataitselfconsistsofone or more variablesinthe followingform: name= value; warn; crit; min; max Thevariable name maycontain spaces, butthenitmustbesurrounded by single quotationmarks.After theequalssigncomes first themeasuredvalue as an integer or floating-point decimal, withorwithout aunit. Possible units are % (percentage), s (timeinseconds), B (datasizeinbytes), or c (counter,anincremental counter). This is followed, separated by asemicolon,bythe warningand critical limits, and then theminimum andmaximum value. Percentage values can be left outbythe plugin.You can also specify 0 for minimum/maximum, as well as forthe warningor critical limit,ifthere is no such thresholdvalue.Ifthere areseveral variables, these areseparated withspaces,asinthe check_icmp example. However, in contrast to this,the final specification should not endwithasemicolon, according to the DeveloperGuidelines. 17.1.1The template mechanism Nagios hastwo methods of processing performancedata: either thesystem saves thedatatoafileusing a template,oritexecutes an externalcommand.Ifyou just 1 Some toolssuchasNagiosgraph andNagiosGraphermakeuse of thefactthatthe remaining text normally contains performancedataaswell. If they arecorrespondingly configured, they areable to extractthe performancedatacontained there. In this waytheycan furtherprocess data that doesnot conformtothe standardformat. 314 17.1 ProcessingPluginPerformanceDatawithNagios want to writedataconsistentlytoalogfile,the template procedureissomewhat easier to configure. In orderthatNagios can process performancedataatall, theparameter #/etc/nagios/nagios.cfg ... process_performance_data=1 ... mustbeset to 1 .The filetowhich Nagioswrites thehostorserviceperformance dataisspecifiedbythe *_perfdata_file parameters: #/etc/nagios/nagios.cfg ... # host_perfdata_file=/varr/nagios/host-perfdata.dat service_perfdata_file=/var/nagios/service-perfdata.dat # host_perfdata_file_template=[HOSTPERFDATA]\ t$TIMET$\ t$HOSTNAME$\ t \ $HOSTEXECUTIONTIME$\ t$HOSTOUTPUT$\ t$HOSTPERFDATA$ service_perfdata_file_template=[SERVICEPERFDATA]\ t$TIMET$\ t \ $HOSTNAME$\ t$SERVICEDESC$\ t$SERVICEEXECUTIONTIME$\ t \ $SERVICELATENCY$\ t$SERVICEOUTPUT$\ t$SERVICEPERFDATA$ ... If host_perfdata_file is commented out, as in this example, Nagios doesnot save anyperformancedataofhostchecks.But sincetheyare only used if allservice checks fail, it liesinthe nature of host checks that they only providedatasporadi- cally andatirregular intervals.Thisiswhy it is notworth evaluating them in most cases. The *_perfdata_file_template parametersdefine theoutputformat. Thedefinition shownabove, service_perfdata_file_template,delivers (one-line) logfile entriesin thefollowingpattern: [SERVICEPERFDATA] 1114353266 linux01 PING 0.483 0.104 OK -10.128.254.12: rta 100.436ms, lost 0% rta=100.436ms;3000.000;6000.000;0; pl=0%;40;80;; Each linebeginswitha[SERVICEPERFDATA] “stamp,”followedbythe testtimein epochseconds ( $TIMET$ ), thehostnameand service description($HOSTNAME$ and $SERVICEDESC$ ), thetimeNagios requires for thetest($SERVICEEXECUTION- TIME$ ), andthe latency betweenthe plannedand actualtimeofperformance ( $SERVICELATENCY$), each separated by atab.ThenNagios writes theoutputfor theWeb interfacetothe logfile ( $SERVICEOUTPUT$)and finally theactualper- formancedata($SERVICEPERFDATA$ ). \t in theparameter definitionensures that atab separates theindividualdetails from each otherinthe log. With the *_perfdata_file_mode parametersyou can definewhether Nagios ap- pendsthe datatoanexistingfile ( a )oroverwrites theexistingfile ( w ): 315 17 GraphicDisplay of PerformanceData #/etc/nagios/nagios.cfg ... host_perfdata_file_mode=a service_perfdata_file_mode=a ... This is suitable for externalprogramsthatcan read thedatafroma(previously set up)named pipe.Thismethod provides better performanceand doesnot require anyspaceonthe hard drive.Ifthe processing software is notrunning,however, thedatamay be lost:Nagios does tryfor atimetocontinue writingtothe pipe, butabortsthisprocess after atimeout if thedatacannotbereadout. Programsthatreadfromalogfile generally delete it afterwards, to preventthe filesystem from overflowing.Ifthe program doesnot retrieve anydata, thefile will grow quickly, butnothing willbelostaslongasthere is still space on thefile system. It is best to runexternalevaluationsoftware as apermanent service.But youcan also configure Nagios so that it regularlytriggers aprogram for furtherprocessing: #/etc/nagios/nagios.cfg ... # host_perfdata_file_processing_interval=0 # service_perfdata_file_processing_interval=0 # host_perfdata_file_processing_command=process-host-perfdata-file # service_perfdata_file_processing_command=process-service-perfdata-file ... With the *_perfdata_file_processing_interval parametersyou setanintervalin secondsafter whichNagios willcarry on running thecorresponding *_perfdata_ file_processing_command at specificintervals.Thiscommand is defined as anor- malNagios command object: #misccommands.cfg ... define command{ command_name process-service-perfdata-file command_line /path/to_the/evaluation_program } ... As long as theexternalsoftware itself looksafter thefurther processing of thefile withthe performancedata, youdoneed to usethe *_perfdata_file_processing_* parameters. 316 17.2 Graphs forthe WebwithNagiosgraph 17.1.2Using external commandstoprocess performance data As an alternative to thetemplatemethod, Nagios can also directly callacommand that takesoverfurther processing of data. This is donedirectlyafter each test result;soafter each individualcheck, an externalprogram is started.Ifyou have alarge number of servicestobechecked, this can,depending on thesoftware, considerably degrade performance. Thecommand itself is defined withthe process_perfdata_command parameter insteadofthe perfdata_file parameter: #/etc/nagios/nagios.cfg ... process_performance_data=1 service_perfdata_command=process-service-perfdata ... In thesameway as withserviceperformancedata, youcan also processthe results of host checks,using the host_perfdata_command parameter. process-service- perfdata itself againreferstoanormalNagios command object: #misccommands.cfg ... define command{ command_name process-service-perfdata command_line /path/to/program "$LASTSERVICECHECK$||$HOSTNAME$||\ $SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$" } ... This opens theexternalprogram,which is giventhe necessary information as ar- guments. This should includeatleast thetimestamp of thelastservicecheck ( $LASTSERVICECHECK$),the host name ( $HOSTNAME$), andthe service descrip- tion ( $SERVICEDESC$ ), as well as theactualserviceperformancedata($SERVICE- PERFDATA$ ). Thedelimiter dependsonthe program used:thisexample uses ||,as is used by theNagiosgraph program. 17.2Graphsfor theWeb withNagiosgraph With theprogram Nagiosgraph from http://nagiosgraph.sf.net/ ,performancedata suppliedbyplugins can be displayedgraphically in aWeb interfaceinchronological form. Thesoftware consists of twoPerlscripts.The script insert.pl writes the Nagios performancedatatoaround-robin database, aringbufferinwhich the 317 17 GraphicDisplay of PerformanceData newest dataoverwrites theoldest.2 Theadvantage of this is thesmall amount of space required,which can be defined beforehand. Thetrick consists of savingdatainvarious resolutions, dependingonits age:older datawithalowerresolution(e.g.,one measurementvalue perday), currentdata withahigh resolution (e.g., onemeasurement everyfive minutes). When setting up thedatabase, youalsodefine howlongthe dataisretained.Thisdefinesspace requirements rightfromthe beginning. Provided that Nagiosgraph detects theperformancedata, theprogram creates a separateround-robin databasefor each newservice, when it appears for thefirst time.The map configuration fileincludeddescribesjustafewservices,sothat usually some manualwork—andabasicknowledge of Perl—is required. ThesecondNagiosgraph script show.cgi,aCGIscript, represents theinformation from thedatabaseinadynamicHTMLpage.Todothis, it is run(after configuration is completed)inthe form http://nagsrv/path/to/show.cgi?host=host&service= service_description Nagiosgraph then displays fourgraphs(adaily,aweekly,amonthly, andayearly summary)for thedesired service. 17.2.1Basic installation An installedRRDtool package, whichiscontained in most Linux distributions, is a prerequisite for Nagiosgraph.Alternatively youcan obtain thecurrent source code from http://www.rrdtool.org/ . 3 Forreasons of performance, it is recommended here that youalsoinstall theincludedPerlmodule RRDs . TheNagiosgraph tarfile itself is preferably unpacked in thedirectory /usr/local/ nagios: nagios@linux:local/nagios$ tar xvzf nagiosgraph-0.5.tar.gz nagiosgraph/INSTALL nagiosgraph/README nagiosgraph/README.map nagiosgraph/insert.pl nagiosgraph/insert_fast.pl nagiosgraph/map nagiosgraph/nagiosgraph.conf nagiosgraph/show.cgi nagiosgraph/testcolor.cgi nagiosgraph/testentry.pl 2 Furtherinformation on this topiccan be found at http://www.rrdtool.org/ . 3 To install, seepage330. 318 17.2 Graphs forthe WebwithNagiosgraph insert.pl extractsthe datatransferred by Nagios andinserts this into theRRD database. If this doesnot exist, however, thescriptwill create it.Alternatively insert_fast.pl can take on this task.Thisscriptusesthe Perl module RRDs ,which is considerably more efficient than callingup rrdtool as an externalprogram each time,which is what insert.pl does. AnotherPerlscriptcalled testentry.pl helpsifyou aretesting your own map entries. Butsince youhavetowrite thesedirectlyintothisfile,you can also change the map fileitself(as shownbelow)—provided youhavemade abackup copy first.The CGIscript testcolor.cgi looksmorelikeadeveloper’sutilityleftoverinthe package, rather than atool that is of anyuse for users. Apartfromthe already mentioned map configuration file, thereisasecond one, nagiosgraph.conf,and itspathmustbedefinedcorrectly in both insert.pl (or in- sert_fast.pl)and show.cgi,soitisrecommended that youcheck this: my $configfile =’/usr/local/nagios/nagiosgraph/nagiosgraph.conf’; 17.2.2Configuration Theconfiguration file nagiosgraph.conf Allother relevant paths—suchasthose to the map fileand to the rrdtool—are adjusted in nagiosgraph.conf: rrdtool =/usr/bin/rrdtool rrddir =/var/lib/rrd/nagiosgraph logfile =/var/nagios/nagiosgraph.log mapfile =/usr/local/nagios/nagiosgraph/map debug =2 colorscheme =4 Nagiosgraph creates theRRD databases in the rrddir directory. Here theuser na- gios musthavewrite accessand theuserwithwhose rights theWeb serveris running musthavereadaccess: linux:˜ # mkdir -p /var/lib/rrd/nagiosgraph linux:˜ # chown nagios.nagcmd /var/lib/rrd/nagiosgraph linux:˜ # chmod 755 /var/lib/rrd/nagiosgraph Thelog file, for whichbothusers need writeaccess (the Webuserbecause theCGI script also recordsinformation to thelog file),isalsocritical: linux:˜ # touch /var/nagios/nagiosgraph.log linux:˜ # chown nagios.nagcmd /var/nagios/nagiosgraph.log linux:˜ # chmod 775 /var/nagios/nagiosgraph.log 319 17 GraphicDisplay of PerformanceData Howverbose Nagiosgraph is can be adjusted with debug .The possible debuglevels aredocumented in theconfiguration fileincluded: 2 means“errors,” 4 “informa- tion”—here Nagiosgraph is already so verbosethatyou mustwatch outthatthe file system doesnot overflow. Except fordebugging purposes (suchaswhensetting up thesystem),itisbetter to choose 2 . With colorscheme ,which can acceptvaluesfrom1to 8, youcan influence the amount of colorinthe graphs—it is best to tryout theoptions to seewhich color scheme matchesyourpersonaltaste best. Nagios configuration Nagiosgraph grabsthe performancedatadirectlyfromNagios.For this reason nagios.cfg doesnot requireany *_perfdata_file_* parameters. #/etc/nagios/nagios.cfg ... process_performance_data=1 service_perfdata_command=process-service-perfdata ... process_performance_data switches on processing of performancedataingen- eral; service_perfdata_command refers to theNagios command object that con- tainsthe externalcommand: #misccommands.cfg ... define command{ command_name process-service-perfdata command_line /usr/local/nagios/nagiosgraph/insert_fast.pl \ "$LASTSERVICECHECK$||$HOSTNAME$||\ $SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$" } ... Thedefinition of theparameter command_line mustbewritten on oneline(with- outthe backslashes \ ), as usual. So that theCGI script can rundirectlyfromthe Nagios Webinterface, a serviceex- tinfo object is defined: define serviceextinfo{ service_description PING host_name * notes_url /nagiosgraph/show.cgi?host=$HOSTNAME$&service=PING icon_image graph.gif icon_image_alt show graphics } 320 17.2 Graphs forthe WebwithNagiosgraph If thegraphic defined in icon_image is in thedirectory /usr/local/nagios/share/ images/logos,the Webinterfacemarks the PING servicesfor allhosts in thestatus displaywiththis.4 Here thestrengthof show.cgi can be seen:onlybecause this script is calledexplicitly withhostand service namesisadefinitionlikethe one abovepossible.Insteadofanindividualhostname, youcan also specifyahost group, or,asinthisexample,a* .Arequirementfor this is that PING really is defined as aservicefor everyhost. The $HOSTNAME$ macrothenautomatically insertsthe appropriate host.The additionalinformation for aspecific service type (which musthavethe same service descriptioninall hosts) can thereforebecatered for withjustone single definition. Apacheconfiguration So that theApache Webservercan acceptthe CGIscriptasitis, a ScriptAlias is created,for example: ScriptAlias/nagiosgraph/ /usr/local/nagios/nagiosgraph/ This entryisbestplacedinthe configuration filediscussedinSection 1.3(page 33), nagios.conf .Onlyafter Apacheisreloadedcan theCGI script be runfromthe URL specified on page 318. Adjustments to themap Dependingonthe service,the round-robin databasemay also save severalseriesof measurements,which can be requested individually throughthe CGIscript: http://nagsrv/path/to/show.cgi?host=host&service=service_description&db= database, entry1, entry2&db=database, entry3 Thedatabaseusedherecontainsatleast threedifferent series of measurements, thefirsttwo of whichare showntogether in onegraphic,while thethird is shown in aseparategraphic.Whatisshown together andwhatisseparatedepends on the standardization.Itmakeslittlesense to displaythe percentage load of aharddrive andthe absolute valueinbytes in thesamegraphic,since theYaxis can only have onescale. It is better here to displaypercentage values in onegraphic andabsolute bytevaluesinasecond one. On theother hand youcan displaythe variousaverage values of thesystem load (for one, five, and15minutes)inasingle graphic.Ifyou 4 Amoredetailed descriptionofthe serviceextinfo object is containedinSection 16.4.2, page 310. 321 17 GraphicDisplay of PerformanceData leaveout all db= specifications, Nagiosgraph always displays allmeasuredvalues for aserviceinasingle graphic. What individualdatabases andmeasuredvaluesdisplay is defined by the map file. To understandhow theinstructionscontained thereinfluencethe extractionof data, youjustneed to switch thedebugging levelto 4 andtake alook at theoutput in thelog file nagiosgraph.log .Each time theinsertfunction is run, Nagiosgraph rereads theconfiguration files,sothatthisdoesnot cause anykindofreset. In thefollowingextract from thelog filethe threedotsmarksectionswhich we willnot print, for thesake of clarity: ... INSERT info:... servicedescr:PING ... INSERT info:... hostname:linux01 ... INSERT info:... perfdata:rta=99.278ms;3000.000;7000.000;0; pl=0%;60; 80;; ... INSERT info:... lastcheck:1114853435 ... INSERT info:... output:OK -172.17.4.11: rta 99.278ms, lost 0% Theoutputisfromthe check_icmp plugin.The host name,servicedescription, performancedata, ( perfdata:)and thestandardoutputline(output: )each have theirown line.Inthe performancedatathe plugin announces the round trip average withthe variable rta ,and thenumber of packets that have gonemissing with pl ( packet loss). The map filecontainsPerlinstructionsthatfilter theseoutputs andextractsthe corresponding dataifthere arehits. Each of them starts withasearch instruction: /perfdata:rta=([.\ d]+)ms.+pl=(\ d+)%/ TheclassicPerlsearchfunction consists of thetwo forward slashes / withasearch patterninthe formofaregularexpression in between. Round pairs of brackets enclosepartial patterns withwhich thetextfound in this waycan later be accessed usingthe variables $1, $2,etc. Thepatterninthe first bracket thus matchesasingle digit(\d)oradot, 5 andthe next + states that therecan be severalofthem(or none at all).Inthe second round brackets,though, oneormoredigitsare allowed, butnoperiod. In concrete terms $1 delivers thenumerical valueofthe response time, $2 provides thepacket loss in percent. Thefullinstruction in the map filelinks twoPerlstatements withthe and operator: #--check_icmp #perfdata:rta=100.424ms;5000.000;9000 .000;0; pl=0%;40;80;; /perfdata:rta=([.\ d]+)ms.+pl=(\ d+)%/ and push @s, [’ping’, 5 Apairofsquarebrackets contains alternatives. 322 17.2 Graphs forthe WebwithNagiosgraph [’rta’, ’GAUGE’, $1 ], [’losspct’, ’GAUGE’, $2 ], ]; If thefirstone—the search function—is successful,thenitisthe turn of the push statement. It adds theexpression in square brackets followingtothe array @s. Theinstruction ends withasemicolon. If thesearchfunction provides no result, the map instructionwill notsaveany entryinthe @s array.The expression to be included in thearray hasthe followingformat: [ db-name, [ name_of_data_source, type, value ], [ name_of_data_source, type, value ], ... ] Thefile name for aNagiosgraph databasefile consists of thehostname, service description, andthe databasenametogether,for example, linux01_PING_ping.rrd. Thedesired string for thedatabasenameisentered insteadofthe placeholder db- name into the map file(in this case, ping). Thenameofthe datasourcecan be chosen freely, butshouldcontain an indication of thedatathatisstoredhere, such as rta for theresponse time or losspct for percentage of packets that have been lost. What type youspecify is determinedbythe RRDtools. GAUGE stands forsimple measured values that aredisplayed simply as they are. DERIVE is recommended by Nagiosgraph author SorenDossing for processing counters, such as in queryinga packetcounter on thenetworkinterface. Countersgrowincrementally and, when they runover, startagainatzero. What is of interesthereisthe differencebetween twopointsintime. TheRRD databasedeterminesthese automatically if thedata source type DERIVE is specified. Thedatabasename, datasource, andtypeshouldalways be placed in single quo- tation marksinthe map file, so that no name conflicts can occurwithkeywords reservedinPerl. Themeasuredvalue itself is determinedusing Perl methods,and theplaceholder value is substituted withthe corresponding instructions.Inthe simplest case, you take over thevaluesfound withthe search patterninthe performancedatawith $1, $2,etc.(seeexample above),orcalculate newvaluesfromthese by multiplying 6 by 1024 or by calculating thepercentage: #--check_nt -v USEDDISKSPACE #perfdata:C:\ Used Space=1.71Gb;6.40;7.20;0.00;8.00 /perfdata:.*Used Space=([.\ d]+)Gb;([.\ d]+);([.\ d]+);([.\ d]+);([.\ d]+)/ 6 This turnskilobytesintobytes. 323 17 GraphicDisplay of PerformanceData and push @s, [’disk’, [’used’, ’GAUGE’, $1*1024 ], [’usepct’, ’GAUGE’, ($1/$5)*100 ], [’freepct’, ’GAUGE’, (($5-$1)/$5)*100 ], ]; #--check_disk (unix) #perfdata:/=498MB;1090;1175;0;1212 m@perfdata:.*/([ˆ =]+)=([.\ d]+)MB;([.\ d]+);([.\ d]+);([.\ d]+);([.\ d]+)@ and push @s, [$1, [’used’, ’GAUGE’, $2*1024**2 ], [’warn’, ’GAUGE’, $3*1024**2 ], [’crit’, ’GAUGE’, $4*1024**2 ], ]; Thefirstentry evaluates thequery of hard drive space on aWindowsserverwith check_nt (see Section18.1, page 359).The performancedataalsocontains, apart from theoccupied space in $1,the size of thedatacarrier in $5.Thiscan be used to calculate thepercentage that is available ( freepct)and thepercentage used ( usepct). Figure 17.1: Used spaceand limit values forthe file system /net/linux01/a on thehostlinux01, as Nagiosgraph representsthem Thesecondexample evaluates dataobtainedonaUnix host,with check_disk,by multiplyingthe free hard drive space specified in MB by 10242 to convertittobytes. 324 17.3 PreparingPerformanceDatafor Evaluation with Perf2rrd Thecritical andwarning limitsalways remain constant,which leads to horizontal lines,asseen in Figure 17.1: thelower lineat12.1GBrepresentsthe warninglimit, themiddlelinethe currentload, andthe toplineat18.1GB, thecritical limit. Thekeysfor theindividualgraphseach listminimum, maximum, andaverage as anumerical value. This differentiation for thetwo limitvaluesisnot of anyuse, butitcannotbeavoided,since Nagiosgraph doesnot know that theseare constant values:ittreatswarning andcritical limitsjustlikeany othermeasuredvalues. If aplugindoesnot provideany performancedata, butvaluesthatare used in nor- maloutput, thesearchfunction can be appliedtothe output ( /output:.../)instead of to theperformancedata. Help is provided,for example, by theNagiosgraph Forumat http://sourceforge.net/forum/forum.php?forum_id=394748. Changestothe map arecritical.Itisthereforerecommended that youcopy thefile first andeditthe copy,and then performasyntax check, using perl -c: nagios@linux:libexec/nagios$ cp map map.new nagios@linux:libexec/nagios$ vi map.new nagios@linux:libexec/nagios$ perl -c map.new nagios@linux:libexec/nagios$ mv map.new map If thesyntaxcheck is in order, youcan installthe newfile as map . 17.3Preparing PerformanceDatafor Evaluation withPerf2rrd Anothertool whichtransfers Nagios performancedatatoanRRD databaseisthe Java applicationPerf2rrd.ThisrequiresaninstalledJavaRuntimeEnvironment (1.4.2,orpreferably 1.5).Since thevirtual machinegenerates anoticeable load on less powerfulcomputers, andalsorequiresalargeamount of memory,the require- mentsmade of theNagios serverbyPerf2rrd aresignificantlyhigherthanthose made by Nagiosgraph. On theother hand thereisnomoreworkafter theinstallation as farasgenerating theRRD databases is concerned, because Perf2rrd uses thetemplatemechanism of Nagios (see Section17.1, page 314).For each service andeach variable containedin thetemplate, thetool creates aseparateRRD databaseusing thefollowingnaming pattern: host+ service_description+ variable_name.rrd So to evaluate the check_icmp variables rta ( round trip average)and pl ( packet loss), thefile namesare linux01+PING+pl.rrd and linux01+PING+rta.rrd . 325 17 GraphicDisplay of PerformanceData Perf2rrd only looksafter thestorage of datainanRRD databaseand doesnot provideany toolstographically displaythe datasaved there. ThePerf2rrd author MarcDeTrano refers here to the drraw tool (see Section17.4, page 330).Itcan be advantageoustouse this,because on theone hand drraw allows far more than just theone displayprovided by Nagiosgraph,and on theother hand youdonot have to strugglewithregular expressionsinPerl. 17.3.1Installation Forthe installation youshouldget hold of thearchive in tarformatfrom http:// perf2rrd.sf.net/,and copy it,preferably to the /usr/local hierarchy: linux:˜ # cd /usr/local linux:usr/local # tar xvzf / path/to/perf2rrd-1.0.tar.gz ... perf2rrd/run ... Theexecutable program that is later runisascript called run ,which in turn calls the Java bytecode interpreter, java.Besides this thedirectory contains theJavaclass files andother utilities, withwhich youcan recompile theincludedsharedlibrary librrdj.so ,ifrequired. This is normally notnecessary for thenewer distributions. In orderfor run to be able to findthe java program,itmustbelocated in /usr/bin .If this is notthe case(because youhaveinstalledthe Java archive from http://www. sun.com/,for example),thenyou should setalink: linux:˜ # ln -s /usr/local/jre1.5.0_02/bin/java /usr/bin/java Ashort testshows whether or notPerf2rrd starts correctly: nagios@linux:local/perf2rrd$ ./run perf2rrd starting Using Nagios Config: /etc/nagios/nagios.cfg Using RRD Repository: /var/log/nagios/rrd Unable to create RRD Repository Theerror message issued in thelastlineisnot aproblematthe moment,since we have savedthe RRDdatabases in adifferent directoryanyway(page 329). 17.3.2Nagiosconfiguration Perf2rrd searches in theNagios configuration for allthe dataitrequires: to what fileNagios should writethe performancedata, thewrite mode used for this, 7 and theformatofthe template: 7 With a ,Nagiosappendsthe data to anormallog file; with w it makesitaccessible througha namedpipe. SeeSection 17.1, page 314. 326 17.3 PreparingPerformanceDatafor Evaluation with Perf2rrd #/etc/nagios/nagios.cfg ... process_performance_data=1 ... service_perfdata_file=/var/nagios/service-perfdata.dat service_perfdata_file_template=$TIMET$\ t$HOSTNAME$\ t \ $SERVICEDESC$\ t$SERVICEEXECUTIONTIME$\ t$SERVICELATENCY$\ t \ $SERVICEOUTPUT$\ t$SERVICEPERFDATA$ service_perfdata_file_mode=w ... Thenamed pipe used here,thanksto service_perfdata_file_mode=w,mustbe created manually—Perf2rrd 1.0inNagios 2.0has problems withthe normal file interface(service_perfdata_file_mode=a): linux:˜ # mknod /var/nagios/service-perfdata.dat p linux:˜ # ls -l /var/nagios/service-perfdata.dat prw-r--r-- 1nagios nagios 0May 110:49 /var/nagios/service-perfdata.dat In thetemplatethe introductory [SERVICEPERFDATA] stampismissing (see Sec- tion 17.1),since Perf2rrd 1.0doesnot parse this correctly. Changestothe Nagios configuration requireareload: linux:˜#/etc/init.d/nagios reload Finally youcreatethe directoryfor theRRD databases: linux:˜ # mkdir /var/lib/rrd/perf2rrd linux:˜ # chown nagios.nagios /var/lib/rrd/perf2rrd 17.3.3Perf2rrd in practice Program start Loading theJavaVirtual Machineeach time Perf2rrd is started requires considerable resources. Forthisreasonyou should notuse themethod of starting Perf2rrd with theparameter service_perfdata_file_processing_command at specificintervals of Nagios,and also should notuse the one-shotmode,with ./run-o ,inwhich the software processesone fileatatime.Intheorythiswould make it possible to run Perf2rrd regularlywithacron job. Instead, it is recommended that youkeep the program running permanently. When usingthisfor thefirsttime, we recommend that youswitch on thedebug- ging mode,which willshowany problems that occur. Theoption -d specifies the directoryinwhich thetoolsshouldcreateand updatethe RRDdatabases: 327 17 GraphicDisplay of PerformanceData nagios@linux:local/perf2rrd$ ./run -d /var/lib/rrd/perf2rrd -x perf2rrd starting Using Nagios Config: /etc/nagios/nagios.cfg Using RRD Repository: /var/lib/rrd/perf2rrd Debug Mode is on Reading perfdata from named pipe. Perf Data File is :/var/nagios/service-perfdata.dat Ibelieve we are using Nagios ver. 2 Object Cache File is :/var/nagios/objects.cache Nagios interval_length 60 called update with: .../eli02+PING+rta.rrd 1114938329:0.079 called update with: .../eli02+PING+pl.rrd 1114938329:0.0 /var/lib/rrd/perf2rrd/sap-14+SAP-3202+time.rrd created. called update with: .../sap-14+SAP-3202+time.rrd 1114938688:0.030775 ... Theoutputofthe Nagios configuration file, theRRD repository, andthe datatrans- fermode ( namedpipe )isfollowedbythe time unitusedbyNagios (and setwith the interval_length parameter). Normally this is 60 seconds, that is,acheckin- tervalof 5 is fiveminutes long.Itisextremely important that this parameter is correctlyrecognized,since Perf2rrd determinesthe step interval of theRRD databasebymultiplyingthe normal_check_interval and interval_length param- eterstogether. Allmeasuredvaluesthatoccurduringastep intervalare averagedbythe database. If this time period is toosmall, it is possible that thedatabasewill neverissueany values,since it expectsconsiderably more datathanitobtains for saving. While Nagiosgraph workswithafixed five-minuteinterval, Perf2rrd adjusts itself to theNagios configuration.The software only takesintoaccount theintervalwhen creating theRRD database, however; changing theNagios configuration later on hasnofurther consequences.The only thingyou can do here to alter this is delete theRRD databaseand setitupagain. Perf2rrdinpermanent operation OperatingPerf2rrd on anamed pipe hasone disadvantage:ifNagios restarts, it closes thepipebeforeopening it again. Unfortunatelywhenthe pipe closes, Perf2rrd closes as well. This can be prevented by theuse of theDaemonToolsbyDanielJ.Bernstein. They monitorprogramsand restartthem, if theseprogramsshouldeverstop. They arethemselvesstarted throughan /etc/inittab entrybythe init process, andare restarted if they were to shut themselvesdownatsomepoint. TheDaemonToolstar filecan be obtained from http://cr.yp.to/daemontools/ install.html anditisunpacked in thedirectory /usr/local/src: 328 17.3 PreparingPerformanceDatafor Evaluation with Perf2rrd linux:˜ # cd /usr/local/src linux:local/src # tar xvzf /path/to/daemontools-0.76.tar.gz admin admin/daemontools-0.76 admin/daemontools-0.76/package admin/daemontools-0.76/package/README ... admin/daemontools-0.76/src This creates thedirectory admin/daemontools-0.76,withthe subdirectories pack- age and src .Fromthere youshouldrun the install script,which compilesand installs theprogram: linux:local/src # cd admin/daemontools-0.76 linux:admin/daemontools-0.76 # package/install Thebinarieslandinthe newly created directory daemontools-0.76/command and remain there. Theinstallation routinealsosets up symbolic links pointingtothem from the—also newly created—folder /command. The install script also includes thefollowinglineinthe file /etc/inittab,which ensuresthatthe Daemon Toolsrun permanently: SV:123456:respawn:/command/svscanboot Theprogram svscanboot searches regularlyfor neworcrashed daemons.For this purposeitscansthe /service directory, whichisalsocreated during theinstallation. Just onesymboliclinkisrequiredtohavePerf2rrd monitored: linux:˜ # ln -s /usr/local/perf2rrd /service/perf2rrd TheDaemonToolssearchinthisdirectory for ascriptcalled run andstart it.Inorder for run to be able to findthe pathtothe RRDrepository, an actualcommand-line optionisentered in thescriptfile insteadof $*: #exec java -cp $classpath perf2rrd $* exec java -cp $classpath perf2rrd -d /var/lib/rrd/perf2rrd Starting andendingPerf2rrd is nowtakenoverbythe program svc : linux:˜ # /command/svc -d /service/perf2rrd linux:˜ # /command/svc -u /service/perf2rrd The -d option(for down )stops theservicespecified, and -u ( up)startsitagain. It is notnecessary to runitatthe beginning,since theDaemonToolsregularly scan the /service directoryfor newservices andautomatically startthem. 329 17 GraphicDisplay of PerformanceData This is important insofar as theNagios-2.0 beta versions,onwhich this book is based, hadproblems if theconfigurednamed pipe wasnot read. Then it mightnot deliverany more dataatall until areload or restart. Whether this problemhas been fixed in thefinalversion 2.0ofNagios could notbeclarifiedatthe time of going to press. 17.4The Graphics Specialist drraw From theRRD databases,generated for examplebyPerf2rrd or Nagiosgraph,the CGIscript drraw creates interactive graphics—simple ones relatively quicky,whereas formorecomplex ones youneed to know abit more aboutthe RRDtools.8 17.4.1Installation Forthe drraw installation,you need to obtain thecurrent tarfile from http://www. taranis.org/drraw/ andunpack it to itsown subdirectory in theCGI hierarchy 9 on theWeb server: linux:˜ # cd /usr/lib/cgi-bin linux:lib/cgi-bin # tar xvzf /path/to/drraw-2.1.1.tar.gz drraw-2.1.1/ ... drraw-2.1.1/drraw.cgi drraw-2.1.1/drraw.conf drraw-2.1.1/icons/ ... Theversion-dependent directorycreated by this is then renamedto drraw : 10 linux:lib/cgi-bin # mv drraw-2.1.1 drraw drraw.cgi itself requires,apart from Perl,the Perl CGI module ( CGI.pm), andthe RRDtools, from at leastversion 1.0.47; nothingwill work belowversion 1.0.36. If your distribution doesnot includeacurrentversion,you should obtain thesources from http://www.rrdtool.org/ andcompile them yourself: linux:˜ # cd /usr/local/src linux:local/src # tar xvzf /path/to/rrdtool-1.0.49.tar.gz 8 Apartfromthe documentationonthe homepage http://www.rrdtool.org/ ,the tutorial in- cluded ( manrrdtutorial)isauseful startingpoint,aswellasthe manpage manrrdgraph . 9 Whichdirectory this is dependsonthe distribution or Apache configuration youare using. 10 Asymbolic linkwould also be possible,but then Apache mustbeconfigured so that it follows symbolic links,which is normally notautomatically thecase. 330 17.4 TheGraphics Specialist drraw linux:local/src # cd rrdtool-1.0.49 linux:src/rrdtool-1.0..49 # ./configure linux:src/rrdtool-1.0..49 # make linux:src/rrdtool-1.0..49 # make install linux:src/rrdtool-1.0..49 # make site-perl-install TheCGI script drraw.cgi uses thePerlmodule RRDs ,which after theinstallation with make site-perl-install ,isfound automatically. 17.4.2Configuration The drraw configuration is containedinthe file drraw.conf: linux:cgi-bin/drraw # egrep -v ’ˆ#|ˆ$’ drraw.conf ... %datadirs =(’/var/lib/rrd’ => ’[RRDbase]’, ); $vrefresh =’120’; @dv_def =(’end -6hours’, ’end -28hours’, ’end -1week’, ’end -1 month’, ’end -1year’ ); @dv_name =(’Past 6Hours’, ’Past 28 Hours’, ’Past Week’, ’Past Month’, ’Past Year’ ); @dv_secs =(21600, 100800, 604800, 2419200, 31536000 ); $saved_dir =’/var/lib/drraw/saved’; $tmp_dir =’/var/lib/drraw/tmp’; ... Theextract shownspecifiesthe RRDrepository(here: /var/lib/rrd )asthe most important detail, butseveral directoriescan also be specified: %datadirs =(’/var/lib/rrd’ => ’[RRDbase]’, ’/data/rrd’ => ’[RRDdata]’, ); Thetextinsquarebrackets (e.g., [RRDbase])appearslater on theWeb interface, whichallows adistinction to be made betweenvarious different repositories. The variables @dv_def , @dv_name,and @dv_secs influence thelayoutand number of graphics. Theconfiguration shownabove generates onegraphic more than thestandard configuration.Thisrepresentsthe pastsix hours: theextendedstatement ’end—6 hours’ in @dv_def describesthe time period for rrdtool (see manrrdgraph ), in @dv_name therepresentationisgivenasuitable titlewith ’Past6Hours’, and @dv_secs contains thesix hours, converted into ( 21600)seconds,displayed by drraw as atimeperiodinaseparategraphic. Therepositorymustbereadable for theuserwithwhose rights theWeb serveris running,and thedirectories specified in $saved_dir and $tmp_dir mustalsobe 331 17 GraphicDisplay of PerformanceData readable.Ifauser otherthan www-data runs this,the followingcommand must be adapted accordingly: linux:˜ # mkdir -p /var/lib/drraw/{ saved,tmp} linux:˜ # chown -R www-data.www-data /var/lib/drraw Data arrivesinthe temporary directory $temp_dir ,whose contents can be deleted at anytime, whereasin $saved_dirdrraw stores configuration datawhich the program needsinorder to accessalready created graphicslater on.Thisdatamust notbelost. drraw implements asimpleaccess protection in threestages: read-only ( 0 ), re- stricted editing(1 ), andfullaccess ( 2 ). Users loggedintothe Webserverautomat- ically obtain level2.Nonauthorized usersare treated as guestsand assigned level 0 .Toavoid thehasslewithauthenticationatthe beginning,you can grantthe user guest full accessvia thefollowingdirective in theconfiguration file: %users =(’guest’ => 2); 17.4.3Practical application TheCGI script in theCGI directoryofthe Webservercan be addressedthrough the followingURL: http://nagsrv/cgi-bin/drraw/drraw.cgi Figure 17.2: 332 The drraw startmenu 17.4 TheGraphics Specialist drraw New graphicsare generated in themenuitem Create anew graph in thestart picture, whichisshown in Figure17.2. Thedialogshown in Figure17.3allows the appropriate RRDdatabasetobeselected.Using aregular expression11 in the Data Source filterregexp field,the datasources available can be furtherrestricted;this expression can also be asimpleliteral text, such as sap-12. Once youhavechosenanRRD database, youjustneed to specifythe round-robin archive (RRA) to be used.Each of thesearchivessaves datainaparticular form, processedwithaconsolidationfunction:the AVERAGE function averagesall mea- surement datathataccumulates in ameasurement period, MIN savesonlythe minimum valueofthe datainaninterval, and MAX savesonlythe maximum. Since theoriginaldataislost, thearchivesmustbespecifiedwhenthe round-robin databaseiscreated;maximum values can only be recalledlater if this wastaken into account at thetime. Figure 17.3: Selectingthe data source If youcannotrememberwhatarchivesexist,you can displaythemusing thebutton RRD Info forselectedDB .Clicking on the AddDB(s) to Data Sources buttontakes youtoadialogwhere youfirsthavetoscrolldownabit to reach theitem Data Source Configuration (Figure17.4).There youcan fine-tune thedesired graph— noworlater.You can defineyourown colors, andwhether alineorasurface will be shown. Youshouldonlymake useofthe otherpossibilitiesifyou arefamiliar withthe concepts of theRRDtoolsand theway they work. 12 The Update buttonprovides apreviewofthe finishedgraphic,which at thesame time revealsthe rrdtool options used (Figure17.5).Whenyou save,with Save Graph ,you obtain alinkinthe form 11 POSIXregular expression;see man7regex . 12 Thereare anumber of tutorialsonthe homepage of theRRDtoolsauthor, TobiasOetiker,at http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/tut/index.en.html . 333 17 GraphicDisplay of PerformanceData http://nagsrv/cgi-bin/drraw/drraw.cgi?Mode=view;Graph=11149589.4932 withwhich thegraphic can be accessedatany time.Alternatively youcan now findthe graphic in the drraw starting menu under AllGraphs . Figure 17.4: Fine-tuning the graphic configuration Figure 17.5: Preview and specifying therrdtool options 334 17.4 TheGraphics Specialist drraw Figure 17.6: Thefinished graphic representsdifferent time periods Thelinkmentioned when yousaveagraphic can be recorded in a serviceextinfo object,making it directly accessible throughthe Nagios interface: define serviceextinfo{ service_description PING host sap-12 notes_url /nagiosgraph/drraw/drraw.cgi?Mode=view;Graph=11149589.4932 icon_image graph.gif icon_image_alt View graphics } With templates anddashboards, drraw includes otherfeatures, whichcannotbe discussedindetail here,for reasonsofspace. Templates allowseveral sourcesof thesametypetobeshown in thesamegraphic.Whatthese arecan be specified in Create anew Graph (see Figure 17.3).Since youcan only add onesourceata time there, youmustclickthe Add buttonfor each separatesource, beforemoving on to thenextone. Adashboardpresentsadisplaycontainingseveral previewgraphics. If youclick on oneofthe graphics, youare shownthe detailedrepresentation. Theinteractive menu Create aDashboard contains briefinstructionswhere youcan obtain help on thetwo features. 335 17 GraphicDisplay of PerformanceData 17.5AutomatedtoaLargeExtent:NagiosGrapher NagiosGrapher from Netways,the host of TheNagios Exchange Platform http:// www.nagiosexchange.org/ ,isabrand-newrepresentationtool for performance data, butalready averypowerfulone.Thisalsosaves datainround-robin databases andusesthe RRDtoolsfor processing andrepresentation. It claims to be easy to installand to work automatically to alarge extentincontrast to the“competition.” Thelatter promisehas so far notbeen kept;asinNagiosgraph, youhavetoconfigure search patterns in ordertointerpret thepluginoutputor performancedatacorrespondingly.The RRDdatabases aregenerated by Nagios- Grapher automatically; in additiontothis, thetool serviceextinfo also generates entries. As soon as it once recognizes theperformancedata, youdon’t have to worry anymoreabout integrating it into Nagios.Areload is sufficienttomake the serviceextinfo entriesgenerated in themeantimeusable in Nagios.The entries arecreated “intelligently,”sothatifyou clickonthe corresponding icon in theser- vicesummary (see Figure 17.7onpage 340),you aretakendirectlytothe graphic displayofthe performancedata. As far as functionalityand installation efforts areconcerned,NagiosGrapher lies somewherebetween Nagiosgraph andPerf2rrd:the initialconfiguration needed is somewhat more than forNagiosgraph,but thepossibilitiesofvariationsinthe graphic output areconsiderably larger,and youdonot have to generate each graphic individually,asisthe casewithPerf2rrd/ drraw . 17.5.1Installation NagiosGrapher requires thePerlmodules GD, CGI , RRDs , XML::SIMPLE , XML::Parser, and Data::Dumper. 13 From version 1.3-dev , Data::Dumper replaces theXML mod- ules,but they arestill necessary here to convertdatafromprevious versions,if required. Normally allthe standardLinux distributionswill includecorresponding packages. Thesecan also be downloadedand installedinthe modulesprovided by CPAN, followingthe pattern: linux:˜ # perl -MCPAN -e ’install CGI’ TheNagiosGrapher sourcescan be obtained from TheNagios Exchange14 andthey areunpacked to thedirectory /usr/local/nagios : 13 Data::Dumper is acomponentofthe Perl base distribution,the module RRDs belongstothe RRDtoolspackage andisnot available throughthe ComprehensivePerlArchive Network (CPAN). 14 http://www.nagiosexchange.org/Charts.42.0.html. ,entry nagios_grapher 336 17.5 AutomatedtoaLargeExtent:NagiosGrapher linux:˜ # cd /usr/local/nagios linux:local/nagios # tar xvjf /path/to/NagiosGrapher_1.3-dev.tar.bz2 linux:local/nagios # ln -s NagiosGrapher_1.3-dev nagiosgrapher So that youcan later on usepaths whichare notversion-specific, youshouldcreate asymboliclink, nagiosgrapher ,tothe currentdirectory NagiosGrapher_1.3-dev . Change to this directoryand copy theCGI scriptstothe Nagios CGIdirectory,orto /usr/local/nagios/sbin/ if youhaveinstalledthisyourself: linux:local/nagios # cd nagiosgrapher linux:nagios/nagiosgrapher # cp *.cgi /usr/local/nagios/sbin/. linux:nagios/nagiosgrapher # chown nagios.nagcmd \ /usr/local/nagios/sbin/{ graphs,rrd2-graph} .cgi linux:nagios/nagiosgrapher # ln -s \ /usr/local/nagios/nagiosgrapher/NagiosGrapher.pm \ /usr/local/nagios/sbin/. In orderfor DebianSarge to be able to find NagiosGrapher.pm,the symbolic link mentionedmustbelocated in /etc/perl: ln -s /usr/local/nagios/nagiosgrapher/NagiosGrapher.pm /etc/perl/. Theexample configuration file ngraph.ncfg 15 is copied to thedirectory /etc/nagios , andtwo iconstothe logos directoryofthe Nagios installation: linux:nagios/nagiosgrapher # cp cfg/ngraph.ncfg /etc/nagios/. linux:nagios/nagiosgrapher # cp graph.png dot.png \ /usr/local/nagios/share/images/logos Beforethe startscript nagios_grapher is stored in /etc/init.d,the pathto col- lect2.pl containedinitisadjusted (thisscriptreprocessesthe datapassedonby Nagios andwrites it to theRRD databases): #nagios_grapher start script ... DAEMON=/usr/local/nagios/nagiosgrapher/collect2.pl ... So that thescriptalsoruns in theindividualrunlevels, corresponding symbolic links areset in the rc?.d directories.16 Here is an examplefor Debian: 15 In versions up to 1.2the fileextensionwas .cfg! 16 Dependingonthe distribution,these aredirectlyin /etc (Debian) or in /etc/init.d (SuSE, Red Hat). 337 17 GraphicDisplay of PerformanceData linux:nagios/nagiosgrapher # cp nagios_grapher /etc/init.d/ linux:nagios/nagiosgrapher # cd /etc/init.d linux:etc/init.d # ln -s nagios_grapher /etc/rc2.d/S99nagios_grapher linux:etc/init.d # ln -s nagios_grapher /etc/rc3.d/S99nagios_grapher linux:etc/init.d # ln -s nagios_grapher /etc/rc4.d/S99nagios_grapher linux:etc/init.d # ln -s nagios_grapher /etc/rc5.d/S99nagios_grapher 17.5.2Configuration Theconfiguration file ngraph.ncfg Theconfiguration file ngraph.ncfg contains aglobal config sectionwithpaths and generalsettings.Thisisfollowedbyasmany ngraph definitions as youwant, each of whichdescribesagraphic. Even from theglobaldetails,itisnot difficulttosee that thesyntaxsticks closeto theconcept used by Nagios: #/etc/nagios/ngraph.ncfg define config { pipe /var/nagios/ngraph.pipe step 60 heartbeat 600 rrdpath /var/lib/rrd/nagiosgrapher tmppath /tmp/nagiosgrapher/ serviceext_type MULTIPLE serviceextinfo /etc/nagios/serviceextinfo.cfg serviceext_path /etc/nagios/serviceext url /nagios/cgi-bin/graphs.cgi nagios_config /etc/nagios/nagios.cfg cgi_config /etc/nagios/cgi.cfg icon_image_tag dot.png’ border="0"> \ pNSClient.exe /install C:\ Programs\ NSClient> net start nsclient Running pNSClient.exe /install installs theservice, andthe switch /uninstall re- movesthe service again. Using theservices management youshouldmake sure that theoperating system starts automatically. NSClient hastwo parameters: port and password ,withthe defaults port1248 and passwordnone .The values can only be changed(with regedit )inthe registry under HKEY_LOCAL_MACHINE\SOFTWARE\NSClient\Parms. NC Net Beforeyou installthe most currentversion from http://www.shatterit.com/NC_ Net,itisessentialthatany previous versioninstalledisfirstuninstalled. Since NC Netusesthe Microsoft Installer, youdothisthrough thesoftware administra- tion utility. Even an NSClient that mightexist should be removedfirst. Double clicking on thefile NC_Net_setup.msi installs theservice, butyou should checkinthe service management that it really is running,and whether or not automatic is entered as thestartingtype. NC Net hasthe same parametersasNSClient,with password and port ,but these can also be specified in theservices management under properties in the Start parameters line: port 4711 password password 18.1.2 The check_nt plugin When installingthe standardNagios plugins, the check_nt plugin is automatically loadedtothe hard drive.Itonlyhas thesamerange of functionsasNSClient, however. To make useofthe extensionsofNC Net,you mustdownload theex- tendedsourcecode (the file check_nt.c)from http://www.shatterit.com/NC_Net andcompile it yourself. Theactualeffect that the check_nt parametershave, describedbelow,depends on thecommand that is specified withthe -v option, andwhich youcan read about in more detail in Section18.1.3onpage 356: 355 18 MonitoringWindows Servers -H address / --host=address IP addressorhostnameofthe host on whichthe NSClient/NC Net is in- stalled. -v command / --variable= command Thecommand to be executed. -p port / --port=port This defines an alternative portfor NSClient/NC Net.The defaultisTCP port 1248. -w integer / --warning= integer This defines awarning limit.Thisoptionisnot available for allcommands. -c integer / --critical= integer Thecritical limit optionisalsonot available for allcommands. -l parameter This is used for passing parametersalong,suchasthe drive for theharddrive checkorthe processnamewhenchecking processes. -d option When checking servicesorprocesses, youcan specifyseveral servicesor processessimultaneously. Normally check_nt then only showsthe defec- tive ones ( -d SHOWFAIL). To have allofthemdisplayed youmustspecify SHOWALL as the option . -s password Apasswordfor authentication is only required if NC Net or NSClient starts thecorresponding service withthe passwordparameter. -t timeout / --timeout=timeout After timeout secondshaveelapsed,the plugin aborts thetestand returns theCRITICALstate. Thedefault is 10 seconds. 18.1.3Commandswhich canberun with NSClient and NC Net Forthe commandsintroducedhere, it makesnodifferencewhether NSClient and NC Net is installed; they can be runwiththe unpatched check_nt. Queryingthe clientversion Theversion of theinstalledNSClient or NC Net service is returned by running the command 356 18.1 NSClient and NC Net check_nt -H address -v CLIENTVERSION Allother argumentsare ignored: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v CLIENTVERSION NC_Net 2.21 03/13/05 Command andservicedefinitionsare notveryspectacular, butthe latter is ex- tremelyusefulindescribingdependencies: define command{ command_name check_nt_nsclient command_line $USER1$/check_nt -H $HOSTADDRESS$ -v CLIENTVERSION } define service{ host_name winsrv service_description NSClient check_command check_nt_nsclient ... } If NSClient/NC Net fails on theWindowsserver, Nagios normally informs thead- ministratorofall serviceswhich have presumably failed. This problemissimilarto onewithNRPE, whichinthatcasewas solvedthrough thedefinition of dependen- cies (see Section12.6, page 234).Thisisalsothe casewhenusing NSClient/NC Net: define servicedependency{ host_name winsrv service_description NSClient dependent_host_name winsrv dependent_service_description Disks,Load,Memory notification_failure_criteria c,u execution_failure_criteria n } With NSClient as amaster service on whichthe otherservices aredependent, Nagios doesnot trouble theadminswithmessagesfromthese otherservices,as long as NSClient is in aCRITICALorUNKNOWNstate. CPUload Howheavy theload is on theprocessorisrevealedbythe command CPULOAD : check_nt -H address -v CPULOAD -l interval, warning limit, critical_limit 357 18 MonitoringWindows Servers It expectsatripletofparameters, separated by commas, consisting of thelength of thetimeintervalthatistobeaveraged, in minutes,and thetwo thresholds for thewarning andcritical limitsinpercent.So CPULOAD,with 5,80,90,forms the average over fiveminutes andissues awarning if thevalue determinedexceeds 80 percent. If thereisover90% CPU load, thecommand returnsCRITICAL: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v CPULOAD -l 5,50,90 CPU Load 10% (5 min average) |’5min avg Load’=10%;50;90;0;100 Theoutputherealsocontainsadditional performancedataafter the | sign,which Nagiosignores in theWeb interface. If youare interested in average values over severalintervals,you just add furthertriplet values followingtothe first one: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v CPULOAD \ -l 5,80,90,15,70,80 CPU Load 10% (5 min average) 10% (15 min average) |’5min avg Load’=10 %;80;90;0;100 ’15 min avg Load’=10%;70;80;0;100 In this example CPULOAD checkstwo intervals: thepastfive minutes andthe past 15 minutes.Inthe second casethere aredeviating limit values.The plugin always returnsthe more critical value; for example, it returnsCRITICALifone interval issues CRITICALand theother just aWARNING. Thecommand andservicedefinitionsthereforelook likethis: define command{ command_name check_nt_cpuload command_line $USER1$/check_nt -H $HOSTADDRESS$ -v CPULOAD -l $ARG1$ } define service{ host_name winsrv service_description CPU Load check_command check_nt_cpuload!5,80,90,15,70,80 ... } Main memory usage When specifyingthe limit values,the command for monitoring theamount of main memory used—in contrast to CPULOAD—is basedonthe syntax of “normal” Nagios plugins: check_nt -H address -v MEMUSE -w integer -c integer 358 18.1 NSClient and NC Net MEMUSE returnsthe memory usage in percent. It should be remembered that Win- dowsrefersheretothe sumofmemoryand swap files,thatis, theentireavailable virtual memory.The command expectsthe warningand critical limitsaspercent- ages, givenwithout apercent sign: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v MEMUSE \ -w 70 -c 90 Memory usage: total:4331.31Mb -used: 257.04Mb (6%) -free: 4074.27Mb (9 4%) |’Memory usage’=257.04Mb;3031.91;3898.18;0.00;4331.31 On theexample host, winsrv ,onlysix percentofthe virtual memory is used.The factthatthe physical size of themainmemoryitself(here:256 MBytes)isalready exceeded is notshown in theoutput. It doesnot necessarily make sense, however, to requestthe memory usage as in Unix:Windowsregularly swaps programand datacode from themainmemory, even when it still hassparereserves. In Unix,programsand datalandinthe swap partition only if more space is required than is currentlyfree. In this respectthe load of theentirevirtual memory in Windowsisthe more important parameter. Thecommand mentionedabove is againpacked into acommand andaservice object: define command{ command_name check_nt_memuse command_line $USER1$/check_nt -H $HOSTADDRESS$ -v MEMUSE \ -w $ARG1$ -c $ARG2$ } define service{ host_name winsrv service_description MEM Usage check_command check_nt_memuse!70!90 ... } Hard drivecapacity Theload on afile system is tested by USEDDISKSPACE: check_nt -H address -v USEDDISKSPACE -l drive letter -w integer -c integer in Windowsfashion,the filesystem is specified as drive letters, thelimit values in percent: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v USEDDISKSPACE \ -l C-w70-c80 359 18 MonitoringWindows Servers C: -total: 4.00 Gb -used: 2.06 Gb (52%) -free 1.94 Gb (48%) |’C: Use dSpace’=2.06Gb;2.80;3.20;0.00;4.00 nagios@linux:nagios/libexec$ echo $? 0 In theexample, check_nt should issueaWarningifdrive C is more than 70 percent full, andaCRITICALifthe load exceeds80%.The currentvalue liesat52percent, so check_nt thereforereturnsanOK, whichyou can checkwith echo $?. Thecorresponding command andserviceobjectswould look somethinglikethis: define command{ command_name check_nt_disk command_line $USER1$/check_nt -H $HOSTADDRESS$ -v USEDDISKSPACE \ -l $ARG1$ -w $ARG2$ -c $ARG3$ } define service{ host_name winsrv service_description Disk_C check_command check_nt_disk!C!70!80 ... } Uptime Howlongago thelastreboot wasperformedisrevealedbythe command UPTIME: check_nt -H address -v UPTIME Definingawarningorcritical limit is notpossible,which is whysuchaqueryis only for information purposes (the plugin returnseitherOK, or UNKNOWNifitis used wrongly): nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v UPTIME System Uptime -17day(s) 9hour(s) 54 minute(s) so thehost winsrv hasalready been running for 17 days. Thedefinition of the corresponding command andserviceobjectsistrivial: define command{ command_name check_nt_uptime command_line $USER1$/check_nt -H $HOSTADDRESS$ -v UPTIME } 360 18.1 NSClient and NC Net define service{ host_name winsrv service_description UPTIME check_command check_nt_uptime ... } Status of services Thecurrent status of Windowsservices can be checkedwith SERVICESTATE : check_nt -H address -v SERVICESTATE -d SHOWALL -l service1, service2,... Theoptional -d SHOWALL ensuresthatthe output textlists allservices.Ifyou leavethisoptionout,the plugin provides information only on thoseservices that are not running. To findthe name of theservicedescription to be specified for NSClient after the -l optionisquite achallenge. It is notthe displayname whichisdisplayed by the servicesmanagement (e.g., Routing andRAS), that is beingsought, butthe registry entrythatcorresponds to this.Accordinglyyou search withthe Registry editor regedit in thepartial tree HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\ Services for thenode withthe corresponding displayname. It contains theservice descriptionbeing sought,which in thecaseof Routing andRAS is somethinglike RemoteAccess. If youuse NC Net,you have an easier task:the software accepts both theservice descriptionand thedisplay name,inwhich no distinctionismade betweenupper andlower case. Thefollowingtwo examples usethe displayname: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v SERVICESTATE \ -l "Routing and RAS" Routing and RAS: Stopped nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v SERVICESTATE \ -l "VNC Server" All services are running Theservice Routing andRAS is currentlynot running,and check_nt returnsthe return value 2 (CRITICAL).The factthatthe VNCserverisperforming itsservices correctlyisonlyrevealedindirectlywithout -d SHOWALL ,onthe otherhand. The plugin here returns 0 (OK) as thereturn value. Severalservices can be included in asinglecommand,separated by acomma.The corresponding return valueis dictatedbythe “worst case.” Thematchingcommand andserviceobjectslook somethinglikethis: 361 18 MonitoringWindows Servers define command{ command_name check_nt_service command_line $USER1$/check_nt -H $HOSTADDRESS$ -v SERVICESTATE \ -l $ARG1$ } define service{ host_name winsrv service_description Routing and RAS check_command check_nt_service!"Routing and RAS" ... } Status of processes As withthe services, PROCSTATE monitors running processes: check_nt -H address -v PROCSTATE -d SHOWALL -l process1, process2,... Theprocess name,which almost always ends in .exe,isbestdeterminedinthe processlistofthe task manager; upperand lowercaseare also ignoredhere: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v PROCSTATE \ WinVNC.exe,winlogon.exe,notexist.exe notexist.exe: not running As withthe services, youcan also specifyalistofseveral processes, separated by commas. Without -d SHOWALL , PROCSTATE showsonlythose processesthatare notrunning,inthisexample, notexist.exe. Thecorresponding command andservicedefinitionslook likethis: define command{ command_name check_nt_process command_line $USER1$/check_nt -H $HOSTADDRESS$ -v PROCSTATE \ -d SHOWALL -l $ARG1$ } define service{ host_name winsrv service_description WinVNC check_command check_nt_process!winvnc.exe ... } 362 18.1 NSClient and NC Net Ageoffiles It is worthmonitoringthe time sincethe last modification of critical files with FILEAGE,particularlyfor logfilesand otherfilesthatchangeregularly: check_nt -H address -v FILEAGE -l path -w integer -c integer Thecommand needsthe filenametogether withits complete path, andbackslashes mustbedoubled, as in C:\\xyz.log .The units for thresholdvaluesare minutes,and if they areexceeded, FILEAGE willissueaWARNINGorCRITICAL. Thetimesince the last modification is givenbythe command,bydefault,inepochseconds (seconds sinceJanuary 1, 1970): nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v FILEAGE \ -l "C:\\test.log" -w 1-c20 1113158517 nagios@linux:nagios/libexec$ echo $? 1 Thestatuscan againbecheckedwith echo $?.Hereaswell, thecommand and service definitions do notholdany secrets: define command{ command_name check_nt_fileage command_line $USER1$/check_nt -H $HOSTADDRESS$ -v FILEAGE \ -l $ARG1$ -w $ARG2$ -c $ARG3$ } define service{ host_name winsrv service_description Log file check_command check_nt_fileage!C: xyz.log!60!1440 ... } 18.1.4Advancedfunctions of NC Net NC Net’s rangeoffunctionsisexpanding constantly; this chapter describesthe possibilitiesthatgobeyondNSClient for version2.21. Manyofthem, especially the ENUM* functions, areonlysuitable for direct useinNagios in exceptionalcases. Buttheyare very useful if youneed to findout theprecise name of aservice, a process, or a WindowsPerformance Counter. 363 18 MonitoringWindows Servers Theextensionsrequire an up-to-date check_nt plugin,whose source code consists of asinglefile, check_nt.c.Itiscopied to theharddrive during theinstallation of NC Net,but it can also be downloadedseparatelyfrom http://www.shatterit.com/ nc_net. Thesourcecan currentlybecompiledwithout problems only in combinationwith theentireNagios plugin package(seeSection 1.2, page 30).Todothis, youover- writethe existing file check_nt.c in thesubdirectory plugins withthe extended version. Theold check_nt binarymustbedeleted;thenyou run make check_nt to recompile thesourcefile.Afterwardsyou copy thebinarytothe libexec directory of Nagios,along withthe otherplugins: linux:˜ # cp check_nt.c /usr/local/src/nagios-plugins-1.4/plugins linux:˜ # cd /usr/local/src/nagios-plugins-1.4/plugins linux:nagios-plugins-1.4/plugins # rm check_nt linux:nagios-plugins-1.4/plugins # make check_nt linux:nagios-plugins-1.4/plugins # cp check_nt /usr/local/nagios/libexec/. WindowsPerformance Counter Throughso-calledPerformanceCounters, Windowsprovides values for everything in thesystem that can be expressedinnumbers: hard drive usage,CPU usage, number of logins,number of terminalserversessions, theload on thenetwork interface, andmanymorethings. check_nt -H address -v ENUMCOUNTER -l category1, category2 If youomitthe -l parameter, ENUMCOUNTER willdisplay alistofall performance counter categories: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v ENUMCOUNTER ... Processor; ... Terminal services; .NET CLR loading procedure; tot al RAS services; Process; ... Otherwise, it showsall countersinthe categoryspecifiedwith -l.Several categories areseparated withcommas. The Terminalservices categorycontainsthree counter objectsinall: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v ENUMCOUNTER \ -l Terminal services Terminal Services: Total Sessions; Active Sessions; Inactive Sessions nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v ENUMCOUNTER \ -l "Terminal Services","Process" Terminal Services: Total Sessions; Active Sessions; Inactive Sessions - Process: %Processor Time; %User Time; %Privileged Time; Virtual Bytes Peak; Virtual Bytes; Page Faults/sec; Working Set Peak; Working Set; ... 364 18.1 NSClient and NC Net Theprecise object name is important for later use, in whichthe % sign (as, for example, in %ProcessorTime)ispartofthe name.Ifthe counter or category name contains spaces, youmustremembertoplace it withinquotation markswhen formulatingthe therequest. Thedescription stored in theWindowsPerformanceCounter objectsare shown, by theway,withthe command ENUMCOUNTERDESC . Severalcounter categoriescontain instances, whichyou mustspecify when query- ingacounter object.For this reason youshouldalways checkfirst, usingthe IN- STANCES function,whether thecategoryyou want workswithinstances: check_nt -H address -v INSTANCES -l category1, category2 Forthe terminalservices,thisisnot thecase: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v INSTANCES \ -l "Terminal Services" Terminal Services: Typical categorieswithinstances are Processor or Process: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v INSTANCES \ -l "Process" Process: svchost#6,svchost,Idle,explorer,services,... Here it becomesapparentwhatismeant by instances: Windowsviews everyrun- ning processasaninstanceinthe Process PerformanceCounter category. As can be seen on page 364, thecounter object ( %ProcessorTime), whichcontainsthe percentage useofprocessortime),isinthiscategory. It can be queriedonlyfor in- dividualinstances,suchasfor the explorer process, or for allprocessestogether— then youspecify _Total insteadofaninstance. In ordertoaccess aWindowsPerformanceCounter,therefore, youalways need to give thefollowingdetails: \ category\ counter object \ category( instance) \ counter object Theinstanceisspecifiedonlyifthe categoryhas instancesavailable.There must be no space betweenthe categorynameand thefirstbracket. Thecorrespond- ingquery command is called COUNTER ;the placeholder name is replaced by the combinationjustdescribed: check_nt -H address -v COUNTER -l name, format description -w integer -c integer 365 18 MonitoringWindows Servers This function asks after theWindowsPerformanceCounter object that is specified after the -l optionwithits exact name.The warningand critical limitsgivenas integer values refertothe size measured:ifanobject is involvedthathas aper- centage figure (e.g., theprocessorload),justimagine apercent sign addedtoit; the numbersofprocesses, sessions, etc.,are just values that arenot specified in units. Thenumber of active sessionsischeckedwiththe Active Sessions object,for which thereare no instances: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v COUNTER \ -l " \ Terminal Services\ Active Sessions" 1 nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v COUNTER \ -l " \ Process(Idle)\ %Processor Time" 98 Because the Idle instance always looksatthe differencebetween used andspare processorload, so that thesum of thetwo is always 100 percent, queryingthe _Total pseudo-instance in thesecondexample doesnot make muchsense. Normally COUNTER doesnot formatits output.Thiscan be changedbyfollowing theobject name withadescriptioninthe printf format,5 separated from it witha comma: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v COUNTER \ -l " \ Process(Idle)\ %Processor Time","Idle Process: %.2f %%" Idle Process Usage is: 54.00 %|’Idle Process Usage is: %.2f %%’=54.000 000%;0.000000;0.000000; Notonlydoesthiscause theoutputtobeclearer,italsoreturnsadditional perfor- mancedata. TheNagios command andthe corresponding service definitionthenlook likethis: define command{ command_name check_nt_counter command_line $USER1$/check_nt -H $HOSTADDRESS$ -v COUNTER \ -l$ARGV1 } define service{ host_name winsrv service_description Idle Time check_command check_nt_counter!"\ Process(Idle)\ %Processor Time","Idle \ 3Process: %.2f %%" ... } 5 man3printf 366 18.1 NSClient and NC Net Thetwo functions COUNTER and INSTANCES also belong to theNSClient rangeof functions, buttheyare extremelydifficulttohandlethere.Ifyou want to usethem, youare well advised to switch to NC Net. Listing processesand services To findout thenames of processes, youcan work your waythrough theTask Manager—or have alistofall running processesdisplayed with ENUMPROCESS: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v ENUMPROCESS System Idle Process; System; smss.exe; csrss.exe; winlogon.exe; services.exe; lsass.exe; svchost.exe; svchost.exe; svchost.exe; ... Theequivalent command for listing allinstalledservices is ENUMSERVICE: check_nt -H host -v ENUMSERVICE -l typ,short Theoptional -l restrictsthe output to specificcategories(seeTable 18.1): nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v ENUMSERVICE \ -l manual,short ALG; AppMgmt; BITS; COMSysApp; dmadmin; EventSystem; HTTPFilter; LPDSVC; MSIServer; Netman; Nla; NtFrs; NtLmSsp; NtmsSvc; RasAuto; ... With the short option, ENUMSERVICE displays theservicenames as they areen- tered in theregistry; if youleave outthe keyword,itshows thedisplay names. Table18.1: Limitingoptions for ENUMSERVICE Type Description all allservices running allcurrently active services stopped allservices whichhavebeen stopped automatic servicesstartingautomatically manual serviceswhich mustbestarted manually disabled disabledservices 367 18 MonitoringWindows Servers Queryingthe Windowsevent log With the EVENTLOG command,the WindowsEvent Logcan be queried: check_nt -H address -v EVENTLOG -w integer -c integer -l eventlog, event type, interval, source filter, description filter, id-filter Using it doestake some getting used to,however: 6 thefirstthree parametersto follow -l select theeventstobetakenintoaccount by type andbytime. Theplace- holder eventlog is replaced withone of thethree logareas application, security , or system that youwanttolook at.If EVENTLOG is to includeall three, youjust specify any ;but youcannotchooseonlytwo of thethree areas. Forthe eventtype youcan choose from error , Warning, Information,or any for allthree. In placeof interval youspecify atimeintervalinminutes: 5 limitsthe selection to events whichoccurred in thelastfive minutes,for example; 1440 stands fora wholeday. Thelastthree parametersineffect work as filterswithwhich specificresults can be determinedfromthe preselection that alloriginate from aparticularsource (the source filter placeholder),thatcontain aspecific patternintheir descriptions ( descriptionfilter), or that have aspecific eventID(id-filter). Each of thesefiltersconsistsoftwo parts: in thefirstaninteger revealshow many search patterns aretofollow(formulated as regularexpressionsinaccordance with the.NET-Regexp class),and then theactualfilter entriesare specified,separated by commas. If oneofthe filtersisnot used,its placeholderisreplacedwitha0 , whichsearchesfor exactly zero search patterns.Asource filter whichonlylooksfor NC_Net events wouldbecalled 1,NC_Net;ifyou want to search for NC_Net and Perflib events,itwould be called 2,NC_Net,Perflib . -l any,any,5,0 ,0,0 evaluates allentries from allevent ranges from thelastfive min- utes. -l application,error,1440,0 ,0,0 determinesall events of thetype error ,which occurred in theevent range application withinthe last 24 hours. With -l ap- plication,error,60,1,NC_Net,0,0,the time window is setto60minutes andfilters theevent source usingthe string NC_Net.Finally -l application,any,60,0,2,start, stop,0 searches theevent descriptionfor twokeywords: start and stop. With thewarning andcritical limitsyou can specifyhow many matching entriesare needed beforethe plugin returnsaWARNINGorCRITICALvalue.Ifyou leaveout thesetwo parameters, Nagios showsOKaslongas no events occurred;otherwise, it showsCRITICAL. 6 Accordingtohis owncomments, author Tony Montibello wantedtochangethe syntax for definingservicesinversion 2.25. Butuptoand includingversion 2.28, this resolution hasnot yetbeen implemented. 368 18.1 NSClient and NC Net Thefollowingexample asks howmanymessagesthere were withinthe last 24 hoursinthe applications area: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v EVENTLOG \ -l "Application,any,1440,0,0,0" 9Errors with ID: 13001;2003;1010;6013;1111;262194;26;262194;26 LAST - ID 262194;Not all data for the file " \ Device\ LanmanRedirector" were saved. Possible causes are computer hardware or the network connection. Please specify adifferent file path. Theerror message displayedhere LAST -IDD 262194;Notall data. .. belongsto thelastentry found. Acommand definitionthatomits details of warningand critical limitscould look likethis: define command{ command_name check_nt_eventlog command_line $USER1$/check_nt -H $HOSTADDRESS$ -v EVENTLOG \ -l $ARG1$ } On this basisaservice couldbedefinedthat, for example, searches for errors in allclassesinthe System area whichoccurred in thepastfive minutes.(When specifyingthe time period youshouldgenerally ensure that it correlates withthe time period in normal_check_interval .) Theserviceexaminesthe descriptions of theentries found for thetext data loss.The source andIDfiltersare notusedhere: define service { host_name winsrv service_description Eventlog data loss check_command check_nt_eventlog!System,any,5,0,1,data loss,0 is_volatile 1 normal_check_interval 5 max_check_attempts 1 ... } Logfileshavethe characteristic of pointingout aproblemonlyonceunder certain circumstances,evenifthe problemcontinues. Youmustthereforeensurethat Nagios immediatelymakesanotificationthe first time theevent occurs, andleaves outrepeated tests andsoft states.Thiscan be achievedwith max_check_attempts 1 :thisimmediatelysets off ahardstate, andnotification is givenright away. Butifthe hard stateremains,thiswould mean in practice that newerrorsmight occurinthe meantime (the next testafter fiveminutes no longer recordsthe old states), whilethe statehas notchanged;the admin wouldonlybeinformedagain 369 18 MonitoringWindows Servers after the notification_interval hasexpired.For such cases,Nagios hasavailable the is_volatile parameter (see Section14.5.2, page 257),withwhich thesystem provides notificationonevery single error. Displayingand manipulating theNC Net configuration The ENUMCONFIG function displays thecurrent settingsofNC Net in areadable form: nagios@linux:nagios/libexec$ ./check_nt -H winsrv -v ENUMCONFIG Date: 16.04.2005 18:15:10; Version: NC_Net 2.21 03/13/05; NC_Net Config Path: c:\ Programs\ shatter it\ nc_net\ config\ ; Startup Config: c:\ Programs\ shatter it\ nc_net\ config\ startup.cfg; Debug Log: c:\ Programs\ shatter it\ nc_net\ config\ deb.log; ... Port: 1248; Pass: None; ... Date showsthe currentquery date, Version theNC Net versionused. NC_Net Config Path describesthe pathtothe configuration directory, StartupConfig the configuration fileused. Debug Log specifies thelog filecontainingthe debugging output,but only if the MYDEBUGtrue parameter is setinthe configuration file. Port revealsthe portonwhich NC Netislistening, and Pass showswhether a passwordhas been used for theconnection(None:nopassword). Thereisalsothe command CONFIG to manipulate theconfiguration of theNC Net installation over thenetwork. Forreasons of security youshoulduse this for test purposes only,and otherwisekeep thefunction switched off. Accordinglyyou should keep thefollowingdefault setinthe configuration file startup.cfg : lock_passive_config true lock_active_config true This meansthatthe configuration cannotbechanged from theoutside. Otherfunctions NC Net’s rangeoffunctionsisgrowingall thetime, andtodescribe allthe func- tionsindetail wouldneed aseparatebook.We’ll just mentionafewquite useful commands: FREEDISKSPACE Theequivalent of USEDDISKSPACE (page 359) expectsthe free hard drive capacity (insteadofthe used space)inpercent for warningand critical limits 370 18.2 NRPE forWindows:NRPE NT WMIQUERY This function enablesthe SQL-capable WMI7 databasetobequeried,which contains the.NETconfiguration data. WMICOUNTER Objectscomparable to theWindowsperformancecountersalsoexist in the WMI area (only.NET);theycan be queriedwiththis. Passive Checks From version2.0,NC Net also supports passive checks basedonthe NSCA mechanism(seeChapter 14, page 247).Ashortdocumentationcan be found in theincluded passive.cfg file. Moreinformation can be found in thefile readme.html,includedinthe installation, butitcan also be vieweddirectlyat http://www.shatterit.com/nc_net/files/read- me.html. 18.2NRPEfor Windows: NRPE NT With NRPE NT thereisaversionofthe Nagios Remote Plugin Executor,introduced in Chapter 10, ported for Windows. Itstaskistoexecute pluginsonthe target system if aparticulartestisonlypossible locally andnosuitable networkprotocol exists to querythe resource concerned. As withthe Unix version, thedesired plug- insmustbeinstalledlocally on thetargetsystem, apartfromthe daemon(in this case: NRPE NT)and thetests mustbeentered in alocal configuration file. NRPE NT is basedonNRPEversion 2.0. This meansthatthe same check_nrpe plugin can be used for queryingasthe onefor theUnixNRPE. On theInternetaseries of pluginsexecutable in Windowscan be found which work together withNRPE NT.The first placetolook is againThe Nagios Exchange, whichhas aseparatesubcategory.8 On theone hand theseprogramsare based on thesamesourcecode as theirUnixequivalents, andwerejustcompiledfor Windows. Theported programsalsoinclude some Perl scripts, whichrequire an installedversion of Perl—in most cases thescriptlanguage willfirsthavetobe installed. NRPE NT canalsobeusedfor otherpurposes:onceinstalledonthe Windows server, youcan usethe mechanismtorun otherscripts remotely, apartfromNagios plugins. If youwantNagios to restartaservice remotelythrough theEventhandler, this can be donejustaseasily withNRPE NT. 9 7 Shortfor WindowsManagementInstrumentation . 8 http://www.nagiosexchange.org/NRPE_Plugins.66.0.html 9 To executescripts remotely on aWindowsserver, youcan also usethe Windowsversion of the Secure Shell, atopicthatistoo largetogointointhisbook. 371 18 MonitoringWindows Servers 18.2.1Installation andconfiguration Thecurrent ziparchive from TheNagios Exchange or http://www.miwi-dv.com/ nrpent is unpacked to asuitable directory, such as D:\Programs\Nagios\nrpe_nt : D:\ Programs\ Nagios\ nrpe_nt> unzip nrpe_nt.0.8-bin.zip It contains asubdirectory bin ,inwhich arefound thedaemon NRPE_NT.exe,two DLLs for usingSSL ( libeay32.dll and ssleay32.dll ), an exampleofasimpleplugin script ( test.cmd), andthe configuration file nrpe.cfg. Theserviceisinstalledfromthisdirectory withthe command nrpe_nt-i ,after whichitjustneedstobestarted,eitherinthe Windowsservices managerorfrom thecommand line: D:\ Programs\ Nagios\ nrpe_nt\ bin> nrpe_nt -i D:\ Programs\ Nagios\ nrpe_nt\ bin> net start nrpe_nt Theconfiguration file nrpe.cfg is only slightly different from theUnixversion of NRPE 2.0(seeSection 10.3, page 170):onlythe directive include_dir doesnot function in NRPE NT. Thefile in Windowsalsohas theclassical Unix textformat, so either yourequire a suitable editor ( notepad.exe is notsufficient)oryou mustedititinLinux andcopy it afterwardstothe testsystem. Since thereisnoinetdaemoninWindows, youmustspecify theport(standard: server_port=5666)and thehosts from whichNRPEshouldbeaddressed(you should only enter theNagios serverhere; for example: allowed_hosts=172.17. 129.2 ) 10 in nrpe.cfg.The parameters nrpe_user and nrpe_group have no meaning in Windows, andthe otherparameterscorrespondtothose discussedinSection 10.3. In thedefinition of executable commands(here for theincludedtestplugin) you mustrememberthe Windows-typical syntax withharddrive lettersand back- slashes: command[check_cmd]=D:\ Programs\ nagios\ nrpe_nt\ plugins\ test.cmd In this examplethe pluginsare in aseparatesubdirectorycalled plugins.After changestothe configuration fileyou should always restartNRPE NT: D:\ Programs\ Nagios\ nrpe_nt\ bin> net stop nrpe_nt D:\ Programs\ Nagios\ nrpe_nt\ bin> net start nrpe_nt 10 This security measure, however, is restrictedtoasimple comparison of IP addresses. 372 18.2 NRPE forWindows:NRPE NT 18.2.2Functiontest Beforeputting NRPE NT into service,you should checkwhether it is functioning correctly. To do this,run theplugin check_nt on theNagios serverasthe user nagios,withjustone host specification andnoother parameters: nagios@linux:nagios/libexec$ ./check_nrpe -H winsrv NRPE_NT v0.8/2.0 If theservicehas been correctlyinstalledand configured, it will replywithaversion number.Another simple testisperformedbythe included test.cmd plugin.It provides ashort textand ends withthe return value 1 : @echo off echo hallo from cmd exit 1 Thecommand to be executed (definedinthe previous section) is passedtothe plugin check_nt withthe -c option: nagios@linux:nagios/libexec$ ./check_nrpe -H winsrv -c check_cmd hallo from cmd nagios@linux:nagios/libexec$ echo $? 1 Thereturn value, determinedwith echo $?,mustbe 1 in this case, sincethe script exitswithan exit1. 18.2.3The Cygwinplugins In the CheckPlugins → Windows 11 category, Nagios Exchange includes the Cyg- winPlugins packagefor downloading.ItconsistsofNagios standardplugins,which have been compiledfor Windowswiththe help of theCygwin Tools.12 Apartfrom theexecutable plugins(*.exe)the packagealsocontainsall thenecessary DLLs.It is thereforesufficient to unpack thezip archive into adirectory: D:\ Tmp> unzip CygwinPlugins1-3-1.zip D:\ Tmp> dir NagPlug check_dummy.exe check_ssh.exe check_udp.exe cygwin1.dll check_http.exe check_tcp.exe cygcrypto-0.9.7.dll negate.exe check_smtp.exe check_time.exe cygssl-0.9.7.dll urlize.exe 11 http://www.nagiosexchange.org/Windows.49.0.html. 12 Theseare portedversionsofalargenumber of GNUtools, includingcompilers,libraries, and shells.Thankstotheir openlicense (a GPLderivative) they have become an unofficial standard forthose whowishtoportOpenSourceprogramsfromthe Unix worldtoWindows. 373 18 MonitoringWindows Servers Forthe sake of simplicity, just copy thecontents of thedirectory that is created, NagPlug,tothe plugin directoryofNRPE NT: D:\ Tmp\ NagPlug> copy *D: \ Programs\ Nagios\ nrpe_nt\ plugins Thepluginfunctionsinthe same wayasinLinux.Table 18.2referstothe corre- sponding sections in this book. Table18.2: Cygwin Pluginsfor NRPE NT Plugin Page Description check_dummy.exe 154 Test plugin check_http.exe 98 ReachabilityofaWeb site check_smtp.exe 92 Testingamail server check_ssh.exe 108 SSH availability check_tcp.exe 110 Genericplugin check_time.exe 146 Clocktimecomparisonoftwo hosts check_udp.exe 112 Genericplugin negate.exe 155 Negates thereturn valueofaplugin urlize.exe 156 creates alinktothe plugin output in the Nagios Webinterface As in Unix,each of thecorresponding command definitions in theconfiguration file nrpe.cfg mustbewritten on asingleline: command[check_web]=D: \ Programs\ nagios\ nrpe_nt\ plugins\ check_http \ -H www.swobspace.de command[check_identd]=D:\ Programs\ nagios\ nrpe_nt\ plugins\ check_tcp \ -H linux01 -p 113 Thefirstlinechecks whether aWeb serverisrunning on theHTTP standardport80 of thehost www.swobspace.de .The second linetests whether an identd daemon (TCP port 113) is active on thehost linux01 . 18.2.4Perl pluginsinWindows Unfortunatelythe Cygwin pluginsdonot containacheck_ping or check_icmp. Youcan usethe Perl script check_ping.pl instead, whichisavailable for download on TheNagios Exchange in the Networking category.13 It uses thePerlmodule Net::Ping for thenetworkconnection. In contrast to check_tcp , check_ping.pl 13 http://www.nagiosexchange.org/Networking.53.0.html 374 18.2 NRPE forWindows:NRPE NT sendsseveral packets,soitcan make amoreprecise assessmentofresponse times andpacket losses. An up-to-dateand simple to installPerlfor Windowscan be obtained from ActiveS- tate14.Todownload the Active Perl Free Distribution,noregistrationisrequired, even if thedownload procedurewould suggest otherwise. Of theversionsoffered, youshoulduse thelatestPerlversion (currently 5.8.7),and only fallback on the olderversion 5.6.1ifthisshouldcause problems. ThepluginscriptitselfcontainsaBEGIN statement, whichyou mustcomment out for useinWindows: #BEGIN{ #push @INC, "/usr/lib/perl5/site_perl/... # } It sendsaTCPechorequest to port7,alternatively youcan also explicitly seta different portbyadding thefollowinglineafter the Net::Ping->new statement: $p->port =80; This wouldcause aTCP ping to port80(HTTP).SothatNRPE NT canexecute the script,you mustexplicitly startthe Perl executable: command[check_ping_eli02]=C:\ Perl\ bin\ perl.exe \ D:\ Programs\ nagios\ nrpe_nt\ plugins\ check_ping.pl \ --host 172.17.129.2 --loss 10,20 --rta 50,250 Thecommand hasbeen line-wrapped for theprinted version, butinthe configu- ration filethe wholecommand mustbewritten on asingleline. With the --host parameter youspecify ahostnamewhich can be resolvedoranIPaddress, --loss is followedbyapairofvaluesfor thewarning andcritical limitsfor packetlossin percent, separated by acomma,(so values between0and100 arepossible here). The --rta optionalsodemands athreshold valuepairasanargument,for theav- erage response time in milliseconds.Since this is aPerlscript, it doesnot matter if theseare specified as integersorfloatingcomma decimals. 14 http://www.activestate.com/store/languages/register.plex?id=ActivePerl 375 19 Ch ap te r MonitoringRoom Temperature and Humidity Thereare anumber of sensorsfor monitoring room temperatureand humidity. Most of them areintegrated into thenetworkasindependent networkdevices, andare normally addressedvia SNMP.But youhavetospend at leastthree hun- dred dollars on your first sensor.Searching for acheaperand modularsystem, the author finally cameacross http://www.pcmeasure.com/ ;ithas metall hisrequire- mentsuntil now. Thefact that this chapter is restricted to this sensor is notmeant to detract from othersystems,but is downtothe factthatthistopicalone would be enough for aseparatebook. 377 19 MonitoringRoom Temperatureand Humidity 19.1Sensors andSoftware Acomplete monitoring system for physical datanormally consists of threecompo- nents: asensor(for temperatureorhumidity,for example),anadapter to connect to theserialorparallelportofaPC, andsoftware to querythe sensor. 1 Thereare adaptersfor thePCMeasuresystem in variations from onetofoursensors, whichcan be operated simultaneously. Forthe power supplythe adaptersneed an available USBinterface; alternatively aseparate“USBpower supply” is available. Insteadofthe adapter solution,there is also an optionally available Ethernet box withfoursensorconnections,which is somewhat more expensive,thatcan be expandedtoaccept 12 sensors. Themeasurement queryingsoftware PCMeasure is available for both Linux and Windows.2 Some features areexclusive to theWindowsversion,which is whyitis slightly more expensive.For usewithNagios,the Linux versionistotally sufficient, sinceonlythe measurementvaluesare transmitted over asimplenetworkprotocol. Thesensors themselvesare interesting:aswellasthose for temperatureand hu- midity (aswellascombinations of thetwo)there is also acontact sensor,asmoke andwater alarm, amovementdetector, andvoltage detectors. Theseare normally connected withatwisted-paircable (RJ45 connector); according to theFAQ,3 they can be used up to 100 metersfromthe adapter or Ethernet box, provided youhave good cables, that is,throughoutabuilding. 19.1.1The PCMeasure softwarefor Linux Thetar archive pcmeasure.tar.gz withthe Linux software is unpacked in itsown di- rectory, such as /usr/local/pcmeasure .The configuration file pcmeasure4linux.cfg is also installedhere. Theportentries in this fileneed to be adjusted so that only thoseports arelisted to whichasensor is actually connected: [ports] com1.1=01 com1 stands forthe first serial port;ifyou areusing thefirstparallelportinstead, theentry beforethe period is lpt1.The digitfollowingthe portreferstothe adapter slot used by thesensor, so dependingonhow many adaptersyou have,thisisa number from 1 to 4 .The =signisfollowedbythe sensor type: 01 stands fora 1 ThePCMeasureWeb site showed thefollowing prices as of March2006: simple temperature sensor 30101, $27; serial single-portadapter 30201 $39; Linux software,$29 (Windows: $39). 2 Theaccess data forthe downloadcomes with theinvoice. 3 http://www.pcmeasure.com/faq.php 378 19.2 TheNagios Plugin check_pcmeasure temperaturesensor, 03 for ahumidity sensor.Anadditional humidity sensor on thesecondslotofthe same adapter wouldthenbeaddressedas com1.2=03 . Thequery program pcmeasure requires theconfiguration filetobespecifiedasan argument: linux:local/pcmeasure # ./pcmeasure ./pcmeasure4linux.cfg It runs as adaemoninthe background andonlyendsifitisterminated with kill . In principle, anyusercan startitwho hasreadpermissionsfor thecorresponding interface. 19.1.2 Thequery protocol Thesoftware opens TCPport4000 by defaultand accepts requests from thenet- work.The protocol used is quitesimple: yousendatextinthe format pcmeasure.interface. slot (thatis, withaDOS lineending) andyou receive aresponse in theformat port;valid=validity;value=value;... The validity placeholderisreplacedbya1 for avalid valueor 0 for an invalid one. Theportspecificationcomplieswiththe internalnumberingsystem: lpt1.1 corresponds to port1, com1.1 to port13 .Whether everything functionscorrectly or notcan be tested with telnet : user@linux:˜$ telnet localhost 4000 Trying 127.0.0.1... Connected to localhost. Escape character is ’ˆ]’. pcmeasure.com1.1 port13;valid=1;value=22.59;counter0=10627;counter1=14373; Connection closed by foreign host. Thecurrent temperatureinthisexample is 22.59 ◦ C, andthe valueisvalid. 19.2The Nagios Plugin check_pcmeasure Theplugin check_pcmeasure.pl4 enablesasingle sensor to be queriedoverthe network. It entersthe values receivedintoaround-robin databaseinthe following form: 4 http://linux.swobspace.net/projects/nagios . 379 19 MonitoringRoom Temperatureand Humidity timestamp: value Ascriptcalled create-rrd.sh to create this databaseand aCGI script to displaythe graphicsgenerated ( temp.cgi)can also be found at thelinkspecified. To be able to work witharound-robin database(seepage 317) yourequire the RRDtools,5 )which containthe Perl module RRDs used by theplugin. If youdonot usethis, youshouldcomment outthe line use RRDs; in thePerlcode of thepluginbyplacing a # at thebeginning of theline. Theplugin hasthe followingoptions: -H address / --host=address This is thehostnameorIPaddressofthe measuringcomputer on whichthe software is running andtowhich thesensors areconnected. -S sensor port / --sensor=sensor port This switch defines thesensorport, such as com1.1 or lpt1.2 (see above). -p port / --port=port This sets alternative portspecifications for theTCP port of thesoftware.The defaultisport4000. -w floating point decimal / --warn-min=floating point decimal If themeasuredvalue falls belowthe giventhreshold value, check_ pcmeasure sets off awarning. -W floating point decimal / --warn-max= floating point decimal If themeasuredvalue liesabove this warninglimit,the plugin givesawarn- ing. Upperand lowerthresholdscan be combined. -c floating point decimal / --crit-min=floating point decimal Thepluginissues CRITICALifthe valuedrops belowthislimit. -C floating point decimal / --crit-max=floating point decimal Thepluginissues CRITICALifthe valuegoesabove this threshold. It can be combined with -c. -R file / --rrd-database=file This optionspecifiesthe round-robin database. If this optionismissing,the RRDPerlmodule can be commented out. 5 http://www.rrdtool.org/ 380 19.2 TheNagios Plugin check_pcmeasure -V / --version This is theoutputofthe plugin versionand ashort help.The plugin doesnot queryany sensor in doing this. In thefollowingexample thepluginasksfor thetemperature of thesensorcon- nected to thehostwiththe IP address 172.17.193.6 : nagios@linux:nagios/libexec$ ./check_pcmeasure.pl -H 172.17.193.6 \ -S com1.1 -W 22.0 -C 25.0 WARNING: Value com1.1: /22.6/ >22.0 Since themeasuredvalue liesabove thewarning limit of 22 ◦ C, butbelow the critical limit of 25 ◦ C, thereisaWARNING. Thecorresponding Nagios command can be specified withorwithout around- robin database: define command { command_name check_temp_max command_line $USER1$/check_pcmeasure.pl -H $HOSTADDRESS$ \ -S $ARG1$ -W $ARG2$ -C $ARG3$ } define command{ command_name check_temp_max_rrd command_line $USER1$/check_pcmeasure.pl -H $HOSTADDRESS$ \ -S $ARG1$ -W $ARG2$ -C $ARG3$ -R $ARG4$ } If it is without,you only need themaximum andcritical warninglimits, apartfrom thesensordetails.Inthe second variationthe RRDfile predefinedin $ARG4$ additionally savesthe measured data. Thefile mustbecreated beforehandand mustbewritable for theuser nagios. Thefollowingserviceusesthe file /var/lib/rrd/temperatur-serverraum1.rrd for this purpose: define service { host_name linux01 service_description Room temperature max_check_attempts 1 normal_check_interval 2 check_command check_temp_max_rrd!com1.1!23.0!27.0!\ /var/lib/rrd/temperatur-serverraum1.rrd ... } With max_check_attempts setto 1 ,Nagios does not repeat thequery in caseof an erroratintervals of retry_check_interval.Insteadthe temperatureismeasured constantly everytwo minutes. 381 19 MonitoringRoom Temperatureand Humidity Since room temperatures normally change very slowly,you coulduse a normal_ check_interval of fiveminutes.Ifyou chooselargermeasuring intervals youcan set max_check_attempts to more than 1 andrepeatthe measurementatshorter intervals in caseoferrors(e.g., retry_check_interval 1 ). 382 20 Ch ap te r MonitoringSAP Systems Thereare severalwaysofmonitoringanSAP system.The simplest is just to check theports on whichthe corresponding SAPservices arerunning.Normally these areTCP ports3200/3300 for system number 00,3201/3301 for system number 01 etc. This can be donewiththe genericplugindescribedinSection 6.7.1, page 110. Butitispossible that no user is able to login, even though theportisreachable, because SAP-internalservices fail, making it impossible to work withthe system. To really testthe complexinteractionofvarious SAPcomponents,you requirea program that communicatesonanapplicationlayer withthe SAPsystem. There aretwo alternativeshere: themoresimpleone uses theprogram sapinfo ,which queriesthe available information without adirectlogin—likethe SAP-GUI at the start. With somewhat more effortyou can communicate withthe SAPsystem over an SAPstandardinterface. This is no use, however, unlessyou have an SAP loginwithcorresponding permissions. With the Computing Center Management System (CCMS), SAPprovides itsown internalmonitoringsystem, whichcan also 383 20 MonitoringSAP Systems be queriedwiththe RFC 1 interface, andwhich can be puttoexcellent useinNagios, withthe rightplugins. 20.1CheckingwithoutaLogin: sapinfo Theprogram sapinfo is part of an optionalsoftware packagefor thedevelopment of client-sideRFC interfaces.The Linux versionwhich yourequire, RFC_OPT_46C. SAR ,can be obtained either at ftp://ftp.sap.com/pub/linuxlab/contrib/ or youcan logintothe SAPService Marketplace at http://service.sap.com/ (a passwordis required for this)and usethe search help theretolook for thekeyword “RFC-SDK”. 20.1.1Installation SAPhas itsown archivingformatinwhich theprecompiledsoftware is stored.To unpack programs yourequire theprogram SAPCAR ,which can also be obtained throughthe FTP linkmentioned or throughthe SAPServiceMarketplace.Itis operated in away similartotar: linux:˜ # mkdir /usr/local/sap linux:˜ # cd /usr/local/sap linux:local/sap # /path/to/SAPCAR -xvf RFC_OPT_46C.SAR SAPCAR: processing archive RFC_OPT_46C.SAR xrfcsdk xrfcsdk/bin xrfcsdk/bin/sapinfo ... Thedatacontained in thearchive landsinits ownsubdirectory, rfcsdk.Ifyou run SAPCAR without anyparameters, ashort operating manualisdisplayed. 20.1.2First test Theprogram sapinfo can be tested nowwithout furtherconfiguration.Todothis yourequire theso-called connect string;ifthe connectionisrunning throughan SAPgateway,thisisastring such as /H/ ip_of_the_sap-gateway /S/3297/H/ ip_of_ the_sap_system;without agateway yousimplyspecify an IP addressorahost name that can be resolved, insteadofthiscomplex expression.Incaseofdoubt,the administrator responsible for theSAP system willrevealthe exact connect string. In addition youmustspecify thesystem number, 2 in this example, 01: 1 Remote Function Call. 2 TheSAP administrator will also know this. 384 20.1 Checking withoutaLogin: sapinfo nagios@linux:˜$ cd /usr/local/sap/rfcsdk/bin nagios@linux:rfcsdk/bin$ ./sapinfo ashost=10.128.254.13 sysnr=01 SAP System Information ----------------------------------------------- Destination p10ap013_P10_01 Host p10ap013 System ID P10 Database P10 DB host P10DB012 DB system ORACLE SAP release 620 SAP kernel release 620 RFC Protokoll 011 Characters 1100 Integers LIT Floating P. IE3 SAP machine id 560 Timezone 3600 Theoutputprovides variousinformation on theSAP installation,including theSAP release(620), theSAP system ID ( P10), thehostonwhich thedatabaseislocated, andthe databasesystem used,which in this caseisOracle. With the ashost parameter youquery aspecific applicationserver. Foramessage server, sapinfo requires thefollowingdetails: nagios@linux:rfcsdk/bin$ ./sapinfo r3name=P10 mshost=10.128.254.12 \ group=ISH The r3name parameter specifies theSAP system ID, mshost defines theIPaddress of theserver, and group describesthe logon group. As long as the PUBLIC group exists,you can leavethisparameter out, andthenthe default, PUBLIC,will be used. If thequery ends withanerror message such as ERROR service ’sapmsP10’ unknown then thedefinition of the sapmsP10 service is missing for theNagios server 3 in /etc/services: sapmsP10 3600/tcp 3 Instead of P10,the appropriate systemIDwill always be shownhere. 385 20 MonitoringSAP Systems Forthe portyou definethe TCPportonwhich themessage serverisrunning.Which onethisisdepends on theparticularSAP installation;the standardportis 3600. 20.1.3The plugin check_sap.sh Theplugin check_sap.sh,ashellscriptbased on sapinfo ,isincludedinthe standard Nagios Pluginspackage,but it is in the contrib directoryand is notautomatically installed. Youcan copy it manually to theplugindirectory: linux:˜ # cp /usr/local/src/nagios-plugins-1.4/contrib/check_sap.sh \ /usr/local/nagios/libexec/. Then youlook in thepluginfor thevariable sapinfocmd andadjustthe pathfor sapinfo : sapinfocmd=’/usr/local/sap/rfcsdk/bin/sapinfo’ Like sapinfo ,the plugin can be runintwo ways:withthe argument as it queries an applicationserver, andwith ms,amessage server. Thesecondargument in each caseisthe connect string,and if no SAPgateway is used,thenitisthe IP address or thehostnameofthe host to be queried: check_sap.sh as connect string system number check_sap.sh ms connect string SID logon group Thefirstvariation demandsthe two-digitsystem number of theapplicationserver as thethird parameter,the countingofwhich starts at 00: nagios@linux:nagios/libexec$ ./check_sap.sh as 10.128.254.13 01 OK -SAP server p10ap013_P10_01 available. This meansthatthe applicationserverrunning on thehost 10.128.254.13 is avail- able. When themessage serverisqueried,the plugin displays theapplicationserver belongingtothe specified logingroup (givenasthe fourthargument). If this in- formation is missing,itdeterminesthe applicationserverfor the PUBLIC group. Foramessage server, youspecify theSAP system ID ( SID ), for example, P10, 4 in- steadofthe system number: nagios@linux:nagios/libexec$ ./check_sap.sh ms 10.128.254.12 P10 ISH OK -SAP server p10ap014_P10_02 available. 4 Thefirstinstanceofthishas thesystemnumber 00, thesecondone,01, etc. 386 20.1 Checking withoutaLogin: sapinfo In this examplethe message serverrunning on 10.128.254.12 detects p10ap014_ P10_02 as theapplicationserverfor thelogon group ISH andalsoreveals that this is reachable. Thefollowingtwo command definitions assume that it is sufficienttouse theIP address,and that no SAPconnect string is required: define command{ command_name check_sap_as command_line $USER1$/check_sap.sh as $HOSTADDRESS$ $ARG1$ } define command{ command_name check_sap_ms command_line $USER1$/check_sap.sh ms $HOSTADDRESS$ $ARG1$ $ARG2$ } If this is notthe case, the command_line for queryinganapplicationservercould look likethis: $USER1$/check_sap.sh as /H/sapgw/S/3297/H/$HOSTADDRESS$ $ARG1$ Thefollowingservicedefinition can be used for allapplicationservers: define service{ service_description SAP_AS host_name sap01 check_command check_sap_as!00 ... } Since thereisonlyasingle message serverinanSAP system,itmakesmoresense to defineaseparateservicefor each logon group. Thefollowingexample shows this for thegroup ISH : define service{ service_description SAP_MS_ISH host_name sap09 check_command check_sap_ms!P10!ISH ... } In this wayyou can testwhether ausermay loginwithout actually logging in.If thereare interruptions betweenthe databaseand theapplicationserverthatmake it impossible to login, sapinfo provides acorresponding errormessage after atime- out. Theauthorwas able to observe severaltimes that sapinfo and check_sap.sh reported an errorinsuchasituation, whilethe TCPport-only testofthe applica- tion server, check_tcp ,returned an OK,althoughnousercould loginany longer. So check_sap.sh,evenwithout alogin, provides more reliable information than a port-only check. 387 20 MonitoringSAP Systems 20.2Monitoring withSAP’s OwnMonitoring System (CCMS) With SAP’sown Computing Center Management System framework(CCMS), not only SAPsystems,but also externalapplications can be monitored. Here local agents collect datafromeach of thehosts,which,since ReleaseR/3 4.6C,5 can be queriedfromacentralcomponent. Thedataexaminedincludesnot only SAP- specificfeaturessuchasSAP buffers or batchjobs, butalsooperating system data such as memory andCPU usage,ordiskIOand swapping. Even information on the databaseusedorthe average response timesofapplications can be queried. Thedataofthe CCMS can also be queriedexternally throughRFC ( Remote Func- tion Calls,astandardSAP interface).Corresponding librariesfor Unix andWindows platforms,withwhich aLinux program,for example, can queryinformation from theCCMS over thenetwork, areprovided by SAP. 20.2.1CCMS—ashortoverview Within theSAP worldyou gainaccess to this datathrough theCCMS AlertMonitor (transactionRZ20) (Figure20.1).The illustration showsso-calledmonitor connec- tionsthatcategorizevarious information in groups. Figure 20.1: TheSAP CCMSAlert Monitor 5 Centralevaluationwas notpossible in earlierreleases. 388 20.2 MonitoringwithSAP’s OwnMonitoringSystem(CCMS) SAPprovides severalmonitor collections withpreconfiguredvaluesinits distribu- tion.AtrainedSAP administrator can create andoperate monitors at anytime. We shallrestrictourselvesheretothe monitorcollection SAPCCMSMonitor Tem- plates andfocus on the Dialog Overview monitor(Figure 20.2). Thedialogresponse timesspecifiedthere (accessible throughthe monitor at- tribute Dialog Response Time)provideameasurable equivalent for performance problems corresponding to what theuserfeelsisa“slow system.” This valuespec- ifies theaverage processing time of atransaction(without networktransmission time andwithout thetimeneeded to render theinformation in theGUI of the client). Figure 20.2: TheSAP CCMS monitorDialog Overview Themonitor attribute Network Time revealshow muchtimethe system needsto send dataduringadialogstage from theclient (the SAPGUI)tothe SAPsystem andback again. Foreach of theattributes,the monitorshows whichcontextdefinedinthe SAP system—normally,which SAPinstance—is involvedinthe measured values spec- ified.Mostmeasurement parametershaveawarningand acritical limit.Ifthe valueliesbeneath thewarning limit,the monitordisplaysthe lineingreen;for monochrome devices thecolor is listed as text. If thewarning limit is exceeded, yellow is shown, andifthe critical limit is exceeded, red .Ifanentry of apartial tree liesoutside thegreen limit,the monitoralsosets theoverlyingnodestoyellow or red, so that theadministrator can seethatsomethingisnot right, even when themenus arenot open. 389 20 MonitoringSAP Systems Youdonot normally need to worry aboutthe thresholds.The settingsconfigured by SAPare sensible andshouldonlybechanged if thereisasound reason to do so. TheNagios pluginsfor theCCMS query, describedinSection 20.2.4(page 394), return thestatusdefinedinthe CCMS:OKifthe trafficlight is on green,WARNING for yellow,and CRITICALfor red. Thethresholdsare thereforeset by theSAP system, andnot by Nagios. If youwanttofind outmoreabout CCMS,werefer youtothe documentationat http://service.sap.com/monitoring (passwordrequired).There SAPprovides de- tailedinformation on theinstallation andoperation of CCMS.The SAPonlinehelp also hasanextensive rangeofinformation available.Ifyou just want ashort sum- mary of thesubject andare more interested in theway theNagios pluginswork, youcan findtwo informative PDFdocumentsat http://www.nagiosexchange.org/ Misc.54.0.html under thekeyword SAPCCMS . 20.2.2Obtaining thenecessary SAPusage permissionsfor Nagios6 Retrievinginformation from theCCMS is donethrough RFC(Remote Function Calls), whichrequiresaloginonthe SAPside. Luckily theuseronlyneedsaminimal setofpermissions. Anew role is setupinthe role generator(transactionPFCG) with anamethat conforms to thecompany-internalconventions.Itisnot givenany transaction assignment in themenu. Figure 20.3: Foraccess from Nagios yourequire theseSAP authorization objects 6 This sectionisintendedfor SAP authorizationadministrators. If youdonot maintain SAP authorizations yourself, youcan skip this section. 390 20.2 MonitoringwithSAP’s OwnMonitoringSystem(CCMS) When maintainingpermissions, thefollowingpermission objectsare added manu- ally: S_RFC , S_XMI_LOG,and S_XMI_LOG (see also Figure 20.3). Whether thesepermissionsare sufficientornot can be tested withthe plugin check_sap_cons describedinSection 20.2.4, page 394 check_sap_cons .Ifafunc- tion group(such as SALG)ismissing from thepermission object S_RFC ,the plugin showsnameofthisinplain textinthe errormessage. Thelogindataisstoredonthe Nagios serverinthe file /etc/sapmon/login.cfg . When doing this,various target hosts(called RFCdestinations in SAP) can be configuredsimultaneously. Such aloginconfiguration for atargetsystem is called an RFCtemplate in thelanguage of theCCMS plugins(Section20.2.4, page 394). It hasthe followingform: [LOGIN_template] LOGIN=-d target -u user -p password -c client-id -h address -s system number Thecomplete LOGIN definitionmustbewritten on asingleline, anditisessential that it containthe followingdetails: -d target This is thenameofthe SAPsystem, also referred to as SID or system ID. -u user -p password Theseparametersstate theSAP user andcorresponding password. Remem- berthatanewly created dialog user hastochangehis or herpasswordon first logon. -c client-id This is thethree digitclient ID,alsocalled client. -h address Thehostnameofthe host on whichthe namedusershouldlog in.Thismust resolve to an IP address. -s system_number TheSAP system number.The first SAPinstanceisnormally 00,thenincreased incrementally. Below, the user withthe password secret should loginfromthe client withthe ID 020 to thehost p10ap013 whoseSAP installation hasthe system number 01: [LOGIN_P10] LOGIN=-d P10 -u user -p secret -c 020 -h p10ap013 -s 01 TheRFC template name in square brackets consists of thetext LOGIN_ andthe SAP system ID (SID). TheRFC template defined here belongstothe SAPsystem P10. 391 20 MonitoringSAP Systems 20.2.3Monitorsand templates Theinterfaceprovided by SAPthatisusedbythe pluginsdoesnot have asimple andextendable variant. Only additionalfunctionsenable allinformation from the CCMS to be retrieved, whichiswhy we areomitting thedescription of thesimple interface.7 Forthe extendedinterface, templates definethe monitordatatobeused. These arestoredonthe Nagios serverinthe file /etc/sapmon/agent.cfg andhavethe followingformat: [TEMPLATE_name] DESCRIPTION=description MONI_SET_NAME=monitor collection MONI_NAME=name of the monitor PATTERN_0=SID\ context\ monitor object\ attribute Theplaceholderswritten in italicsare replaced as follows: name This is thenamewithwhich theplugins later address thetemplate. When this book waswritten,onlytemplatenames that began withtwo digits worked,so 00_sap13 worked,for example, butnot TEST. description This consists of afreelyselectable,simpletext. monitor collection This is thenameofthe monitor, setexactly as it is in theCCMS (including upper/lower caseand spaces). name of the monitor Thenameofthe monitormustalsomatch theSAP name exactly. context This patternfiltersout thedesired values from thosecontained in themon- itor.Inmostcases youspecify theidentifier for theSAP instance,suchas p10ap013_P10_01 ( p10ap013 is thehostname, P10 theSID of theSAP system,and 01 is thesystem number). monitor object This is thenameofthe desiredmonitor object,for example Dialog.Unfortu- natelythe termdemandedhererarelycorresponds to theone showninthe SAPGUI.Itisbesttodetermine it using PATTERN_0=* ,asdescribedbelow. 7 Informationonthisisprovidedbythe PDFdocumentsmentioned on page 390. 392 20.2 MonitoringwithSAP’s OwnMonitoringSystem(CCMS) attribute This is thevariable to be queried. Each monitorobject maycontain severable variables. Dialog,for example, has, apartfromthe ResponseTime variable, the FrontendNetTime variable,which revealsthe average processing time of atransaction, restricted to thenetworktransmission time andprocessing time on theclient. Thechallengehereisinspecifyingthe filter in PATTERN_0.Itmustexactly match theSAP-internalnames,and theseare notidentical to theterms that aredisplayed in theCCMS AlertMonitor (TransactionRZ20). It is best to startwith PATTERN_0=* ,which ensuresthatthe entire tree appears. We shallcallthe template for this simply 00: [TEMPLATE_00] DESCRIPTION=Dialog response time MONI_SET_NAME=SAP CCMS Monitor Templates MONI_NAME=Dialog Overview PATTERN_0=* With this entryin /etc/sapmon/agent.cfg youquery thecomplete listofall moni- torentries,inthiscasethose of thesystem withthe ID P10,using the check_sap_ cons plugin: nagios@linux:nagios/libexec$ ./check_sap_cons 00 P10 ... P10 p10ap013_P10_01 Dialog ResponseTime 262 msec P10 p10ap014_P10_02 Dialog ResponseTime 61 msec P10 p10db012_P10_00 Dialog ResponseTime 11 msec ... Theentries containthe followinginformation—withitems separated by spaces: SID context monitor object attribute value Theinformation for the P10 system queriedabove first givesthe SAPinstance, such as p10ap013_P10_01,thenthe monitorobject ( Dialog)and theattribute ( ResponseTime)together withvalues. In theSAP GUI (Figure20.2) this latter is called Dialog Response Time,and sinceeach emptyspaceissignificant, this is a completelydifferent name. In atemplatethatisonlyinterested in theresponse time of theinstance p10ap014 _P10_02,the PATTERN_0 is defined as follows: PATTERN_0=P10\ p10ap014_P10_02\ Dialog\ ResponseTime 393 20 MonitoringSAP Systems If youwanttoquery allthe entriesofaquerylevel,you mustuse thewildcard * . Thefollowingexample defines templates for thedialogresponse time,the network response time,and theaverage CPU load forall instancesofthe system P10: [TEMPLATE_00] DESCRIPTION=Dialog response time MONI_SET_NAME=SAP CCMS Monitor Templates MONI_NAME=Dialog Overview PATTERN_0=P10\ * \ Dialog\ ResponseTime [TEMPLATE_01] DESCRIPTION=network response time MONI_SET_NAME=SAP CCMS Monitor Templates MONI_NAME=Dialog Overview PATTERN_0=P10\ * \ Dialog\ FrontEndNetTime [TEMPLATE_10] DESCRIPTION=System load in five-minute average MONI_SET_NAME=SAP CCMS Monitor Templates MONI_NAME=Operating System PATTERN_0="P10\ * \ CPU\ 5minLoadAverage" 20.2.4The CCMSplugins SAPdemonstrates theuse of theRFC interfacetothe CCMS withthe CCMS plugins for SuSE.InDebianyou can convertthe RPMpackage nagios-plugins-sap-ccms- 0.7.2 8 to atar filewith alien ,oralternatively youcan obtain thesourceRPM from aSuSEFTP mirror9 andcompile thesourcecode yourself.Thiswill give youthe pluginslisted in Table 20.1. Table20.1: TheSAP-CCMS plugins Plugin Description check_sap Output of themonitor datainHTMLformat check_sap_cons Ditto, butwithout HTML formatting andwithout hyperlinks for theoutputonthe command line check_sap_instance Dialog response time andnumber of logged-in usersonaparticular applicationserver(requires CCMS Ping10) 8 It canbefound at http://www.rpmseek.com/ ,for example, if yousearch therefor nagios- plugins-sap-ccms . 9 e.g., ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/suse/i386/9.3/suse/src/nagios-plugins- sap-ccms-0.7.2-45.src.rpm . 10 As components of theCCMS monitoring system, CCMS Ping monitors theavailabilityofthe applicationserverbelonging to theSAP system. 394 20.2 MonitoringwithSAP’s OwnMonitoringSystem(CCMS) continued Plugin Description check_sap_instance_cons Ditto, as textoutputwithout HTML markup check_sap_multiple HTML-formatted output of dataofamonitor template, whichreturnsmorethanone value check_sap_mult_no_thr Output of multiplevalueswithsimpleHTML formatting,without hyperlinks,incontrastto check_sap_multiple check_sap_system Showsthe applicationservers of theSAP system andtheir states (requiresCCMS Ping) check_sap_system_cons Like check_sap_system ,onlywithout HTML for- matting Theplugins that endin cons areespecially suitable for testpurposes:theysimply passthe dataontothe command line, without furtherformatting.The output of theotherscontainsHTMLformatting for aNagios versionmodifiedbySAP;with Nagios 2.0theyusually lead to an incorrectviewand arethereforeuseless. Individualvaluesare best retrievedwith check_sap_cons ,but then themonitor definitionmustreally only return asinglevalue.The remainingoneswould be returned on additional lines,ignored by Nagios. If Nagios is to displayseveral return values,itisbesttouse check_sap_mult_no_thr, whichprovides thesevalueswithsomeHTMLformatting elements whichalsowork withNagios 2.0. Allplugins demand twoarguments: check_sap , check_sap_cons , check_sap_mul- tiple ,and check_sap_mult_no_thr first requirethe name of themonitor tem- platefromthe file /etc/sapmon/agent.cfg ,suchas 00, 00_sap13, 01,or 10 (see page 392),followedbythe name of theRFC templates,asdefinedin /etc/sapmon/ login.cfg (inthe examples in this book we usethe system ID P10). For check_sap_system / check_sap_system_cons and check_sap_instance/ check_ sap_system_cons,the first argument changes: insteadofthe monitortemplate, check_sap_system demandsthe system ID (here, P10), and check_sap_instance demandsthe SAPinstance, consisting of thehostname, theSID,and thesystem number (for example, p10ap13_P10_01). Firststeps with check_sap_cons Theplugin check_sap_cons is probably best suited to your first attempts.Only after this hasworkedfor youproperlyonthe command lineshouldyou move on to theactualNagios configuration.The exampleonpage 393already showed how youdetermine thedialogresponse time withthe monitortemplate 00,and the 395 20 MonitoringSAP Systems followingexample queriesthe networktimewhich theSAP GUI requires till the result of thetransactionappearsinthe SAPGUI,using themonitor template 01: nagios@linux:nagios/libexec$ ./check_sap_cons 01 P10 P10 p10ap013_P10_01 Dialog FrontEndNetTime 383 msec P10 p10ap014_P10_02 Dialog FrontEndNetTime 673 msec P10 p10db012_P10_00 Dialog FrontEndNetTime 1491 msec Thedefinitionsinthe twotemplates can be found in Section20.2.3onpage 392. In both examples, check_sap_cons returnsmultiplevalues, only thefirstlineof whichwould be noticedbyNagios in theWeb interfaceand in notifications.Ifthe instance p10ap014_P10_02displayedacritical status,but p10ap013_P10_01did not, thepluginwould return aCRITICAL, butthe Webinterfacewould only present thefirstline(likethe notification), whichwould notgive anyreasontoworry.This meansthatthe admin wouldnot seethe very thingthathas setoff thecritical state. If check_sap_cons only returnserror messagesinsteadofthe datayou want,there couldbeseveral reasonsfor this.Inthe followingexample theloginfails: nagios@linux:nagios/libexec$ ./check_sap_cons 00 P10 <== RfcLastError FUNCTION: SXMI_LOGON RFC operation/code SYSTEM_FAILURE ERROR/EXCEPTION key : status : message :User account not in validity date internal: <== RfcClose Thereasonisgiveninthe message: field:the user currentlydoesnot have avalid account.Ifthe followingmessage were to be found there message :User 910WOB has no RFC authorization for function group SXMI . this wouldmeanthatthe user 910WOB doesnot have thenecessary permission in theauthorization object S_RFC .Inorder to grantit, that user should be assigned to thefunction group SXMI . Theplugins record such RFCerror messagesinthe file dev_rfc in thecurrent work- ingdirectory.IfNagios runs theplugin, then it will generate this fileinthe Nagios home directory(/usr/local/nagios ,ifyou have followedthe installation description in this book). In thenextcasethe loginworks perfectly, butthe plugin doesnot return anyvalues: nagios@linux:nagios/libexec$ ./check_sap_cons 01 P10 No information gathered! System up? 396 20.2 MonitoringwithSAP’s OwnMonitoringSystem(CCMS) Theerror here liesinthe monitordefinition:often thenameofthe monitorset or themonitor is written wrongly, or thepatterndoesnot matchthe monitorused. Theintersectionofmonitor andpatternisthenempty,and SAPalsodoesnot warn explicitly if themonitor or monitorset do notevenexist. Checking multiple values with check_sap_mult_no_thr If Nagios is to representmultiplequeried values in theWeb interface, youshould use check_sap_mult_no_thr: nagios@linux:nagios/libexec$ ./check_sap_mult_no_thr 00 P10
P10 p10ap013_P10_01
Dialog ResponseTime 785 msec
P10 p10ap014_P10_02
Dialog ResponseTime 352 msec
P10 p10db012_P10_00
Dialog ResponseTime 22 msec
Theoutputisgiveninasingle line, whichwehavereformatted manually here so that it can be more easily read. With theHTMLcode,the plugin ensuresthateach value(thanks to the CLASS specifications) is shownonaseparatelineinthe color matching itsstatus. Thestatusofthe Nagios service changestoCRITICALifatleast onemeasuredvalue is critical.Suchacaseisshown in Figure20.4. Figure 20.4: check_sap_mult_no_thr uses HTML markups which Nagios 2.0also understands In this caseaswellyou should remember that Nagios altogether processesnomore than 300bytes of thepluginoutput, andcutsoff therest. ForHTML-formatted output,not only is information then missing,there arealsosideeffectsinthe table layout in theWeb interface. In caseofdoubt,you mustshare thetestamong severalservicechecks. In thedefinition of theNagios command objects, thehostname, exceptionally, doesnot play arolefor theCCMS plugins. This meansthatthe $HOSTADDRESS$ macroisnot used: define command{ command_name check_sap_ccms command_line $USER1$/check_sap_mult_no_thr $ARG1$ $ARG2$ } 397 20 MonitoringSAP Systems If yourequest severalvaluessimultaneously, they willnormally belong to different hosts. This meansthatservices can only be assigned to ahostinone-to-onesingle valuequeries.Nevertheless, Nagios expectsaspecifichostinthe service definition: define service{ service_description SAP Dialog Response Time host_name sap01 check_command check_sap_ccms!00!P10 ... } 20.2.5Performance optimization Since themonitor always transmitsall thedataithas available over theRFC in- terface, filteringalways takesplace on theclient side throughthe plugin.For this reason it is notrecommended that youquery single values from alarge monitor oneafter another: this consumesconsiderable resources. Youshouldeitherhaveasingle service provideall thevalues,11 or youshoulddefine aseparatemonitor yourself containing preciselythose values youwould liketotest. This latter method is recommended by SAP. If youwanttocheck severalmonitors, or even single values of themonitor oneaf- ter theother,you should keep an eyeonthe necessary networkbandwidth.Within alocal networkthisisnormally notaproblem, butitcan place aconsiderable bur- denonnarrow-bandwidth long-distanceconnections (ISDN, simple VPNs). In such cases youshouldmeasure thenetworktrafficwhenstartingoperation,sothatyou can increase thecheck intervals accordingly in caseofproblems. 11 Using apluginpredestinedfor theoutputofmultiplevalues. 398 Appendixes A Ap pe ndi x RapidlyAlternatingStates: Flapping If thestate of ahostorservicekeepsonchangingoverand over,Nagios inundates theadministrator withaflood of problemand recovery messages, whichcan not only be very irritating butalsodistract theadministrator’s attention from other, perhaps more urgent problems. With aspecial mechanism, Nagios quicklyrecognizes alternating states andcan informthe administrator of theseselectively.The Nagios documentationrefersto such alternating states as stateflapping andtotheir detection as flapdetection. Whether thesealternating states involve hostsorservices hasnoinfluenceonthe detection mechanismitself. Thedifferences aremoretobefound in thenatureof 401 A RapidlyAlternatingStates: Flapping host andservicechecks:Nagios carries outservicechecks periodically,and there- foreregularly.Inthisway thesystem continuously receivesnew information on thecurrent status.Hostchecks,onthe otherhand, normally only take placeifthey arenecessary,soNagios hastoobtainthe appropriate information in otherways. A.1 Flap Detectionwith Services To detectalternating states youneed acomplete listofall states that occurred during thelastservicechecks.For this purposeNagios stores thelast21testresults for each service andthenoverwrites theoldestvalue in each caseinthe memory. In these21states,amaximum of 20 changescan occur. Figure A.1shows an example. Thex-axisnumbersthe possible alternating states in each casefrom1to 20, andthe heads of thearrow indicate alternating states that have actually occurred. Figure A.1: Nagios saves thelast 21 states to detect frequently alternating states.Thisservice changedits state twelve times In theperiodspecified, thestate of thesystem shownchanged 12 timesout of a possible 20, whichasapercentage is 60 percent. At 0percent,not onealternation statehas takenplace,and 100% meansthatthe service really wasinadifferent stateevery time it wasrecorded. When determining thepercentage value, Nagios assignslesssignificance to older changesofstate than to more recent ones.Accordinglyitweights theoldest change in stateat 1 in Figure A.1with0.8,and themostrecentat 20 with1.2. From left to right, thefactor increaseseach time by approx. 0.02,1 resultingina linearprogression. This weightingdoesnot have anymajor effectsonthe endresultinthisexample: for Figure A.1, this resultsin62.21 percent(insteadof60),aslight shift, sincethe stateinthe second half changedmoreoften.Ifthere wasonlyasingle change of stateat 20,the weightingwould have themosteffect:insteadof5%(that is,one change outofapossible 20) this wouldresultin5*1.2 =6percent. Using thresholdvalueswhich can be defined—two for services, twofor hosts— Nagios defines whether aserviceorhostis“flapping”. Boththe upperand lower 1 (1.2-0.8)/19 =0.0211 402 A.1 Flap DetectionwithServices limitsare specified as percentages. If thedetected change stateexceedsthe upper threshold, Nagios categorizes theserviceas flapping.Thishas consequences: Na- gios logs theevent in thelog file, addsanonpermanent comment, 2 andstops any notifications concerning this from beingsent. If thepercentage valuefalls belowthe lowerlimit,the system undoesthisstep; that is,the comment disappears, notifications aresentagain, andthe result also appears in thelog file. A.1.1Nagiosconfiguration Flap detection is configuredattwo locations: in thecentral configuration fileand in thedefinition of theserviceobject.In nagios.cfg thefeature is switched on generally withthe parameter enable_flap_detection ,and global limit values are also defined here,which willalways apply if nothingelseisdefinedfor theservice in question: #/etc/nagios/nagios.cfg ... enable_flap_detection=1 low_service_flap_threshold=5.0 high_service_flap_threshold=20.0 ... Thevalue 1 setherefor enable_flap_detection enablesflap detection,and 0 switches it off. Thelower limit low_service_flap_threshold liesat5percentinthiscase, theupper high_service_flap_threshold limit at 20.Thismeans that Nagios categorizes a service as flapping if thehistory saveddetects at leastfive changesinstate (more than four outofapossible 20). 3 Thelower fivepercent limit corresponds to one change in state. To drop belowthis, all21states mustbeidentical. 4 In thedefinition of aserviceobject,you have anotherchancetodecidewhether flapdetection is desiredinthiscase. Youalsohaveanoptiontospecify threshold values for this service that differ from theglobalsettings: define service{ host_name linux01 service_description NTP ... 2 Nonpermanent commentsdisappear afterthe monitoring systemisrestarted,but permanent ones remain. 3 If thechanges in state took placerecently, theweighting wouldensurethatfourchanges in state wouldalreadybeenoughtoexceed the20percent limit. 4 If asinglechangeofstate takesplace in thefirsthalf, theweighting resultsinavalueofless than 5percent. 403 A RapidlyAlternatingStates: Flapping flap_detection_enabled 1 low_flap_threshold 5.0 high_flap_threshold 20.0 ... } Thevalue 1 in flap_detection_enabled switches on thefeature for this service, and 0 (the default) switches it off. Thetwo limitvalues low_flap_threshold and high_flap_threshold definethe limit values that override theglobally defined val- ues. If they areset to 0 ,orare omitted,the global thresholds willapply. A.1.2The historymemoryand thechronologicalprogression of thechanges in state Since thehistory only saveshardstates andsoft recovery,the sections on thex- axis cannotbeallocatedsoeasily on achronological basis, because theintervals betweenpossible changesofstate arenot equal. Assuming that theserviceobject hasthe followingdefinitions: max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 Nagios checks theservicetwo more timesafter achangeinstate from OK to WARNINGhas takenplace,beforethe service changestothe hard stateWARN- ING(state1in Figure A.1onpage 402).Since thelastcheck, whichreturned OK, atotal of sevenminutes5 haselapsed,since thetwo soft states after fiveand six minutes arenot included in thehistory. If thenextservicecheck, as in Figure A.1, again detects aWARNING(i.e., thestate doesnot change this time), then only fiveminutes elapsethistimebetween states 1and 2. Thex-axisthereforeonlyillustrates time in alinearforminexceptional circumstances—if no change of stateoccurs, for example. A.1.3Representationinthe Webinterface Services that Nagios categorizes as flapping arevisible in theWeb interfaceat threepoints: in thesummaries generated by tac.cgi (Section 16.2.4, page 290) and status.cgi (Section 16.2.1, page 279),aswellasonthe information page created by extinfo.cgi (Section 16.2.2, page 284). Thequickest waytoget thereisthrough tac.cgi (FigureA.2): alinkinthe Mon- itoring Features sectionmarkedby xServices Flapping takesyou to thestatus overviewofservices whichcontinually change theirstate. Thestatusoverview 5 5+2*1=7 404 A.1 Flap DetectionwithServices showninFigureA.3 can also be openeddirectlywith status.cgi?host=all&style= detail&serviceprops=1024. serviceprops=1024 describesall servicesthatNagios categorizes as flapping. style =detail provides adetailedview(in contrast to overview ,ascan be seen in Figure A.10 on page 280),and host=all includes allhosts. Figure A.2: tac.cginotes changing states in sectionMonitoring Features In thestatusviewinFigureA.3,awhitefieldwithseveral horizontal gray bars movingtoand frorevealthataflapping service is involved. At the same time awhite speechbubble denotes theexistence of acomment on this (generated automatically by Nagios). Figure A.3: Animated horizontal barsinthe status displaydenote flapping states If youclickinthe status viewonthe flapping icon next to theserviceinquestion, extinfo.cgi generates additionalinformation on theservice(Figure A.4),showing thechanges in stateinpercent next to theflappingcategory, depicted by ared bar labeledwith YES . Figure A.4: Percent StateChange: reveals howoften the hard statechanged, as apercentage 405 A RapidlyAlternatingStates: Flapping Thepage also contains thenonpermanentcomment generated by Nagios (Figure A.5),which pointsout that thesending of messageshas been stoppeduntil the status of theservicebecomes stable again. It disappears, therefore, when Nagios is restarted. Figure A.5: With this comment, Nagios categorizesa serviceasflapping A.2 Flap Detectionfor Hosts Nagios only performs host checks if allavailable servicesare in an errorstate— that is,extremely irregularly. Thesystem thereforecannotrelyexclusively on these reachabilitytests when detecting changesinstate for hosts. As long as at least oneservicecheck returnsOK, Nagios deducesfromthisthatthe host itself is also reachable,and is thereforeinanOKstate. Thesoftware thereforechecks for flap- ping states for each host andeach service check. Theresultofeach host checkthatreturnsahard stateorsoft recovery is savedby Nagios.Duringthe period in whichthe reachabilitytestisnot available,the system assumes, after atimeperiodwhich it defines itself,thatthe statehas notchanged, andstoresthe currentstate againinthe history. Thetimeperiodcorresponds to theaverage of allservicecheck intervals. On thebasis of this history, thesameflap detection mechanismisusedfor hostsas for services. So thedifferenceisonlyinhow Nagios determinesthe corresponding databasis. Whether flapdetection is desiredfor hostsisrevealedbythe centralconfiguration file nagios.cfg andthe definitionofthe host objects. Theglobalparameter en- able_flap_detection ,which appliesequally to hostsand services, mustbeset to 1 : #/etc/nagios/nagios.cfg enable_flap_detection=1 low_host_flap_threshold=5.0 high_host_flap_threshold=20.0 Thethreshold parametersfor hostsinclude host in theirnames,but they have the same effect as their service equivalents.6 6 Cf.page403. 406 A.2 Flap Detectionfor Hosts Forthe host object itself,detection is switched on with flap_detection_enabled1 andoff with 0 : define host{ host_name linux01 ... flap_detection_enabled 1 low_flap_threshold 5.0 high_flap_threshold 20.0 } Thetwo optionalparameters low_flap_threshold and high_flap_threshold allow for host-specificthresholds. If theseare omitted,the global thresholdvaluesare used. 407 B Ap pe ndi x Event Handlers If thestate of ahostorservicealternates betweenOKand errorstates,you can usean eventhandler to runany programsyou want.You can make useofthis if aservicefails,for example, andNagios should attempttorestart it.Thispro- vides an opportunity to solve minorproblems without theadministrator needing to intervene. Use of the eventhandlers is notjustrestricted to self-healing, however: withan appropriate script youcan just as easily logcurrent values or theevent itself in a database. Butthere aremoresuitable methods for doing this,describedinSection 17.1, page 314. Afailedprinter service servesasanexample here of usinganevent handlerfor self-healing. In this examplethe printer service lpd is used,but this method can be appliedingeneral to anyservicefor whichastart-stop script is available. 409 B Event Handlers B.1ExecutionTimes forthe Event Handler Thefollowingparametersinthe service definitionensurethatNagios tests the service under normal circumstances everyfive minutes,but in cases of error, every twominutes: normal_check_interval 5 retry_check_interval 2 max_check_attempts 4 An errorstate becomeshardafter fourtests leading to thesameresult. Figure B.1: When doesNagios runthe event handler? Figure B.1shows an exampleofthe change of the lpd service from an OK state to CRITICAL, andback again. After 10 minutes testNo. 2detects that theservice is no longer available.The soft statethatresults causesNagios to examine lpd more closelyattwo-minuteintervals (checks No. 3, 4, and5). Test No.5returnsa CRITICALfor thefourthtime, causing Nagios to categorizethisasahard stateand to go back to thenormal, five-minutetestinterval. In checkNo. 7the service is functioningagain, andthe statechanges from CRITICALtoOK(for hard state, see Section4.3,page 75.). Eventhandlersare carried outbyNagios forsoft errorstates (inchecks No. 2, 3, 4), thefirsttimeahard errorstate occurs (incheck No. 5),and in theresettingofthe OK stateafter an error(irrespective of whether this is ahardorsoft recovery). Since hard errorstates lead to theadministrator beingnotified,itisrecommended that therepairattemptismoved to thetimeofthe soft errorstates.Ifitsucceeds at this point in time,the administrator is spared theseminor details.Ideally the service willberunning againbeforeauser even noticesthatithas failed. Thefact that Nagios only executes theevent handlerwhenahard errorstate first occurs prevents periodicattempts at repairthatdonot lead to thedesired result after all(if theattempthad succeeded,nofurther hard errorstates wouldhave occurred). 410 B.2 Definingthe Event Handlerinthe ServiceDefinition B.2Defining theEvent Handlerinthe Service Definition AlthoughNagios executes theevent handlerfor everyevent,itdoesnot have to carry outanactioneach time.Inour examplethe handlershouldattempttoreset theprinter service on thethird soft errorstate (check No. 4) andonthe first hard errorstate (check No. 5),and do nothingatall theother executiontimes. Forthispurpose, theservicedefinition is modifiedasfollows: define service{ host_name printserver service_description LPD ... event_handler restart-lpd ... } The event_handler parameter expectsaNagios command object that willrun the handlerscript: define command{ command_name restart-lpd command_line $USER1$/eventhandler/restart-lpd.sh \ $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ } In this exampleitiscalled restart-lpd.sh andisnot locateddirectlyinthe Nagios plugin directory /usr/local/nagios/libexec,but in asubdirectorycalled /usr/local/ nagios/libexec/eventhandler ,assuggested in theNagios documentation. The script receivesthree macros as parameters: thecurrent state $SERVICESTATE$ (OK, WARNING, CRITICAL, or UNKNOWN), thestate type $SERVICESTATETYPE$ ( SOFT, or HARD ), andthe number of thecurrent (possibly repeated)attempt $SERVICEAT- TEMPT$ (e.g., 3 if thetestisbeing performedfor thethird time). If theevent handleristobeusedfor host checks,thenthe macros $HOSTSTATE$ , $HOST- STATETYPE$ ,and $HOSTATTEMPT$ areusedinstead. B.3The HandlerScript Theactualtreatment of theerror—depending on thecurrent event—is dealtwith by thescriptdefinedinthe command definition. So that we can concentrateon theessentialaspects in this context, we shallassume that lpd is installedonthe Nagios serveritself. This enablesthe service to be restarted locally,without the need for aremoteshell such as theSecureShell. 411 B Event Handlers Thescript restart-lpd.sh checks to seeexactly what eventisinvolved, usingthe macros passedontoit, andeitherdoesnothing at allortries to restart lpd : #!/bin/bash #/usr/local/nagios/libexec/eventhandlers/restart-lpd.sh #$1=Status, $2 =status type, $3 =attempt case $1 in OK) ;; WARNING) ;; CRITICAL) if [$2=="HARD" ]||[[$2=="SOFT" && $3 -eq 3]]; then echo "Restarting lpd service" /usr/bin/sudo /etc/init.d/lpd restart fi ;; UNKNOWN) ;; esac exit 0 The case statementfirstchecks to seewhatstate exists.OnlyifitisCRITICALwill thescriptdoanything; it doesnot carry outany actionfor otherstates.Ifthe ser- viceisinacritical state, either thestate type mustbe HARD or ( ||)acorresponding soft statemustoccurfor thethird time in succession,sothat restart-lpd.sh can executethe lpd init script withthe argument restart . 1 Thescriptisexecuted withthe permissionsofthe user nagios,who mayneither stop norrestart system services. This is why sudo is used,which provides temporary root permissionsexclusively for thestart-upscript /etc/init.d/lpd,justfor this user. Thecorresponding configuration can be found in thefile /etc/sudoers,but if it is edited then youmustuse theprogram visudo rather than astandardeditor(this checks theconfiguration filefor syntax errors when it is saved): linux:˜ # visudo Then youadd thefollowinglinetothe configuration file: nagios nagsrv=(root)NOPASSWD: /etc/init.d/lpd In plainlanguage this means: theuser nagios mayrun thecommand /etc/init.d/lpd on thehost nagsrv.The command is runasthe user root,but no passwordisre- quired for this. 1 If youwanttoget to know Bash programming more closely, we canrecommend theexcellent Advanced Bash-Scripting Guide ( http://www.tldp.org/LDP/abs/html)byMendelCooper. 412 B.4 Things to NoteWhenUsing Event Handlers B.4ThingstoNoteWhenUsing Event Handlers If yourestart aservicethatisalready in asoft errorstate as describedhere, the administrator willnot receive anynotification as long as theactionwas successful. Althoughthe logfile recordsthe restart, it will scarcely be noticedunlessyou search thelog fileexplicitly for such events.Thismeans that theadministrator willseldom investigatethe cause of theservicefailure. Youshouldthereforebearinmindthateliminatingthe problemisthe best solu- tion,and that arestart is only second best.Likeair bags in automobiles, theevent handlershouldjustberegardedasanadditional security measure, andshouldcer- tainly notrepresent theprimary method of handlingerrors. If youcarry outthe restartonlywhenahard errorstate occurs, theadministrator is confronted with theproblemthrough thenotification mechanism. In addition,not everyserviceissuitable for an automaticrestart.WithOpenLDAP in versions before2.1.17, aproblemoccurred sporadically in thereplicationthrough slurpd,which left behind acorrupted replicationfile.Althoughthe replicationser- vicecould be restarted,itdiedagainafter ashort time.Toreally getthe replication up andrunning again, youwould have to repairthe replicationfile manually. Youshouldalways remember this exampleand neverhavecomplete faith in self- healing. In theworst case, restarting aservicerepeatedly andwithout thought couldleadtolossofdata, whichmight be rectified only by retrievingdatafromthe backups. 413 C Ap pe ndi x WritingYourOwn Plugins: MonitoringOraclewiththe Instant Client Thefollowingchapter willnot introduce anyfinished plugins, butillustrate how youcan build your ownOracleplugin, usinganexample that monitors Oracle. Some pluginsdoalready existfor this DBMS,suchas check_oracle ,one of the standardNagios plugins, or check_oracle_writeaccess1 by MathiasKettner. But both of them requirethe normal Oracleclient,and most non-Oracleadministrators willbeout of theirdepth attempting to installit. 1 http://mathias-kettner.de/nagios_plugins.html. 415 C WritingYourOwn Plugins: MonitoringOraclewiththeInstant Client Luckily thereisaneasiersolution: Forsometimenow,Oraclehas been offering an instantclient,which drastically reducesthe installation work:unpack thezip files, setthe variables, andthe installation is finished—thecommand-linetool sqlplus can be used immediately. Thelatter can be used in aplugin—just likethe Perl script introducedinthischapter does, whichsends arequest to theOracledatabaseusing sqlplus andevaluates theresponse. C.1Installingthe Oracle InstantClient Even though theinstant client hasbeen available only sinceOracleversion 10g, it can be used just as well witholder Oracledatabases such as 8i or 9i.The software is available in theformofzip files at theOraclehomepage, 2 provided youhave previously registered on theWeb site of thecompany. When downloading,you are askedsomeadditional questionsonexportconditions. Althoughthe software costsnothing,you mustobserve Oracle’slicense terms.If your OracledatabaseislicensedonaCPU basis, youdonot need to worry about additionalaccess by anotheruser(Nagios). For sqlplus yourequire twozip files, 3 instantclient-basic-linux32-10.1.0.3.zip and instantclient-sqlplus-linux32-10.1.0.3.zip . The instantclient-basic package, some 31 MB in size,containsall thenecessary libraries, andthe instantclient-sqlplus included,only320 kB in size,containsa shortdocumentation(READFROM_IC.htm)aswellasthe client itself withafurther library. It doesnot matter for theinstallation wherethe files areunpacked;inthis casewewill use /usr/local/oracle : linux:˜ # mkdir /usr/local/oracle linux:˜ # cd /usr/local/oracle linux:local/oracle # unzip instantclient-basic-linux32-10.1.0.3.zip Archive: instantclient-basic-linux32-10.1.0.3.zip inflating: instantclient10_1/classes12.jar ... linux:local/oracle # unzip instantclient-sqlplus-linux32-10.1.0.3.zip Archive: instantclient-sqlplus-linux32-10.1.0.3.zip inflating: instantclient10_1/READFROM_IC.htm inflating: instantclient10_1/glogin.sql inflating: instantclient10_1/libsqlplus.so inflating: instantclient10_1/sqlplus 2 http://www.oracle.com/technology/software/tech/oci/instantclient/ 3 Apartfromthe Linux versionintroducedhereonIntel x86-32 systems,the client is also available forLinux x86-64, Linux Itanium, MACOS-X, HP-UX(32- and64-bit,for both PA-RISCand Itanium), SolarisSPARC (32- and64-bit), Solarisx86-32, AIX5L(32- and64-bit), andHPTru64 UNIX. 416 C.2 Establishing aConnection to theOracleDatabase This creates asubdirectory instantclient10_1,containingall therequiredfiles. Af- ter setting twoenvironment variables, theinstant client is ready foruse: LD_LIBRARY_PATH=/usr/local/oracle/instantclient10_1 SQLPATH=/usr/local/oracle/instantclient10_1 LD_LIBRARY_PATH ensuresfirstthatall shared librariesfromthe instantclient di- rectoryare takenintoaccount when programsare run, beforethe librariesinstalled system-wide areloaded. SQLPATH revealsto sqlplus whereitneedstolook for the file glogin.sql .Thisfile makesanumber of defaultsettings for accessing theOracle database, andnoadjustments arenecessary for ourpurposes. C.2EstablishingaConnectiontothe Oracle Database sqlplus requires thefollowingdetails to make contact withthe database: sqlplus user/ password@//host/ database Theplaceholder user is replaced by auserwho exists in thedatabase, andthe passwordisfollowedbyaforward slash. After the @// sign comesthe host name or IP address, followedbythe name of thedatabasetowhich sqlplus should make aconnection. In thefollowingexample we will usethe database DEMO: user@linux:˜$ sqlplus wob/password@//192.168.1.9/DEMO SQL*Plus: Release 10.1.0.3.0 -Production on Sat Aug 13 14:12:52 2005 ... SQL> quit Disconnected from Oracle8i Release 8.1.7.0.0 -Production JServer Release 8.1.7.0.0 -Production On theconnect youare shownthe versionofthe instantclient used (here: 10.1.0. 3.0 )aswellasanote on theversion of theOracledatabaseused, in this case 8.1.7.0.0 .The quit command terminates theconnection. If thepasswordiswrong, or if theuserdoesnot exist, Oracleexplicitly requests theusertoenter both again. C.3AWrapperPluginfor sqlplus To queryanOracledatabase, sqlplus is giventhe appropriate SQLstatementvia standardinput andreceivesareplyvia thestandardoutput: 417 C WritingYourOwn Plugins: MonitoringOraclewiththeInstant Client user@linux:˜$ echo "select trash from nothing" | \ sqlplus -i wob/password@//192.168.1.9/DEMO select trash from nothing * ERROR at line 1: ORA-00942: table or view does not exist Theswitch -s ( silent)preventsthe output of things likeversion andcopyright, and restrictsthe replytothe really interesting part.Ifthe queryfails,asabove,the text merely pointsout theerror that hasoccurred. sqlplus itself only returnsanerror status as areturn valueifthe erroroccurred when usingthe client itself,other- wiseitjustreturnsOK(command executed). This is why sqlplus cannotbeused directly by Nagios.Instead, a wrapper mustbewritten around theactualquery whichevaluates thereply of thedatabase, whichinthe aboveexample generates aCRITICALreturn valueappropriate for Nagios from the ERROR reply, andadds a shortone-linereply. sqlplus can in principleberun withany scriptinglanguage that enablesthe text response to be interpreted.Since this is oneofthe strengthsofPerl, we shalluse this language for thewrapperplugin—butitcould also be written in ashell like Bash;the basicprinciple is always thesame. C.3.1How thewrapperworks Thewrapperpluginisconstructed on thefollowinglines: sql-statement |sqplus arguments | output processing sqlplus receivesanSQL statementonthe standardinput,and thepluginretrieves theresultfromthe standardoutput. Wrapperscan be built around (almost) any program whichdoesnot providesensible return values,but “hides”the result in text. Perl itself doesnot provideadirect wayofchecking standardinput andoutput at thesametime. ButPerlwould notbePerlifthere were notamodule created specifically for this purpose. ICP::Open24 fullfils exactly this purpose: useIPC::Open2; open2(*READFROM, *WRITETO, program, list_of_arguments); print WRITETO " instruction_via_standard_input\ n"; 4 Themodule is included in thestandardpackage of Perl 5.8. 418 C.3 AWrapperPluginfor sqlplus while () { processed_standard_output; } close(READFROM); close(WRITETO); Theroutine open2 requires twofile handles. Theirnames, WRITETO and READ- FROM,describe theinteractionfromthe point of viewofthe wrapper,and seen from open2 itsbehaviorisexactly theopposite: open2 reads from itsstandardin- put(WRITETO )and writes to itsoutput(READFROM), wherenodistinction is made betweenstandardoutputand erroroutput. Thethird argument is aprogram with itscomplete path, followedbyany number of argumentsfor theprogram,each separated from thenextbyacomma. With the WRITETO filehandle, thedesired commandsare sent with print.Each line for sqlplus should endherewithacorrectend-of-line(Perl:’\n’).Withthe while () construction,Perlreads linebylinefromthe standard(or error) output until thereare no more lines.Then close() closes thetwo filehandles. Using IPC::Open2 can cause problems,however:itisconceivable that theprogram used (inour case, sqlplus)gets blocked, because it continuesprocessing apart of theinput only after it haswritten something. If thepluginonlyprocessesthe output once allthe inputiscompleted,you have theclassicsituation of adeadlock. Forthisreasonyou mustmake sure thereare no blocks when reading andwriting. Luckily thedangerofthishappeninginour simple applicationisminimal. C.3.2The Perl plugin in detail Agood Perl script starts withthe instructions usestrict and usewarnings .Thenall variablesmustbedeclared, andinother ways Perl is very particular withsyntax.5 #!/usr/bin/perl -w use strict; use warnings; use IPC::Open2; my $ipath ="/usr/local/oracle/instantclient10_1"; my $sqlplus ="$ipath/sqlplus"; my $connectstring ="wob/password@//192.168.1.9/DEMO"; 5 Some programmers getveryirritated, especially at thestart,because Perl reacts very pettily with usestrict .Without this instruction,variablesdonot need to be declared.One single typing errorinavariable name is sometimessufficient to keep yousearchingfor hourstofind outwhy thevalue at acertain positionisalways 0 . 419 C WritingYourOwn Plugins: MonitoringOraclewiththeInstant Client #--Set environment variables $ENV{ ’LD_LIBRARY_PATH’} =$ipath; $ENV{ ’SQLPATH’} =$ipath; $ipath contains thepathtothe directoryinwhich theinstant client is located, and $sqlplus hasthe absolute pathtothe program sqlplus.The connect string was already explainedabove.Withthe hash %ENV,the script sets thetwo required environment variables. Hash entriesare referenced by Perl with $ENV{’variable name’}. Thedatabasequery statementisdefinedfor this exampleinavariable: #--SQL-Statement my $select ="SELECT table_name FROM all_tables "; $select .= "where table_name =’VERSION’;"; Theinstruction .= appendsthe followingtexttothatalready existing in $select . TheSQL statementthereforeselects,fromthe Oraclesystem table all_tables ,which contains allthe namesofexistingtables, thecolumn table_name ,inthiscasewith an additionalrestriction to thetable name VERSION. In thenextstep thepluginopens thestandardinput andoutputwiththe routine open2: #--open2 with error processing eval { open2(*READFROM, *WRITETO, $sqlplus, "-s", $connectstring); } ; if ($@) { die "Error in open2: $!\ n$@\ n"; } The sqlplus switch -s prevents unnecessary connect output.For adequateerror processing,weembed the open2 command in an eval environment:since open2 aborts directly if thereisanerror,the programmerwould otherwisehavenochance to displayasensible errormessage.Ifitisneeded,the erroroutputisobtainedin the eval environment through $@. die outputsthisand aborts theexecution of thePerlscript. Theonlything remainingnow is to send theSQL statement, with print WRITETO , to sqlplus (afterwardsweclose downthe standardinput WRITETO ,tobeonthe safe side)and evaluate theoutput: #--Write instruction print WRITETO $select; close(WRITETO); 420 C.3 AWrapperPluginfor sqlplus #--Process reply while () { print $_; } while reads theoutputlinebyline. Thecontents of thecurrent lineare containedin $_.Withyourfirstattempts,werecommend that youhave theoutputofall lines displayedwith print $_; so that youcan determine whether everything is working. If this is thecase, theactuallogiccan be expanded: if thetable name sought exists in thedatabase, Oraclefirstdisplaysthe column header, then (separated by hyphens )the actualcontents,thatis, thenameofthe table beingsought: TABLE_NAME ------------------------------ VERSION If such atable doesnot existinthe database, theresponse is: no rows selected If an erroroccurs in thequery,perhaps because thecolumn sought, table_name ,is missing or thetable all_tables doesnot exist, sqlplus returnsamessage containing thekeyword ERROR ,asinthe initialexample on page 418. The while loop nowlookslikethis: #--Process response while () { if (/ˆVERSION/i ) { print "OK -Table VERSION found\ n"; exit 0; } elsif (/no rows selected/i) { print "WARNING -Table VERSION not found\ n"; exit 1; } elsif (/ERROR/i) { print "CRITICAL -SQL-Statement failed\ n"; exit 2; } } close(READFROM); print "UNKNOWN -unknown response\ n"; exit 3; Thesearchinstruction /^VERSION/i contains twospecial features:the i at theend ensuresthatthe comparison ignoresupperorlower case. The ^ at thebeginning 421 C WritingYourOwn Plugins: MonitoringOraclewiththeInstant Client ensuresthatthe text VERSION must standatthe beginning of theline. If theSQL statementsentbyOraclewas incorrect, theerror message repeatsthisfirst—but then thetext VERSION is not at thebeginning of theline. If thepluginfindsthe sought table name VERSION in theresponse sent,anOKtext message is displayedand it terminates withthe return value 0 . If thedatabaseissues no rows selected or even an ERROR ,however,the script feedsNagios acorresponding replyand terminates with exit andthe corresponding return value. If none of thethree search patterns match, areturn valuemustalso be accounted for;otherwisethe script willend withthe status 0 ,and Nagios will announce: “Everythinginorder.” Here we take advantage of the UNKNOWN status, whichisactually reservedfor errorprocessing for theplugin. Armedwiththisbackground knowledge,itshouldnot be toodifficulttowrite your ownOracleplugin. Itsuse here is notrestricted to read access: provided you have writepermissionsfor theuserinquestion, youcan just as well formulateSQL statements withUPDATE, INSERT, or DELETE,and evaluate theanswer. 422 D Ap pe ndi x An Overview of the Nagios Configuration Parameters Nagios contains twoindependent main configuration files: nagios.cfg controls operation of theNagios daemon, cgi.cfg configuresthe Webinterface. Bothfiles should be locatedinthe Nagios configuration directory, whichisnormally /etc/ nagios. nagios.cfg specifies aseriesoffurther configuration databaseand logfiles, and theirfunctionsfor therespective parameter willbebrieflydescribedinthe follow- ingreference.The notation ⇒ parameter refers to thedescription of the parame- ter in theconfiguration filecurrently beingdiscussed. Unless specified otherwise, parametersmay have either thevalue 0 (disabled) or 1 423 D An Overview of theNagios Configuration Parameters (enabled).Ifaparameter hasadefaultvalue,thisisspecifiedaccordingly. Forsome pathdetails,the standardvalue is defined by options during compiling. Thevalues listed in this casecorrespondtothe paths used in thebook (see Table 1.1, page 28). Forsomeparametersthere arenodefaults. If theseare missing from theconfigura- tion,Nagios does notprovidethe corresponding function (so, for example, without the cfg_dir parameter,Nagios ignoresthe object definitions stored in separatedi- rectories). D.1The Main ConfigurationFile nagios.cfg accept_passive_host_checks Global switch for passive host checks;the value 0 suppressesthem. Even though passive host checks areallowedaccordingto nagios.cfg,thisfeature mustbeex- plicitly enabledwhendefining thehostobject.Default value: accept_passive_host_checks=1 accept_passive_service_checks Global switch for passive service checks.Eventhoughthe value 1 allows corre- sponding tests,thisfeature mustbeexplicitly enabledwhendefining theservice object.Default value: accept_passive_service_checks=1 admin_email Thee-mailaddressofthe administrator responsible for theNagios server, to which youhaveaccess throughthe macro $ADMINEMAIL$ .Ifthere is no explicit config- urationofacontact object,Nagios willnot send an e-mail to this address.Example (no defaultvalue): admin_email=nagios admin_pager Pagernumber,SMS number,ore-mailaddressfor apagergateway/SMS gateway throughwhich theadministrator of theNagios servercan be reached.Accessible throughthe macro $ADMINPAGER$.Example (no defaultvalue): admin_pager=pagenagios 424 D.1 TheMainConfiguration File nagios.cfg aggregate_status_updates Specifies whether Nagios writes status information from hosts, services, andits ownprogramsfor thetimeinterval ⇒ status_update_interval in ablock to the ⇒ status_file.The value0meansNagios updates this fileimmediatelyafter every event. Default value: aggregate_status_updates=1 auto_reschedule_checks With this experimental feature, Nagios spreads tests equally over thetimeperiod, to avoidpeaks.Thiscan considerably reduce performanceand in particular is of no useifNagios is already strugglingtokeep on schedule because of poor perfor- mance. Normally this optionshouldbeswitched off. Default value: auto_reschedule_checks=0 auto_rescheduling_interval Everysomanyseconds,specifiedhere, Nagios distributes tests whichare to be executed in thenext auto_rescheduling_window seconds, so that thereisanequal load. Experimental feature! Default value: auto_rescheduling_interval=30 auto_rescheduling_window Alltests that aretotake place in thenextnumber of secondsspecifiedhereare rescheduledbyNagios so that they arespreadequally over this time period. Checks specified for afuturetimethatlie outsidethisintervalare not(yet) takeninto account.Experimentalfeature;use only in exceptionalcases!Default value: auto_rescheduling_window=180 cfg_dir Thedirectory in whichthe configuration files containing object definitions arelo- cated. Nagios searches throughitrecursively for configuration files withthe ex- tension .cfg.Fileswithother namesare ignored, so that youcan place help files in this directory, such as aCSV filefromwhich host definitions aregenerated auto- matically by ascript. To integrate individualfiles ⇒ cfg_file.The directive maybe specified as often as youwant(seealsoSection 2.1, page 38).Example (no default valueset): cfg_dir=/etc/nagios/servers 425 D An Overview of theNagios Configuration Parameters cfg_file Integrates asinglefile withobject definitions.MoreonthisinSection 2.1, page 38. Thedirective can be specified as often as youwant. Example(no defaultvalue set): cfg_file=/etc/nagios/checkcommands.cfg check_external_commands Enablesthe interfacefor externalcommands. Necessary for passive checks or if commandsare to be executed throughthe Webinterface. MoreonthisinSection 13.1, page 240. Default value: check_external_commands=0 check_for_orphaned_services If theresults of aservicecheck arenot receivedafter acertain time,thisisre- ferred to as an orphaned service.Since Nagiosonlyreschedules service checks if aresultexists, it couldbethe caseunder certainconditionsthataservice is never againtested.Normally this only happens if arunning service checkisterminated manually from outside. If thereisasuspicionthatsuchorphanedservices have occurred,you should set check_for_orphaned_services to 1 for debugging purposes.Thisisthenconfirmed if Nagios writes acorresponding errorentry to thelogfile. Whether this is justified or notcan easily be seen in theWeb interface: youcan have allservices displayed independentlyoftheir status,and sorted by thelasttesttime, in ascendingorder. Normally theexecution of an active checkshouldnot be longer ago than specified in normal_check_interval .Default value: check_for_orphaned_services=0 check_host_freshness Allows apassive host checktobetested actively if no checkresulthas arrivedfor alongtime. If Nagios considersthe testresulttobetoo old, ⇒ host_freshness_ check_interval stepsin. Moreon freshness checking in Section13.4, page 243. Default value: check_host_freshness=1 426 D.1 TheMainConfiguration File nagios.cfg check_service_freshness Theserviceequivalent to check_host_freshness.The time after whichNagios con- siders thetestresulttobetoo oldisdefinedbythe parameter ⇒ service_fresh- ness_check_interval .Default value: check_service_freshness=1 command_check_interval Defines thetimeintervalinwhich Nagiostests theExternalCommand File (see Sec- tion 13.1, page 240) for newentries.For this to happenatall, ⇒ check_external_ commands mustbeenabled. Asimplenumber as thevalue refers to thetimeunitspecifiedby ⇒ interval_length (normally 60 seconds, so that 1 stands forone minute).The value -1 meansthat Nagios tests theinterfaceasoften as possible.Ifthe number is supplemented (without aspace) withthe unit s ,seconds can also be explicitly specified. Theintervaldependent on passive checks maynot be toolarge,since theoperating system in theExternalCommand File,anamedpipe, can normally only save 4KB. Default value: command_check_interval=-1 command_file Thenamed pipe that servesasanExternalCommand File.Itshouldonlybewritable for theuser nagios andthe group nagcmd (see also Section13.1, page 240).De- fault value: command_file=/var/nagios/rw/nagios.cmd comment_file File in whichNagios stores thecomments, whichcan be specified throughthe Web interface. Default value: comment_file=/var/nagios/comments.dat 427 D An Overview of theNagios Configuration Parameters date_format ThedateformatthatNagios displays in theWeb interfaceorusesinthe dateand time macro. Possible values are us ( mm/ dd/ yyyy hh: mm: ss), euro ( dd/ mm/ yyyy hh: mm: ss), iso8601 ( yyyy- mm- dd hh: mm: ss), and strict-iso8601 ( yyyy- mm- ddThh: mm: ss). Default value: date_format=us downtime_file File in whichthe downtimedetails aresaved,which can be specified throughthe Nagios Webinterfacefor hostsand/or services(seeSection 16.3, page 304).Default value: downtime_file=/var/nagios/downtime.dat enable_event_handlers Globally switches theoptionon(or off) to work withevent handlers for service and host checks.MoreonthisinAppendix B, page 409. Default value: enable_event_handlers=1 enable_flap_detection Defines whether Nagios is generally able to detectcontinually changing states ( flap detection ,moreonthisinAppendix A, page 401).Default value: enable_flap_detection=0 enable_notifications Defines whether Nagios can send notifications.Switchingoff this featurenormally only makessense on thecentral hostsofadistributed installation,which them- selvescannotgeneratenotifications, andinsteadforward theirtestresults to a centralNagios instance (see Chapter 15 from page 265).Default value: enable_notifications=1 428 D.1 TheMainConfiguration File nagios.cfg event_broker_options Theevent broker as anew interfaceinNagios 2.0allows thirdparties to add some features to Nagios in theformofloadable modules, for exampletosavetestresults to adatabaseinsteadoftoafile. At the time of going to pressthere were not yetany functional modules. Possible values are 0 (switched off) and -1 (accept all broker modules).Default value: event_broker_options=0 event_handler_timeout Thetimeafter whichNagios terminates theevent handlers whichhavenot yet finished. Default value: event_handler_timeout=30 execute_host_checks Enables/disablesactive host checks globally.Thisisonlyworth switchingoff in distributed environmentswithacentralNagios instance that only accepts passive resultsfromother Nagios servers (see Chapter 15, page 265).Default value: execute_host_checks=1 execute_service_checks Like execute_host_checks ,but for service checks.Default value: execute_service_checks=1 global_host_event_handler Defines aglobalhostevent handler, in additiontothe host-specificevent han- dlersdefinedwith event_handler .For this,boththe global parameter ⇒ enable_ event_handlers as well as theparameter event_handler_enabled mustbeen- abledinthe host definition. Nagios executes theglobalevent handler, anormal command object,beforethe host-specificone.Example (no defaultvalue set): global_host_event_handler=name_of_the_command-object 429 D An Overview of theNagios Configuration Parameters global_service_event_handler Theservice-specificequivalent to global_host_event_handler.Apart from ⇒ enable_event_handlers,the parameter event_handler_enabled in theservice definitionmustalsobeenabled. Example(no defaultvalue set): global_service_event_handler=name_of_the_command_object high_host_flap_threshold Upperlimit of flapdetection for host checks.Details aregiveninAppendix A, page 401. Default value: high_host_flap_threshold=30.0 high_service_flap_threshold Upperlimit of flapdetection for service checks (see Appendix A). Default value: high_service_flap_threshold=30.0 host_check_timeout Time in secondsafter whichNagios aborts ahostcheck if this hasnot yetreturned aresult. Default value: host_check_timeout=30 host_freshness_check_interval Intervalbetween two freshness checks in seconds. Default value: host_freshness_check_interval=60 host_inter_check_delay_method Controlshow Nagios processeshostchecks after arestart.Asophisticatedpro- cedure aims to preventNagios in this situationfromexecuting alltests simulta- neously, andthus overloading theserver. Possible values are: s ( smart ,intelligent, automaticdistribution of thehostchecks), n ( no,all checks startsimultaneously), d ( dumb,Nagios processesthe tests at intervals of seconds),and an intervalspecified in seconds, in theformat x . xx.Default value: host_inter_check_delay_method=s 430 D.1 TheMainConfiguration File nagios.cfg host_perfdata_command ANagios command object that should checkthe performancedataafter everyhost check. Requires the ⇒ process_performance_data parameter to be set. This parameter only makessense in afew cases,since Nagiosexecutes host checks only if necessary,and thereforeatveryirregular intervals.Itisusedifperformance dataare to be processedwithout atemplate(Section17.1, page 314).Example (no defaultvalue set): host_perfdata_command=process-host-perfdata host_perfdata_file Specifies afile or namedpipethrough whichNagios forwards performancedata from host checks via atemplatemechanism to an externalprogram (see Chapter 17, page 313). ⇒ process_performance_data mustbeset. Example(no default valueset): host_perfdata_file=/tmp/host-perfdata host_perfdata_file_mode Defines howdataispassedontothe file ⇒ service_perfdata_file.Possible val- uesare a ( append,append to anormalfile)or w ( write ,write to anamed pipe). Example(no defaultvalue set): host_perfdata_file_mode=a host_perfdata_file_processing_command Nagios command object that is calledafter host performancedataispassedonto the ⇒ host_perfdata_file interface. Theparameter is only used withthe template mechanismand is optional. Programssuchas perf2rrd (Section 17.3, page 325) have theirown tool that permanentlyreads datafromthe interfaceasadaemon. Example(no defaultvalue set): host_perfdata_file_processing_command=process-host-perfdata-file host_perfdata_file_processing_interval If this interval—specified in seconds—is larger than 0 ,the command belongingtoit ( ⇒ host_perfdata_file_processing_command)isrun periodically at theseinter- vals. 0 ensuresthatitisnot used.Example (no defaultvalue set): host_perfdata_file_processing_interval=0 431 D An Overview of theNagios Configuration Parameters host_perfdata_file_template Describesthe output formatofthe performancedata. TheNagios macros and formatdetails in it,suchas \t (tabulator) or \n (linefeed)are replaced in theoutput. Moreonthe useoftemplates in Section17.1, page 314. Example(no defaultvalue set): host_perfdata_file_template=$TIMET$\ t$HOSTNAME$\ t$HOSTEXECUTIONTIME$\ t \ $HOSTOUTPUT$\ t$HOSTPERFDATA$ illegal_macro_output_chars Listscharactersthatare discardedwhenmacros aresubstituted for notifications,to avoidproblems such as interpretation by theshell. Theparameter hasnoinfluence on thesubstitutionofmacros in host or service definitions.Example (no default valueset): illegal_macro_output_chars=‘˜$&|’"<> illegal_object_name_chars Specifies impermissible characters in thenames of Nagios objects. It is recom- mended that at leastthe characters listed in thefollowingexample be specified (no defaultvalue set): illegal_object_name_chars=‘˜!$%ˆ&*|’"<>?,()= interval_length Defines thetimeunitinseconds to whichtimedetails in object definitions (such as with normal_check_interval or retry_check_interval)refer.If interval_length is 60 seconds, thetimespecificationis 5 fiveminutes.You should only change thedefault of 60 secondsifthere is good reason to do so. interval_length hasno influence on time parametersin nagios.cfg,however.Default value: interval_length=60 lock_file Specifies alockfile for theNagios daemoncontainingthe processID(PID)ofthe daemonrunning.Isrequiredfor start/stop purposes.Default value: lock_file=/var/nagios/nagios.lock 432 D.1 TheMainConfiguration File nagios.cfg log_archive_path Thearchive directoryfor rotating Nagios logfiles. Evaluationsare basedonthe archive files copied there. If oneofthe files is deleted,the information contained in it is lost.Nagios uses thedirectory only if logrotationisenabledwiththe ⇒ log_rotation_method parameter.Default value: log_archive_path=/var/nagios/archives log_event_handlers Should eventhandler actions appearinthe logfile?The parameter is used primarily to search forerrors. Default value: log_event_handlers=1 log_external_commands Should Nagios logexternalcommands(seeSection 13.1, page 240) in thelog file? Default value: log_external_commands=1 log_file Thecentral logfile.Apart from errors andproblems,italsoretainsall events.All historyevaluations usethisfile.For logrotation, Nagios provides aseparatemech- anism, with ⇒ log_rotation_method,and youshouldnot useexternalprograms here.Default value: log_file=/var/nagios/nagios.log log_host_retries Specifies whether Nagios should loghostcheck repeatsbecause of an errorstate. This is absolutelyessentialifevent handlers (see Appendix B, page 409) areused whichare to react to soft states.Default value: log_host_retries=0 433 D An Overview of theNagios Configuration Parameters log_initial_states Specifies whether thestart stateofservices andhosts should appearinthe logfile when theNagios system is started.Default value: log_initial_states=0 log_notifications Defines whether Nagios should also lognotificationsinthe logfile.Default value: log_notifications=1 log_passive_checks Specifies whether Nagios should logpassive checks in thelog file. Default value: log_passive_checks=1 log_rotation_method Defines whether thelog file ⇒ log_file should be savedperiodically to thearchive ⇒ log_archive_path .Log rotating should always be left to Nagios itself,rather than anyexternalprograms, or otherwisethe software willhavedifficulties in evaluating historydata. Possible values are n ( none,noarchiving), h ( hourly,at thebeginning of each hour), d ( daily,each day at 00:00hours), w ( weekly ,atmid- nightfromSaturday to Sunday), and m ( monthly,the first day of each monthat 00:00 hours).Default: log_rotation_method=n log_service_retries Should Nagios logthe repeat of aservicecheck because of asoft stateerror?This is useful for debugging when developing eventhandlers, butotherwiseitisbestto leavethisout.Default value: log_service_retries=0 434 D.1 TheMainConfiguration File nagios.cfg low_host_flap_threshold Lowerlimit for flapdetection for hostschecks.Details aredescribedinAppendix A, page 401. Default value: low_host_flap_threshold=20.0 low_service_flap_threshold Like low_host_flap_threshold ,but for service checks.Default value: low_service_flap_threshold=20.0 max_concurrent_checks Specifies howmanychecks Nagiosmay executesimultaneously. Thevalue 0 allows an unlimited number.Arestrictionthrough avalue larger than zero may, under unfavorable circumstances,leadtothe testnot beingexecuted in time.Default value: max_concurrent_checks=0 max_host_check_spread At what time interval(in minutes)shouldNagios have started allhostchecks after arestart?Preventsall tests from beingexecuted simultaneously, whichwould overload theNagios server. Default value: max_host_check_spread=30 max_service_check_spread Like max_host_check_spread ,but for service checks.Default value: max_service_check_spread=30 nagios_group Thegroup withwhose permissionsthe Nagios daemonruns.Default value(is de- fined during compilation): nagios_group=nagios 435 D An Overview of theNagios Configuration Parameters nagios_user Theuserwithwhose permissionsthe Nagios daemonruns.Default value(is defined during compilation): nagios_user=nagios notification_timeout Afterhow many secondsshouldNagios abortthe attempttodeliveranotification? Some actions,suchassending an SMSmessage,require acertain amount of time, sincethe system first waitsfor confirmationfromthe recipient. Thevalue should thereforenot be toolow.Default value: notification_timeout=30 object_cache_file Thefile in whichNagios stores allobjectsafter it starts.Since theWeb inter- faceusesthisfile,the normal configuration files withthe object definitions can be edited whileNagios is running,without jeopardizing thefunctionalityofthe Web interface. Default value: object_cache_file=/var/nagios/objects.cache obsess_over_host Defines in generalwhether host checkresults areforwarded to acentral Nagios instance.Ifthe parameter is enabled, thecommand defined in ⇒ ocsp_command is run. This is used in distributed environments; adescription can be found in Chapter 15, page 265. Default value: obsess_over_host=0 obsess_over_services Defines in generalwhether service checkresults should be forwarded to acentral Nagios instance.Ifthe parameter is enabled, thecommand defined in ⇒ ohcp_ command is used.Thisfeature is used in distributed environments(seeChapter 15, page 265).Default value: obsess_over_services=0 436 D.1 TheMainConfiguration File nagios.cfg ochp_command Defines the obsessivecompulsivehostprocessor ,aNagios command object that forwards allhostcheck resultsinadistributed environment to acentral instance (see Chapter 15, page 265).Example (no defaultvalue set): ochp_command=name_of_the_command_object ochp_timeout Defines thetimeout for the ⇒ ochp_command.After this time hasexpired,Nagios aborts theexecution of thecommand.Default value: ochp_timeout=15 ocsp_command Specifies thecommand object that,asthe obsessive compulsiveservice proces- sor ,shouldforward allservicecheck resultsinadistributed environment to acen- tral instance (see Chapter 15, page 265).Example (no defaultvalue set): ocsp_command=name_of_the_command_object ocsp_timeout Thetimeout for the ⇒ ocsp_command.After thetimespecifiedherehas expired, Nagios aborts theexecution of thecommand.Default value: ocsp_timeout=15 perfdata_timeout Defines after howmanyseconds aperformancecommand ( ⇒ host_perfdata_ command, ⇒ service_perfdata_command, ⇒ host_perfdata_file_processing_ command or ⇒ service_perfdata_file_processing_command)shouldbeaborted. Default value: perfdata_timeout=5 437 D An Overview of theNagios Configuration Parameters process_performance_data Switches on processing of performancedata. This parameter should be enabled only if performancedatareally is evaluated.Otherwiseitonlyusesupresources on theNagios server. Default value: process_performance_data=0 resource_file Theconfiguration filecontainingthe definitions of the(maximum of 32) $USER x $ macros. $USER1$ normally specifies thepathtothe Nagios plugins. Otherwise youcould save passwords here,for example, whichshouldnot be readable in the normal Nagios configuration files.The filemustthenbeprotected from allexternal access, andonlythe user nagios should be able to read it.Example (no default valueset): resource_file=/etc/nagios/resource.cfg retain_state_information Determineswhether Nagios willsavecurrent states to afile on shutdown(⇒ state_ retention_file)and read theseagainwhenitstarts. Default value: retain_state_information=0 retention_update_interval Everyhow many minutes should Nagios storecurrent stateinformation in the ⇒ state_retention_file?Withavalueof 0 ,the system only savesinformation if Nagios is shut down. Theparameter ⇒ retain_state_information mustbeenabled for this.Default value: retention_update_interval=60 service_check_timeout Number of secondsafter whichNagios aborts aservicecheck if this hasnot re- turned aresultbythen. Default value: service_check_timeout=60 438 D.1 TheMainConfiguration File nagios.cfg service_freshness_check_interval Intervalbetween twofreshness checks in seconds. Default value: service_freshness_check_interval=60 service_inter_check_delay_method Controlshow Nagios processesservicechecks after arestart.An“intelligent” pro- cedure should preventthemfromall starting at thesametime, to avoidputting an unnecessary load on theserver. Possible values are s ( smart ,automatic distribu- tion), n ( no,start alltests simultaneously!), d ( dumb,one second intervalbetween checks), as well as an explicitly specified intervalinseconds,inthe form x . xx.De- fault value: service_inter_check_delay_method=s service_interleave_factor Prevents thechecks accumulatingfor aspecific host from beingexecuted at the same time ( ⇒ max_concurrent_checks ,435),through Nagios distributing the plannedchecks for allhosts “intelligently” over aperiodoftime. Possible values are s ( smart ,automatic distribution)oraninteger larger than 0. With avalue of 1 , Nagios doesnot carry outany distribution,withavalueof 4 ,Nagios initially plans everyfourthservicecheck (thatis, from theamount of intendedchecks,the 1st, 5th, 9th, etc.), then thefollowingnumber (thatis, the2nd,6th,10th, etc.), andso on.The testsequenceisshown by the ServiceDetail item in theWeb interface. In caseofdoubt,the defaultvalue can be left as it is: service_interleave_factor=s service_perfdata_command TheNagios command object that is runafter each service checktoprocess perfor- mancedata. Arequirement for this is that ⇒ process_performance_data mustbe set. Theparameter is used if theperformancedataistobeprocessedwithout atem- plate(Section17.1, page 314).Example (no defaultvalue set): service_perfdata_command=process-service-perfdata 439 D An Overview of theNagios Configuration Parameters service_perfdata_file Path to thefile or namedpipethrough whichNagios forwards performancedata from service checks via atemplatemechanism to an externalprogram.Thisonly worksif ⇒ process_performance_data is set. Moreonprocessing performance datainChapter 17, page 313. Example(no defaultvalue set): service_perfdata_file=/tmp/service-perfdata service_perfdata_file_mode Defines themode in whichdataispassedonto ⇒ service_perfdata_file.Possible values are a ( append,append to anormalfile)or w ( write ,write to anamed pipe). Example(no defaultvalue): service_perfdata_file_mode=a service_perfdata_file_processing_command Acommand object that is executed after Nagios haspassedonserviceperformance datatothe ⇒ service_perfdata_file.The parameter is optionaland is only used together withthe template mechanism. As long as programs that furtherpro- cess thedata, such as perf2rrd (Section 17.3, page 325),include theirown service that permanentlyreads outthe service_perfdata_file ,you can manage without definingacommand for reading out. SeealsoChapter 17, page 313. Example(no defaultvalue set): service_perfdata_file_processing_command=process-service-perfdata-file service_perfdata_file_processing_interval Intervalinseconds in whichthe command defined in ⇒ service_perfdata_file_ processing_command is periodically run. Setting thevalue 0 ensuresthatitis neverused. Example(no defaultvalue set): service_perfdata_file_processing_interval=0 service_perfdata_file_template Theoutputformatfor performancedata; Nagios macros andformatdetails such as \t (tabulator) or \n (linefeed)are substituted in theoutput. SeealsoSection 17.1, page 314. Example(no defaultvalue set): service_perfdata_file_template=$TIMET$\ t$HOSTNAME$ \ t$SERVICEDESC$\ t \ $SERVICEEXECUTIONTIME$\ t$SERVICELATENCY$\ t$SERVICEOUTPUT$\ t \ $SERVICEPERFDATA$ 440 D.1 TheMainConfiguration File nagios.cfg service_reaper_frequency Everyhow many secondsshouldNagios process accumulated service testresults? Default value: service_reaper_frequency=10 sleep_time Pauseinseconds for whichNagios waitsbeforesearching againinthe scheduling queuefor checks to be performed. Default value: sleep_time=0.5 state_retention_file Thefile in whichNagios stores status information on shutdown, andfromwhich theinformation is read in againwhenNagios is started.Thisisusedonlyifthe ⇒ retain_state_information parameter is set. Default value: state_retention_file=/var/nagios/retention.dat status_file Path to thefile in whichNagios savesall currentstatusvaluesand from whichthe Webinterfaceretrievesthem. Default value: status_file=/var/nagios/status.dat status_update_interval At what intervalshouldNagios storestatusvaluesinthe file ⇒ status_file?If ⇒ aggregate_status_updates is notset, thesystem ignoresthisparameter and immediatelywrites thestatusvaluestothisfile (notrecommended). Default value: status_update_interval=60 temp_file Path to atemporary filethatNagios uses if necessary,and deletes each time when it no longer requires it.Default value: temp_file=/var/nagios/tempfile 441 D An Overview of theNagios Configuration Parameters use_regexp_matching Defines whether thewildcards * (any character) and ? (a single character) are allowedinobject definitions.Ifyou want to work withregular expressions, ⇒ use_ true_regexp_matching mustbeused. Default value: use_regexp_matching=0 use_retained_program_state Should changestothe parameters ⇒ enable_notifications , ⇒ enable_flap_ detection , ⇒ enable_event_handlers, ⇒ execute_service_checks and ⇒ accept_ passive_service_checks on theWeb interfacesurvive aNagios restart? Only works if ⇒ retain_status_information is enabled. Default value: use_retained_program_state=1 use_retained_scheduling_info Should Nagios save currentschedulinginformation on shutdownsoitcan read it in againwhenitrestarts? Youcan temporarily disable theparameter if youare addingalargenumber of tests;otherwiseitissensible to keep it enabled. Default value: use_retained_scheduling_info=1 use_syslog Ensureslogging of allNagios activitiesinthe syslog. Default value: use_syslog=1 use_true_regexp_matching In contrast to ⇒ use_regexp_matching,allows theuse of real regularexpressions in accordance withthe POSIXstandard.1 Default value: use_true_regexp_matching=0 1 See man7regex . 442 D.2 CGIConfiguration in cgi.cfg D.2CGI Configurationin cgi.cfg D.2.1Authenticationparameters Throughthe contact andthe contact group, Nagios allocatesresponsibilitiesto usersfromwhich permissionsfor theWeb interfacecan likewisebeinferred: each contact maynormally only seethose hostsand servicesfor whichheisalsore- sponsible.Thisiswhy thenameofthe Webloginmustmatch thecontact name. Theparameterslisted belowworkaround this concepttosomeextent. They arenot intendedtosolve problems,however,causedbycontact andWeb user namesnot matching. cmduse authentication Determines whether younormally need to logintothe Webinterface. Like the username,the contact name is always used;how youstore passwords is described in Section1.3,page 33. In generalyou should neverpermitthisauthentication, butifyou do, youshould make sure that theinterfacefor externalcommands(Section13.1, page 240) is switched off completely. Default: use_authentication=1 authorized_for_all_host_commands Allows theusers specified here to runcommandsthrough theWeb interfacefor allhosts,without them belongingtothe appropriate contact group. Example(no defaultvalue set): authorized_for_all_host_commands=nagiosadmin authorized_for_all_hosts Allows theusers specified here to look at allhostinformation,irrespective of their actualresponsibility. Example(no defaultvalue set): authorized_for_all_hosts=nagiosadmin,guest 443 D An Overview of theNagios Configuration Parameters authorized_for_all_service_commands Allows theusers defined here to runcommandsfor allservices via theWeb inter- face, independentlyofmembershipofcontact groups.Example (no defaultvalue set): authorized_for_all_service_commands=nagiosadmin authorized_for_all_services Allows theusers specified to viewall service information,irrespective of theirown permissions. Example(no defaultvalue set): authorized_for_all_services=nagiosadmin,guest authorized_for_configuration_information Enablesthe usersspecifiedtoviewall configuration datavia theWeb interface. This should be reservedfor theNagios administrators. Example(no defaultvalue set): authorized_for_configuration_information=nagiosadmin,jdoe authorized_for_system_commands Allows thespecifiedusers to shut downorrestart Nagios via theWeb interface. Normally,nobody hasthisauthorization.Example (no defaultvalue set): authorized_for_system_commands=nagiosadmin authorized_for_system_information Allows thespecifiedusers to viewNagios process information.Normally,nobody maydothis. Example(no defaultvalue set): authorized_for_system_information=nagiosadmin,theboss,jdoe D.2.2Other Parameters default_statusmap_layout Defines thelayoutfor thestatusmap. Possible values are 0 (coordinates defined throughahostextinfo object), 1 (the user mustmovebymouse clickfromone 444 D.2 CGIConfiguration in cgi.cfg layertothe next one), 2 (compressedtree—somewhat confusing,because branches cutacrosseach otherinthe picture), 3 (balancedtree, thebranchesare displayed so that thereare no crossovers in thegraphic—clearer,but requires muchspace), 4 (circularrepresentation, withNagios at thecenter:hosts that can be reached directly2 areshown in theinnercircle, whileonother circlesare locatedthose hoststhatcan be reached from hostsalready entered in thegraphic), 5 (circular, like 4 ;the area around thehostismarkedincolor—grayfor OK,red for DOWNor UNREACHABLE; Figure D.26 on page 291 showsanexample), and 6 (circular; the hostsare shownasballoons).The settingscan also be changedinthe Webinterface without theneed to adjustthe configuration fileeach time,which makesiteasier to trythingsout.Example: default_statusmap_layout=5 default_statuswrl_layout Determines thelayoutfor theVRMLrepresentationofthe status page through sta- tuswrl.cgi.Possible values are 0 , 2 , 3 ,and 4 ;the corresponding appearanceisbased on thevaluesofthe same name for ⇒ default_statusmap_layout .Example: default_statuswrl_layout=4 default_user_name Name of aguest user whomay usethe Webpageswithout authentication.You should only usethisparameter if theWeb serverisprotected from unauthorizedac- cess,and youshouldlook closelyatwhatpermissionsthisuserisallocatedthrough thecontact groups.Example (no defaultvalue set): default_user_name=guest main_config_file TheNagios main configuration file. Default value: main_config_file=/etc/nagios/nagios.cfg 2 That is,without the“diversion” via parents . 445 D An Overview of theNagios Configuration Parameters nagios_check_command Acommand that checks thestatusofthe Nagios daemons.You can omit this parameter,since theCGI programsalready containanequivalent built-infunction. If Nagios is notrunning,theyissueacorresponding errormessage.Ifyou still want to defineaseparatecommand here,you can usethe plugin check_nagios (Section 7.11, page 150): nagios_check_command=/usr/local/nagios/libexec/check_nagios -F /var/nagios/nagios.log -e 60 -C /usr/local/nagios/bin/nagios physical_html_path Path in thefile system that leads to theNagios directoryfor documentationand images. Seealso ⇒ url_html_path .Default value: physical_html_path=/usr/local/nagios/share refresh_rate Specifies at what intervals theWeb page is automatically updated.Default value: refresh_rate=60 statusmap_background_image Thebackground image forthe status mapdisplay.Example (no defaultvalue set): statusmap_background_image=smbackground.gd2 statuswrl_include Afile withits ownVRMLobjectsusedinthe VRMLrepresentation. Thepathis specified relative to ⇒ html_physical_path .Example (no defaultvalue set): statuswrl_include=myworld.wrl url_html_path Thelogical pathtothe Nagios documentsand imagesfromthe point of viewof thebrowser,startingfromthe document root of theWeb server. If youuse this pathinanURL,you willbetakentothe Nagios startpage.Default value: url_html_path=/nagios 446 Index Symbols .NET querying configuration data 371 $ARG1$ 53 $ARG2$ 53 $HOSTADDRESS$ 53, 90 $USER1$53, 59 $USERx$macros59, 438 24x745, 49, 54, 220 2d coords 310 3D display monitored computer see sta- tuswrl.cgi 3d coords 310 A accept passive host checks 424 accept passive service checks 424, 442 access control see authentication accounts creating161 acknowledgement 278, 288 as adisplay criterionfor sta- tus.cgi282 displaying in theWeb interface 290 viacellphone 295 viaWAP 295 action url308 additional information adding to Nagios Webpage43 address44, 226 admin email424 admin pager424 agemonitoring of afile see check file age of aWindowsfile 363 agent(SNMP)178 aggregate status updates424 alias44, 47, 50–52, 226 Alias(Apache)33 alternatingstates see flapping Amavis monitoring92 Apache configuration 33–34 configuration file27 home page 35 setting theenvironment vari- able 57 Apache 1.3 andNagios33 Apache 2.0 andNagios33 APAN 349 APC-UPSs monitoring126, 149–150 APC-USVs monitoring128 apcupsd 126, 149–150 arguments forcheck commands53 AS/400 querying system load 213 ash programming 412 asynchronousevents processing 240 authentication configuringthe NET-SNMP sn- mpd190–192 in SNMP 183–184, 190–192 switching on/off at theWeb in- terface58 authorized for all host commands 443 authorized for all hosts58, 443 authorized for all service commands 443 authorized for all services 58, 444 authorized for configuration infor- mation 296, 444 authorized for system commands 444 authorized for system information 444 auto reschedule checks 425 auto rescheduling interval 425 auto rescheduling window 425 avail.cgi275, 296–297, 305 availabilityreport see avail.cgi availabilitystates75 B backup monitoring240 BB see BigBrother BigBrother 19 booting see system start browserrefresh configuring58 C Cacti19, 350 447 Index CCMS 388–398 CCMS plugins394–398 CDEF 346 cell phone as adisplay device forNagios 295 number forSMS see pager certificate testingthe lifespan101 testingthe time span 111 Webservertesting 81 cfg dir39, 269, 425 cfg file38, 425 CGIconfiguration57–59 CGIprograms avail.cgi see avail.cgi callingyourown ˜ see ac- tion url cmd.cgi see cmd.cgi config.cgi see config.cgi extinfo.cgi see extinfo.cgi histogram.cgi see his- togram.cgi history.cgi see history.cgi interactionwithNagios273 notifications.cgi see notifica- tions.cgi outages.cgi see outages.cgi showlog.cgi see showlog.cgi status.cgi see status.cgi statusmap.cgi see sta- tusmap.cgi statuswml.cgi see sta- tuswml.cgi statuswrl.cgi see statuswrl.cgi summary.cgi see summary.cgi tac.cgi see tac.cgi trends.cgi see trends.cgi workingwithNagios84 CGIscripts see CGIprograms cgi.cfg39, 57–59, 152, 275, 292, 296, 443–446 changeofstate continual see flapping check-host-alive45 check-iftraffic 207–209 check apc150 check by ssh 82, 108, 157–160 passive mode 160 check cluster installation 31 check command44, 48 check dhcp 124–126 check dig107–108 check disk 134–136, 171–172 evaluating performance data graphically 324 evaluating performance data with NagiosGrapher345–348 check dns106–107 check dummy 154, 241, 258 forWindows374 check external commands 240, 426 check file age148–149 check for orphaned services426 check freshness 244 andnotification failure criteria 236 check ftp97 check host 91 check host freshness 426 check http 81,98–103 critical limit value99 forWindows374 reaction to aWeb server redirect 99 regularexpressionsinqueries 99 specifying user andpassword forthe test 99 testingSSL connection 101 testingthe lifespanofacertifi- cate 101 warninglimit 99 check icmp 88–91 andWindows374 as ahostcheck 91 as aservice check 90–91 critical limit 89 evaluating performance data with Nagiosgraph322 evaluating performance data with NagiosGrapher343–345 host entry89 options89 test 32–33, 90 usewithnegate156 vs.check ping 88 warninglimit 89 check ifoperstatus83, 203–205 check ifstatus 83, 201–203 check imap 95 check ldap 121–124 check load 137–138 check log141–144 check log2 143 check mailq 147–148 check mysql120–121 check nagios 150–152 check nrpe formonitoringNRPE234 monitoringWindowssystems 371 running pluginsonthird-party computers171–173 check nt 354–370 installation 363–364 check ntp145–146 check oracle 114, 415 check oracle writeaccess 115, 415 check pcmeasure 379–382 check period45, 49 vs.notification period45 check pgsql 115, 117–118 check ping vs.check icmp 88 check pop 95 check procs138–141 check sap 386–387, 394 check sap cons 393–397 check sap instance394 check sap instance cons 395 check sap mult no thr395, 397– 398 check sap multiple395 check sap system 395 check sap system cons 395 check sensors152–154 check service freshness 426 check simap95 448 Index check smtp 81,92–95 critical limit 93 forWindows374 warninglimit 93 check snmp 83, 196–201 check snmp cpfw 210 check snmp disk 205–207 check snmp int209 check snmp load 209, 212–213 check snmp mem209 check snmp proc 205–207 check snmp process209 check snmp storage209–212 check snmp vrrp 209 check spop 95 check squid103–105 check ssh 108–110 forWindows374 check swap 136–137 check tcp82, 110–112 stipulatingIPv4orIPv6112 critical limit value95 forFTP monitoring97–98 formonitoringPOP3 andIMAP 92 forPOP andIMAPmonitoring 95–97 forWindows374 to check SAP383 to monitor SAP387 usingSSL 112 warninglimit 95, 110 check time 146–147 forWindows374 check traffic 207 check udp82, 112–114 forWindows374 check ups127, 129–131 check users144 checkcommands.cfg 90, 91, 225 Checkpoint firewall monitoring210 chmod 161 chown 161 Ciscocomponents querying system load 213 CLIENTVERSION (NSClient/NC Net command) 356–357 clocktimes restricting actions54 cluster monitoring31 cmd.cgi 274, 288–290, 304, 311 collect2.pl 337 comma-separated list see CSV command(object 54 command(object)42, 53 commandobject fore-mailnotification see notify-by-email forevaluatingperformance data 316, 317 command check interval 427 command file427 commands definingtoberun in SNMP queries193 fornotification see notifica- tion command comment file427 comments deleting on problemhosts 278 in configuration files 39 lookingatfor hosts285 lookingatfor services 285 maintainingonproblemhosts 277, 288 nonpermanent 403, 406 community(SNMP)183 configuringfor snmpd190 defaultvalues186 specifying in check snmp 197 compilation 29 computer defining see host (object) definingdependencies see hostdependency (object) excludingfromnotification 220 grouping see hostgroup(ob- ject) monitor allofauser 58 monitoringindifferent network segments see networktopol- ogy overview of all67 overview of individual 67 recommended configuration file39 shutdownduringpower failure 149 states 46 computer address defining see address computer name defining see host name CONFIG (NC Netcommand)370 config.cgi275, 295–296 configuration 37–59 checking 61 forusing Nagiosgraph320– 321 forusing Perf2rrd 326–327 overview of allobjects275 testing63 configuration changes applying 64 configuration directory27 configuration file forcomputer39 forservices39 forsnmpd see snmpd.conf configuration files cgi.cfg see cgi.cfg checkcommands.cfg see checkcommands.cfg,misccom- mands.cfg forcheck logs.pl143 forNagiosgraph see mapand nagiosgraph.conf forNSCA see nsca.cfg forNSCAclients see send nsca.cfg forPCMeasure querysoftware see pcmeasure4linux.cfg forsnmptrapd see sn- mptrapd.conf nagios.cfg see nagios.cfg NagiosGrapher see ngraph.ncfg nrpe.cfg see nrpe.cfg,dr- 449 Index raw.conf object-related 39 resource.cfg see resource.cfg syslog-ng see syslog-ng.conf configurationsfiles main configuration file445 configure command forNagios27, 33 forNRPE167, 172 forNSCA248 contact(object)42, 50–52, 223 definingexternalnotification programs224 definingnotificationstates221 definingnotificationtimes 222 contactgroups17 contactpersons see contact(ob- ject) andusernames forthe Webin- terface36 contactsensor378 contact groups 45, 50 contact name 51 contactgroup(object)42, 52, 221 Cortona294 counter314 COUNTER(NC Netcommand)365– 367 CPUload caused by aprogram 138 checking 138, 139 in theUCD-SNMP-MIB 189 monitoringinWindows366 of an SAPinstance 394 on Windowscomputers 357– 358 testing82, 137 testingvia SNMP 195–196, 209, 212–213 CPUruntime of program monitoring138 CPUtemperature testingvia SNMP 200 CPULOAD(NSClient/NC Netcom- mand)357–358 crashed computer see DOWN (state) Cricket350 CRITICAL (state)16, 17, 48, 75, 85, 88 as adisplay criterionfor sta- tus.cgi282 force/suppressnotification219 macro227 markinginthe Webinterface 66 negating return value155 resetting manually see error states return value143, 154, 244 critical limit see threshold check apc150 check by ssh 157, 159 check dig107 check disk 134 check file age148 check http 98,99 check icmp 88, 89 check iftraffic 207 check ldap 121, 123 check load 137 check mailq 147 check nt 355 check ntp145 check pgsql 115 check procs138, 139 check smtp 92 check snmp 196 check snmp load 212 check squid103, 105 check swap 136 check tcp95 check udp113 check ups129 check users145 CPULOAD358 in performance data 146 specifying 88 critical threshold check apc150 check file age149 check iftraffic 208 check load 137 check mailq 147, 148 check nt 356 check ntp146 check pgsql 117 check snmp 197, 201 check snmp in lm-sensors 200 check snmp load 213 check tcp111 check time 146, 147 check udp113 check users145 CPULOAD358 detail of performance data 146 cron forNagiosself-monitoring 151, 152 used to runservice checks 84 CSMA/CD 182 CSV availabilitydataas˜ 296 Cygwin353, 373 ˜plugins 373–374 D Daemon Tools328 data backup see backup database testing17 databases andservice dependencies 237 monitoring114–121, 415–422 date format 40, 427 ddraw 330–335 Debian NET-SNMP 184 NRPE installation 166 smsclient installation 228 default statusmap layout 58, 444 default statuswrl layout 58, 292, 445 default user name 445 deliverynumber forSMS see pager Department of Defense 179 dependencies between computers see host- dependency (object) between NSClient/NC Netand 450 Index monitored services 357 between services see ser- vicedependency (object) circular 63 implied237 developmentpackages26 DHCP monitoring see check dhcp dig to monitor name servers see check dig distributed monitoring84, 239, 247, 265–272 DNS monitoring105–108 monitoringnameservers see check dig documentation37 linking on hostsinNagios308 DOWN(state) 46, 74, 75, 219 as displaycriterionfor status.cgi 282 macro226 markinginthe Webinterface 66 downtime flexible length 305 forhosts 306 forservices306–307 planned see maintenance pe- riod planning 307 scheduling304 taking into accountfor mes- sages 219 downtime file428 drivecapacity see hard driveca- pacity drraw.conf 331–332 DSLconnection warninglimit forping86 dummy plugin see check dummy E e-mail address fornotifications see email specifying of theadmin in NET- SNMP 192 e-mail deliverycommand see notify-by-email e-mail server testing see SMTP egrep excludingcommentsand empty lines 57 email52, 225, 226 embeddedPerl29 enable event handlers 428, 442 enable flap detection403, 406, 428, 442 enable notifications 218, 428, 442 encryption NSCA 251 ENUMCONFIG (NC Netcommand) 370 ENUMCOUNTER(NC Netcommand) 364–365 ENUMCOUNTERDESC(NC Netcom- mand)365 ENUMPROCESS(NC Netcommand) 367 ENUMSERVICE(NC Netcommand) 367 errormessages 63 interval see notifica- tion interval restricting number of 75 errorstates resetting manually 258–259 escalation management 18, 231– 234 forcomputers see hostescala- tion (object) forservices see serviceescala- tion (object) Ethernet 182 eventbroker29, 429 eventhandler 409–413 vs.OCSPand OCHP 265 event broker options428 event handler timeout429 eventlog see Windowseventlog EVENTLOG(NC Netcommand) 368–370 events as histogram 298 showinggraphically see his- togram.cgi Exchangefor Nagios addons81 addonsfor managing mainte- nance times304 logos andicons 310 NagiosGrapher336 networkplugins 103 NRPE pluginsfor Windows 371, 373 NRPE source code 167 NSClient 354 Oracle plugin 115 ping plugin forWindows374 proxytest103 SNMP plugins205 Squidtest103 ExchangeServer monitoring93 execute host checks 429 execute service checks 429, 442 Exim monitoringmailqueue 147 monitoringthe mail queue148 External Command File 240 extinfo.cgi 274, 277, 284–287, 304, 404–406 adding additional information 308 F failedlogins monitoringon142 failure of networkrangesdetecting 290 of partialnetworks275 Fast Ethernet interface monitoringtraffic208 Fedora NRPE installation 166 FHS27 FIFO 240 file changingowner see chown changingpermissions see 451 Index chmod monitoringmodificationdate see check file age monitoringvia SNMP 189 size monitoring see check file age FILEAGE(NSClient/NC Netcom- mand)363 Filesystem Hierarchy Standard see FHS firewall environmentsindirecttests in 174, 236 FirstLevel Support informingofproblems 231 flapdetection see flapping flap detection enabled404, 407 flapping 219, 226 as adisplay criterionfor sta- tus.cgi282 flapping (state)46, 401–407 forservices406 host 406–407 with services 402 flapping services see flapping FREEDISKSPACE (NC Netcommand) 370 freeWRL 294 frequency of astate representing graphi- cally see histogram.cgi frequency of state showinggraphically see his- togram.cgi freshness checks see freshness mechanism freshness mechanism 236, 243– 245 FTP monitoring97–98 G global host event handler429 global service event handler429 graphics adding to Nagios Webpage43 green (state)16 groupadd 161 groups creating161 H hard disk capacity testing136 hard drivecapacity checking 134 checking with SNMP 198 displaying graphically 324 monitoringwithSNMP 210 of Windowshosts displaying graphically 324 testing82 testingonWindowscomputers 359–360, 370 testingwithSNMP 194–195, 209, 212 hard drivecapactiy testingwithSNMP 205 hard recovery 77 hard state45, 48, 72, 75, 217, 404 header files see developmentpack- ages health check see lm-sensors help in theWeb interface58 Help Desk informingofproblems 231 high flap threshold407 high host flap threshold406, 430 high service flap threshold403, 430 histogram.cgi 275, 298–299 history see history.cgi history.cgi275, 299 hitlist problematichosts 302 host 16 host (object)41, 44–46 host check 16, 32, 44, 74 active 239 beyond reachabilitytests 91 passive 239–243, 258, 371 resetting errorstate manually see errorstates role in flapdetection 406 vs.pingservice 47, 63, 75 with check icmp 91 host dependencies 234 host dependency (object)238 host group(object)57 host MIB188 host name defining(plugin option)88 host-notify-by-email224, 226– 227 host-notify-by-sms224 host check timeout430 host freshness check interval 430 host inter check delay method 430 host name 44, 48, 56, 226, 308 host notification commands 52 host notification options51 host notification period51 host perfdata command317, 430 host perfdata file431 host perfdata file mode 431 host perfdata file processing command 431 host perfdata file processing interval 431 host perfdata file template 431 hostdependency (object)43 hostescalation (object)43, 232, 233 hostextinfo (object)43, 292, 307– 310 hostgroup downtimefor allservicesof 306 showinginthe status display 279 hostgroup(object)41, 46–47 applying with NRPE 174 selectingfor status display280 hostgroup name 47, 48, 56 hostgroups 44 hostname defining see host name hosts availabilitystatistics see avail.cgi 452 Index extensiveinformationonindi- vidual 284 htpasswd35, 51 HTTP monitoring97–103 testing81 HTTP header manipulating 81 humidity monitoring377–382 I I2C 152 icon adding your owninthe Webin- terface see icon image icon image309 icon image alt309 identdaemon 116 identd monitoring374 illegal macro output chars 432 illegal object name chars 432 IMAP monitoring92, 95–97 monitoringvia SSL/TLS 95–97 IMAP3S see IMAP viaSSL/TLS imprecision in SNMP see roundingup indirect checks 158, 174–175, 236 inetd configuration forNRPE169, 252 inheritance of dependencies 236 installation 25–31, 240 check nt 363–364 drraw330 isapinfo 384 Nagiosgraph318 NC Net355, 363–364 NRPE 166–168 NRPE NT 372 NSCA 248–249 NSClient 354–355 Perf2rrd 326 RRDtools 330 INSTANCES(NC Netcommand) 365, 367 instant client (Oracle) see Oracle interface forexternalcommands18, 34, 81, 84, 160, 240–241, 247, 288– 290 Internet services testing81–82 Internet Standard Management Framework178 interval between errormessages see notification interval between errornotifications see notification interval between servicechecks49 interval check 220, 223 interval length 432 IP address defining see address defining(plugin option)88 IPv4stipulating 88 check by ssh 159 check http 101 check ldap 123 check pgsql 117 check smtp 93 check ssh 109 check tcp112 IPv6stipulating 88 check by ssh 159 check http 101 check ldap 123 check pgsql 117 check smtp 93 check ssh 109 check tcp112 is volatile 257, 259, 263, 370 ISDN sendingSMS via229 ISDN connection warninglimit forping86 ISO(organization)179 J jitter145, 146 L LDAP see OpenLDAP monitoring see check ldap libraries required forcompiling26 limit see critical limit,warning limit limit value critical 88 critical (check by ssh)159 lm-sensors 152–154 informationinthe UCD-SNMP- MIB189 readingout informationvia SNMP 200 specifying thresholds 200 temperaturequery viaSNMP 200 load of anetwork interface see check-iftraffic load status of aUPS 150 lock file432 logfile entries forNSCA250 generating 314–316 graphicaloverviewof see showlog.cgi incomplete 297 logfiles evaluating see syslog evaluating theWindowsevent- log368 evaluating WindowsEventlog 370 filteringafter states see his- tory.cgi forNagiosGrapher341, 349 monitoring see check log monitoringthe Nagios logfile see check nagios log archive path 432 log event handlers 433 log external commands 433 log file433 log host retries433 log initial state298 453 Index log initial states 433 log notifications 434 log passive checks 434 log rotation method 299, 434 log service retries434 logcheck 255 logins failed see failedlogins low flap threshold404, 407 low host flap threshold406, 434 low service flap threshold403, 435 lpd restartautomatically if it fails 409 restarting automatically on fail- ure413 M MacOSX monitoring353 macros 53, 59, 225–227 $ADMINEMAIL$ 424 $ADMINPAGER$424 $HOSTATTEMPT$ 411 $HOSTSTATETYPE$ 411 $HOSTSTATE$411 $SERVICEATTEMPT$ 411 $SERVICESTATETYPE$411 $SERVICESTATE$ 411 $USERx$ see $USERx$macros used in e-mail delivery226 mail queue monitoring see check mailq, see check mailq mail server testing see SMTP mailinglists nagiosplug-help31 main configuration file see na- gios.cfg mainmemory consumption monitoring138 in thehostMIB 188 monitoringwithSNMP 209– 212 testingonWindowscomputers 358–359 main config file58, 445 maintenance window addonsfor maintenance 304 displayinthe Webinterface 282, 286 forhosts 305 make options29, 38 Management InformationBase see MIB management nodes(SNMP) see nodes manager(SNMP)178 manufacturer MIB201 map318, 322–325 max check attempts 45,48, 49, 76, 217, 404, 410 in connection with logfile mon- itoring141 representation Webinterface 66 max concurrent checks 435 max host check spread 64, 435 max service check spread 64, 435 mbrowse186–187 measuredvalues displaying over time 19 measuringtemperature as ahostcheck 92 members47, 50, 57 memory monitoring139 MEMUSE (NSClient/NC Netcom- mand)358–359 messages 45 stopping see notifica- tions enabled MIB178 of themanufacturer 201 MIB-II 181–183, 188 Microsoft ExchangeServer93 Microsoft Windows see Windows misccommands.cfg 225, 268 modificationdate of afile monitoring see check file age movement detector 378 MRTG 19, 209 MTA monitoring see check smtp MySQL creatingadatabase 119 monitoring119–121 starting in networkmode 119 N nagcmd (group)26 Nagios monitoring see self- monitoring reload 327 restarting see restart stopping 285 nagios (group)26 nagios (program)61–63 startvia startscript63 nagios (user)26 read permissionswhenusing check log142 Nagios Exchange see Exchangefor Nagios addons Nagios Remote Plugin Executor see NRPE Nagios ServiceCheck Acceptor see NSCA nagios-snmp-plugins205–207 nagios.cfg 38–43, 218, 311, 424– 442 activating freshness checking 243 allowing passive host checks 242 configuration forNagiosgraph 320 definingtimeunit43 flapdetection 403, 406 logrotation 299 passive servicechecks241 processing performance data 315–317 switching on OCSP/OCHP266 switching on processing of ex- ternal commands 240 NAGIOS CGI CONFIG (environment variable) 57 454 Index nagios check command152, 445 nagios group435 nagios user 435 Nagiosgraph314, 317–325 debuglevel 320 delimiter 317 nagiosgraph.conf 319–320 NagiosGrapher314, 336–349 configuration 338–349 installation 336–338 Name server see DNS namedpipe84, 240, 427 creatinga 327 forNagiosGrapher339 forNSCA250 problems with Nagios 2.0beta 330 navigation area 274 customizing283 NC Net81, 354–371 changingconfiguration370 definingthe Performance Counter 364–365 installation 355, 363–364 listing services 367 monitoringprocesses362 monitoringprocessorload366 monitoringthe ageofafile 363 monitoringuptime 360–361 monitoringWindowsservices 361–362 querying configuration 370 querying eventlog 368–370 querying processlist367 querying theclient version 356–357 querying theconfiguration370 querying thePerformance Counter 365–367 querying WMIdatabase 371 testingCPU load 357–358 testingharddrivecapacity 359–360, 370 testingmainmemory358–359 negate 155–156 forWindows374 NET-SNMP 184–196, 260 configuration see snmpd.conf definingsystemand localinfor- mation 192 pluginsspecialized in ˜205 specialfeatures in the check snmp load call 212 NET-SNMPD83 network detectingoutages 74 networkconnection slow warninglimits86 networkinterfaces monitoringvia SNMP 83, 200 testingload see check-iftraffic networkoutages 74 networksegments73 networkservices testing81–82 networktopology accounting for46 taking into account72 networktraffic observing see check-iftraffic NetworkUPS Tools126–131 networktopology taking into account17–75 ngraph.ncfg337–345 nmbd monitoring138 nodes181 nodes(SNMP)179 Nokia-VRRP cluster monitoring209 normal check attempts 49 normal check interval 49, 76, 286, 404 notes308 notes url308 notification commands 52 preventing 46 notificationcommand 52 defining224–231 notification interval 45, 49, 220, 223, 231 forescalation233 notification options46, 49 in case of escalation 233 in connection with check log 142 notification period45, 49, 220, 223, 231 in case of escalation 233 notification timeout436 notifications 17–18, 215–238 as adisplay criterionfor sta- tus.cgi282 commands 52, 224 globally switching on andoff 289 graphicoverview see notifica- tions.cgi lookingatsent see notifica- tions.cgi periodic see interval check preventing 285 stopping in general see en- able notifications switching offfor hostsofa group284 time interval see notifica- tion interval notifications.cgi 275, 300–301 notifications enabled219 notify-by-email224–227 notify-by-sms224, 230–231 NRPE 82–83, 165–175 exampleofservice dependen- cies 234 forWindows see NRPE NT monitoring234 nrpe.cfg 167, 170–172 forWindows372, 374 NRPE NT 371–375 configuration 372 installation 372 NSCA 84, 239, 247–265 client configuration 252–253 configuringthe Nagios server 249–252 daemon 247 encryption 251 installation 248–249 455 Index processing SNMP traps260 testingfunctionality254 nsca.cfg 249–251 NSClient 81, 354–363 andservice dependencies 237 installation 354–355 monitoringprocesses362 monitoringthe ageofafile 363 monitoringuptime 360–361 monitoringWindowsservices 361–362 querying Performance Counters 367 querying theclient version 356–357 testingCPU load 357–358 testingharddrivecapacity 359–360 testingmainmemory358–359 NSClient+354 nslookup to check name services see check dns NTP formonitoringsystemtime see check ntp ntpdate 145 ntpq 145 nut 127 O object 41–43 object definitions displaying see config.cgi object identifier see OID object types41–43 object cache file436 obsess over host 267, 436 obsess over hosts266 obsess over service267, 271 obsess over services 266, 436 obsessive commands 265 OCHP 265–268 ochp command 266, 436 ochp timeout266, 437 OCSP 265–268 ocsp command266, 437 ocsp timeout266, 437 OID 179 querying 184–187 OK (state)17, 48, 75, 85 macro227 negating return value155 return value154 OpenLDAP monitoring138 restartbyevent handler413 OpenNMS 260 OpenSSH 158 OpenVRML 294 operating status of anetwork interfacetesting 203 Oracle instant client 416–417 monitoring114, 115, 415–422 orphaned service 426 outages detectinginnetwork 74 outages.cgi275, 295 P pager225 parents46, 63, 72–73, 238, 306 passive mode check by ssh 160 password in SNMP 183 passwordfile forlogging in to theWeb front end see htpasswd PCAnywhere monitoring112 PCmeasure (sensorquery program) 379 PCmeasure4linux.cfg 378 PENDING(state) as adisplay criterionfor sta- tus.cgi282 as criterionfor servicedepen- dencies 235 as displaycriterionfor status.cgi 282 Perf2rrd 325–330 perfdata timeout437 Performance Counter 364 defining364–365 querying 365–367 Performance Counter instances 365 performance data 87, 96, 313–350 foroverall system 291 format 314 processing throughanexternal command317 processing viatemplate314– 316 performance problems of Nagios revealing286 periodicnotification see interval check Perl embedded see embeddedPerl forWindows375 ICP::Open2 module 418 pluginsfor Windows374–375 searching in ˜322 Perl modules installing31, 336 Perl script as aplugin17 permissions changingonfile see chmod PerParse 349 physical html path 58, 446 ping 32, 45, 47, 62, 88 check forWindows374–375 warninglimits86 plugin 79, 81–83, 87 differencesbetween versions 1.3.1and 1.4166 executingvia SSH 82 generic 82, 110–114 local82 Oracle 417–422 running viaNRPE see NRPE running viaSSH 82, 157–163 service-specificvs. generic 81– 82 wrapper417–422 456 Index plugin directory53 plugins17 check icmp see check icmp documentation87 downwards compatibility19 echo,getting return value143, 154, 206, 360, 363, 373 fornetwork services 88–131 forWindows354 help 87 installation 30–31 manipulating output155–156 negating output see negate path to 59 performance data 87 return status 85 return value75, 154 running throughSSH 371 specifying host name 88 specifying IP address88 standardoptions87–88, 153 states 17, 75 testing32–33 timeout86, 88 versioninformation88 writingyourown 415–422 POP3 monitoring92, 95–97 POP3 viaSSL/TLS monitoring95–97 POP3S see POP3 viaSSL/TLS port scan as ahostcheck 92 Postfix monitoringmailqueue 147, 148 PostgreSQL creatingadatabase 115 creatingadatabase user 115 monitoring115–118 starting in networkmode 115 testingdatabase 17 postponing tests287 power failure shutdowncomputer149 printerservice restarting automatically on fail- ure409–413 problem taking on 278 PROCESS HOST CHECK RESULT 240, 243, 253 process perfdata command317 process performance data 315, 317, 320, 437 PROCESS SERVICE CHECK RESULT 84, 240, 242, 253 processes informationinthe host MIB 188 listing in Windows367 monitoring see check procs monitoringinWindows362 monitoringvia SNMP 205, 209 specifying,tobemonitored via SNMP 193 processorload see CPUload PROCSTATE(NSClient/NC Netcom- mand)362 proxy monitoring see Squid pseudo tests forfreshness checks 244 public-key login160 Q QMail monitoringmailqueue 147, 148 questionable status see WARNING (state) queues on mail server see mail queue R ranking list see hitlist reboot see restart recovery aftererror 77 recovery (state)46, 219 recovery notification142 red(state) 16 redirect reaction of thecheck http plu- gin99 refresh rate 58,446 regexps see regularexpressions regularexpressions allowing +innagios.cfg442 in check http 99 in check logs.pl144 in check snmp 197, 200 in eventlog 368 in Nagiosgraph322 in NagiosGrapher343, 344, 346 in Perl 322 with egrep170 reload of thesystem64 repeat see test repeat replay attacks on NSCA 250 rescheduling automatic 220, 223, 224 resource.cfg 38, 39, 53, 59, 199 resource file438 responsible person see contact(ob- ject) restart failedservices409 of Nagios server 285, 311 retain nonstatus information312 retain state information298, 311, 438 retain status information312 retention311–312 retention update interval 151, 438 retry check interval 49, 76, 404, 410 return status of plugins85 return value forcing thedefined see check dummy of pluginsdetermining with echo 143, 154, 206, 360, 363 reversePolishnotation see RPN RFCs 1065–1067 (SNMP)183 457 Index 1155 (Internetnamespace) 181 1155–1157 (SNMP)183 1212 (format of an MIB) 181 1213 (MIB-II) 188 1901–1908 (SNMPv2c) 183 1905 (SNMPv2)183 2790 (Host-MIB) 188 3410 (SNMP)179 3411 (SNMP)179 3411–3418 (SNMPv3)183 3414 (USM)183 3415 (VACM) 183 round-robin archive 333 round-robin database 317 creatingwithPerf2rrd see Perf2rrd evaluating graphically see ddraw forsensordata380 to assess networktraffic207 roundingup in SNMP 198 router monitoringnetwork interfaces 200 RPN346 RRA see round-robin archive RRD see round-robin database RRDtools 330 CDEF see CDEF installation 330 RSH82 S Samba monitoring138 SAP CCMS plugins see CCMS plug- ins detectingapplicationserver 386–387, 395 interfacefor Nagios plugins 392–394 monitoring383–398 monitoringsystem see CCMS querying applicationserver 384, 386 querying messageserver385– 387 SAPinstance 392, 395 SAPCAR 384 sapinfo 383–387 scheduling64 ScriptAlias(Apache)33 scripting in Windows354 search in theWeb interface67 Second LevelSupport informingofproblems 231 Secure Shell see SSH, see SSH segmentlimits of anetwork,defining 73 self-healing throughevent handlers 409 self-monitoring138, 150 send nsca 84, 247, 252–254, 267 usingwithsyslog-ng256 send nsca.cfg 252–253 Sendmail monitoringmailqueue 147, 148 sensors monitoring see lm-sensors service(object)41, 47–50, 56 servicecheck 16, 79–84 active 239 active preventing 241 active switching 288 commandused48 direct 81–82 passive 239–242, 258, 371 passive as adisplay criterionfor status.cgi 282 reachability90–91 resetting errorstate manually see errorstates viaNRPE see NRPE viaSSH 82 vs.hostcheck 402 servicechecks active 80 passive 80, 84 viacronjobs84 viaNSCA84 viaSMTP83–84 servicedependencies 234–238 servicedependency (object)234– 237 servicegroup showing, in thestatusdisplay 279 service check timeout438 service description 48 service freshness check interval 438 service inter check delay method 439 service interleave factor 439 service notification commands 52 service notification options52 service notification period51 service perfdata command320, 439 service perfdata file439 service perfdata file mode 440 service perfdata file processing command 327, 440 service perfdata file processing interval 440 service perfdata file template 440 service reaper frequency 440 servicedependency (object)42 in NSClient/NC Net357 serviceescalation (object)42, 232– 234 serviceextinfo (object)43, 307, 310–311 forNagiosgraph 320 generating with NagiosGrapher 336, 339 integratingddraw graphics into Nagios 335 servicegroup (object)42, 50 selectingfor status display280 servicegroup name 50 services availabilitystatistics see avail.cgi definingdependences see ser- vicedependency (object) 458 Index definingNRPEin/etc/˜168 detailedinformationonindivid- ual284 excludingfromnotification 220 grouping see servicegroup (ob- ject) listing in Windows367 monitor allofauser 58 overview of all67 overview of defective67 overview of faulty 66 passworddefinitionsin59 recommended configuration file39 test commands see service check test interval 49 to be monitored see service (object) volatile see volatile services Windows see Windowsser- vices SERVICESTATE (NSClient/NC Net command) 361–362 shellscript as aplugin17 shellscripting see bash program- ming show context help 58 showlog.cgi275, 301 size of afile monitoring see check file age sleep time 441 slurpd monitoring138 SMBus152 smokealarm 378 SMS as anotificationmedium227– 231 deliveryaddress see pager notificationprogram 227 smsclient 227–231 installation 228 smssend 227 SMTP 16,83–84, 92–95 test of mail server restrictions 94 testing81 SNMP 177–213 andprecision see roundingup andservice dependencies 237 authentication see authentica- tion definingprotocol versionfor check snmp 198 generic Nagios plugin see check snmp in Windows354 Nagios plugins196–213 querying OIDs 184, 187 RFCs 179, 181, 183, 188 testingseveral networkinter- facessimultaneously201 SNMP management systems in comparison to Nagios 260 SNMP traps178 processing 240 processing with Nagios 260– 263 snmpd187–196 configuration see snmpd.conf trapssentbydefault 261 snmpd.conf 190–196, 261 snmpget184–185 as autilityfor check snmp 197 snmpgetnext184–185 snmptrapd260–261 snmptrapd.conf 260 SNMPv1 183 as securitymodelinthe snmpd configuration 190 SNMPv2c183 as securitymodelinthe snmpd configuration 190 SNMPv3 183 securitymodelinthe snmpd configuration 190 snmpwalk 184–186, 189 soft recovery 77 soft state45, 48, 72, 75, 217 accounting for, in frequency statistics 299 afterRECOVERY 299 source code downloading 26 spreading64 sqlplus(Oracle)416–417 Squid cache manager103, 104 configuringtouse check squid 104 monitoring101–105 SSH compatibilityproblems in het- erogeneousenvironments157 generating keypairs 160 monitoring see check ssh running pluginsthrough 82, 157–163 running pluginsthrough 371 usinginevent handlerscripts 411 SSL usingfor thetest(check tcp) 112 viaSTARTTLS see STARTTLS SSL (check pop, check imap)96 SSL capabilities Webservertesting 81 SSL connection Webservertesting 101 startscript63 STARTTLS 96 andcheck tcp112 testing, in POP AndIMAPcon- nections 96 STARTTLS (check smtp) 93 state confirm see acknowledgement stateflapping see flapping statetype411 state retention file311, 441 states hard andsoft72 of hostsand services 75–77 statistics availabilityofhosts andservices see avail.cgi 459 Index status oscillating see flapping status display in theWeb interface see sta- tus.cgi status flags monitoringprocesseswithspe- cific 139 status macros 411 status.cgi 274, 279–283, 404, 405 outputstyle 280 status file441 status update interval 441 statusmap.cgi274, 291–293 user defined maplayout310 usingindividualicons 309 statusmap background image446 statusmap image309 statuswml.cgi274, 295 statuswrl.cgi274, 293–294, 309, 310, 445 statuswrl include446 storagespace see hard drivecapac- ity sudo 412 summary.cgi 275, 301–303 SuSE NET-SNMP 184 NRPE installation 166 smsclient installation 228 specialfeatures of theApache configuration 34 swap area usageinUnixvs. Windows359 swap partition testing158 swap space in thehostMIB 188 in theUCD-SNMP-MIB 188 monitoringwithSNMP 209– 212 testing82, 136–137 switched-offcomputer see down (state) switches monitoring177 symbolic links forthe startscript64 syslog integratingintoNagios254– 259 logging of NSCA 250 syslog-ng see syslog documentation255 syslog-ng.conf 255 system information storinginSNMP 192 system load see CPUload system start64 system time checkingwithNTP see check ntp checking with thetimeprotocol see check time monitoring145–147 T tac.cgi274, 290–291, 404, 405 TCPwrapper usingwithNRPE169, 172 telephonenumber forSMS see pager temp file441 temperature monitoring377–382 testingvia SNMP 200 templates54–56 fordistributedmonitoring 269–272 fordrraw 335 forprocessing performance data 314–316 to retrieve SAPmonitoringdata 392–394 test of theNSCA254 test plugin see check dummy test repeat definingnumber see max check attempts tests postponing287 time system see system time time axis of states that have occurred see trends.cgi time details 43 time object see timeperiod(object) time period defining54 formessages 220 formonitoring see check period fornotification42, 45, 51, 222–223 time protocol formonitoringsystemtime see check time time unit43 timeout plugin 86, 88 timeperiod(object)42, 45, 54 TLS see SSL TokenRing vs.CSMA/CD (Ethernet) 182 topology see networktopology traffic see networktraffic traffic light states 16, 48 traps see SNMP traps trends.cgi 275, 303–304 U UCD-SNMP 184 UCD-SNMP-MIB 188 UDPservices monitoring see check udp uninterruptible power supply see UPS UNKNOWN(state) 17, 75, 86 as adisplay criterionfor sta- tus.cgi282 colorinthe Webinterface 297 displaying in theWeb interface 291 force/suppressnotification219 macro227 return value154, 155 UNREACHABLE(state) 17, 46, 74, 219 as displaycriterionfor status.cgi 460 Index 282 macro226 UP (state)74, 75 as adisplay criterionfor sta- tus.cgi282 as displaycriterionfor status.cgi 282 macro226 UPS126 check load 150 checking load status 150 monitoring126–131, 149–150 SNMP capability177 upsd 127 upsmon 127 uptime 137 testingonWindowscomputers 360–361 UPTIME(NSClient/NC Netcom- mand)360–361 URL adding to Nagios Webpage43 url html path 58, 446 urlize156 forWindows374 use authentication 58,443 use regexp matching 441 use retained program state442 use retained scheduling info 442 use syslog 442 use true regexp matching 442 USEDDISKSPACE(NSClient/NC Net command) 359–360 user creating161 user account creating see creatinguser user permissions changingonfile see chmod useradd161 users loggedin, monitoringnumber of 144 V volatile services 142, 257–258 voltagedetector 378 VRML display monitored computer see sta- tuswrl.cgi VRML-capable browser293 vrml image309 VRRP 209 vrwave 294 W WAP Nagios via295 WAPaccess to Nagios see statuswml.cgi WARNING(state) 16, 17, 75, 85 as adisplay criterionfor sta- tus.cgi282 force/suppressnotification219 macro227 markinginthe Webinterface 66 resetting manually see error states return value154, 155 warninglimit check apc150 check by ssh 159 check dig107 check disk 134 check file age149 check http 99 check icmp 89 check iftraffic 208 check ldap 122 check load 137 check mailq 147, 148 check nt 356 check ntp146 check pcmeasure 380 check pgsql 117 check procs139 check smtp 93 check snmp 197, 201 check snmp in lm-sensors query200 check snmp load 212 check squid105 check swap 136 check tcp95, 110 check time 146, 147 check udp113 check ups129 check users145 CPULOAD358 forslownetwork connections 86 in performance data 146 in plugin output87 specifying 88 wateralarm 378 Webfront end see Webinterface Webinterface 18, 64–68, 273–312 configuration 33–36 context-dependenthelp58 displaying host groups 41 generaloverview65, 274 granting auseraccess to every- thing58 overview of allhosts andser- vices67 overview of defectiveservices 67 overview of faulty services 66 representation of flappingser- vices404–406 representing servicegroups42 search options67 showingasingle host 67 showingvirtualhosts as links 99 starting 34 switching authentication on/off 58 welcomescreen 64 Webproxy monitoring see Squid Webserver specifying user andpassword forthe test 99 testing see HTTP testingthe lifespanofacertifi- cate 101 weekdays restricting actions54 Windows 461 Index listing processes367 listing services 367 monitoring353–375 NRPE see NRPE NT, see NRPE NT Performance Counter see Per- formance Counter querying eventlog 368–370 querying WMIdatabase 371 scripting 354 SNMP 354 Windowseventlog353 Windowsserver monitoring81 Windowsservices monitoring361–362 WMIdatabase querying 371 WMICOUNTER (NC Netcommand) 371 WMIQUERY (NC Netcommand) 371 WML see statuswml.cgi X xinetd configuration forNRPE168 configuration forNSCA251 Y yaps 227 yellow(state) 16 Z zombies checking system for139 462
还剩462页未读

继续阅读

pdf贡献者

javacaoyu

贡献于2015-09-17

下载需要 10 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf