= Quattor Workshop - CERN - 16-18/3/2011 = [[TracNav]] [[TOC(inline)]] [http://indico.cern.ch/conferenceTimeTable.py?confId=118603#all.detailed Agenda]. == Quattor Status Report - M. Jouvin == See [http://indico.cern.ch/getFile.py/access?contribId=16&sessionId=13&resId=0&materialId=slides&confId=118603 Slides]. == Core Tools == === Quattor Configuration Modules - L. Munoz Meijias === Recent addition * `ncm-ofed` (from Stijn): Infiniband configuration, very preliminary Obsolete/suspect modules * `ncm-http`: used only at CERN * `ncm-tomcat`: used only at CERN * `ncm-tomcat2`: who uses it? * `ncm-rproxy`: replaced by `ncm-squid`? * `ncm-smartd` and `ncm-smartd5`: same schema, what is the difference? Component quality: testing is tedious, old bugs are often back... * CAF is a critical tool to help with component reliability * Can run commands, update/write files with traceability and a real test suite * LC status: still maintained, private copy in Quattor * LC::File should not be used directly, CAF::FileWriter and CAF::FileEditor instead * LC::Process no longer expands to a subshell: new version will allow to split command on whitespaces `ncm-accounts`: rewritten twice in last 2 years, hope v6 will fix all problems definitely * Very fast with large number of accounts * It correctly handles creation of user's home directories `ncm-networks`: lot of complains, ugly code, need rewriting, must review what the component should really do * Define IP config for network cards * Define bridge devices: required for virtualization support * Tuning of device parameters * Configuration of channel bonding: kernel module manipulations should be done with ncm-modprobes * Must work on diskless servers: no more restart of the network just for testing * TCP-parameters should be controlled by ncm-sysctl * IPv6 support: schema extension needed * Loic may try to look at it `ncm-sudo`: CERN needs to be able to manage sudoers.ldap `ncm-useraccess`: proposal to remove PAM-based ACL support * Mistake to have put it here, use `ncm-pam` instead `ncm-authconfig` rewritten with LDAP support `ncm-modprobe`: support for SL4 to SL6 * SL3 support removed `ncm-spma` v2: new incompatible schema, 2-phase upgrade needed `ncm-chkconfig`: problem with removal of ability to specify both `on` and `off` * Need ability to say "set service on to runlevel ",4,5 but off at other levels" CCM: no changes since Feb. 2009, MS patches still not there? * Still no support for local profiles required for easier testing `ncm-ncd`: version 1.3 by MS allows subclassing * Some components won't work with 1.2 in the future * Should try to enable `-t` in Perl to get warnings about dangerous components * `cdp-listend` and `ncm-cdispd` should drop privileges when not needed RHEL6 support: another platform to maintain... introduces Perl 5.12 * Major language changes but backward compatible * Use stricter modes in Perl 5.12 to identify improvements needed to existing components * Don't jump to quickly into new, incompatible features * Core libraries (CAF, LC) are not affected by the changes Some actions needed * Identify obsolete components and move them to another location (delete them?) * Identify and fix components using backquotes to spawn commands instead of CAF::Process * Do a Quattor component release? or official/recommended list of components? * Can the new build tools help with this? * What process? Luis will no longer be able to (significantly) contribute as he's leaving CERN... === Aquilon and other MS Developements Status - N. Williams === Main developments: integration with ESX through QRD Profile signatue: decided it was not the way to go, just encrypting profiles over the wire through Krb host keytab ncm-network: MS would like to be involved in rewrite, ready to contribute ncm-modprobes: difficulties with last version because of removal of shell expansion RHEL6: found various issues, some of them not identified at CERN (in particular AII-related ones) * Goal: support ready in a few months, fixing problems as they are discovered Aquilon: will add support for "clusters" in a way similar to SCDB to help with use of QWG templates. Will start looking at support Solaris 11 in the future. * Probably not before autumn May also look at feeding their internal Windows management system from Quattor. == QWG Templates == === QWG Templates Update - M. Jouvin === See slides. === Nagios Templates - R. Starink === Almost done * monitoring/nagios/config: server configuration * RPM includes * Apache configuration Difference between generic and local configuration: most work done locally, still need to be merged Pnp4Nagios: service tpl needs site modifications Hierarchy of Nagios servers: optional Grid monitoring: based on EGEE/EGI work * Uses YAIM, not easy to integrate into existing QWG templates. More discussion required. * Replace ncm-yaim by ncm-filecopy? * Not generic Summary * No visible progress yet but some progress behind the scenes * Merge into SVN requires time and dedication: would like to test results * Grid monitoring to be done after completion of base work === LEMON - I. Fedorko === Exception sensor: runs on the monitored node to report abnormal conditions (based on local conditions and metrics collected by other sensors on the node) and optionally run a corrective action. LEMON usage: CERN (several instances at CC and in online farms), CNAF, MS Backends * Supported: Oracle and flat-files * No alarming possible with flat-files: rely on PL/SQL (available for most DBs) * Postgres/MySQL support should be possible to add as internally LEMON is DB vendor independent LEMON usage at CERN: 11K entities (8K machines), 5+ Knodes with 150-200 metrics * Performance monitoring * Application monitoring * Infrastructure metrics: temperature, power... * 5 core sensors covering 60% of measurements * 1 LEMON server (+1 for historical data) Quattor management of LEMON * LEMON agent configured with ncm-fmonagent * LEMON client configuration files signed with SINDES Current activities * Federation of instances (worked with MS) * Virtual machines monitoring: Quattor used to generate images (golden nodes) leading to several VM instances sharing the same SINDES certificate * Also customized sensors for hypervisors * New LEMON web for better scalability and improved flexibility (more dynamic operations) Future * Consolidation of monitoring activities at CERNIT, including experiment support * New DB schema for increased scalability * DB data export for data mining * Exception classification for alarming * Remote instances (monitoring of remote T0) * Probably integrating with Nagios for collection * Interfacing to Nagios: interested by ideas/experiences * Interfacing with Windows monitoring LEMON development team at CERN: 1 to 1.5 FTE * Enough for maintenance for CERN usage * Not enough to do major reengineering: a problem as the use cases have changed * Consolidating monitoring activities (and manpower) is one direction * Interacting with other tools is the other direction == CMS Experience with Quattor - J. A. Coarasa Perez == CMS online computing infrastructure requirements * Complete autonomy from CERN IT and network * Myrinet core network * Redundant services (2 different computing rooms distant of 200m), managed remotely from a central control room * Fast configuration turnaround * Ability to read from electronics at 100 kHz (100 GB/s) * Computing power to do data selection (2 GB/s output): ~2500 cores * Enough disk storage for 2 days: ~300 TB * Spare computers for quick replacement in case of failure (~200) Limited manpower: ~5 FTEs Quattor is the core configuration management tool * Computers installed through Quattor but not using AII * Based on a (very) old Quattor version: 1.3 * panc 6 * RPM repositories hosted on NetApp filers with 6 SWrep servers * Performance: 10mn for compiling the whole 2500 machines * Time to deploy changes: ~10 mn (cdp notification used) * Reinstallation of 1K machines in 1h 1/4 * Problem when SWrep servers were overloaded: spma timeout, no retry Well defined template structure/layout (CMS specific) Some CMS-specific developments * Dropbox for RPMs to allow people without Quattor expertize to update their SW in a traceable/reproducible way * Buggy components requiring workarounds (in particular for configuration removal) * Copy of configuration files to individual computers: copyd * Network configuration derived from DNS * RPMs used to create users * Web frontend to CDB configuration: template summarizer (parsing part of the profiles according to CMS way of configuring systems) * Software Update Tools: allow to update existing RPM in a group of computers (dropbox) * Upgrade or downgrade * Allows to rollback changes * Allow to check the status of the update * Perform checks on the RPM: already exists, not named properly... * Cannot be used to add a new RPM to the configuration: need to request addition of a new RPM * Only one person can run it at any time * Ability to run commands on a set of computers selected based on the use of a given template (like a type) Conclusions * Quattor has been and is scalable to CMS needs * Not always as flexible as we'd like...