= 11th Quattor Workshop - CERN - 16-18/3/2011 = [[TracNav]] [[TOC(inline)]] [http://indico.cern.ch/conferenceTimeTable.py?confId=118603#all.detailed Agenda]. == Quattor Status Report - M. Jouvin == See [http://indico.cern.ch/getFile.py/access?contribId=16&sessionId=13&resId=0&materialId=slides&confId=118603 Slides]. == Core Tools == === Quattor Configuration Modules - L. Munoz Meijias === Recent addition * `ncm-ofed` (from Stijn): Infiniband configuration, very preliminary Obsolete/suspect modules * `ncm-http`: used only at CERN * `ncm-tomcat`: used only at CERN * `ncm-tomcat2`: who uses it? * `ncm-rproxy`: replaced by `ncm-squid`? * `ncm-smartd` and `ncm-smartd5`: same schema, what is the difference? Component quality: testing is tedious, old bugs are often back... * CAF is a critical tool to help with component reliability * Can run commands, update/write files with traceability and a real test suite * LC status: still maintained, private copy in Quattor * LC::File should not be used directly, CAF::FileWriter and CAF::FileEditor instead * LC::Process no longer expands to a subshell: new version will allow to split command on whitespaces `ncm-accounts`: rewritten twice in last 2 years, hope v6 will fix all problems definitely * Very fast with large number of accounts * It correctly handles creation of user's home directories `ncm-networks`: lot of complains, ugly code, need rewriting, must review what the component should really do * Define IP config for network cards * Define bridge devices: required for virtualization support * Tuning of device parameters * Configuration of channel bonding: kernel module manipulations should be done with ncm-modprobes * Must work on diskless servers: no more restart of the network just for testing * TCP-parameters should be controlled by ncm-sysctl * IPv6 support: schema extension needed * Loic may try to look at it `ncm-sudo`: CERN needs to be able to manage sudoers.ldap `ncm-useraccess`: proposal to remove PAM-based ACL support * Mistake to have put it here, use `ncm-pam` instead `ncm-authconfig` rewritten with LDAP support `ncm-modprobe`: support for SL4 to SL6 * SL3 support removed `ncm-spma` v2: new incompatible schema, 2-phase upgrade needed `ncm-chkconfig`: problem with removal of ability to specify both `on` and `off` * Need ability to say "set service on to runlevel ",4,5 but off at other levels" CCM: no changes since Feb. 2009, MS patches still not there? * Still no support for local profiles required for easier testing `ncm-ncd`: version 1.3 by MS allows subclassing * Some components won't work with 1.2 in the future * Should try to enable `-t` in Perl to get warnings about dangerous components * `cdp-listend` and `ncm-cdispd` should drop privileges when not needed RHEL6 support: another platform to maintain... introduces Perl 5.12 * Major language changes but backward compatible * Use stricter modes in Perl 5.12 to identify improvements needed to existing components * Don't jump to quickly into new, incompatible features * Core libraries (CAF, LC) are not affected by the changes Some actions needed * Identify obsolete components and move them to another location (delete them?) * Identify and fix components using backquotes to spawn commands instead of CAF::Process * Do a Quattor component release? or official/recommended list of components? * Can the new build tools help with this? * What process? Luis will no longer be able to (significantly) contribute as he's leaving CERN... === Aquilon and other MS Developments Status - N. Williams === Main developments: integration with ESX through QRD Profile signature: decided it was not the way to go, just encrypting profiles over the wire through Krb host keytab ncm-network: MS would like to be involved in rewrite, ready to contribute ncm-modprobes: difficulties with last version because of removal of shell expansion RHEL6: found various issues, some of them not identified at CERN (in particular AII-related ones) * Goal: support ready in a few months, fixing problems as they are discovered Aquilon: will add support for "clusters" in a way similar to SCDB to help with use of QWG templates. Will start looking at support Solaris 11 in the future. * Probably not before autumn May also look at feeding their internal Windows management system from Quattor. === Pan Compiler Status - C. Loomis === v8.5-7: * Warning about all v9 deprecated feature * Fix for NS object templates * Fix for panc on Windows * Introduction of "prefix" statement to simplify paths in templates * In action from where it is declared to the end of the template, several possible in the same template but not recommended v9 * v9.0: available soon, development version * v9.2: early summer, production version v9 main new features for the first versions * Fixed generation of annotation output for better handling of namespaces * Fix command line script * New ant task and maven plugin v9 roadmap * Investigate use of clojure (lisp-like over JVM): better implementation of mem mgt in panc, agent model similar to panc task mgt, memorization of file system stats... * May allow to simplify/streamline panc code * Focus on code maintainability rather than new features (none requested) Support * v8.4.7 is the last v8 version * v8.2.x is now unsupported, please upgrade * v9.x: all new developments in these releases Requests from the discussion * debug() outside DML (Nick) * Produce a "map" of the includes without executing (compiling) them (Nick) * Difficult when using loadpath * clojure may help == QWG Templates == === QWG Templates Update - M. Jouvin === See slides. === Nagios Templates - R. Starink === Almost done * monitoring/nagios/config: server configuration * RPM includes * Apache configuration Difference between generic and local configuration: most work done locally, still need to be merged Pnp4Nagios: service tpl needs site modifications Hierarchy of Nagios servers: optional Grid monitoring: based on EGEE/EGI work * Uses YAIM, not easy to integrate into existing QWG templates. More discussion required. * Replace ncm-yaim by ncm-filecopy? * Not generic Summary * No visible progress yet but some progress behind the scenes * Merge into SVN requires time and dedication: would like to test results * Grid monitoring to be done after completion of base work === LEMON - I. Fedorko === Exception sensor: runs on the monitored node to report abnormal conditions (based on local conditions and metrics collected by other sensors on the node) and optionally run a corrective action. LEMON usage: CERN (several instances: one at CC + online farms), CNAF, MS Backends * Supported: Oracle and flat-files * No alarming possible with flat-files: rely on PL/SQL (available for most DBs) * Postgres/MySQL support should be possible to add as internally LEMON is DB vendor independent LEMON usage at CERN: 11K entities (8K machines), 5+ Knodes with 150-200 metrics * Performance monitoring * Application monitoring * Infrastructure metrics: temperature, power... * 5 core sensors covering 60% of measurements * 1 LEMON server (+1 for historical data) Quattor management of LEMON * LEMON agent configured with ncm-fmonagent * LEMON client authenticated to the server via the SINDES host certificates Current activities * Federation of instances (worked with MS) * Virtual machines monitoring: Quattor used to generate images (golden nodes) leading to several VM instances sharing the same SINDES certificate * Also customized sensors for hypervisors * New LEMON web for better scalability and improved flexibility (more dynamic operations) Future * Consolidation of monitoring activities at CERNIT, including experiment support * New DB schema for increased scalability * DB data export for data mining * Exception classification for alarming * Remote instances (monitoring of remote T0) * Probably integrating with Nagios for collection * Interfacing to Nagios: interested by ideas/experiences * Interfacing with Windows monitoring LEMON development team at CERN: 0.5 to 1.5 FTE * Enough for maintenance for CERN usage * Not enough to do major reengineering: a problem as the use cases have changed * Consolidating monitoring activities (and manpower) is one direction * Interacting with other tools is the other direction == CMS Experience with Quattor - J. A. Coarasa Perez == CMS online computing infrastructure requirements * Complete autonomy from CERN IT and network * Myrinet core network * Redundant services (2 different computing rooms distant of 200m), managed remotely from a central control room * Fast configuration turnaround * Ability to read from electronics at 100 kHz (100 GB/s) * Computing power to do data selection (2 GB/s output): ~2500 cores * Enough disk storage for 2 days: ~300 TB * Spare computers for quick replacement in case of failure (~200) Limited manpower: ~5 FTEs Quattor is the core configuration management tool * Computers installed through Quattor but not using AII * Based on a (very) old Quattor version: 1.3 * panc 6 * RPM repositories hosted on NetApp filers with 6 SWrep servers * Performance: 10mn for compiling the whole 2500 machines * Time to deploy changes: ~10 mn (cdp notification used) * Reinstallation of 1K machines in 1h 1/4 * Problem when SWrep servers were overloaded: spma timeout, no retry Well defined template structure/layout (CMS specific) Some CMS-specific developments * Dropbox for RPMs to allow people without Quattor expertise to update their SW in a traceable/reproducible way * Buggy components requiring workarounds (in particular for configuration removal) * Copy of configuration files to individual computers: copyd * Network configuration derived from DNS * RPMs used to create users * Web frontend to CDB configuration: template summarizer (parsing part of the profiles according to CMS way of configuring systems) * Software Update Tools: allow to update existing RPM in a group of computers (dropbox) * Upgrade or downgrade * Allows to rollback changes * Allow to check the status of the update * Perform checks on the RPM: already exists, not named properly... * Cannot be used to add a new RPM to the configuration: need to request addition of a new RPM * Only one person can run it at any time * Ability to run commands on a set of computers selected based on the use of a given template (like a type) Conclusions * Quattor has been and is scalable to CMS needs * Not always as flexible as we'd like... == Security == === SINDES Update - J. Dudziec === Certificate authority + secured file store. Several issues requiring major reengineering * Only one SINDES user per system: no fine grain permissions for different users * Impossible to delete a file in the file store * Impossible a to move a machine from one cluster to another one * No support of subclusters * Unattended installation of machines not really possible: SINDES must be notified first Progress since last workshop * Documentation on Quattor wiki * Code in SF with Apache2 license: still some issues (currently GPL) SINDES v2 overview * Apache + SOAP * Internal DB for SINDES-specific information, eg. SINDES items (file + permissions) * Interaction with CDB through CDB2SQL to retrieve host information (eg. cluster/subcluster) * Krb used in replacement for SINDES CA for accessing SINDES server * Same client tool for users and hosts * Major improvement of code maintainability: 300 LOC instead of 10K in v1 * Site-specific authorization modules to use whatever is appropriate at the site: CERN is using egroups, LANDB... * flexible item to host/cluster/subcluster mapping Impact of SINDES CA removal: not necessarily more difficult to setup Krb than SINDES CA... * SINDES CA adds a lot of complexity to SINDES code, not necessarily secure... * May consider adding support for optional other authentication methods, like certificates Unattended installation * CERN: initial keytab download allowed if request coming from low ports and reverse DNS matches (no time window) * MS: adding a time-window feature to Krb Only client available in v1 is command line, other APIs under consideration for v2 * May change the implication of a GPL license === Quattor and CERN Security - L. Munoz Meijias === Applying SW updates * CERN IT preparing snapshots of recommended versions every week, users picks one of them * In case of major security vulnerability/problem, CERN Security can make mandatory an update Challenge of managing 250M accounts... in particular challenge of maintaining ACLs * Unix group are not an appropriate solution: difficult to maintain, no recursive groups * LDAP + egroups used to implement the required features * egroups: logical name for a set of accounts, recursive * Quattor used to configure LDAP authorization * Still an issue to configure non LDAP-aware services Access to accounts: mainly through Krb (+PAM) * SSH keys are a potential problem in case of computers compromised outside CERN, no way to know if they have been removed from all CERN computers * Privileged accounts: no possibility to ssh out from a privilege account * Need to restrict more accounts (apache, oracle) * Would like to restrict the allowed origin of incoming connections to privilege accounts Multi-factor authentication: no uniform way of doing it, work in progress to make progress on this Logging of all commands and command arguments on sensitive systems using `snoopy` * Intercept calls to execv and sent to syslog what is executed Monitoring with alarming of ~12 security-sensitive files per computer * File permissions == Development Process and Tools == === End of current sprint === See [https://trac.lal.in2p3.fr/Quattor/wiki/Development/Scrum/Sprint-2011-01 wiki]. === Maven-based Build Tools - C. Loomis === Tools available * Build plugins, configuration for full configuration module * Archetype to build config for a new configuration module * Creates tarballs as well as RPMs [https://trac.lal.in2p3.fr/Quattor/wiki/Web/UsingMaven Documentation] available. Usage status * pan * pan-templates * ncm-query an ncm-cdispd: prototype available Making releases * "Magic" to do a release with Maven is easy: `mvn clean release:prepare release:perform` * Builds, tags, uploads packages, makes "site", increments version * But requires some specific permissions: write access to nexus repository for packages, write access to SF website for "site" information * Allow anyone from command line to make release? * Use continuous integration server (eg. Hudson) for releases? Preferred solution... but it is an additional service (not available on SF, need to manage identities) Continuous integration service: much more benefits than just the release making * Not necessary a big deal to set up the service * Already have the StratusLab experience * Is LAPP ready to do it (previously in charge of build infrastructure)? More discussion tomorrow? Attempt to start a Hudson server? === Development Process Discussion === Generally good feeling for those who participated * But not a very good feeling about EVO * Let's test the MS audio system for next standups Meeting day and time: Thursday 2pm CET is not necessarily the best slot * Start a Doodle to find another possible slot, preferably during the morning * Wednesday may be a good choice, avoid Monday/Friday Sprint duration: 1 month, smaller backlog * Monthly meeting to close the sprint and have the user interaction slot Configuration module management: how to know/update what is the recommended (stable) version to use? * No great idea... * Involve manual steps/work * Must be sustainable... * To be reviewed when we have converted all configuration modules to use Maven tools and have a continuous integration server in place. Migration from SVN to Git: is it desirable? if yes, which steps? * StratusLab feedback: required a bit of effort for SVN users but several benefits at the end * Smaller, more specific repositories: smaller number of users with write access per repository * External references are really external to the repository * At SF, all users with write access in the project have write access to Git repositories by default but can be restricted per repository * Short term steps: new projects (eg. SINDES) in Git, panc moved from SVN to Git * Need a candidate with interaction between several developers: CCM-related things (ccm, ncm-cdispd, ncm-ncd, QuattorFS), AII and its plugins == RHEL6 == === CERN Experience === * Path to install perl modules has changed: fixed by Luis for every Perl modules in SVN * Now installed in /usr/lib/perl, that applications are not supposed to use (not used by anybody) * Using standard installation place makes RPM OS-version dependent * Use /opt/quattor/lib/perl: upgrade ncm-ncd to look in both places * `rpmt-py` broken by 2 changes in standard Python RPM bindings: fixed by Christos * `curl` bug breaking SINDES and potentially anything using `wget`: certificate not looked in the right place * No workaround/fix yet * No impact known yet on Quattor itself * `ncm-grub` doesn't update the kernel after a kernel upgrade and %post script in RPM is failing * Christos will try to look at this * Also a need to add kernel architecture to grub command * `SELinux`: issue with rpmt-py, need to be documented on Quattor wiki * LDAP configuration is now in another place with another layout (previous information as a string now several tokens): ncm-authconfig has to be enhanced * Difficult to split the existing string in SL6 tokens: site specific * Rework the schema to have something more appropriate to handle different platforms? * Luis will try to summarize the problem and possible solutions on quattor-devel * Not clear if others are using LDAP for authentication... * Time synchronization in virtual machines run on HyperV: need to run new `ntpdate` service during startup to force synchronization * Initial time seems to be UTC... * ncm-modprobe: modprobe configuration files changed their location, new attribute in the schema to specify this location * Need to be handled by OS configuration templates * ncm-symlinks: use of last version required === AUTH === * A few fixes to be committed === MS === * Pb with AII file system partitioning (caches not flushed or something like that): thinking at using standard Anaconda for this, driving it from Quattor config * In the past, issues with Anaconda's handling of file system partitioning * ncm-pam issue, about to commit the fix === Discussion === RHEL6 raises the question of the potential changes that may be required for multiple-platform support * MS wants to restart Solaris port * Some components may be valid only on one platform: how to prevent them to be used on the wrong one * Bind the path in configuration to something that prevents its use == Conclusions - Next Workshop == Next workshop in Strasbourg. Start a [http://www.doodle.com/mc57tvnrcwtv5f7w Doodle] to find the most appropriate date. * Best candidates look week 41 (10-14/10) and 42 (17-21/10)