Changes between Initial Version and Version 1 of Meetings/Workshops/20110316


Ignore:
Timestamp:
Mar 16, 2011, 6:50:41 PM (13 years ago)
Author:
/O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=Michel Jouvin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Meetings/Workshops/20110316

    v1 v1  
     1= Quattor Workshop - CERN - 16-18/3/2011 =
     2[[TracNav]]
     3
     4[[TOC(inline)]]
     5
     6
     7[http://indico.cern.ch/conferenceTimeTable.py?confId=118603#all.detailed Agenda].
     8
     9== Quattor Status Report - M. Jouvin ==
     10
     11See [http://indico.cern.ch/getFile.py/access?contribId=16&sessionId=13&resId=0&materialId=slides&confId=118603 Slides].
     12
     13== Core Tools ==
     14
     15=== Quattor Configuration Modules - L. Munoz Meijias ===
     16
     17Recent addition
     18 * `ncm-ofed` (from Stijn): Infiniband configuration, very preliminary
     19 
     20Obsolete/suspect modules
     21 * `ncm-http`: used only at CERN
     22 * `ncm-tomcat`: used only at CERN
     23 * `ncm-tomcat2`: who uses it?
     24 * `ncm-rproxy`: replaced by `ncm-squid`?
     25 * `ncm-smartd` and `ncm-smartd5`: same schema, what is the difference?
     26 
     27Component quality: testing is tedious, old bugs are often back...
     28 * CAF is a critical tool to help with component reliability
     29   * Can run commands, update/write files with traceability and a real test suite
     30 * LC status: still maintained, private copy in Quattor
     31   * LC::File should not be used directly, CAF::FileWriter and CAF::FileEditor instead
     32   * LC::Process no longer expands to a subshell: new version will allow to split command on whitespaces
     33   
     34`ncm-accounts`: rewritten twice in last 2 years, hope v6 will fix all problems definitely
     35 * Very fast with large number of accounts
     36 * It correctly handles creation of user's home directories
     37 
     38`ncm-networks`: lot of complains, ugly code, need rewriting, must review what the component should really do
     39 * Define IP config for network cards
     40 * Define bridge devices: required for virtualization support
     41 * Tuning of device parameters
     42 * Configuration of channel bonding: kernel module manipulations should be done with ncm-modprobes
     43 * Must work on diskless servers: no more restart of the network just for testing
     44 * TCP-parameters should be controlled by ncm-sysctl
     45 * IPv6 support: schema extension needed
     46 * Loic may try to look at it
     47 
     48`ncm-sudo`: CERN needs to be able to manage sudoers.ldap
     49
     50`ncm-useraccess`: proposal to remove PAM-based ACL support
     51 * Mistake to have put it here, use `ncm-pam` instead
     52 
     53`ncm-authconfig` rewritten with LDAP support
     54
     55`ncm-modprobe`: support for SL4 to SL6
     56 * SL3 support removed
     57 
     58`ncm-spma` v2: new incompatible schema, 2-phase upgrade needed
     59
     60`ncm-chkconfig`: problem with removal of ability to specify both `on` and `off`
     61 * Need ability to say "set service on to runlevel ",4,5 but off at other levels"
     62
     63CCM: no changes since Feb. 2009, MS patches still not there?
     64 * Still no support for local profiles required for easier testing
     65
     66`ncm-ncd`: version 1.3 by MS allows subclassing
     67 * Some components won't work with 1.2 in the future
     68 * Should try to enable `-t` in Perl to get warnings about dangerous components
     69 * `cdp-listend` and `ncm-cdispd` should drop privileges when not needed
     70 
     71RHEL6 support: another platform to maintain... introduces Perl 5.12
     72 * Major language changes but backward compatible
     73 * Use stricter modes in Perl 5.12 to identify improvements needed to existing components
     74 * Don't jump to quickly into new, incompatible features
     75 * Core libraries (CAF, LC) are not affected by the changes
     76 
     77Some actions needed
     78 * Identify obsolete components and move them to another location (delete them?)
     79 * Identify and fix components using backquotes to spawn commands instead of CAF::Process
     80 * Do a Quattor component release? or official/recommended list of components?
     81   * Can the new build tools help with this?
     82   * What process?
     83 
     84Luis will no longer be able to (significantly) contribute as he's leaving CERN...
     85
     86
     87=== Aquilon and other MS Developements Status - N. Williams ===
     88
     89Main developments: integration with ESX through QRD
     90
     91Profile signatue: decided it was not the way to go, just encrypting profiles over the wire through Krb host keytab
     92
     93ncm-network: MS would like to be involved in rewrite, ready to contribute
     94
     95ncm-modprobes: difficulties with last version because of removal of shell expansion
     96
     97RHEL6: found various issues, some of them not identified at CERN (in particular AII-related ones)
     98 * Goal: support ready in a few months, fixing problems as they are discovered
     99
     100Aquilon: will add support for "clusters" in a way similar to SCDB to help with use of QWG templates.
     101
     102Will start looking at support Solaris 11 in the future.
     103 * Probably not before autumn
     104
     105May also look at feeding their internal Windows management system from Quattor.
     106
     107
     108== QWG Templates ==
     109
     110=== QWG Templates Update - M. Jouvin ===
     111
     112See slides.
     113 
     114=== Nagios Templates - R. Starink ===
     115
     116Almost done
     117 * monitoring/nagios/config: server configuration
     118   * RPM includes
     119   * Apache configuration
     120   
     121Difference between generic and local configuration: most work done locally, still need to be merged
     122
     123Pnp4Nagios: service tpl needs site modifications
     124
     125Hierarchy of Nagios servers: optional
     126
     127Grid monitoring: based on EGEE/EGI work
     128 * Uses YAIM, not easy to integrate into existing QWG templates. More discussion required.
     129   * Replace ncm-yaim by ncm-filecopy?
     130 * Not generic
     131 
     132Summary
     133 * No visible progress yet but some progress behind the scenes
     134 * Merge into SVN requires time and dedication: would like to test results
     135 * Grid monitoring to be done after completion of base work
     136 
     137
     138=== LEMON - I. Fedorko ===
     139
     140Exception sensor: runs on the monitored node to report abnormal conditions (based on local conditions and metrics collected by other sensors on the node) and optionally run a corrective action.
     141
     142LEMON usage: CERN (several instances at CC and in online farms), CNAF, MS
     143
     144Backends
     145 * Supported: Oracle and flat-files
     146   * No alarming possible with flat-files: rely on PL/SQL (available for most DBs)
     147 * Postgres/MySQL support should be possible to add as internally LEMON is DB vendor independent
     148 
     149LEMON usage at CERN: 11K entities (8K machines), 5+ Knodes with 150-200 metrics
     150 * Performance monitoring
     151 * Application monitoring
     152 * Infrastructure metrics: temperature, power...
     153 * 5 core sensors covering 60% of measurements
     154 * 1 LEMON server (+1 for historical data)
     155
     156Quattor management of LEMON
     157 * LEMON agent configured with ncm-fmonagent
     158 * LEMON client configuration files signed with SINDES
     159 
     160Current activities
     161 * Federation of instances (worked with MS)
     162 * Virtual machines monitoring: Quattor used to generate images (golden nodes) leading to several VM instances sharing the same SINDES certificate
     163   * Also customized sensors for hypervisors
     164 * New LEMON web for better scalability and improved flexibility (more dynamic operations)
     165
     166Future
     167 * Consolidation of monitoring activities at CERNIT, including experiment support
     168 * New DB schema for increased scalability
     169 * DB data export for data mining
     170 * Exception classification for alarming
     171 * Remote instances (monitoring of remote T0)
     172   * Probably integrating with Nagios for collection
     173 * Interfacing to Nagios: interested by ideas/experiences
     174 * Interfacing with Windows monitoring
     175 
     176LEMON development team at CERN: 1 to 1.5 FTE
     177 * Enough for maintenance for CERN usage
     178 * Not enough to do major reengineering: a problem as the use cases have changed
     179   * Consolidating monitoring activities (and manpower) is one direction
     180   * Interacting with other tools is the other direction
     181   
     182   
     183== CMS Experience with Quattor - J. A. Coarasa Perez ==
     184
     185CMS online computing infrastructure requirements
     186 * Complete autonomy from CERN IT and network
     187   * Myrinet core network
     188 * Redundant services (2 different computing rooms distant of 200m), managed remotely from a central control room
     189 * Fast configuration turnaround
     190 * Ability to read from electronics at 100 kHz (100 GB/s)
     191 * Computing power to do data selection (2 GB/s output): ~2500 cores
     192 * Enough disk storage for 2 days:  ~300 TB
     193 * Spare computers for quick replacement in case of failure (~200)
     194 
     195Limited manpower: ~5 FTEs
     196
     197Quattor is the core configuration management tool
     198 * Computers installed through Quattor but not using AII
     199 * Based on a (very) old Quattor version: 1.3
     200   * panc 6
     201   * RPM repositories hosted on NetApp filers with 6 SWrep servers
     202 * Performance: 10mn for compiling the whole 2500 machines
     203   * Time to deploy changes:  ~10 mn (cdp notification used)
     204   * Reinstallation of 1K machines in 1h 1/4
     205   * Problem when SWrep servers were overloaded: spma timeout, no retry
     206
     207Well defined template structure/layout (CMS specific)
     208   
     209Some CMS-specific developments
     210 * Dropbox for RPMs to allow people without Quattor expertize to update their SW in a traceable/reproducible way
     211 * Buggy components requiring workarounds (in particular for configuration removal)
     212 * Copy of configuration files to individual computers: copyd
     213 * Network configuration derived from DNS
     214 * RPMs used to create users
     215 * Web frontend to CDB configuration: template summarizer (parsing part of the profiles according to CMS way of configuring systems)
     216 * Software Update Tools: allow to update existing RPM in a group of computers (dropbox)
     217   * Upgrade or downgrade
     218   * Allows to rollback changes
     219   * Allow to check the status of the update
     220   * Perform checks on the RPM: already exists, not named properly...
     221   * Cannot be used to add a new RPM to the configuration: need to request addition of a new RPM
     222   * Only one person can run it at any time
     223 * Ability to run commands on a set of computers selected based on the use of a given template (like a type)
     224 
     225Conclusions
     226 * Quattor has been and is scalable to CMS needs
     227 * Not always as flexible as we'd like...
     228 
     229