wiki:Meetings/Workshops/20110316

Version 1 (modified by /O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=Michel Jouvin, 13 years ago) (diff)

--

Quattor Workshop - CERN - 16-18/3/2011

Agenda.

Quattor Status Report - M. Jouvin

See Slides.

Core Tools

Quattor Configuration Modules - L. Munoz Meijias

Recent addition

  • ncm-ofed (from Stijn): Infiniband configuration, very preliminary

Obsolete/suspect modules

  • ncm-http: used only at CERN
  • ncm-tomcat: used only at CERN
  • ncm-tomcat2: who uses it?
  • ncm-rproxy: replaced by ncm-squid?
  • ncm-smartd and ncm-smartd5: same schema, what is the difference?

Component quality: testing is tedious, old bugs are often back...

  • CAF is a critical tool to help with component reliability
    • Can run commands, update/write files with traceability and a real test suite
  • LC status: still maintained, private copy in Quattor
    • LC::File should not be used directly, CAF::FileWriter and CAF::FileEditor instead
    • LC::Process no longer expands to a subshell: new version will allow to split command on whitespaces

ncm-accounts: rewritten twice in last 2 years, hope v6 will fix all problems definitely

  • Very fast with large number of accounts
  • It correctly handles creation of user's home directories

ncm-networks: lot of complains, ugly code, need rewriting, must review what the component should really do

  • Define IP config for network cards
  • Define bridge devices: required for virtualization support
  • Tuning of device parameters
  • Configuration of channel bonding: kernel module manipulations should be done with ncm-modprobes
  • Must work on diskless servers: no more restart of the network just for testing
  • TCP-parameters should be controlled by ncm-sysctl
  • IPv6 support: schema extension needed
  • Loic may try to look at it

ncm-sudo: CERN needs to be able to manage sudoers.ldap

ncm-useraccess: proposal to remove PAM-based ACL support

  • Mistake to have put it here, use ncm-pam instead

ncm-authconfig rewritten with LDAP support

ncm-modprobe: support for SL4 to SL6

  • SL3 support removed

ncm-spma v2: new incompatible schema, 2-phase upgrade needed

ncm-chkconfig: problem with removal of ability to specify both on and off

  • Need ability to say "set service on to runlevel ",4,5 but off at other levels"

CCM: no changes since Feb. 2009, MS patches still not there?

  • Still no support for local profiles required for easier testing

ncm-ncd: version 1.3 by MS allows subclassing

  • Some components won't work with 1.2 in the future
  • Should try to enable -t in Perl to get warnings about dangerous components
  • cdp-listend and ncm-cdispd should drop privileges when not needed

RHEL6 support: another platform to maintain... introduces Perl 5.12

  • Major language changes but backward compatible
  • Use stricter modes in Perl 5.12 to identify improvements needed to existing components
  • Don't jump to quickly into new, incompatible features
  • Core libraries (CAF, LC) are not affected by the changes

Some actions needed

  • Identify obsolete components and move them to another location (delete them?)
  • Identify and fix components using backquotes to spawn commands instead of CAF::Process
  • Do a Quattor component release? or official/recommended list of components?
    • Can the new build tools help with this?
    • What process?

Luis will no longer be able to (significantly) contribute as he's leaving CERN...

Aquilon and other MS Developements Status - N. Williams

Main developments: integration with ESX through QRD

Profile signatue: decided it was not the way to go, just encrypting profiles over the wire through Krb host keytab

ncm-network: MS would like to be involved in rewrite, ready to contribute

ncm-modprobes: difficulties with last version because of removal of shell expansion

RHEL6: found various issues, some of them not identified at CERN (in particular AII-related ones)

  • Goal: support ready in a few months, fixing problems as they are discovered

Aquilon: will add support for "clusters" in a way similar to SCDB to help with use of QWG templates.

Will start looking at support Solaris 11 in the future.

  • Probably not before autumn

May also look at feeding their internal Windows management system from Quattor.

QWG Templates

QWG Templates Update - M. Jouvin

See slides.

Nagios Templates - R. Starink

Almost done

  • monitoring/nagios/config: server configuration
    • RPM includes
    • Apache configuration

Difference between generic and local configuration: most work done locally, still need to be merged

Pnp4Nagios: service tpl needs site modifications

Hierarchy of Nagios servers: optional

Grid monitoring: based on EGEE/EGI work

  • Uses YAIM, not easy to integrate into existing QWG templates. More discussion required.
    • Replace ncm-yaim by ncm-filecopy?
  • Not generic

Summary

  • No visible progress yet but some progress behind the scenes
  • Merge into SVN requires time and dedication: would like to test results
  • Grid monitoring to be done after completion of base work

LEMON - I. Fedorko

Exception sensor: runs on the monitored node to report abnormal conditions (based on local conditions and metrics collected by other sensors on the node) and optionally run a corrective action.

LEMON usage: CERN (several instances at CC and in online farms), CNAF, MS

Backends

  • Supported: Oracle and flat-files
    • No alarming possible with flat-files: rely on PL/SQL (available for most DBs)
  • Postgres/MySQL support should be possible to add as internally LEMON is DB vendor independent

LEMON usage at CERN: 11K entities (8K machines), 5+ Knodes with 150-200 metrics

  • Performance monitoring
  • Application monitoring
  • Infrastructure metrics: temperature, power...
  • 5 core sensors covering 60% of measurements
  • 1 LEMON server (+1 for historical data)

Quattor management of LEMON

  • LEMON agent configured with ncm-fmonagent
  • LEMON client configuration files signed with SINDES

Current activities

  • Federation of instances (worked with MS)
  • Virtual machines monitoring: Quattor used to generate images (golden nodes) leading to several VM instances sharing the same SINDES certificate
    • Also customized sensors for hypervisors
  • New LEMON web for better scalability and improved flexibility (more dynamic operations)

Future

  • Consolidation of monitoring activities at CERNIT, including experiment support
  • New DB schema for increased scalability
  • DB data export for data mining
  • Exception classification for alarming
  • Remote instances (monitoring of remote T0)
    • Probably integrating with Nagios for collection
  • Interfacing to Nagios: interested by ideas/experiences
  • Interfacing with Windows monitoring

LEMON development team at CERN: 1 to 1.5 FTE

  • Enough for maintenance for CERN usage
  • Not enough to do major reengineering: a problem as the use cases have changed
    • Consolidating monitoring activities (and manpower) is one direction
    • Interacting with other tools is the other direction

CMS Experience with Quattor - J. A. Coarasa Perez

CMS online computing infrastructure requirements

  • Complete autonomy from CERN IT and network
    • Myrinet core network
  • Redundant services (2 different computing rooms distant of 200m), managed remotely from a central control room
  • Fast configuration turnaround
  • Ability to read from electronics at 100 kHz (100 GB/s)
  • Computing power to do data selection (2 GB/s output): ~2500 cores
  • Enough disk storage for 2 days: ~300 TB
  • Spare computers for quick replacement in case of failure (~200)

Limited manpower: ~5 FTEs

Quattor is the core configuration management tool

  • Computers installed through Quattor but not using AII
  • Based on a (very) old Quattor version: 1.3
    • panc 6
    • RPM repositories hosted on NetApp filers with 6 SWrep servers
  • Performance: 10mn for compiling the whole 2500 machines
    • Time to deploy changes: ~10 mn (cdp notification used)
    • Reinstallation of 1K machines in 1h 1/4
    • Problem when SWrep servers were overloaded: spma timeout, no retry

Well defined template structure/layout (CMS specific) Some CMS-specific developments

  • Dropbox for RPMs to allow people without Quattor expertize to update their SW in a traceable/reproducible way
  • Buggy components requiring workarounds (in particular for configuration removal)
  • Copy of configuration files to individual computers: copyd
  • Network configuration derived from DNS
  • RPMs used to create users
  • Web frontend to CDB configuration: template summarizer (parsing part of the profiles according to CMS way of configuring systems)
  • Software Update Tools: allow to update existing RPM in a group of computers (dropbox)
    • Upgrade or downgrade
    • Allows to rollback changes
    • Allow to check the status of the update
    • Perform checks on the RPM: already exists, not named properly...
    • Cannot be used to add a new RPM to the configuration: need to request addition of a new RPM
    • Only one person can run it at any time
  • Ability to run commands on a set of computers selected based on the use of a given template (like a type)

Conclusions

  • Quattor has been and is scalable to CMS needs
  • Not always as flexible as we'd like...