Changes between Initial Version and Version 1 of Meetings/Workshops/20091104


Ignore:
Timestamp:
Nov 5, 2009, 3:53:28 PM (16 years ago)
Author:
/O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=Michel Jouvin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Meetings/Workshops/20091104

    v1 v1  
     1= Quattor Workshop -Bruxelles - 4-5/11/09 =
     2[[TracNav]]
     3
     4[[TOC(inline)]]
     5
     6[http://indico.cern.ch/conferenceTimeTable.py?confId=67632 Agenda].
     7
     8== Site Reports ==
     9
     10=== GRIF - G. Philippon ===
     11
     126 subistes, 31 clusters: 750 machines
     13 * Mainly grid machines: 100 nongrid
     14 * 2 subsites managing non grid machines
     15 * ~20 persons involved in Quattor
     16 
     17Main changes:
     18 * Use of dummy WNs: /2 for full recompile (/5 on the largest cluster)
     19 * First attempts to use checkdeps for pre-deployment RPM checks: successful so far
     20 * Errata deployment with last features in QWG
     21 * No more recompile if only addition/removal of a RPM in a repository
     22 
     23Nagios:
     24 * Issues with QWG templates for Nagios and ncm-nagios
     25 * Added host and host group support on ncm-nagios: to be commited...
     26 
     27=== CNAF - A. Chierichi ===
     28
     29Migration to SCDB done!
     30 * Smooth migration
     31 * Started with a pilot last summer
     32 * Some effort needed to convince users: mixed feedback... see dedicated presentation
     33 
     34Quatview now working: depends on SCDB...
     35
     36New templates for partitioning HDs
     37
     38Pb with Ethernet bonding but apparently not supported by standard templates
     39 * Remark: Probably a matter of documentation
     40 
     41Monitoring: still rely on Lemon for storage, also some attempts with Nagios
     42 * Lemon templates in QWG seems to be not backward compatible: reverted back to original templates
     43 * Same pb with QWG templates for Nagios but didn't succeed to reproduce the current configuration
     44 
     45SCDB: 1 instance, 19 clusters, 13 users
     46 * Grid services configured with YAIM
     47 * OS configured with CNAF-specific templates
     48
     49Virtualization of WNs in production, based on KVM.
     50 * All VM managed by Quattor
     51 * VMs treated as standard machines
     52 
     53=== MS - N. Williams ===
     54
     55Main changes is numbers: 2x as many nodes, now 15K managed by Quattor
     56 * Total compile time: 10mn
     57   * Now use SCDB ant interface to panc
     58   * Some tweaking of panc required, should come mainstream
     59 * 2 sets of 6 boot servers
     60 * AII performance (notify) is the pb: 3h for all nodes
     61   
     62Monitoring based on Lemon.
     63 * 2 servers
     64 
     65Pending for open-source release:
     66 * AII, CCM patches
     67 * FUSE interface to profile on clients: allow to  browse the configuration, ncm-query replacement
     68   * Ready, looking for volunteers
     69   * Allow to browse a specific revision of a profile
     70 * Aquilon (already given to RAL): should be part of QUEST
     71 
     72 
     73=== AUTH - C. Tryantafyllydys ===
     74
     75New sysadmins involved with Quattor
     76
     77Merge in progress from AUTH-specific cfg/local to SCDB sites
     78 * Almost done
     79 
     80Ganglia configuration based on SCDB clusters
     81 * Use ncm-filecopy: plan to improve
     82 
     83Virtualisation with Xen and QWG templates from 1 year ago.
     84
     85New machine types:
     86 * Hydra committed to QWG
     87 * Working on core machine types
     88 
     89Quattor Asset Database: another Quattor db viewer
     90 * See presentation
     91 
     92Monitoring based on Nagios + OAT
     93
     94=== CERN - V. Lefébure ===
     95
     96Several instances based on CDB
     97 * Main instance: 7800 profiles
     98
     99CDB 2.2.0-3
     100 * New CDBChangeTracker: statistics about CDB (number of commits, ''transaction'' log message, link with incidents...)
     101 * grep-like feature for cdbop
     102 * panc 8.2.10
     103 * Compilation in 6mn
     104 
     105Virtualization based on Hyper-V
     106 * Clients are Quattor-managed
     107 * Will go to production Nov. 2009
     108 
     109lxcloud: tests with OpenNebula and PlatformVMO
     110 * Currently Xen, will look at KVM
     111 * Quattor used to manage images
     112 
     113Quattor schema extended to allow more HW types to be described in CDB
     114 * Used for HW inventory: includes HW not managed by Quattor
     115 * Not yet committed:
     116   * Remark from Michel: should be done to avoid a fork of the schema, should not be a major pb as long as this is additions
     117 
     118Related activities
     119 * SMS: duration option added
     120 * Lemon: work with MS on federated Lemon instance support, improving the schema to address perf issues
     121 * SINDES proted to SLC5
     122
     123CLUMAN: tool to display clusters and act on a selected subsets
     124 * Based on CDBSQL + Lemon
     125 * Contact: Marian.Barbik@cern.ch
     126 
     127=== TCD and Grid Ireland - D. O'Callaghan ===
     128
     1292 TCD sites + 17 Irish sites
     130 * + 3 test sites
     131 * 500 machines
     132 * 18 Quattor servers
     133 * 55 virtual hosts + 190 VMs
     134 
     135SCDB
     136 * Added support for Bazaar client for supporting personnal branches
     137 * Switch back from svn:externals to a copy for QWG templates
     138 
     139Nagios: not much work since 1 year
     140
     141Starting to rename Quattor client commands (ncm-ncd, ccm-fetch, ncm-query...) as subcommands of `quattor` command
     142 * Committed and documented on http://quattor.org
     143 
     144New services Shibboleth, RT, Boinc
     145 * Using filecopy
     146 
     147=== NIKHEF - R. Starink ===
     148
     149No major changes in numbers
     150 * +1 admin: now 6
     151 * +175 hosts: now 475
     152 * Using SCDB 2.3.1 + local changes
     153 * Virtualization based on Xen: 12 hosts, 48 guests
     154 
     155Monitoring based on Nagios
     156 * 1 master, 3 slaves
     157 * QWG-ish hierarchy, NCG generator for grid probes
     158 
     159Trying to move to a better compliance with QWG
     160
     161Virtualization: thinking at OpenNebula, KVM
     162 * BEGrid investigating WN virtualization
     163 
     164=== Philips - Serge Vrijaldenhoven ===
     165
     166No major changes: 6 clusters, 200 machines
     167 * Want to virtualize service nodes: no plan for WNs
     168 * New people involved in Quattor, Serge acting as backup: still a steep learning curve
     169 
     170Philips would like to host a workshop in the future.
     171
     172
     173=== RAL - D. Ross ===
     174
     175Started with Torque/MAUI server: needed replacement. Fairly smooth transition.
     176 * Some problems: RAL using a standalone Torque server, a few differences in Torque/MAUI config
     177 
     178WNs: push to deploy new CPUs and to deploy SL5
     179 * Moved 90% of our capacity to gLite 3.2/SL5 with Quattor
     180 * Problems: Castor client configuration missing, pbs_mom crashing on lots of node (version/arch inconsistency)
     181   * Also jobs dropped during WN reconfig: not clear why. Pb when executing ncm-chkconfig?
     182   
     183Current work:
     184 * UI: largely done
     185 * BDII
     186 * SinDes
     187 * Disk servers: move from Puppet being discussed
     188 * Castor service nodes: probably more tricky
     189 
     1909 persons involved with Quattor: after 6 weeks in production and 60% of T1 under Quattor control, benefits of Quattor beginning to be felt
     191 * Also some issues with VO configuration: what's done in QWG not necessarily matching what we were doing... nothing insurmountable
     192 
     193=== LAPP - E. Fede ===
     194
     1951 quattor server with SCDB 2.3.2
     196 * 150 nodes: 4mn for a full build
     197 * Quattor server running in a VM
     198 
     199Running nightly autobuild for Quattor components.
     200 * Build everything that has a 'make rpm' in its directory
     201 * Report linked in http://quattor.org
     202 * Not yet alarming capabilities in case of errors
     203 * Future of this tool highly connected to QUEST
     204 
     205 
     206== Core Components ==
     207
     208=== Pan Compiler Update - C. Loomis ===
     209
     2104 bug fix releases since last workshop for v8.2
     211 * v7 highly deprecated: nobody seems to use it anymore
     212 
     213Current work:
     214 * Entitlements/Authorization: on hold, lack of time
     215 * Internationalization: progressing, done when modifying files for other purposes
     216 * Performance enhancements from MS to be integrated
     217 * Eclipse editor: preliminary work being done at the LAL
     218 * Eclipse debugger: will be part of QUEST but some preparatory work being done
     219 
     220Planned changes:
     221 * Incompatible change to dependency file needed: full support for file_contents() and exists() dependencies only possible with the new format
     222   * Needs simultaneous release of ant task: are there other tools which could be affected
     223   * Will be done in 8.4.0 as it seems no other tools are affecte
     224 * Contemplating change from ant to maven2 for building panc
     225   * Ant build files are extremely complex and external handling not very nice, maven2 would be simpler and allow easier inclusion/upgrades of external dependencies
     226   
     227Other requests:
     228 * Ronald: get some information about the template tree processed in case of error
     229 
     230No version 9 planned right now: mainly removal of deprecated syntax
     231 * Concentrate on component cleanups to upgrade deprecated syntax
     232 
     233 
     234=== SCDB Update - M. Jouvin ===
     235
     236See [http://indico.cern.ch/getFile.py/access?contribId=1&sessionId=1&resId=1&materialId=slides&confId=67632 slides].
     237
     238=== CNAF migration to SCDB - A. Chierici ===
     239
     240Migration rather smooth, minor problems mainly due to the need to reorganize a few things
     241 * Also the problem with the buggy apr rpm distributed by GRIF during a couple of days
     242
     243Quattor server migrated, not reinstalled
     244 * SCDB installed in // with CDB: clusters migrated one after another
     245 
     246The existing Quattor front-end has been kept
     247 * In fact this is the main point of administration because of performances of Eclipse on desktops
     248 
     249Used profile cloning: huge speed-up
     250
     251Cluster in SCDB much easier than in CDB
     252 * Easy to move a node from cluster to cluster
     253
     254Feedback from users better than expected
     255 * svn already known to many users
     256 * A wiki was prepared in advance and most people seemed to have read it
     257 * Not too much problem with the compile/deploy workflow
     258 
     259Main complaints:
     260 * ant seems to demand too many resources (on laptops mainly, connected with Eclipse)
     261 * Users don't like to recompile changes made by others after svn update
     262   * NIKHEF has a modification to remove this requirement but at the price of deployment errors
     263
     264=== QWG Templates Update - M. Jouvin ===
     265
     266See [http://indico.cern.ch/getFile.py/access?contribId=2&sessionId=1&resId=1&materialId=slides&confId=67632 slides].
     267
     268
     269=== Generic Machine Types - C. Tryantafyllydys ===
     270
     271Several machines types like web servers used at many place: may benefit from a shared effort.
     272
     273Some specificities:
     274 * Do not depend on gLite: not reason to have them in these branches
     275 * Depend a lot on OS rpms
     276 
     277Based on machine-types/core: attempt to reduce the base RPM configuration to the minimum
     278
     279Several types currently worked on:
     280 * web_server: Apache + main modules, filecopy used to configure them
     281 * db_server: currently a mysql server
     282 * nfs_server: move from gLite
     283 
     284Need to be able to combine them: not really possible with machine-types, even with gLite.
     285 * Misunderstanding about machine-type: there are the combination of a base OS configuration + 1 feature
     286 * Put features in standard/features
     287 
     288Need to discuss how to enforce annotations for high-level services (features) and produce some tools to easily access this information.
     289
     290
     291== Improving Quattor's Accessibility - M. Jouvin ==
     292
     293See [http://indico.cern.ch/getFile.py/access?contribId=8&sessionId=2&resId=1&materialId=slides&confId=67632 slides].
     294
     295Proposed changes:
     296 * ''NCM Components'' renamed into ''Quattor Configuration Modules''
     297   * Need to remove ''components'': source of confusion
     298   * Could also keep NCM for ''Node Configuration Modules'' but will be confusing as NCM stands for ''Node Configure Manager''
     299   * No change to component names themselves
     300 * ''Dummy WN"" renamed into ''profile cloning''
     301   * ''Exact node'' also renamed into ''reference node''
     302   * QWG variables will be updated at some point...
     303 * CDB: both one component in Quattor architecture and one of its implementation
     304   * Only the component in architecture can be changed
     305   * One proposal is CMDB for ''Configuration Management DB''. Need more thinking
     306   * May be not really an issue anymore with CERN left as the only site using CDB...
     307
     308== QUEST Proposal Status - M. Jouvin ==
     309
     310See [http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=2&resId=1&materialId=slides&confId=67632 slides].
     311
     312== Other Developments ==
     313
     314=== Quattor Asset Database - C. Tryantafyllydys ===
     315
     316AUTH needed a tool to ease everyday tasks for new sysadmins and allow to feed other tools with Quattor data
     317 * Eg. configure a monitoring system from Quattor
     318 
     319QAD main characteristics:
     320 * Written in Ruby
     321 * A few schema extensions:  /monitiroing, network "parent" (switch it depends on), rack
     322   * Need to be merged with CERN similar extensions
     323   * /monitoring is an alternative to the LEMON /monitoring
     324 * Not yet decided: access to SVN? read or write?
     325   * Currently using a post-commit hook to synchronize QAD db with SCDB deploys: impact on performance
     326   * Would like to remove this in the future and rely on XML but some high level information is currently missing: need to extend the schema
     327 * Allow to change some node characteristics: OS, IP... and redeploy
     328 * From the config db, can list nodes attached to a specific switch: rely on configured information
     329 * Can list all the VMs attached to a host (using the enclosure definition)
     330 
     331Need to make progress on using CDB2SQL back-end for producing/maintaining db and use it in all tools that need a build a db from the XML
     332 * Rename XML2SQL
     333   * CERN will provide an update on the status and the best version to start from
     334 * May look at document-oriented dbs, liek CacheDB
     335 
     336=== Remote Configuration with Quattor - N. Williams ===
     337
     338''Disclaimer: doesn't currently work with SCDB (because of dependency over CDB notification).''
     339
     340Goal: apply Quattor benefits to boxes that cannot run Quattor client by using delegation
     341 * A Linux box will acts as a delegate where  a configuration module will execute appropriate configuration commands
     342 * Combination of AII and and CCM/NCM/NCD
     343 
     344Current implementation:
     345 * quattor-remote-dispatcher (QRD): a tool running on Linux box and receiving CDB notification messages. It acts as a replacement for listend, cdispd, ccm-fetch, ncm-ncd
     346   * Configuration allows to define which part of the configuration are listened and what is the command to run
     347   * Can use different commands on different sets of nodes
     348   * Can define constraints on some part of the configuration to do different things based on some configuration state (for example state=build or production)
     349 * quattor-remote-configure: an AII equivalent allowing to produce a new configuration for the managed box and notify the remote dispatcher
     350   * Configuration of the managed device is  under /software/components
     351   * The actual component to use is defined by /system/components/namespace: default is NCM::NCD:: but can be defined to something specific to a device, eg. ESX, Netapp...
     352
     353Used to manage virtualization in Aquilon (for a WMware Hypervisor)
     354 * VMs are not associated with actual hosts (handled for example by VMware) but with clusters
     355   * A cluster is a  group of hosts running an hypervisor. There is an object template for each cluster.
     356   * Each virtual host (machine running an hypervisor) has also an object profile that allows its configuration out of the box
     357 * VMs are managed as normal machines
     358   
     359Plan to release this as soon as it is polished but will not be able to release the specific components as they are very MS specifics and sometimes rely on non public APIs.
     360
     361=== Virtualisation - D. O'Callaghan ===
     362
     363Currently used Hypervisors configured with Quattor:
     364 * Xen: TCD, NIKHEF, CERN, AUTH
     365 * OpenVZ: UAM
     366 * VMware: MS
     367 
     368Also in use but not managed by Quattor:
     369 * KVM: CNAF
     370 * Hyper-V: CERN
     371 
     372 VM cluster managers of interest: Platform/VMO, OpenNebula
     373 
     374Issues and requests:
     375 * Use libvirt, common configuration module for Linux-base hypervisors
     376   * Could look at what is done by Puppet or other config tools
     377 * Creation of images
     378 
     379QWG templates for basic virtualisation: mainly used at TCD (also in Senegal!)
     380 * Rely on ncm-xen, would be great if we had support KVM, probably using libvirt
     381 
     382=== Integration with Monitoring Systems - C. Tryantafyllydys ===
     383
     384Porblem: current Nagios configuration in QWG relies on everything being described (in particular probles) in the configuration. But EGEE and OAT developped NCG to do it more dynamically on a specific node. How to take advantage of this?
     385
     386NCD: generic tool to define Nagios configuration based on context
     387 * 2 main basic modules/entry points: NCG::SiteSet and NCG::SiteInfo::
     388 * Also several internal modules to define probes to use, ... that would benefit to receive information from Quattor
     389   * Exemple: configure all probes for a CE if the node is configured as a CE
     390 
     391Integration between NCG and XML doesn't scale as it is far too long. Need to pre-process data and this is done with QAD.
     392 * Currently more a proof of concept than a ready-to-use tool
     393 
     394Discussion:
     395 * Need to figure out how to use NCG to define services without defining hosts and rely on Quattor for host definitions of hosts managed by Quattor
     396 * Create a small working group with a specific mailing list: quattor-monitoring
     397 * TCD will commit their change to the generic templates: no specific change
     398 * NIKHEF will document the change they had to make to generic templates to identify additional customizations needed
     399 * CNAF: currently the storage group has its own way of configuring Nagios with filecopy, difficult to change even if convinced
     400 * RAL interested but probably not in the short term
     401 * ULB/BEGrid started to look at it
     402 * Guillaume will fix the Nagios example in QWG repository
     403 * Potential scalability pb if nagios server profile need to depend on all the profiles for all the machines it monitors
     404 
     405
     406== Miscellanous ==
     407
     408=== SourceForge Status ===
     409
     410QWG Trac migration on SF: 2 issues:
     411 * Export of DB from MySQL to SQLite
     412 * Use of Trac plugins that were not supported initially at SF
     413
     414Michel agrees to have a new look at this in December.
     415
     416When migrated, may think at migrated MediaWiki to Trac.
     417 * Nick agreeing to review MediaWiki contents after QUEST submission
     418 
     419=== Source structure ===
     420
     421Current use of .cin leading to many problems as the files are not recognized by editors and other consequences.
     422
     423MS experienced editing directly the source files and putting the metadata separatly in the RPM. Will try to document as a proof of concept.
     424
     425=== Errata deployment at MS ===
     426
     427MS thinking at tagging repositories based on their contents (critical and non critical) and have the ability to instruct SPMA to do only the non critical upgrades outside maintenance windows.
     428
     429
     430== Wrap-Up ==
     431
     432Next meeting in Greece 17-19/3
     433 * 2 days over 2 like this time
     434 * Plan 1/2 day after the meeting for an unformal hands-on session
     435 * Keep site reports much shorter: 10mn by default
     436 
     437