Changes between Initial Version and Version 1 of Meetings/Workshops/20101011


Ignore:
Timestamp:
Oct 12, 2010, 8:03:17 PM (15 years ago)
Author:
/O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=Michel Jouvin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Meetings/Workshops/20101011

    v1 v1  
     1= Quattor Workshop - RAL - 11-13/10/2010 =
     2[[TracNav]]
     3
     4[[TOC(inline)]]
     5
     6[http://indico.cern.ch/conferenceTimeTable.py?confId=105169#all.detailed Agenda]
     7
     8== Quattor at RAL T1 - A. Sun ==
     9
     10Started grid with some bricolage based on Kickstart, Puppet... In 2006 realized that this should be reenginered.
     11 * 500 WNs, 500 disk servers
     12
     13MAin benefit of Quattor so far: huge improvement in system management efficiency.
     14 * But must not underestimate the difficulty of getting the whole team onboard: mostly done now. It takes time to get the existing knowledge put in Quattor config.
     15 * Experienced it during the last kernel update: full reinstallation would have been an affordable option if necessary
     16 
     17
     18== Quattor Usage Report - M. Jouvin ==
     19
     20See [http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=2&resId=0&materialId=slides&confId=105169 slides].
     21
     22
     23== Core Tools ==
     24=== ncm-filesystems and ncm-lib-blockdevices New Ideas - L. Munoz ===
     25
     26ncm-filesystems: NCM components able to build/destroy block devices
     27 * Take advantage of advanced description available in Quattor
     28 * Able to do things not possible to do with Kickstart
     29 * A few bugs:
     30   * Logical partitions cannot be grown: use LVM instead, no plan to fix
     31   * preserve_partitions sometimes not honoured: pb understood, some time required to fix it
     32 * preserve and format: required by AII but should not be available in the component
     33 * Some requests for new file system types: tmpfs, iscsi (replacement for ncm-iscsitarget), smbfs, FUSE filesystems...
     34
     35One of the problem is that ncm-filesystem also manages fstab: proposal to move this part to a specific component, ncm-fstab.
     36 * Add pseudo-file systems and network filesytems (without a block device) to fstab
     37 
     38This changes require some change in the schema
     39 * Some validation relaxation
     40 
     41Backward compatibility
     42 * Profiles 100% compatible
     43 * ncm-filesystems will require ncm-fstab: can be handled in the component templates
     44 
     45Remark on FUSE filesystems: an alternative to fstab is to use a specific daemon for that but this will not be managed by the component.
     46
     47
     48=== SCDB Update - M. Jouvin ===
     49
     50See [http://indico.cern.ch/getFile.py/access?contribId=2&sessionId=2&resId=1&materialId=slides&confId=105169 Slides].
     51
     52
     53=== PAN Update - C. Loomis ===
     54
     55Status of v8 series
     56 * 8.4.2: last announced, the one everybody should be using
     57 * 8.4.3, 8.4.4: not yet announced
     58   * Maaven integration
     59 * 8.4.5: planned soon after the workshop, last v8 version
     60   * All deprecated features will provide warnings
     61   * Add `prefix` keyword to grammar (not active, implementation in v9)
     62   
     63v9: first beta planned soon after 8.4.5
     64 * Main feature for the first release:
     65   * Removal of deprecated features
     66   * Bareword includes
     67   * New syntax of external reference: `machine:/path` instead of `//machine/path` (will allow proper support of anespaces)
     68   * Search order between .pan and .tpl reversed?
     69   
     70Spare time going way down: no time to tackle performance issues
     71 * Not foreseen in the next 6 months
     72 
     73Discussion - Wishes:
     74 * MS: signing of XML profiles
     75   * May be easier to chain a signing task with Maaven, the same way it is done for gzipping profiles
     76 * Michel: auto-escaping of keys in nlist
     77   * Difficult to implement, pretty intrusive change in the compiler, risk of ambiguity
     78
     79=== QWG Update - M. Jouvin  ===   
     80 
     81See [http://indico.cern.ch/getFile.py/access?contribId=1&sessionId=5&resId=0&materialId=slides&confId=105169 Slides].
     82 
     83Scrum/Agile ideas agreed. Let's decide later how to do it
     84
     85
     86=== Quattor FS - N. William ===
     87
     88Alternative to ncm-query implemented as a FUSE file system
     89 * Written in Python (300 lines): manages access to XML profile
     90 * Currently requiring 2.5 but should be easy to backport to 2.4
     91 
     92Can specify a default mode for accessing the profile still restricting some parts of the namespace
     93 * Currently only explicit path but could be possible to add pattern matching to disable access to all 'password' attribute for example
     94
     95nlist/list become directories, values become files whose content is the value.
     96
     97Can use all the file commands to browse, show the differences between the configuration versions.
     98
     99Escaped values (e.g. package names): a symlink is created with the unescaped value.
     100
     101With `--cache` you can start quattorfs to browse profiles in arbitrary locations (e.g. SCDB `build/xml`).
     102
     103Works on Linux and Mac.
     104
     105Python code stats the NCM directory at every request to detect profile changes.
     106
     107=== Aquilon and other related MS developments - N. William ===
     108
     109No Aquilon changes since last workshop.
     110 * Mainly worked on a DNS schema to produce DNS configuration, including host records, based on what is in the configuration database
     111 * Aquilon soon to be committed to SF repository
     112 * A few specific requirements in particular Python 2.6, Kerberos
     113 
     114The main task still to be done is QWG templates integration
     115 * Aquilon enforces the expected namespaces
     116 * Namespaces: more directories expected by Aquilon than QWG
     117   * E.g. cpu/intel/l5520 rather than cpu/intel_l5520, rhel/5.0-x86_64 rather than rhel5.0-x86_64
     118
     119Schema extensions: need to way to allow optional extensions
     120 * Product lifecycle
     121 * Personality: what is used for, OS version/arch
     122   * Related to Aquilon
     123 * Function: development, production...
     124 * Threshold, maintenance windows
     125 * Start/stop jobs: scripts that must be executed at startup but are not a service
     126 
     127Component subclassing implement and working, including for exception handling
     128 * ncm-ncd (NCM::Component) enhanced to have a method prefix() returning the configuration path for the current component.
     129   * Replacement of "my $base" definition
     130   * Helps with support of subclassing
     131   
     132ncm-network: would like to add support for loopback aliases
     133
     134Monitoring configuration: would be good to have it embedded into the service configuration rather than done at a later stage.
     135 * More discussion required on where the template configuring the information should site (service directory or monitoring directory)
     136 * May think about putting a meta-description of monitoring information in the service and generate the appropriate information on the fly
     137 
     138Versionned components: we need to be able to get several version of a component installed to deploy a new version still using the component described in the configuration description
     139 * One idea would be to install components at a location that includes the version number
     140 * Still to be checked if it really solves the problem: the real problem may be to enable SPMA to deploy only a subset of the RPM changes
     141 
     142@xxx@ substitution in source code: should be reduced and probably restricted to where it doesn't break syntax checks (strings, comments)
     143
     144
     145== Security Management ==
     146
     147=== Quattor and Security - Mingchao Ma ===
     148
     149Operational security aims to maintain normal operation at a reasonable cost and effort
     150 * Prevention and response
     151 
     152Problems at many sites run by Quattor because a lot of unnecessary stuff is installed on WNs
     153 * E.g. firefox: lot of severe well known vulnerabilities with known exploits
     154 * Other examples: Samba, Xorg, KDE...
     155   * XWindows used to triggered a kernel exploit with one recent vulnerability
     156   
     157Would be better not to install something which is not needed.
     158
     159Also saw some sites claiming having upgraded but in fact still running the old kernel.
     160
     161
     162=== Cruft Removal - J. Adams ===
     163
     164Work started after realizing that mrtg is running on WNs...
     165
     166Started to remove the not so useful stuff
     167 * pkg_del
     168 * Removal of the inclusion of some RH groups
     169 
     170But got immediatly some complaints about favorite admin tools, editors missing
     171 * Also a few grid bits broken
     172 
     173Started to turn out in a nightmare: decided to improve checkdeps to produce .dot files showing dependencies.
     174 * Result: 223 packages removed, 29 running process disappeared, hundreds of MB freed in RAM and disk
     175 * Goal should be only to have 'core' group and add explicitly the required bits
     176 
     177
     178=== QWG Errata Framework - I. Collier ===
     179
     180See [http://indico.cern.ch/conferenceTimeTable.py?confId=105169#20101011.detailed slides].
     181
     182
     183=== GRIF Approach to Errata Deployment - G. Philippon ===
     184
     185GRIF would like to move to scheduled deployment of errata once a month.
     186 * Only SL security errata (no fastbug)
     187 * No kernel update except if there is a critical kernel vulnerability to avoid complex/disruptive reboots
     188   * Kernel updates controlled  by each GRIF site admins
     189   
     190In case of a critical vulnerability, a specific out-of-schedule errata is produced.
     191
     192Deployment strategy
     193 * First deployment on a test cluster representing the most common configurations to fix main problems
     194 * Then deployment on production clusters, under control of GRIF site admins
     195 * NODE_OS_ERRATA_TEMPLATE used to force a machine to stay with its current errata level
     196   * Cumbersome to maintain, some GRIF-specific scripts to help
     197   
     198Main issues is changed in RPM version, dependencies, arch
     199 * arch-specific to noarch
     200 * RPM splitted into -common and -libs
     201 * RPM name change
     202 * Pb with algorithm used to guess the most recent version: 4.6 considered more recent than 4.7
     203 
     204Useful companion tools
     205 * Pakiti: easy to see problems with undeployed/misdeployed errata
     206 * Nagios: specific probe to detect SPMA problems
     207 
     208
     209== Quattor Software Process ==
     210
     211=== Build Tools - C. Loomis ===
     212
     213Reasons to replace Quattor Build Tools
     214 * Broken
     215 * Incredibly complex
     216 * Linux dependent
     217 * No maintainer
     218 
     219Why Maven?
     220 * Portable, open to non-java language
     221 * Clean, standard build process
     222 * Integrated mechanisms for release management
     223 * All build information in a single file
     224 
     225First tests done with a few components and looked reasonably easy
     226
     227Roadmap
     228 1. Components: may need some adjustements in configuration
     229 1. Primary tools: in particular NCM related stuff
     230 1. Other tools
     231
     232Updated build for configuration components exists and ready to be applied to all components
     233 * Lots of changed required but most of them can be automated
     234 * Current features
     235   * Only build functionality, no tagging, code update...
     236   * Creates RPMs for Quattor clients but only with the Perl modules
     237   * Tarball with pan configuration and documentation
     238 * Configuration component archetype
     239   * Example component with simple command
     240   
     241Basically 3 commands:
     242 * mvn clean: clean up workspace
     243 * mvn install: build and store locally
     244 * mvn deploy: build and deploy
     245 
     246Build features:
     247 * Substitue values in source files but no need for specific extensions
     248 * Checks pan language syntax
     249 * Create RPM and tarballs
     250 
     251Still to be done
     252 * Script for automated conversion
     253 * Finish documentation on website
     254 * Apply to all existing components
     255 * Verify conversion/update of components
     256 * Determine integration continuous integration
     257 
     258Maven requirements
     259 * Java + Maven core (jar file)
     260 * Or Eclipse plugin
     261 
     262=== Discussion ===
     263
     264Components
     265 * Review which ones are needed and/or actively maintained
     266   * Maven migration may be an occasion to contact official maintainers, discuss problems during monthly meetings
     267 * Ensure there is a reference person for every critical/important component
     268 
     269Quattor releases: not a big deal for people already using Quattor but important for new users
     270 * Maven should help as it allows to easily maintain a list of what is considered production
     271   * Component maintainers will keep the control/responsibility of updating the list
     272   
     273SVN repository usage
     274 * Encourage developpers to use Git at their SVN client to commit their work and restrict trunk to reasonably good stuff
     275   * trunk should be able to build successfully at every revision
     276   * May be enforced by rebuilding trunk every night
     277 * tag management: delay decision until we have some experience with Maven and possible workflow for tagging new versions
     278 
     279 
     280== Site Experiences ==
     281
     282=== NIKHEF Migration to QWG - R. Starink ===
     283
     284Historically, NIKHEF started with pre-QWG templates and tried to remain as compatible as possible but lot of effort duplication
     285 * Some specific requirements, in particular wants to stay with YAIM
     286 
     287Approach for migration
     288 * Use 1 SCDB with duplicate namespaces
     289 * Migration per cluster
     290 * Focus host layout and services: generic hosts, non-gLite hosts, gLite hosts
     291 * Hit dirty workarounds
     292 
     293First impressions
     294 1. Lots of old stuff in our CBD: clean up before migrating
     295 1. gLite everywhere in QWG examples: in fact missed core machine type
     296 1. Various services not in QWG: authconfig, audit, psacct...
     297 
     298Decided to keep the current node layout/machine types
     299
     300Kernel errata management: intentions ok but implementation/documentation not consistent across all OS versions
     301
     302File system configuration: AII magic too complicated or risky
     303
     304Node cloning: looks too complicated in QWG
     305
     306User management via central LDAP
     307
     308Hardware description: machine classes rather than individual machines
     309
     310Migration done after 4 weeks without disrupting the site and without complaint from other admins.
     311 * Too early to say about real benefits
     312 * Now in a position to contribute to QWG but need to better understand how specific a contribution must be
     313   * In particular may contribute to better consistency in QWG templates
     314 * "One size fits all" wont work... but not a reason not to collaborate
     315 
     316=== SINDES - J. Dudziek ===
     317
     318SINDES main purposes:
     319 * CA : manage the certificates, confirm identities, create/revoke certificates
     320   * Generated certificate intended to be used only for securing communication: different from the service certificates
     321   * Notion of time windows during which a given client can request a new certificate
     322 * Storage centre for secret files, passowrds...
     323   * Deliver them in a secure way
     324 
     325Based on Apache, openssl, mod_rewrite
     326
     327Currently in use at CERN and serving 8000 hosts
     328 * Several applications relying on it
     329 
     330Weaknesses
     331 * No feature to delete files
     332 * Only 2 target types: host and cluster. Subclusters needed
     333 * No easy way to move a machine from a cluster to another one
     334 * No possibility to view files
     335 * No file versioning
     336 
     337Possibility for improvements
     338 * Enhance the current implementation
     339 * Provide the same features based on an exiting product, e.g. wallet
     340 * Manpower available: 1 year of technical student
     341 
     342 
     343=== SINDES at RAL - J. Adams ===
     344
     345Background: desire to put passwords in templates but plain http serving not very appropriate
     346 * Anyone can access a machine profile
     347 * Every node can access another node profile
     348 
     349SINDES used only to deliver the certificates
     350 * File store has been disabled
     351 * Information that could be in the file store is put in the profile then transfered securely
     352 * Integration with AII through hooks (from BEGrid)
     353 
     354Used to secure transmission of profiles but don't secure template files in the repository
     355 * Assumption that users accessing the repository can be trusted
     356 
     357Problems
     358 * SINDES version used for SLC4 only: required a lot of effort to port to SL5
     359   * In fact CERN has a SL5 version... but no official distribution point
     360 * Documentation: the most useful is BEGrid documentation
     361 
     362Question: would CERN accept that we import it to quattor.org for easier use by Quattor site and better integration with Quattor?
     363 * May become a standard component of a Quattor server
     364
     365
     366=== BEGrid Experience and Questions - D. Durvaux / S. Rugovac ===
     367
     368BEGrid: several partners around BELNET
     369 * Need to reengineer current SCDB structure: looking for some input
     370 
     371Current configuration based on 2-tier infrastructure
     372 * 1 central national server running SCDB
     373 * 1 site server per site which is a SCDB client (doing a checkout): no ability to commit
     374 * `runcheck` script on each site server doing replacement in central configuration for site-specific parts and handling/triggering the deployment for the site
     375 
     376Problems:
     377 * Quattor out-of-sync with the community: way to use it, QWG templats, OS/errata
     378 * BELNET not yet enough skilled with Quattor to take over the coordination responsibility: still relying on IIHE team
     379 * SINDES support
     380 * dCache support: no other Quattor site using it? Relying on very old templates
     381 
     382Possible solution envisionned
     383 * Refactoring of SVN structure with SVN externals
     384 * SWrep replacement: is http-based repository suitable?
     385 
     386Discussion
     387 * Need to clarify the workflow
     388   * In particular how much control of the central team over the sites: ability to trig deployment...
     389   * Centrally-triggered deployment requires nothing more than ability to write the site tags/ branch
     390 * 1 specific branch per site (with its trunk/, tags/ structure) and an svn:externals reference to the central server
     391   * An option is one SVN server per site
     392   * External reference can be or not to a fix revision
     393
     394
     395== Monitoring Support ==
     396
     397=== Future Changes in QWG Nagios Templates - R. Starink ===
     398
     399Problems found in standard templates by NIKHEF
     400 * `monitoring/nagios/config`
     401   * Location of some RPM includes (minor)
     402   * Host list derived from HW database: a problem with a master/slave Nagios config, one function needlessly complicated
     403 * `monitoring/nagios/command`:  huge list of command definitions
     404   * Should be break into a common part and a site-specific part
     405 * Some configuration variable have meaningless defaults rather than trig an error if undefined
     406 * pnp4nagios configuration missing: general setup easy to add, keep it optional
     407   * Currently requires modification (1 line) of service.tpl
     408   
     409Hierarchy of Nagios servers: slaves collect the information, master runs the web interface
     410 * Configuration of host list is different on slave and master: a problem for the current automatic determination of host list in QWG templates
     411 * Current NIKHEF implementation based on hosts and services groups
     412   * Grouping done by cluster
     413   * Master is configured with everything
     414   
     415Grid monitoring based on EGEE/EGI work
     416 * Using YAIM
     417 
     418Discussion
     419 * Build a RPM for probles related to monitoring Quattor activity
     420   * Probe sources should be put and built in the SF repository as any usual Quattor component
     421
     422== Build Awareness (and knowledge) of Quattor - D. O'Callaghan ==
     423
     424We need to promote Quattor to:
     425 * have more users
     426 * have more contributors
     427 * to improve Quattor
     428 
     429This requires to make Quattor easier to discover and to make easier to join the community.
     430
     431Functionality / complexity must be taken into account
     432 * Quattor's narrow OS support counts against it
     433 * Pan language is powerful
     434 * Complexiting of creating a new configuration component and bringing it to the community
     435 
     436User-facing website: content is good but several problems
     437 * Server certificate check and user-certificate check
     438   * Server certificate should be from a well-known CA
     439   * Should not require a user certificate
     440 * Search results on Trac are not focused enough, e.g. `tutorial`. Should not return results about Trac documentation
     441 * Too much hierarchy exposed on the first page
     442 * `quattor.org` doesn't appear in Google
     443   
     444Aims for user website
     445 * Create a user landing page outside Trac
     446 * Resolve security issues or solve it
     447 * Better internal search results
     448 * Better external search engine results
     449   * Links on sites "power by Quattor"
     450   
     451Worth looking at how the competition does things: Puppet, Chef
     452 * Document differences
     453 * Document how Quattor can be used in these other environements
     454   * In particular for OS configuration and initial installation
     455   
     456Marketing Pan outside the Quattor community and separately the other parts of Quattor framework
     457
     458Need to check Quattor description on some well-known places: Wikipedia, freshmeat, Ohloh...
     459 * Fix wikipedia based on LISA paper introduction
     460
     461David agrees to spend some effort on the new landing page before the next monthly meeting.
     462 * Landing page to be hosted on SF website
     463 * Try to define a stylesheet that could be reused by other pages, in particular those generated by Maven
     464 
     465== Virtualisation Support - Cal ==
     466
     467Quattor survey showed 80% of sites use some virtualization
     468 * Easy integration as PXE booting supported by all hypervisors
     469 * S. Childs provided configuration component for Xen with some support in QWG
     470 
     471StratusLab contributions
     472 * 2 new configuration components: source currently in StratusLab repository and will remain there during the duration of the project
     473   * ncm-libvirtd: configure and control libvirt
     474   * ncm-oned: configure and control OpenNebula
     475 * Quattor configuration of a cloud
     476   * Configures OpenNebula, incl. NFS mounts
     477   * Configures private and public network bridges
     478   * HTTPS proxy for OpenNebula XMLRPC server (which is plain http)
     479   * Ganglia for rudimentary monitoring of cloud infrastructure
     480 * 2 manual configuration required
     481   * Definition/addition of hosts in the cloud: in the future may be done by a cron/service with SINDES
     482   * NFS mounts verification: currently static mounts, should be solved using autofs
     483   
     484Contextualization
     485 * Networking done by DHCP
     486 * Files and parameters passed through a disk (ISO image) mounted on /dev/hdc
     487   * Provide a nice way of handling credentials: the disk is not exposed to other VMs
     488 * Initialization script on disk run through rc.local
     489 
     490Virtualization provides a cleaner implementation of profile cloning for WNs...
     491 * One reference WN compilation, without node specific information
     492 * Node specific information passed in trhgh contextualization
     493 * One profile per group of machines
     494 * Compilation time scales with number of groups
     495 
     496... but rely on an external tool (VM manager) to handle the deployment
     497 * Require a mechanism to specify state of the fabric
     498 * Still require profile cloning for efficient management of hypervisor machines
     499 
     500Quattor is a good appliance generator offering a bookkeeping to track VM configuration, save state information and regerate images if necessary
     501 * Would like to be able to interface with virt-install for automated image generation and deployment
     502 * Implementation could be a service listening profile changes and running virt-install
     503 * Deployement may use AII hooks
     504 
     505StratusLab developpements will be publically available on Nov. the 2nd.
     506
     507
     508== Actions ==
     509
     510QuattorFS
     511 * Backport to Python 2.4 if possible/easy (James?)
     512 
     513SINDES
     514 * Check with CERN agreement to import it in SF repository (Véronique)
     515 * Check licensing (Véronique)
     516 * Integration Quattor server configuration (RAL?)
     517 
     518Web site
     519 * Landing page for quattor.org on SF (David)
     520   * Develop a stylesheet that could be reused by pages generated from Maven
     521 * Fix Trac server certificate CA (Michel)
     522 * Remove Trac request for a user certificate for anonymous access (Michel)
     523 * Enable Trac indexing (Michel)
     524 * Fix navigation menu behaviour: discuss by email what we want to implement, then implement it (Michel)
     525
     526QWG
     527 * Organize initial development meeting based on today's proposal (Michel)
     528 * Discuss namespace improvements as suggested by Nic (email discussion + decision at monthly meeting)
     529   * standard/hardware/... including a vendor directory
     530   * vendor/version-arch or vendor/version/arch for OS templates
     531   
     532Monitoring
     533 * Implement NIKHEF suggestions to add flexibility and support hierarchy of Nagios servers (Ronald?)
     534 * Collect existing Nagios probes related to Quattor activity monitoring, put them in SF and package them as a RPM
     535 
     536Documentation
     537 * Better integration of former MediaWiki content into existing section, remove duplicates
     538 * Update SINDES related documentation, improve based on BEGrid wiki and RAL experience
     539 * Clarify or add missing material to answer Ronald's questions after his QWG migration experience
     540 * Implement changes based on Andrea's review
     541 * Fix/improve Quattor description in Wikipedia and Freshmeat
     542 * Ensure Quattor is reference on the appropriate open-source or software project portals
     543
     544== Wrap-up ==
     545 
     546