Changes between Version 1 and Version 2 of Meetings/Workshops/20061018


Ignore:
Timestamp:
Oct 21, 2006, 11:36:06 PM (19 years ago)
Author:
/C=FR/O=CNRS/OU=UMR8607/CN=Michel Jouvin/emailAddress=jouvin@…
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Meetings/Workshops/20061018

    v1 v2  
    1 = Quattor Worshop - DESY - 18-20/10/06 =
    2 [[TracNav]]
    3 
    4 [[TOC(inline)]]
    5 
    6 Agenda : https://indico.desy.de/conferenceTimeTable.py?confId=64&showDate=all&showSession=all&detailLevel=contribution&viewMode=parallel
    7 
    8 == Site Reports ==
    9 
    10 === CERN ===
    11 
    12 Mainly increase in the number of nodes managed by Quattor : #4000
    13 
    14 Quattor Solaris support dropped
    15 
    16 CEN specific activities :
    17  * CDB profile authentication/encryption : not yet deployed, need to solve the issue of a host cert expiring or in a CRL to avoid dead lock. Idea is to use 2 URLs for downloading the config, with an automatic failover it first one fails.
    18  * Manage Xen and VirtualPC with Quattor
    19  * Namespaces : urgently need, want to agree on namespace layout before. Will take time to implement in 20K templates
    20  * test, new and pro area to be provided using loadpaths and ACLs
    21  * Integration of Quattor with SL (Service Level Status)
    22  * Impact of SL5 (using FC5/6) on ncm-components
    23  * SINDES maintenance : not really part of Quattor but mainly used in this context.
    24 
    25 
    26 === DESY ===
    27 
    28 200+ systems in production in the GRID infrastructure
    29  * Local HERA experiments
    30  * CMS and Atlas (DESY T2)
    31  * VOs that are part of EGGE
    32 
    33 Still using Quattor 1.1
    34  * CDB, AII, SWREP, SPMA
    35  * Still interest for SCDB but no time yet...
    36  * Middleware installation is done using YAIM (yaimexec ?)
    37 
    38 Current issues :
    39  * How to keep CA and middleware RPMs up to date ? How to use YAIM and SPMA in conjunction ?
    40  * PAN compile time increasing with the number of machines (currently 5-6 minutes for 200 nodes)
    41  * Time to get fixes for YAIM bugs reported in Savanah
    42  * Infrequent cdispd aborts : seems to be known at CERN, related to CCM
    43 
    44 
    45 === BEGrid ===
    46 
    47 Central SCDB + SWrep
    48  * SCDB : Certs+ACLs. Not everybody allowed to edit everything.
    49  * Goal :
    50 
    51 Current configuration :
    52  * 4 sites with LCG, 2 with gLite
    53  * Restarted from scratch (new repository) with gLite
    54 
    55 Additions to QWG templates :
    56  * SE_dCache
    57  * Lemon server with Oracle
    58  * Use of IPMI for /system/hardware
    59  * Ganglia server and client
    60  * All pwd and sensitive information into one template
    61 
    62 Problems using SLC with Quattor/QWG
    63 
    64 Tests done with WNs in VMware under WXP, managed by Quattor.
    65 
    66 Changes to AII :
    67  * SINDES/AII integration : some templates needed to be changed
    68  * Install kernel in ks rather than ks-post-install : easier to use alternative kernels
    69 
    70 Work on improving bulk compile of a large number of worker nodes
    71  * Compile a dummy WN
    72  * In real WNs, include the compiled profile and redo part of the configuration
    73  * Doesn't fit well with QWG templates, difficult to say wich part to reinclude and which part to ignore, probably need something inside PANC
    74 
    75 
    76 === CNAF ===
    77 
    78 Early adopter of Quattor : both experienced QWG templates and YAIM
    79 
    80 Currently using Quattor for initial installation (AII)
    81  * Use http repositories instead of SWrep
    82  * Since LCG 2.7, moved to YAIM with ncm-yaim
    83 
    84 Pretty well accepted by farm managers
    85  * Especially storage guys
    86 
    87 Started to implement our own namespaces
    88  * Mainly to achieve machine category segmentation, in particular to control access to templates.
    89  * Would like some discussion on use of namespaces in standard templates
    90 
    91 Would be useful to have all templates needed to install a basic SL system.
    92 
    93 More of documentation on basic Quattor components (PAN, LC libs...) would help dissemination.
    94 
    95 
    96 === NIKHEF ===
    97 
    98 T1 for LHC and involved in several national projects (BIG GRID, VL-e
    99 
    100 2 "sites" :
    101  * Production : ~180 nodes. Significant increase expected.
    102  * Installation testbed : ~15 nodes
    103 
    104 Quattor usage : CVS + ant/panc
    105  * OS : CentOS 3
    106  * panc 5.0.6
    107  * Only generic components used
    108 
    109 gLite : moved to ncm-yaim
    110  * Initial installation via Quattor
    111 
    112 Issues :
    113  * 64-bit SW installation
    114  * Compiler performance
    115 
    116 
    117 === Philips Eindhoven ===
    118 
    119 Part of research division uses grid
    120  * A few test systems installed with Quattor
    121  * Links with NIKHEF
    122 
    123 
    124 === PIC ===
    125 
    126 250 nodes running SL3 + 2 CASTOR nodes
    127 
    128 Use ncm-yaim.
    129  * Problem with the hardcoded list of variables that can be configured with ncm-yaim. Made some changes, not filled into Savanah. Pb fixed 6 months ago by Ronald.
    130 
    131 Deploying Quattor 1.2 + SCDB
    132 
    133 
    134 === Irish Grid ===
    135 
    136 Entire irish grid managed with Quattor
    137  * 18 sites, 200 nodes : all the nodes centrally managed from 1 site (Dublin)
    138  * 1 Quattor database
    139  * CVS SCDB, HTTP RPMs
    140  * 95% of nodes are Xen VMs
    141  * Had our own hierchical model : EGEE->GI->Site. Started to move to standard QWG layout/hierarchy soon with gLite 3.0.2.
    142  * Expertise is spreading throughout the group
    143 
    144 Moving non Grid servers to Quattor : 64-bit, SL4, Xen...
    145 
    146 Integration with automatic VM creation tool for building testbeds.
    147 
    148 Have spent a lot of time keeping up to date with changes in QWG structure.
    149 
    150 
    151 === UAM ===
    152 
    153 Quattor used for installation + configuration of 3 clusters, using 3 Quattor servers.
    154  * Use CDB with last QWG templates
    155 
    156 Cluster UAM-LCG2
    157  * Part of a distributed T2
    158  * 130 WNs
    159  * QWG LCG templates
    160  * Issue with update snc with other sites
    161 
    162 Cluster GVMUAM-LCG2
    163  * 500 nodes, mainly PCs for student lectures
    164  * Used for different topic
    165  * Installed with QWG templates
    166  * Must preserve existing partitions
    167  * No full control of DHCP
    168  * Network pretty slow
    169 
    170 Cluster WS
    171  * User desktops : #40
    172  * Template layout based on organization : departement, group...
    173  * Home made templates
    174  * Software for desktops
    175  * Components to configure desktop services like printers, X11
    176 
    177 
    178 == Experience with ncm-yaim - R. Starink ==
    179 
    180 QWG templates until 2.6.0 : several difficulties, mainly due to the lack of genericity
    181  * Took 4-8 weeks to incorporate local changes to a new QWG release
    182  * Backward compatibility between release
    183  * Complex structure
    184 
    185 Move to YAIM with ncm-yaim with LCG 2.7
    186  * YAIM used only for config, install with SPMA
    187  * YAIM variables created from templates
    188  * Activate YAIM on each machine
    189 
    190 Setting YAIM variables into templates very similar to writing a pro_lcg2_config_site.tpl
    191  * Issue with new version of YAIM requiring variables not supported by ncm-yaim
    192  * Explicit list of supported variables comes from the ncm-yaim schema that bring the advantage of validation (instead of using a plain filecopy).
    193 
    194 Building RPMs list. Several solutions attempted :
    195  * Dependencies from gLite meta packages
    196  * From an APT repository
    197 
    198 Need for YAIM local functions required :
    199  * Security : no shared pool accounts for SGM users
    200  * Central db for DPM and LFC
    201  * Shared gridmapdir, shared home dirs, accounts on LDAP : not supported by YAIM
    202  * Developped nikhef-yaim-local to be installed (via SPMA) or updated after gLite-yaim. Close contacts with YAIM developpers.
    203 
    204 Satisfied by the change :
    205  * Easier to maintain, less overhead, less dependent on external party (QWG)
    206  * Still some surprise with YAIM...
    207  * Shorter time for deployment of a new release (1 week)
    208  * No experience in changing a node type without reinstalling
    209 
    210 
    211 == RPM Dependency Hell - Stijn Weirdt ==
    212 
    213 RPM management by Quattor involves several steps :
    214  * Create/update repository with tools in cvs/utils
    215  * Create base repository contents with cvs/utils
    216  * Keep repository up to date
    217  * Test deployment : nothing to help here
    218 
    219 Other tools existing :
    220  * apt : no bi-arch support
    221  * yum : the best presently, not supporting all RPM options
    222  * smart : try to support everything, still buggy
    223  * All these tools share in common the idea of RPMs metadata stored in some db that can be accessed without accessing RPMs.
    224 
    225 SPMA : works well but some limitation : in particular inability to test deployement without installing RPMs as there is no metadata.
    226 
    227 RPM repositories :
    228  * SWrep : should have a "rsync url" option to keep local repository up to date.
    229  * OS distros : could use existing mirrors to avoid duplication locally. Would require some kind of metadata to produce the local templates required.
    230  * Initial loading of OS templates : may rely on comps.xml to find what is needed or on some metadata.
    231  * Keeping repository up to date : OS update metadata parsing ?
    232 
    233 RPM testing before deployment : test all the dependency from the administrator machine before deployment
    234  * Should be very fast (30-60s per machine) : require metadata
    235  * Fake installation not fast enough
    236  * Injection of all RPMs in a rpmdb + rpm -i
    237  * Problems with first tests on 64-bit : yum fails for unknown reasons, rpmdb doesn't support correctly bi-arch.
    238 
    239 
    240 == Specific Use Cases ==
    241 
    242 === Diskless Systems - M. Shroeder ===
    243 
    244 2 possible setups :
    245  * RAM disk : the whole system in a large file loaded at boot time
    246  * NFS mount : a small image loaded from network at boot, other FS through NFS, mainly readonly (and shareable)
    247 
    248 Read Hat's way
    249  * PXE + NFS mount
    250  * Clone the server system
    251  * One snapshot for each client : non shared files, writable files
    252 
    253 Quattor usage in this context :
    254  * Configure RH tools (pxeos, pseboot) via quattor templates and components : ncm-diskless_server
    255  * Kickstart for server install and cloning : only install base system
    256  * server and its clone configured separatly : chrooted for clients
    257  * Client configuration cloned in 2 parts : 1 common to all clients (done on the clone), 1 specific for each client (1 profile / client)
    258 
    259 Clone is not really a real machine : cannot receive CDB modifications notification
    260  * Has to fetch new profiles via cron
    261  * ncm components run on the server but must not impact the server
    262  * Client filesystem is read-only but to run a component on a client, need to create some files and the ability to modify existing ones
    263  * Not clear if we want to support several clones (several configurations) per server
    264 
    265 Current experience : 2 test clusters (2 and 8 clients)
    266  * Clients in a private network without access to CDB/SWrep
    267  * SPMA cannot be run on the client : establishing a matrix of components that can run on the clients
    268 
    269 
    270 === Quattor and XEN - S. Child ===
    271 
    272 Main problem is the grub component :
    273  * Need support for multiple boot
    274  * XEN is the kernel, Linux kernel and initrd are "modules"
    275  * New version with this support now checked in but problem found at CERN ?
    276 
    277 Started ncm-xen :
    278  * Write configuration files for individual VMs
    279  * Should also write base Xen configuration
    280  * Will set up links for automatic start of domains
    281  * Will check in 0.1 soon... Still not mature !
    282 
    283 GridBuilder : web based interface (Ajax based) for creating and managing VMs
    284  * http://gridbuilder.sourceforge.net, developper by Dublin Trinity College (Author : S. Childs)
    285  * LVM allows fast creation of COW FS images
    286  * Database of VMs and images
    287  * Quattor used for configuration. Still small amount of pre-configuration :
    288    * Configure network on filesystem images
    289    * Fetch Quattor profile
    290 
    291 Possible improvements in Quattor integration
    292  * Automatic generation of node profiles from user supplied description in GridBuilder
    293  * Support for Colinux, a version cooperation with Windows for sharing of HW ressources (memory...)
    294  * Condor pool
    295 
    296 
    297 == PAN Compiler Update ==
    298 
    299 C version : implementation frozen, only major bugfixes, v6.0.3
    300  * Performance improvements in last version : compression removed, defaults processing improved, speed and memory consumption as good or better than before
    301  * Added "session" directory to improve interface for CDB
    302 
    303 Java version : still in development, all major parts functionning
    304  * Limited alpha available, first beta mid December, Production January
    305  * Main part missing : built-in functions
    306  * Validation suite is complete
    307  * License : probably Apache2 to be consistent with EGEE-II
    308  * Source in QWG SVN repository : https://svn.lal.in2p3.fr/LCG/QWG
    309  * Backward-compatibility : as much as possible, may be some incompatibilities for some very unused features
    310  * Require Java 1.5+
    311  * Compilation and packaging : ant
    312  * Parser (build) : JavaCC 4.0
    313  * Unit testing : JUnit 4.1
    314  * Base64 encoding/decoding (build & run) : classes available from Apache/W3C, probably incorporate them directly in the code base.
    315 
    316 Syntax changes :
    317  * Bit operators
    318  * Unary plus for symmetry with unary -
    319  * Octal, hex accepted everywhere : ranges, path...
    320  * Limits allowed on record statements
    321  * 'bind' statement added for binding a path to a type, in replacement of one form of 'type' (this 'type' usage will be deprecated)
    322  * 'return' allowed where functions are (not very useful, grammar simplification)
    323  * Warning could be issued for deprecated usage
    324 
    325 Other changes :
    326  * Stricter syntax checking at compile time
    327  * Generation of "object" files (binary form of a syntax checked template) to avoid recompilation of an unchanged template
    328  * ant tasks will be the primary interface to the compiler (no "binary") but wrapper scripts will be provided for command line
    329 
    330 Incompatibilities :
    331  * 'bind' is now a keyword
    332  * OBJECT, SELF, ARGV ARGC defined to conform to best practices for global varaible and avoid conflict (at grammar level) between object keyword and object variable
    333  * No pointers to properties : now 'x = y[0] = 0; y[0] = 1' now leaves x==& (currently x is also set to 0)
    334 
    335 Emphasis for first release : verifying functionality and measuring performances
    336  * In particular evaluate cost/benefit for object files
    337 
    338 Future changes after initial release :
    339  * Parallelization : compilation of templates, building of configuration trees
    340  * Remove deprecated features : lowercase global variable, deprecated form of 'type', 'define' keyword
    341  * Addition of string functions : uppercasing, lowercasing, push, pop
    342  * More default types : XMLSchema, port...
    343 
    344 Missing features/pending bug reports :
    345  * Unescape strings in traceback produced in error msg
    346  * Add a file existence test operator
    347  * Add an argument to matches() to allow to pass global options (like in Perl)
    348 
    349 == QWG Templates ==
    350 
    351 From discussion : need to think about explicit support for CDB
    352  * Probably mainly the matter of defining load paths in an optional template in replacement for cluster.build.properties
    353 
    354 
    355 == CDB/SCDB Update ==
    356 
    357 == CDB Update ==
    358 
    359 New features since last workshop :
    360  * Namespaces
    361  * X509 and Krb authentication
    362  * ACLs with namespace support
    363  * Client/server improvement through session metadata
    364 
    365 State management through metadata :
    366  * Problem was session directories used for both data and state
    367  * Clear separation with specific metadata for state control : better and earlier detection of commits
    368 
    369 Parallel compilation of templates :
    370  * All profiles compiled with one command doesn't scale : too long, to much memory
    371  * Whole set of templates divided into several subsets (without dependencies) compiled separatly and in parallel on several processor/machines
    372 
    373 Smarter rollback and commit :
    374   * Currently rolling requires rollback all the mods in the sesion, commit can screw up previous modifications committed but not in the session directory
    375  * Now allow selective rollback and interactive commit
    376 
    377 Handling of dead revisions : problem related to CVS backend
    378  * A removed template is no longer in CVS, restoring from backup requires a lot of manual cleanup. Look at SVN as a new backend ?
    379 
    380 New authentication for CDB moved to a separate library and now used in all components (SWrep in particular)
    381  * Doesn't require to install CDB to use other components, only the library
    382 
    383 Other project status :
    384  * CDB as a Web service : still any interest ? Not sure...
    385  * Fine-grained CDB locking with faire queuing : really required
    386  * Concurrent compilation of (non object) templates  : too complex, wait for new compiler..
    387  * mod_perl : no further investigation, mod_fastcgi probably a better solution.
    388 
    389 Open issues :
    390  * CVS desn't scale : a problem with 24k templates. Possible solutions : perl based CVS ? Subversion ? XML database ?
    391  * Relocatability : difficult to port to other systems, testing requires a full installation or specific privileges
    392 
    393 === SCDB Update ===
    394 
    395 See slides.
    396 
    397 Suggestion :
    398  * To avoid a full rebuild after repository templates update, ignore these templates when evaluation if a node profile must be recompiled (this is how it is handled within CDB).
    399 
    400 
    401 == Quattor Core Modules Update ==
    402 
    403 Several ongoing developpement at CERN, of general interest.
    404 
    405 CDB2SQL :
    406  * Rewrite with multithreaded Python and a fast XML parsing library. Sould have no CERN dependency, only Oracle dependency (but should not be difficult to add support for other RBMS)
    407 
    408 autoconf :
    409  *  In contact with ETICS to use their framework for configuration and automatic build
    410  * Will remain possible to run Quattor build tools outside of ETICS (mainly need to define --LOCALDIR)
    411 
    412 CCM :
    413  * Add support for a failover profile, in case the URL in --profile is not available
    414  * Problems observed causing ncm-cdispd to crash
    415 
    416 wassh2 : improvement of wassh (parallel ssh), interfacing with CDB
    417  * CDB plugin done using a plugin : CERN uses Oracle/cdb2sql
    418  * Will be part of Quattor 1.3
    419  * Non CERN testers welcome
    420 
    421 Notification systems : have the ability to trig execution of a NCM component on one or several nodes without logging to them (even with wassh)
    422  * Basically one command : notify_host myhost component
    423  * 'component' is a keyword translated on the target host via a configuration file
    424  * Current version with lots of CERN dependencies, plan to reengineer it and release it as part of Quattor
    425 
    426 
    427 == AII ==
    428 
    429 Since last workshop, various bug fixes and some enhancements :
    430  * Support for rescue images
    431  * Support for alternative device naming schemes
    432 
    433 Work in progress :
    434  * Separate site configuration from component configuration
    435  * Error handling
    436  * Complete partitioning scheme
    437  * Documentation
    438  * Schema change for block devices
    439 
    440 On the todo list :
    441  * SPMA proxy start
    442  * Support for SINDES
    443 
    444 Separating site and component configuration :
    445  * Idea if to have pro_software/declaration_component_aii really only related to AII component. Will provide a function aiiconfig(name,disk) to return actual configuration
    446  * pro_config_aii_OSNAME : OS / arch specific configuration
    447  * pro_aii_config_site : site specific information
    448 
    449 Expected incompatibilities in new version :
    450  * Separation of configuration
    451  * Change in schema for generic block devices
    452 
    453 Remark : could add a property to /software/components/osinstall/options to select if initial installation is done with DHCP or final address and improve the template accordingly
    454 
    455 
    456 == New Schema for Block Devices ===
    457 
    458 Need to add support for SW RAID, new kinds of block devices, filesystem mount options.
    459  * Also need to align naming for HW Raid and SW Raid
    460 
    461 New schema proposal : /system/blockdevices/[disk|md|lvm|hwraid]
    462  * disk : 1 entry per disk, almost all information optional, mainly partitions
    463  * md/hwraid : basically the same, add information about raid members, raid level, stripe size...
    464  * lvm : allow to use more sophisticated LVM scheme that current functions (LVM vg splitted over several HW raid...)
    465 
    466 Seems ok but need to check :
    467  * AII compatibility : in particular KS template
    468  * Ability to represent multi-pathed devices
    469  * Is the md/hwraid distinction relevant : may be just keep a property in a common schema
    470 
    471 == Namespaces ==
    472 
    473 See German presentation. Basically everybody agrees. Just a few details :
    474  * What's in pan/ and what's in quattor/
    475  * Have all type defined in one place as some will move to the compiler
    476  * Hardware : may be cards is not needed, for ram/ use bank.tpl
    477  * Components : upgrade quattor build tools to be able to automatically produce namespaced version from non namespaced source, call declaration.tpl schema.tpl and define default values there in the future, add the ability to build RPM-less components (automatically insert perl script into template).
    478 
    479 Clusters :
    480  * CDB relies on clusters and subclusters
    481  * Could be added in SCDB
    482  * Rename site/ to config/
    483  * Difficult to agree on the whole layout between CDB/SCDB as the concepts are different
    484 
    485 Sites :
    486  * Rename site/ to config/
    487  * Difficult to agree on the whole layout between CDB/SCDB as the concepts are different
    488 
    489 OS templates :
    490  * Change rpmlist to rpms
    491  * Rename templates describing groups to groupname.tpl
    492  * Renamme os/ namespace to config/
    493 
    494 Standard variables : no real need to agree on variables, need to agree only on the schema
    495 
    496 OS/arch naming : originally os_arch, QWG use os-arch, no real need to agree as this is not in the schema
    497 
    498 
    499 == Wish List, Roadmap... ==
    500 
    501 Documentation :
    502  * Provide user guides in addition to specification for all components
    503  * Move quattor.org to Twiki (except home page ?)
    504  * Have a short installation guide (not 80 pages...)
    505  * List of available components, with a very short explanation and the recommended production version (may come from CVS tags, updated manually if necessary)
    506  * Tutorial : differenciate between old and recent tutorials
    507 
    508 Open Issues (from Savanah) :
    509 
    510 == Conclusions ==
    511 
    512 Next meeting : Trinity College, target date : mid-march
    513 
    514 
    515 TBD :
    516  * Add an option to osinstall for using DHCP at installation time and merge LAL and standard KS template
     1                result['accounts']['users'][role_user] = merge(nlist('uid', vo_params[vo]['base_uid']+role_num,
     2                                                                     'groups', list(vo_group),
     3                                                                     'comment', 'VO '+vo_name+' '+role['description'],
     4                                                                     'createKeys', vo_params[vo]['create_keys'],
     5                                                                     'createHome', vo_params[vo]['create_home'],
     6                                                                     'poolDigits', vo_params[vo]['pool_digits'],
     7                                                                     ),
     8                                                                role_user_home);