Changes between Version 1 and Version 2 of Meetings/Workshops/20080317


Ignore:
Timestamp:
Mar 17, 2008, 5:58:41 PM (18 years ago)
Author:
jouvin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Meetings/Workshops/20080317

    v1 v2  
    33
    44[[TOC(inline)]]
     5
     6[http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=28976 Agenda].
     7
     8== Site Reports ==
     9
     10=== CERN - V. Lefebure ===
     11
     12Several Quattor instances :
     13 * Main instance : 7600 profiles with 1700 not Quattor managed (for inventory only), in 140 clusters
     14 * 2 instances for Linux controls
     15 * Desktops : currently using only a small subset of NCM components to configure side-wide defaults
     16   * Special requirement on components : touch only the part they manage, don't remove comments
     17   * Rely on lcm rather than ncm-ncd to allow users to select component to run
     18   
     19PAN : still using v6 because of a performance issue with duplicate() function. Fixed proposed for v8, CERN may skip v7.
     20
     21Namespaces : used mainly for staging purposes (/prod, /test...)
     22 * Just started to use namespace to organize templates
     23 
     24CDB : currently on a 4-core machine
     25 * 33 minutes for 7600 profiles
     26 * Memory issue : compile by batch of 100 profiles
     27 * May be related to the big RPM repository
     28 * More and more CDB users : issues with ACL management and performances
     29 * Compilation serialization leads to long apparent commit time : thinking about allowing some // sessions if no interferences between them.
     30 
     31SPMA and SWrep
     32 * Plan to cleanup RPM repository to improve compilation memory requirement
     33 
     34CDB2SQL and Oracle
     35 * CDB2SQL being rewritten : plan for 3x speed improvement
     36 * CCM v2 deployed : SSL based
     37
     38Quattor development activities focused on maintenance of CERN templates, in particular namespace migration.
     39
     40Xen-based virtualization use increasing but needs a more flexible template structure.
     41
     42Trying to define a profile template structure everyboday will use with aim to separate OS configuration, VO configuration...
     43
     44Other needs and issues :
     45 * SMS needs to access CDB ACLs
     46 * Update in progress to Quattor 1.3 templates : performance issue with push and npush, 2 CERN specific properties in structure_interface
     47 * Reviewing how to better handle non Quattor managed objects : remove requirements for not used information
     48   * Plan to use a specific profile_base for these systems
     49 * Migration of service data historically in CDB to SDB (Service Database)
     50
     51=== LAL - M. Jouvin ===
     52
     53http://indico.cern.ch/materialDisplay.py?contribId=2&sessionId=0&materialId=slides&confId=28976
     54
     55=== NIKHEF - R. Starink ===
     56
     57Sites:
     58 * NIKHEF-ELPROD : ~150 hosts
     59   * Will increase to 300 in April
     60 * Testbed : ~15 nodes
     61
     62Install with Quattor for all grid machines
     63 * CentOS 3, 4, 5
     64
     65Grid MW deployment with ncm-yaim
     66 * Lack of time to loak at/swith to QWG : open to benefit from & contribute to communicty effort
     67   * Interest from another site : occasion to look at it
     68
     69Straightforward implementation of Xen guests using S. Childs's work.
     70
     71Monitoring : Nagios + Ganglia
     72 * Ganglia presents per-node or summary view of successful/failed execution of ncm-ncd
     73 * Nagios : alarms in case of non-zero exist (using NPRE)
     74   * Server configured manually
     75
     76Pan Compiler : using v7.
     77
     78SCDB : non-SVN version
     79 * Using Makefiles to hide SCDB tools
     80
     81AII : successful migration to v2.
     82
     83Facing scaling issues with TFTP and/or RPM repositories
     84 * TFTP unlikely to be the cause
     85 * External reasons like network ?
     86
     87
     88=== Grid-Ireland - S. Childs ===
     89
     9018 distributed sites managed : ~400 nodes
     91 * Single SVN repository with 6 active users
     92 * RPM repositories replicated by rsync on each site
     93 
     94New compute resources deployed : ~40 Condor VMs
     95
     96Isntallation of LEMON monitoring
     97 * Just for TCD site currently
     98 * Starting with Stijn's receipee
     99 * Work required for better integration : alarms, IPMI
     100 
     101Xen config tieded up.
     102
     103Developping GraphXML output format for template dependencies in panc v8.
     104 * http://grid.ie/panc/HyperGraph/index.php
     105
     106Range of new non-grid machines installed
     107 * Portal server
     108 * Data management
     109 
     110QWG pulled in via svn:externals
     111 * "current" pointer to certified revision
     112 * "trunk" pointer for development
     113 * Need to specify ignore:externals in case QWG repository is not accessible
     114 * Using Stijn's dummy WN trick for performance improvement
     115 
     116QWG structure based on a site hierarchy : target "compile.sites"
     117
     118Deployment of OS errata : still not working seamlessly
     119 * Mainly kernel issues
     120 
     121Plan to bring more service nodes in Quattor (e.g. Web servers)
     122
     123
     124=== Morgan Stanley - N. Williams ===
     125
     12620K machines in 4 sites
     127 * Looked at several solutions, including Quattor
     128 * Pressure to have 10K machines managed with the replacement system by June. Will be a major test before final decision.
     129
     130Like the Quattor architecture.
     131
     132Tried SCDB but don't like Subversion and prefer CDB model with specialized commands for non specialist users
     133 * Try to write a new thing merging both : AQDB
     134 * Looking at performances for compiling 10K machines
     135 * Concerns about scalibility of build server : DHCP, HTTP...
     136 
     137Using AII to manage initial installation :
     138 * Would like to dramatically improve installation time from 20 mn down to 5 mn
     139
     14010 people involved, 5 really writing templates : main complaint is difficulty to locate where are the things included
     141 * Would like to use namespace to provide better predictibily
     142 
     143SPMA : dependency management is very time consuming
     144
     145Monitoring : very interested by LEMON but no time to look at it in details.
     146
     147
     148=== UAM - L. Munoz Merias ===
     149
     150Several clusters :
     151 * Atlas T2 : 150 nodes, SL 4.5, managed with QWG templates, monitoring with Nagios
     152 * GVM-UAM : private cluster for UAM users, not necessarily HEP. Many pb with HW support on SL 4.5.
     153 * 40-desktop cluster : still running old-CDB, will probably be closed (old HW)
     154 
     155Quattor changes :
     156 * Implementation of staged deployment
     157 * Dropped SWrep in favor of HTTPrep
     158 * Secure delivery of profiles, using certificates : lightweight alternative to SINDES developped.
     159 * AII v2
     160 * Nagios integration into templates
     161 
     162Plans for the short term:
     163 * Panc v8
     164 * Quattorization of Quattor instances
     165 
     166=== INFN - A. Chierici ===
     167
     168Running SLC4/gLite 3.1, except for CE
     169 * Disk servers run 64-bit (except if machine doesn't support it)
     170 
     171Quattor configuration :
     172 * Using ncm-yaim to configure grid services
     173 * Adopted a new NS schema, close to NIKHEF one
     174 * Xen studied and tested : Xen-UI to be deployed soon
     175 
     176M.E. back at CNAF : continue to support core development
     177 * But only for one year...
     178
     179New LHCb Tier-2 hosted at CNAF and fully quattorized.
     180
     181PANC : migrated to v7 after Madrid workshop. 40% perf improvement.
     182
     183LEMON configured using Quattor
     184 * Storage nodes and WNs
     185   * On WNs, conflict with GridICE
     186 * Migration in progress to NS
     187 * LEMON used for monitoring, Nagios used for alarming
     188 
     189More than 10 people involved in Quattor template maintenance at CNAF
     190
     191Concerns about CERN role in the future : is the community strong enough to take over CERN role ?
     192
     193Developped a web application to display node Quattor status graphically
     194 * Organized by rack
     195 * When a pointer is on a machine, displays the status of the components
     196 * Status retrieved from a MySQL database, filled with a cron job
     197 * Able to send status through RSS
     198
     199
     200=== Philips Research - S. Vrijaldenhoven ===
     201
     202Current internal cluster not managed by Quattor. Want to move to grid to get access to more resources.
     203
     204Test grid cluster managed with Quattor (SCDB+QWG templates)
     205 * SL 4.5, gLite 3.1
     206 * Plan to move to production soon (planned March 25th).
     207 
     208Nagios work going on.
     209
     210
     211=== IBCP - C. Eloto ===
     212
     2131 cluster running gLite 3.1
     214 * 1 CE, 1 SE, 18 WNs
     215 * Everything installed in VMs
     216 * Quattor choosen to compensate from lack of manpower : seem to provide efficient management
     217 
     218Site not yet certified : pb with SE.
     219
     220Faced many problems with early release of QWG templates for gLite 3.1 and some unusual settings (RPM server different from TFTP server)
     221
     222Documentation needs to be improved for AII configuration, Python 32-bit for 64-bit, gLite 3.0/3.1 coexistence.
     223
     224
     225=== BEgrid - Stijn De Weirdt ===
     226
     227Quattor configuration :
     228 * Central configuration database (SCDB)
     229 * RPM repositories with SWrep
     230 * Certificates used for ACls on SVN and SWrep
     231 * Deployment in 2 phases :
     232   * When central admins update central repository, site admins are notified
     233   * Deploy the changes with a custom script if they are interested
     234   
     235Sensitive information deployed with SINDES but not properly integrated with AII
     236 * Need to be done for AII v2
     237 
     238Still not using QWG OS templates for historical reasons but plan to move.
     239
     240Use of dummy WN build (idea from CERN) to speed up the compilation
     241 * Compile just once
     242 * Reuse compiled version of a WN in every real WN profile
     243 * Not yet integrated with QWG : some changes required, mainly in `machine-types/base.tpl` and `machine-types/wn.tpl`
     244
     245Other issues :
     246 * Better integration of monitoring tools in Quattor
     247 * Integration of DNS management in a way similar to DHCP
     248
     249
     250== Main Developments ==
     251
     252=== QWG Templates - M. Jouvin ===
     253
     254http://indico.cern.ch/materialDisplay.py?contribId=18&sessionId=2&materialId=slides&confId=28976
     255
     256=== AII v2 - R. Starink ===
     257
     258Reasons for v2 :
     259 * v1 limited to PXE + Kickstart. Want to support other installation infrastructure like JumpStart
     260 * Device schema limited to /dev/[hs]d[a-z] : other schema supported by some hacks
     261 * RAID and LVM support very limited and impossible to combine
     262 * KS templates were becoming too complex and not maintainable
     263 * Untyped schema : no validation possible at compile time
     264 * Code unnecessarily complex : 2800 lines in v1 vs 1600 in v2
     265 
     266v2 architecture :
     267 * Front-end is using plug-ins to do the real work
     268 * 3 types of plugins : NBP, DHCP, osinstall
     269 * Node profile determines which plug-ins are loaded
     270 
     271aii-pxelinux : default plug-in for NBP, no more use of template
     272
     273aii-ks : default plug-in for osinstall, no more use of template
     274 * Generator for Kickstart files
     275 * Support complex blockdevice combinations : based on ncm-lib-blockdevices
     276 * Flexible partitioning and formatting of file systems... but a bit slower
     277 * Site specific setup through hooks : plug-ins for aii-ks organized in 3 groups (pre-install, post-install, post-reboot)
     278   * Configured through PAN
     279   * Ordered list
     280   
     281AII configuration has changed... but easy to adjust
     282 * NBP : configuration moved to /system/aii/nbp/pxelinux
     283 * osinstall configuration : configuration moved to /system/aii/osinstall/ks
     284   * Customizatble via 22 variables AII_OSINSTALL_*
     285 * hooks : excute NCM object, print what is desired in KS file
     286 * File system definitions : separated between /sys/blockdevices and /system/filesystems
     287   * Require new pan-templates and ncm-lib-blockdevices
     288   
     289Almost no change in command line interface : aii-shellfe, aii-installfe
     290 * --notify being reimplemented
     291 
     292Upgrading Quattor server
     293 * Pre-requisites : CCM >= 2.0.2, pan-templates >= 2.7.1, ncm-lib-blockevices >= 0.17
     294 * Packages : aii-server >= 2.0.4, aii-ks >= 1.0.1, aii-pxelinux >= 1.0.0
     295 * Configuration moved from /etc to /etc/aii
     296 * Move configuration for NBP and osinstall : remove everything under /system/aii/*/options as it will be assumed to be related to a plugin 'option'
     297 
     298Possible contributions :
     299 * Alternative modules
     300 * Hooks : generic (e.g. SINDES support) or site specific to be used by others as a starting point
     301 * Separate directory for sharing contributions and enhancements : contrib
     302   * Extensions maintained by extension author
     303   * A generic extension may be moved to standard AII
     304   
     305Support for AII v1 has been dropped.
     306
     307Already tested with Xen at NIKHEF.
     308
     309
     310=== Update on blockdevices layout - L. Munoz Mejias ===
     311
     312blockdevices :
     313 * partitions_add : bulk addition of partitions to a disk. Partitions given as a pair name/size.
     314 * lvs_add : similar for LVM
     315 * Support for LVM stripping aded : `stripe_size` property on `logical_volumes` structure.
     316 * No support for HW raid yet
     317 * No support for quota yet : looking for suggestions about a quota schema
     318 
     319 
     320== Core Components ==
     321
     322=== PAN Compiler - Cal ===
     323
     324Production version 7.2.9
     325 * Bug fixes only
     326 
     327Version 8 in development, still in trunk
     328 * 8.1.0 tagged last week and available for evaluation
     329 * Feedback from each site would help moving this to production : large chunks rewritten
     330 * Removed features deprecated in v7
     331   * Keyworks : define, delete, description, descro
     332   * Types : embed, fetch, stream
     333 * Newly deprecated features in v8
     334   * Bareword include : `include mytemplate;` changed to `include { 'mytemplate` };`
     335   * Using `type` for binding : must be replace by `bind`
     336   * Lowercase automatic variables : `loadpath`, `self`, `object`
     337 * Language changes :
     338   * External path syntax : //myboject/some/absolute/path to be deprecated in favor of /my/object/tpl:/some/absolute/path
     339     * Will allow eventually object templates to be namespaced
     340   * Literal escaping of paths : can escape part of a path
     341 * More limits on structure template contents : variable, functions, include no longer allowed
     342 * New and changed functions :
     343   * format() : printf-like capabilities. Can avoid incremental building of configuration file contents.
     344     * Not printing everything by itself
     345   * is_defined(), is_boolean()... return false instead of error if variable does not exist.
     346   * String manipulation : to_lowercase/uppercase(), split(), replace()
     347   * to_string() will accept any element, undefined, null and resources included
     348 * New automatic variable : TEMPLATE. Name of the template that initiated a DML block.
     349   * Not changed with function calls, including create()
     350 * New output format : write machine template as a dot (Graphviz) file
     351 * Logging capabilities added
     352   * Several logging types : task, call, memory, all, none
     353   * Messages short and easily parsed for analysis
     354   * Cost : 15% slower with all loging
     355   * Example analysis scripts for memory usage vs. time, task per thread vs. time, graph of call (include) structure, performance studies (how much time spent in every template)
     356 * Change in global variable handling : `variable X = ...exists(X)...` always true.
     357   * Must use null instead of undef for tri-state variables
     358   
     359
     360
     361Implementation changes in v8 :
     362 * Better handling of SELF : faster and less memory intensive, more consistent in various contexts
     363 * Some optimization : evaluation and use of compile-time expressions, specialized operators to aovid redundant checks at runtime, optimization to allow stricter syntax checking (earlier detection of some errors)
     364 * "Read-only" resources : infrastructure in place to avoid unnecessary copying of resources but not used yet. Some semantic issues to work out.
     365 * GRIF full build : 25% faster than v7
     366 
     367Documentation updated, including tutorial and man pages for PAN functions.
     368
     369Migration of standard templates to v8 required to avoid deprecation warnings
     370 * CERN has to migrate to v8 to benefit from last change to improve performance : require to remove no longer supported keywords
     371
     372Would like to clean up panc script options. What are the options used ?
     373 * To be discussed on the mailing list
     374
     375Authorization/Entitlements (Morgan&Stanley request)
     376 * What change : template or configuration ? Real issue is probably configuration and it's triky to know the reference value.
     377 * Who did it ? How to get user identity ?
     378 * Was it authorized ? How to define authorization ? One possibility is to define parts of configuration than can be modified by a template but at the price of flexibility.
     379 * Need to refine the use case and possible design : first discuss on mailing list and try to write a wrap up, then try to implement a prototype
     380 * To be acceptable the performance price for this should be only when you use the feature (and not for everybody like in panc v6).
     381