Changes between Version 1 and Version 2 of Meetings/Workshops/20071029


Ignore:
Timestamp:
Oct 29, 2007, 1:50:38 PM (18 years ago)
Author:
jouvin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Meetings/Workshops/20071029

    v1 v2  
    33
    44[[TOC(inline)]]
     5
     6
     7== Site Reports ==
     8
     9=== CERN ===
     10
     11
     12Main instance : ~6100 nodes (+ 1K nodes in 6 months)
     13 * Template "de-optimization" : had to get rid of most value('/xxx/yyy.tpl') for moving to namespaces. Result is perf slower by a factor of 2.
     14 * Still using panc v6 : perfs not satisfactory with panc v7 first tests. No time for extensive test. Current compilation time is 20' on a 2 dual-core system (16 GB of memory).
     15 * CDB : still using a flat namespace, moving slowly to namespaces. Optimizing for multi-core machines...
     16 * SPMA + SWrep : extended authentication added (Krb5), many bug fixes and improvements
     17 * CDB2SQL : no progress
     18 * CCM : authenticated mode added
     19 * Initial installation : still using ''PrepareInstall'' instead of AII : interacts with both CDB and CDBSQL to get machine parameters and some CERN specific components (LAN db, SINDES...). Installation done  by AIMS. Very successfull but a little bit hugly to maintain.
     20
     21Quattor development activities
     22 * Release management related projets (ETICS integration, Savanah...) on hold
     23 * Default template set has been moved to namespace and build tools updated
     24 * Maintenance of a few core modules : CDB2SQL, SPMA, SWrep, CDB...
     25
     26Other activities :
     27 * Xen-based virtualization based on ncm-xen and ncm-filesystems : development of a Xen Hypervisor.
     28  * Integration between Quattor and SLS in progress
     29 
     30=== LAL ===
     31
     32See http://indico.cern.ch/materialDisplay.py?contribId=10&sessionId=0&materialId=slides&confId=20479
     33
     34About using Squid to cache RPM at every site : Grid-Ireland experience is that they are some drawbacks if you have something outdate in the cache. Can be difficult to clear all caches. Prefer to use rsync at deployment time to ensure that every site repository is up to date.
     35
     36=== NIKHEF ===
     37
     38NIKHEF-ELPROD : ~150 nodes, strong increase expected
     39 * Also an installation testbed with 15 nodes
     40
     41Currently using AII for initial installation and Quattor for configuration of OS and MW
     42 * OS configured with NCM components
     43 * MW configured with ncm-yaim : frequent patches required to YAIM
     44 
     45Quattor components used :
     46 * Moved to panc v7 : faster than v6.
     47 * AII, SPMA
     48 * SCDB without Subversion (using CVS for versionning)... Only Ant tools are used.
     49
     50Issues :
     51 * PAN compiler performances as a significant increase of number of nodes is expected
     52 * Scaling problems when deploying to 130 nodes simultaneously : try to monitor update results with Ganglia.
     53
     54Future :
     55 * XEN virtualization
     56 * Nagios setup under Quattor
     57
     58=== GRID-Ireland ===
     59
     60Quattor managing 18 sites with a total 400 nodes
     61 * Single CVS moving to SVN
     62 * Replicated SW repositories at each site (rsync)
     63 * 3 deployment servers : production, tests, e-Learning
     64
     65Recent developments :
     66 * 17 sites reinstalled with Quattor/Xen : fully automated PXE installation of hosts and guests
     67 * Migration to SVN : integration in progress with local web deployment tool. Realized the high number of unused, obsolte templates.
     68 * New compute resources : ~140 Condor VMs
     69
     70QWG usage :
     71 * "Pointer" to particular QWG revision (by rev number), checked out semi-automatically. Plan to pull QWG templates via svn:externals
     72 * Real sites containing clusters, ability to select sites. Local target ''compile.sites''.
     73 * Plan to use Stijn's dummy WN template
     74
     75Issues :
     76 * Desesperatly need monitoring integrated to detect failed deployments. At least documentation about currently existing solutions.
     77 * int.eu.grid site backed out of Quattor due to conflicts with APT/YAIM
     78 * Getting started with Quattor is still difficult
     79
     80=== PIC ===
     81
     82Several Quattor servers :
     83 * Quattor01 : still Quattor 1.1, CDB
     84 * Quattor02 : Quattor 1.3, SCDB, QWG glite-3.1
     85 * Quattor03 : will replace Quattor01/02 soon with QWG glite-3.0 and glite-3.1 (mainly WNs and UIs at first).
     86
     87Local developments :
     88 * ncm-snmp : needed for Nagios monitoring
     89 * Local command to hide SCDB/CDB differences : gettpl, ...
     90
     91=== UAM ===
     92
     932 grid clusters : UAM-LCG2 (150 nodes) and GVMUAM-LCG2 (300 nodes)
     94 * Quattor 1.2, SCDB, QWG 3.0.2-x (not the most recent), AII, SWrep
     95 * Still configuring d-Cache manually : considering moving to Quattor
     96
     971 non grid cluster still managed using CDB
     98 * Different set of people managing this cluster, used to CDB
     99
     100=== Morgan-Stanley (Nick Williams) ===
     101
     102Not yet using Quattor, looking at it as one possible solution for Morgan Stanley needs.
     103
     104Currently using an home made product called ''Aurora''
     105 * Quite good but designed at beginning of 90s
     106 * Defines all distributed services and host configuration : how machine is configured, how to access apps, ...
     107 * Designed with the ability to restore a machine in a previous state in 20 minutes
     108 * Based on an homogeneous view of the ressources, in particular through use of AFS : nothing installed locally
     109
     110Current configuration is 20K Unix servers, includes "grid systems" (pool of nodes running the same app) with home-gown MW.
     111 * Aurora designed for 5K machines
     112 * Need to integrate risk management, an important part of financial business (failover...)
     113 * Users asking for custom configurations, quicker and more agile.
     114 * Machine organized into ''bucket'' : ~2500 machines sharing most of their configuration settings.
     115 * A campus made of multiple buckets, a region made up of datacenters.
     116 * Significant work done at night to synchronize configurations
     117
     118Core of the new Aurora must be :
     119 * A configuration system
     120 * An entitlements system : control of rights to do an action
     121 * Move off AFS if possible : RPM ?
     122 * Virtualization
     123 * Work done by Unix Engineering group (20 people, based in London)
     124 * Support for Linux and Solaris
     125
     126=== Philips ===
     127
     128Restarting work on Quattor : new persons found.
     129
     130Current effort is to transfer Quattor knowledge to system management to get more people involved.
     131 * Quattor 1.3, SCDB, QWG templates 3.1
     132 
     133Issues :
     134 * How to revert to a previous tag
     135 * What actions to perform when new QWG templates are out
     136
     137=== CNAF ===
     138
     139Still using Quattor 1.2 withouth SCDB/QWG : lack of time to look at it.
     140 * SLC 4 support has been added, mainly 32-bit
     141 * Quattor 1.3 update still planned, will use namespace at the same time (lot of CNAF specific templates to update)
     142 * Using last AII version : no problem so far...
     143
     144Evaluating Xen : some issues using pypxeboot
     145
     146Training of CNAF staff still a problem : Quattor often considered as the responsible for deleting what was done manually...
     147 * Very time consuming
     148
     149Candidate for hosting next Quattor workshop...
     150
     151
     152== Core Components ==
     153
     154=== PAN Compiler - C. Lommis ===
     155
     156Current status :
     157 * v6 deprecated and frozen : no bug fixes or enhancements. Still used by default by CDB.
     158 * v7 is the production version (7.2.6) : (almost) 100% backward compatible, very limited enhancements. Used by default by SCDB.
     159 * v8 : development version. First version with language-level changes compared to v6. Expect a production version before Christmas
     160
     161v7 performance :
     162 * Faster than v6 for almost everybody, except CERN. Rudimentary profiling added to v7 to help investigate the problem.
     163 * Better memory management : can make huge differences with large number of templates
     164 * Generic performance tests done :
     165   * v7 significantly faster than v6 (x5-10) for almost every operation, except variable ''self'' reference test where v7.2.5 was 50x slower than v6. Fixed in 7.2.6 where perf is comparable with v6.
     166   * Many operation slower on multi-core than on single-core : requires some investigation.
     167   * Perf on CERN profiles is better with panc v7.2.6 than with v6 (x1.5 to x2). Memory use is around 2 GB for 920 profiles (probably much less than panc v6 but difficult to get exact consumption for v6).
     168 * Possibility to explicitly set the number of thread to use but all tests showed the best performance is obtained with the number of thread matching the number of cores (default behaviour).
     169
     170Planned language changes :
     171 * v7 : foreach statement, bind statement (replacement for one form of type), bit and misc. functions, some deprecated keyworkds
     172 * v8 :
     173   * Enhancements : `format` function, i18n support, perf and include logging, automatic variable for current template name, simple DML and copy-on-write optimization (only copy changed parts), `final`for structure templates.
     174   * Deprecated : lowercase automatic variables, `type` synonym for `bind`.
     175   * Unsupported : define, description and descro keyword, delete statement
     176 * v9 : deprecated  include with literal name (important simplification of grammar), removal of lowercase automatic variables and type as a synonym for bind
     177 * v10 : removal of include with literal name
     178 * v11 : include DML without braces
     179
     180Other requests :
     181 * Cast operator : bring lot of complications. Delayed for now.
     182 * Forced assignment : change the type of variable. Risk of undetected errors. Delayed for now.
     183
     184Roadmap discussion :
     185 * Deprecate include with literal name in v8 : the change required is compatible with panc v6. Agreed.
     186 * Introduce include DML without braces at the same time former include syntax is unsupported (v9).