| | 5 | |
| | 6 | [http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=28976 Agenda]. |
| | 7 | |
| | 8 | == Site Reports == |
| | 9 | |
| | 10 | === CERN - V. Lefebure === |
| | 11 | |
| | 12 | Several Quattor instances : |
| | 13 | * Main instance : 7600 profiles with 1700 not Quattor managed (for inventory only), in 140 clusters |
| | 14 | * 2 instances for Linux controls |
| | 15 | * Desktops : currently using only a small subset of NCM components to configure side-wide defaults |
| | 16 | * Special requirement on components : touch only the part they manage, don't remove comments |
| | 17 | * Rely on lcm rather than ncm-ncd to allow users to select component to run |
| | 18 | |
| | 19 | PAN : still using v6 because of a performance issue with duplicate() function. Fixed proposed for v8, CERN may skip v7. |
| | 20 | |
| | 21 | Namespaces : used mainly for staging purposes (/prod, /test...) |
| | 22 | * Just started to use namespace to organize templates |
| | 23 | |
| | 24 | CDB : currently on a 4-core machine |
| | 25 | * 33 minutes for 7600 profiles |
| | 26 | * Memory issue : compile by batch of 100 profiles |
| | 27 | * May be related to the big RPM repository |
| | 28 | * More and more CDB users : issues with ACL management and performances |
| | 29 | * Compilation serialization leads to long apparent commit time : thinking about allowing some // sessions if no interferences between them. |
| | 30 | |
| | 31 | SPMA and SWrep |
| | 32 | * Plan to cleanup RPM repository to improve compilation memory requirement |
| | 33 | |
| | 34 | CDB2SQL and Oracle |
| | 35 | * CDB2SQL being rewritten : plan for 3x speed improvement |
| | 36 | * CCM v2 deployed : SSL based |
| | 37 | |
| | 38 | Quattor development activities focused on maintenance of CERN templates, in particular namespace migration. |
| | 39 | |
| | 40 | Xen-based virtualization use increasing but needs a more flexible template structure. |
| | 41 | |
| | 42 | Trying to define a profile template structure everyboday will use with aim to separate OS configuration, VO configuration... |
| | 43 | |
| | 44 | Other needs and issues : |
| | 45 | * SMS needs to access CDB ACLs |
| | 46 | * Update in progress to Quattor 1.3 templates : performance issue with push and npush, 2 CERN specific properties in structure_interface |
| | 47 | * Reviewing how to better handle non Quattor managed objects : remove requirements for not used information |
| | 48 | * Plan to use a specific profile_base for these systems |
| | 49 | * Migration of service data historically in CDB to SDB (Service Database) |
| | 50 | |
| | 51 | === LAL - M. Jouvin === |
| | 52 | |
| | 53 | http://indico.cern.ch/materialDisplay.py?contribId=2&sessionId=0&materialId=slides&confId=28976 |
| | 54 | |
| | 55 | === NIKHEF - R. Starink === |
| | 56 | |
| | 57 | Sites: |
| | 58 | * NIKHEF-ELPROD : ~150 hosts |
| | 59 | * Will increase to 300 in April |
| | 60 | * Testbed : ~15 nodes |
| | 61 | |
| | 62 | Install with Quattor for all grid machines |
| | 63 | * CentOS 3, 4, 5 |
| | 64 | |
| | 65 | Grid MW deployment with ncm-yaim |
| | 66 | * Lack of time to loak at/swith to QWG : open to benefit from & contribute to communicty effort |
| | 67 | * Interest from another site : occasion to look at it |
| | 68 | |
| | 69 | Straightforward implementation of Xen guests using S. Childs's work. |
| | 70 | |
| | 71 | Monitoring : Nagios + Ganglia |
| | 72 | * Ganglia presents per-node or summary view of successful/failed execution of ncm-ncd |
| | 73 | * Nagios : alarms in case of non-zero exist (using NPRE) |
| | 74 | * Server configured manually |
| | 75 | |
| | 76 | Pan Compiler : using v7. |
| | 77 | |
| | 78 | SCDB : non-SVN version |
| | 79 | * Using Makefiles to hide SCDB tools |
| | 80 | |
| | 81 | AII : successful migration to v2. |
| | 82 | |
| | 83 | Facing scaling issues with TFTP and/or RPM repositories |
| | 84 | * TFTP unlikely to be the cause |
| | 85 | * External reasons like network ? |
| | 86 | |
| | 87 | |
| | 88 | === Grid-Ireland - S. Childs === |
| | 89 | |
| | 90 | 18 distributed sites managed : ~400 nodes |
| | 91 | * Single SVN repository with 6 active users |
| | 92 | * RPM repositories replicated by rsync on each site |
| | 93 | |
| | 94 | New compute resources deployed : ~40 Condor VMs |
| | 95 | |
| | 96 | Isntallation of LEMON monitoring |
| | 97 | * Just for TCD site currently |
| | 98 | * Starting with Stijn's receipee |
| | 99 | * Work required for better integration : alarms, IPMI |
| | 100 | |
| | 101 | Xen config tieded up. |
| | 102 | |
| | 103 | Developping GraphXML output format for template dependencies in panc v8. |
| | 104 | * http://grid.ie/panc/HyperGraph/index.php |
| | 105 | |
| | 106 | Range of new non-grid machines installed |
| | 107 | * Portal server |
| | 108 | * Data management |
| | 109 | |
| | 110 | QWG pulled in via svn:externals |
| | 111 | * "current" pointer to certified revision |
| | 112 | * "trunk" pointer for development |
| | 113 | * Need to specify ignore:externals in case QWG repository is not accessible |
| | 114 | * Using Stijn's dummy WN trick for performance improvement |
| | 115 | |
| | 116 | QWG structure based on a site hierarchy : target "compile.sites" |
| | 117 | |
| | 118 | Deployment of OS errata : still not working seamlessly |
| | 119 | * Mainly kernel issues |
| | 120 | |
| | 121 | Plan to bring more service nodes in Quattor (e.g. Web servers) |
| | 122 | |
| | 123 | |
| | 124 | === Morgan Stanley - N. Williams === |
| | 125 | |
| | 126 | 20K machines in 4 sites |
| | 127 | * Looked at several solutions, including Quattor |
| | 128 | * Pressure to have 10K machines managed with the replacement system by June. Will be a major test before final decision. |
| | 129 | |
| | 130 | Like the Quattor architecture. |
| | 131 | |
| | 132 | Tried SCDB but don't like Subversion and prefer CDB model with specialized commands for non specialist users |
| | 133 | * Try to write a new thing merging both : AQDB |
| | 134 | * Looking at performances for compiling 10K machines |
| | 135 | * Concerns about scalibility of build server : DHCP, HTTP... |
| | 136 | |
| | 137 | Using AII to manage initial installation : |
| | 138 | * Would like to dramatically improve installation time from 20 mn down to 5 mn |
| | 139 | |
| | 140 | 10 people involved, 5 really writing templates : main complaint is difficulty to locate where are the things included |
| | 141 | * Would like to use namespace to provide better predictibily |
| | 142 | |
| | 143 | SPMA : dependency management is very time consuming |
| | 144 | |
| | 145 | Monitoring : very interested by LEMON but no time to look at it in details. |
| | 146 | |
| | 147 | |
| | 148 | === UAM - L. Munoz Merias === |
| | 149 | |
| | 150 | Several clusters : |
| | 151 | * Atlas T2 : 150 nodes, SL 4.5, managed with QWG templates, monitoring with Nagios |
| | 152 | * GVM-UAM : private cluster for UAM users, not necessarily HEP. Many pb with HW support on SL 4.5. |
| | 153 | * 40-desktop cluster : still running old-CDB, will probably be closed (old HW) |
| | 154 | |
| | 155 | Quattor changes : |
| | 156 | * Implementation of staged deployment |
| | 157 | * Dropped SWrep in favor of HTTPrep |
| | 158 | * Secure delivery of profiles, using certificates : lightweight alternative to SINDES developped. |
| | 159 | * AII v2 |
| | 160 | * Nagios integration into templates |
| | 161 | |
| | 162 | Plans for the short term: |
| | 163 | * Panc v8 |
| | 164 | * Quattorization of Quattor instances |
| | 165 | |
| | 166 | === INFN - A. Chierici === |
| | 167 | |
| | 168 | Running SLC4/gLite 3.1, except for CE |
| | 169 | * Disk servers run 64-bit (except if machine doesn't support it) |
| | 170 | |
| | 171 | Quattor configuration : |
| | 172 | * Using ncm-yaim to configure grid services |
| | 173 | * Adopted a new NS schema, close to NIKHEF one |
| | 174 | * Xen studied and tested : Xen-UI to be deployed soon |
| | 175 | |
| | 176 | M.E. back at CNAF : continue to support core development |
| | 177 | * But only for one year... |
| | 178 | |
| | 179 | New LHCb Tier-2 hosted at CNAF and fully quattorized. |
| | 180 | |
| | 181 | PANC : migrated to v7 after Madrid workshop. 40% perf improvement. |
| | 182 | |
| | 183 | LEMON configured using Quattor |
| | 184 | * Storage nodes and WNs |
| | 185 | * On WNs, conflict with GridICE |
| | 186 | * Migration in progress to NS |
| | 187 | * LEMON used for monitoring, Nagios used for alarming |
| | 188 | |
| | 189 | More than 10 people involved in Quattor template maintenance at CNAF |
| | 190 | |
| | 191 | Concerns about CERN role in the future : is the community strong enough to take over CERN role ? |
| | 192 | |
| | 193 | Developped a web application to display node Quattor status graphically |
| | 194 | * Organized by rack |
| | 195 | * When a pointer is on a machine, displays the status of the components |
| | 196 | * Status retrieved from a MySQL database, filled with a cron job |
| | 197 | * Able to send status through RSS |
| | 198 | |
| | 199 | |
| | 200 | === Philips Research - S. Vrijaldenhoven === |
| | 201 | |
| | 202 | Current internal cluster not managed by Quattor. Want to move to grid to get access to more resources. |
| | 203 | |
| | 204 | Test grid cluster managed with Quattor (SCDB+QWG templates) |
| | 205 | * SL 4.5, gLite 3.1 |
| | 206 | * Plan to move to production soon (planned March 25th). |
| | 207 | |
| | 208 | Nagios work going on. |
| | 209 | |
| | 210 | |
| | 211 | === IBCP - C. Eloto === |
| | 212 | |
| | 213 | 1 cluster running gLite 3.1 |
| | 214 | * 1 CE, 1 SE, 18 WNs |
| | 215 | * Everything installed in VMs |
| | 216 | * Quattor choosen to compensate from lack of manpower : seem to provide efficient management |
| | 217 | |
| | 218 | Site not yet certified : pb with SE. |
| | 219 | |
| | 220 | Faced many problems with early release of QWG templates for gLite 3.1 and some unusual settings (RPM server different from TFTP server) |
| | 221 | |
| | 222 | Documentation needs to be improved for AII configuration, Python 32-bit for 64-bit, gLite 3.0/3.1 coexistence. |
| | 223 | |
| | 224 | |
| | 225 | === BEgrid - Stijn De Weirdt === |
| | 226 | |
| | 227 | Quattor configuration : |
| | 228 | * Central configuration database (SCDB) |
| | 229 | * RPM repositories with SWrep |
| | 230 | * Certificates used for ACls on SVN and SWrep |
| | 231 | * Deployment in 2 phases : |
| | 232 | * When central admins update central repository, site admins are notified |
| | 233 | * Deploy the changes with a custom script if they are interested |
| | 234 | |
| | 235 | Sensitive information deployed with SINDES but not properly integrated with AII |
| | 236 | * Need to be done for AII v2 |
| | 237 | |
| | 238 | Still not using QWG OS templates for historical reasons but plan to move. |
| | 239 | |
| | 240 | Use of dummy WN build (idea from CERN) to speed up the compilation |
| | 241 | * Compile just once |
| | 242 | * Reuse compiled version of a WN in every real WN profile |
| | 243 | * Not yet integrated with QWG : some changes required, mainly in `machine-types/base.tpl` and `machine-types/wn.tpl` |
| | 244 | |
| | 245 | Other issues : |
| | 246 | * Better integration of monitoring tools in Quattor |
| | 247 | * Integration of DNS management in a way similar to DHCP |
| | 248 | |
| | 249 | |
| | 250 | == Main Developments == |
| | 251 | |
| | 252 | === QWG Templates - M. Jouvin === |
| | 253 | |
| | 254 | http://indico.cern.ch/materialDisplay.py?contribId=18&sessionId=2&materialId=slides&confId=28976 |
| | 255 | |
| | 256 | === AII v2 - R. Starink === |
| | 257 | |
| | 258 | Reasons for v2 : |
| | 259 | * v1 limited to PXE + Kickstart. Want to support other installation infrastructure like JumpStart |
| | 260 | * Device schema limited to /dev/[hs]d[a-z] : other schema supported by some hacks |
| | 261 | * RAID and LVM support very limited and impossible to combine |
| | 262 | * KS templates were becoming too complex and not maintainable |
| | 263 | * Untyped schema : no validation possible at compile time |
| | 264 | * Code unnecessarily complex : 2800 lines in v1 vs 1600 in v2 |
| | 265 | |
| | 266 | v2 architecture : |
| | 267 | * Front-end is using plug-ins to do the real work |
| | 268 | * 3 types of plugins : NBP, DHCP, osinstall |
| | 269 | * Node profile determines which plug-ins are loaded |
| | 270 | |
| | 271 | aii-pxelinux : default plug-in for NBP, no more use of template |
| | 272 | |
| | 273 | aii-ks : default plug-in for osinstall, no more use of template |
| | 274 | * Generator for Kickstart files |
| | 275 | * Support complex blockdevice combinations : based on ncm-lib-blockdevices |
| | 276 | * Flexible partitioning and formatting of file systems... but a bit slower |
| | 277 | * Site specific setup through hooks : plug-ins for aii-ks organized in 3 groups (pre-install, post-install, post-reboot) |
| | 278 | * Configured through PAN |
| | 279 | * Ordered list |
| | 280 | |
| | 281 | AII configuration has changed... but easy to adjust |
| | 282 | * NBP : configuration moved to /system/aii/nbp/pxelinux |
| | 283 | * osinstall configuration : configuration moved to /system/aii/osinstall/ks |
| | 284 | * Customizatble via 22 variables AII_OSINSTALL_* |
| | 285 | * hooks : excute NCM object, print what is desired in KS file |
| | 286 | * File system definitions : separated between /sys/blockdevices and /system/filesystems |
| | 287 | * Require new pan-templates and ncm-lib-blockdevices |
| | 288 | |
| | 289 | Almost no change in command line interface : aii-shellfe, aii-installfe |
| | 290 | * --notify being reimplemented |
| | 291 | |
| | 292 | Upgrading Quattor server |
| | 293 | * Pre-requisites : CCM >= 2.0.2, pan-templates >= 2.7.1, ncm-lib-blockevices >= 0.17 |
| | 294 | * Packages : aii-server >= 2.0.4, aii-ks >= 1.0.1, aii-pxelinux >= 1.0.0 |
| | 295 | * Configuration moved from /etc to /etc/aii |
| | 296 | * Move configuration for NBP and osinstall : remove everything under /system/aii/*/options as it will be assumed to be related to a plugin 'option' |
| | 297 | |
| | 298 | Possible contributions : |
| | 299 | * Alternative modules |
| | 300 | * Hooks : generic (e.g. SINDES support) or site specific to be used by others as a starting point |
| | 301 | * Separate directory for sharing contributions and enhancements : contrib |
| | 302 | * Extensions maintained by extension author |
| | 303 | * A generic extension may be moved to standard AII |
| | 304 | |
| | 305 | Support for AII v1 has been dropped. |
| | 306 | |
| | 307 | Already tested with Xen at NIKHEF. |
| | 308 | |
| | 309 | |
| | 310 | === Update on blockdevices layout - L. Munoz Mejias === |
| | 311 | |
| | 312 | blockdevices : |
| | 313 | * partitions_add : bulk addition of partitions to a disk. Partitions given as a pair name/size. |
| | 314 | * lvs_add : similar for LVM |
| | 315 | * Support for LVM stripping aded : `stripe_size` property on `logical_volumes` structure. |
| | 316 | * No support for HW raid yet |
| | 317 | * No support for quota yet : looking for suggestions about a quota schema |
| | 318 | |
| | 319 | |
| | 320 | == Core Components == |
| | 321 | |
| | 322 | === PAN Compiler - Cal === |
| | 323 | |
| | 324 | Production version 7.2.9 |
| | 325 | * Bug fixes only |
| | 326 | |
| | 327 | Version 8 in development, still in trunk |
| | 328 | * 8.1.0 tagged last week and available for evaluation |
| | 329 | * Feedback from each site would help moving this to production : large chunks rewritten |
| | 330 | * Removed features deprecated in v7 |
| | 331 | * Keyworks : define, delete, description, descro |
| | 332 | * Types : embed, fetch, stream |
| | 333 | * Newly deprecated features in v8 |
| | 334 | * Bareword include : `include mytemplate;` changed to `include { 'mytemplate` };` |
| | 335 | * Using `type` for binding : must be replace by `bind` |
| | 336 | * Lowercase automatic variables : `loadpath`, `self`, `object` |
| | 337 | * Language changes : |
| | 338 | * External path syntax : //myboject/some/absolute/path to be deprecated in favor of /my/object/tpl:/some/absolute/path |
| | 339 | * Will allow eventually object templates to be namespaced |
| | 340 | * Literal escaping of paths : can escape part of a path |
| | 341 | * More limits on structure template contents : variable, functions, include no longer allowed |
| | 342 | * New and changed functions : |
| | 343 | * format() : printf-like capabilities. Can avoid incremental building of configuration file contents. |
| | 344 | * Not printing everything by itself |
| | 345 | * is_defined(), is_boolean()... return false instead of error if variable does not exist. |
| | 346 | * String manipulation : to_lowercase/uppercase(), split(), replace() |
| | 347 | * to_string() will accept any element, undefined, null and resources included |
| | 348 | * New automatic variable : TEMPLATE. Name of the template that initiated a DML block. |
| | 349 | * Not changed with function calls, including create() |
| | 350 | * New output format : write machine template as a dot (Graphviz) file |
| | 351 | * Logging capabilities added |
| | 352 | * Several logging types : task, call, memory, all, none |
| | 353 | * Messages short and easily parsed for analysis |
| | 354 | * Cost : 15% slower with all loging |
| | 355 | * Example analysis scripts for memory usage vs. time, task per thread vs. time, graph of call (include) structure, performance studies (how much time spent in every template) |
| | 356 | * Change in global variable handling : `variable X = ...exists(X)...` always true. |
| | 357 | * Must use null instead of undef for tri-state variables |
| | 358 | |
| | 359 | |
| | 360 | |
| | 361 | Implementation changes in v8 : |
| | 362 | * Better handling of SELF : faster and less memory intensive, more consistent in various contexts |
| | 363 | * Some optimization : evaluation and use of compile-time expressions, specialized operators to aovid redundant checks at runtime, optimization to allow stricter syntax checking (earlier detection of some errors) |
| | 364 | * "Read-only" resources : infrastructure in place to avoid unnecessary copying of resources but not used yet. Some semantic issues to work out. |
| | 365 | * GRIF full build : 25% faster than v7 |
| | 366 | |
| | 367 | Documentation updated, including tutorial and man pages for PAN functions. |
| | 368 | |
| | 369 | Migration of standard templates to v8 required to avoid deprecation warnings |
| | 370 | * CERN has to migrate to v8 to benefit from last change to improve performance : require to remove no longer supported keywords |
| | 371 | |
| | 372 | Would like to clean up panc script options. What are the options used ? |
| | 373 | * To be discussed on the mailing list |
| | 374 | |
| | 375 | Authorization/Entitlements (Morgan&Stanley request) |
| | 376 | * What change : template or configuration ? Real issue is probably configuration and it's triky to know the reference value. |
| | 377 | * Who did it ? How to get user identity ? |
| | 378 | * Was it authorized ? How to define authorization ? One possibility is to define parts of configuration than can be modified by a template but at the price of flexibility. |
| | 379 | * Need to refine the use case and possible design : first discuss on mailing list and try to write a wrap up, then try to implement a prototype |
| | 380 | * To be acceptable the performance price for this should be only when you use the feature (and not for everybody like in panc v6). |
| | 381 | |