| | 1 | = Quattor Workshop - London - 11-13/3/09 = |
| | 2 | [[TracNav]] |
| | 3 | |
| | 4 | [[TOC(inline)]] |
| | 5 | |
| | 6 | See [http://indico.cern.ch/conferenceTimeTable.py?confId=50010&showDate=all&showSession=all&detailLevel=contribution&viewMode=parallel Agenda]. |
| | 7 | |
| | 8 | == Site Reports == |
| | 9 | |
| | 10 | === LAL - G. Philippon === |
| | 11 | |
| | 12 | === Grid Ireland - S. Childs === |
| | 13 | |
| | 14 | Not much changes in the number of admins or resources |
| | 15 | * Virtualization of most nodes (Xen) |
| | 16 | |
| | 17 | Developments in monitoring: |
| | 18 | * NAgios + Ganglia |
| | 19 | * MonAMI to feed Ganglia from DPM and Torque |
| | 20 | * LEMON : upgraded to lemon-web on SL5 |
| | 21 | |
| | 22 | Many tools upgraded : checkdeps, quatview |
| | 23 | * ncm-accounts massive speedup |
| | 24 | |
| | 25 | SCDB : merge "hierarchical site" model into trunk |
| | 26 | |
| | 27 | Issues: |
| | 28 | * Get network and file sytems fully under Quattor control |
| | 29 | * Consistent scheme for monitoring in Quattor |
| | 30 | * Dummy WN speedup trick integrated into the compiler |
| | 31 | |
| | 32 | |
| | 33 | === NIKHEF - R. Starink === |
| | 34 | |
| | 35 | 4 clusters: 300 machines |
| | 36 | * Currently 5 people involved with Quattor |
| | 37 | |
| | 38 | SCDB + local changes: deployment not done by SCDB bu by local tools to allow deployment of a specific machine |
| | 39 | * Related to historical way of managing systems at NIKHEF but has the disadventage that a postponed change deployment may break something later on some other nodes... |
| | 40 | |
| | 41 | Xen: 7 hosts, 38 guests |
| | 42 | * Based on QWG but issues with host and guests in different clusters: workaround found |
| | 43 | |
| | 44 | Monitoring entirely based on Nagios with 1 master and 3 slaves |
| | 45 | * Based on QWG but some mods to handle hierarchy of servers that are willing to share |
| | 46 | |
| | 47 | panc v7 to v8 transition: no problem but no performance improvement observed |
| | 48 | * Very happy of new logging features |
| | 49 | |
| | 50 | NCM components: |
| | 51 | * ncm-openvpn to configure server and clients |
| | 52 | * ncm-yaim: complete refactoring/rewriting, new features, some backward incompatible changes |
| | 53 | |
| | 54 | Issues: |
| | 55 | * WN compilation speed-up has some pb with compile-time dependency |
| | 56 | * Strength of community with increase usage: what about the ability to support everybody? |
| | 57 | |
| | 58 | === LAPP - E. Fede === |
| | 59 | |
| | 60 | Quattor server is running on a VMware virtual machine |
| | 61 | * 110 profiles |
| | 62 | * 4 people using it |
| | 63 | |
| | 64 | Running autobuild of RPMs for NCM components and other core components from SourceForge |
| | 65 | * http://lapp.in2p3.fr/Quattor |
| | 66 | * trunk and tags/latest |
| | 67 | * repodata available from YUM |
| | 68 | |
| | 69 | === CERN - V. Lefébure === |
| | 70 | |
| | 71 | Main instance: 7500 profiles in 139 clusters |
| | 72 | * +1200 increase |
| | 73 | * 1900 profiles corresponding to machines not managed by Quattor |
| | 74 | * Running the last version of CDB, panc v8 ready |
| | 75 | * v8: 20% improvement in compile time but not yet in production. all know issues solved |
| | 76 | * Problem of use of RECORD type by some components (ncm-httpd, ncm-tomcat....) |
| | 77 | |
| | 78 | Xen-based virtualization: support for SLC5 hypervisors ready |
| | 79 | |
| | 80 | Issues of number of users: 65 ACL groups |
| | 81 | |
| | 82 | Package list templates: working on automation |
| | 83 | * Use of comps.xml |
| | 84 | * Automatic detection of missing dependencies |
| | 85 | |
| | 86 | CDB2SQL: Python version fast but buggy, no manpower to fix it, reverted to previous version |
| | 87 | |
| | 88 | |
| | 89 | === Morgan & Stanley - N. Williams === |
| | 90 | |
| | 91 | In production now: AQB, AQDB, LEMON |
| | 92 | * 7500 nodes, compile at 10 minutes (8-core machine) but aii-shellfe --notify at 1h (with patches not yet committed) ! |
| | 93 | * 5 template admins |
| | 94 | * New building just commissionned and expect to double the number of machine in the next months, plan to keep one server (+1 for redundancy) |
| | 95 | |
| | 96 | Issues: |
| | 97 | * Format change of XML profiles painful: dropping LINK support forces "big-bang" changes |
| | 98 | * Configuration success feedback: thinking at implementing a DB of last time a component was run, updated by ncm-ncd/ncm-cidspd that could be compared with the timestamp of the last configuration |
| | 99 | |
| | 100 | Will submit code now via SourceForge |
| | 101 | * Waiting for approval to open source : AQDB, FUSE interface to configuration browsing, AII, CCM patches |
| | 102 | |
| | 103 | |
| | 104 | === UAM - Laura del Caño === |
| | 105 | |
| | 106 | Luis left, Laura is his replacement. |
| | 107 | |
| | 108 | Proposal of tasks that UAM could handle: |
| | 109 | * Maintenance of monitoring tools |
| | 110 | * openvz support |
| | 111 | * AII |
| | 112 | |
| | 113 | 5 clusters |
| | 114 | * Use of ant local tasks for template management |
| | 115 | * Performance tests of new machines configured with Quattor |
| | 116 | |
| | 117 | New component in progress: |
| | 118 | * ncm-amanda to configure Amanda backup SW |
| | 119 | * ncm-pnp4nagios |
| | 120 | |
| | 121 | Some local developments: |
| | 122 | * Postgresql DB to store machine info and group them into categories + ant local task to generate the profile and some other templates (monitoring) for the machine |
| | 123 | * SinDes alternative used to manage secured access to profile (AII hook) |
| | 124 | |
| | 125 | === Greek Grid - D. Zilaskos === |
| | 126 | |
| | 127 | 1 Quattor server to manage 2 clusters representing 133 machines spanning 13 subnets |
| | 128 | * 4 Xen hosts, 19 guests |
| | 129 | * SVN server installed with Trac |
| | 130 | * 3 admins + 2 new people who recently joined |
| | 131 | |
| | 132 | Developpements and issues: |
| | 133 | * Wiki guides for Quattor newbies |
| | 134 | * New components: still in progress... some services like Hydra evolving very quickly |
| | 135 | * Involvement in OAT benefits as work is implemented and tested locally with Quattor |
| | 136 | * Thinking at an administration model for Southern Europe based on GRIF/GRID Ireland experience |
| | 137 | * Lot of small sites, very limited effort available... |
| | 138 | |
| | 139 | === CNAF - A. Chierici === |
| | 140 | |
| | 141 | 90% of the templates adapted to the new schema |
| | 142 | * Inspired by QWG |
| | 143 | * Next step is migration of gLite nodes to SLC5 |
| | 144 | |
| | 145 | Xen used on 3 servers providing 16 cores each |
| | 146 | * LHCb T2 running on Xen |
| | 147 | * Quattor used to configure the profiles of guests but Dom0 managed by hand |
| | 148 | * Investigating KVM |
| | 149 | |
| | 150 | Planning to install soon new ncm-yaim and ncm-accounts |
| | 151 | |
| | 152 | CDB vs. SCDB: still thinking about migration but need to investigate the impact for the users |
| | 153 | * Is CDB still supported after ME left ? |
| | 154 | * Who will take care of new core release ? |
| | 155 | |
| | 156 | Presenting a poster about Quattor at CHEP: anyone who could help to present it ? |
| | 157 | |
| | 158 | One new people to help with Quattor at CNAF (Elisabetha) |
| | 159 | * All new sysadmins teached to use Quattor |
| | 160 | |
| | 161 | |
| | 162 | == Aquilion Drive Through - W. Hertlein == |
| | 163 | |
| | 164 | Aquilon is an architecture to address system management at M&S, in particular scalability, management delegation, application-centric management... |
| | 165 | * Goal is to install hundred of machines without any manual intervention |
| | 166 | |
| | 167 | AQDB is the CDB replacement: no direct interaction between users and templates, everything goes through the Aquilon broker (AQB) |
| | 168 | * Aquilon configuration stored into AQDB |
| | 169 | |
| | 170 | Workflow of provisionning machines: |
| | 171 | * Rack is the unit of work: racks delivered cabled |
| | 172 | * Limited number of vendors and models |
| | 173 | * When racks is powered on, top of rack (tor_switch) switch send a DHCP request and receive a temporary address that will allow to configure it after discoverin its type |
| | 174 | * Switch entered in AQDB |
| | 175 | * DNS integration : all servers configured with the same information |
| | 176 | * DNS DB built from a periodic dump of AQDB (every 3h) |
| | 177 | * DHCP integration: a DHCP server set to server a specific set of machines |
| | 178 | * Configuration built from a periodic dump of AQDB |
| | 179 | * Discovering machine: done by scripts scanning the tor_switch an dquerying it with snmp get |
| | 180 | * Create a machine entry in AQDB for each discovered MAC entry |
| | 181 | * After a machine has been entered in AQDB, create a machine plenary (PAN) template describing the HW |
| | 182 | * Machine plenary template is an equivalent of SCDB hardware/machine + the service/personnality |
| | 183 | * Creating hosts: logical entry for the host created and associated with a plenary template |
| | 184 | * IP is derived by choosing an available IP from the tor_switch subnet and put in the plenary template to avoid an IP address change triggers recompilation of another host |
| | 185 | * Wait for the propagation of previous information, in particular DNS: may take a couple of hours |
| | 186 | * Next day build out the hosts binding host to required services and use panc to compile the template |
| | 187 | * Invoke aii-shellfe to switch over the install pxe image (done through AQB) |
| | 188 | * DNS + Krb propagation delay: looking for some optimizations in the future |
| | 189 | * Finishing the build process: a script runs on the newly built host to update its information in AQDB |
| | 190 | * Record any management interfaces found and if successful update the status information |
| | 191 | * Grid hand-off: after successful build of a host, it is transferred to an application groups |
| | 192 | * Deployment schedules are set up in advance |
| | 193 | * Use personality to set up a host with application defaults |
| | 194 | * Spread hosts for a group over several racks for better resilience |
| | 195 | |
| | 196 | Managing services: a service may have several instance |
| | 197 | * 1 service instance is bind to a host |
| | 198 | |
| | 199 | Monitoring hosts to meet SLA |
| | 200 | * Hosts are monitored by a daemon that does periodic snmp sweep of all the hosts |
| | 201 | * Hosts are grouped by personality with a threshold of hosts allowed to be offline |
| | 202 | * Host personality also defines a reboot schedule: potential previous threshold violation will prevent the reboot |
| | 203 | * Include some draining capabilities |
| | 204 | * LEMON is enabled as a service to provide visualization of aggregated metrics |
| | 205 | * LEMON configured from Aquilon to guarantee consistency |
| | 206 | |
| | 207 | Updating personalities involve working on a local copy of the personality and creating a new temporary one (AQ domain), compiling it with a fake profile and putting the change back into AQDB |
| | 208 | * Currently no check enforced by 'aq put' that the new version compiles. Rely on conventions... |
| | 209 | * A test host can be associated with the new personality and reconfigured to validate the changes |
| | 210 | * AQ then allows to merge changes into a production personality and reconfigure all nodes using it |
| | 211 | |
| | 212 | |
| | 213 | == Monitoring == |
| | 214 | |
| | 215 | === Monitoring Templates in QWG - S. Kenny === |
| | 216 | |
| | 217 | LEMON: |
| | 218 | * NCM components: fmonagent, oramonserver |
| | 219 | * Templates: standard/monitoring/lemon |
| | 220 | * Web front-end via filecopy |
| | 221 | |
| | 222 | Nagios: |
| | 223 | * NCM components: nagios, ncg |
| | 224 | * Templates: standard/monitoring/nagios |
| | 225 | * A few changes to support Nagios 3 |
| | 226 | * Hosts created from HW DB |
| | 227 | * Services defied as separate templates and added to NAGIOS_SERVICE_TEMPLATES variable but not very scalable |
| | 228 | * ncm-ncg currently being developped to produce an input file for WLCG NCG which generates the service definition: looks promising |
| | 229 | |
| | 230 | Ganglia: configured with filecopy |
| | 231 | * Would be better to have a component generating the required config file on client and server from the site hierarchi description |
| | 232 | |
| | 233 | MonAMI: configured with filecopy |
| | 234 | |
| | 235 | Currently, every monitoring tool has its own configuration. Ideally monitoring schema should be mostly tool-independant. Proposed model based on current LEMON config: |
| | 236 | * Host: coming from DB_MACHINE |
| | 237 | * Cluster: group of nodes sharing the same node types |
| | 238 | * Super-cluster: |
| | 239 | * Need to be part of the information in the node profile so that it can be used by several components |
| | 240 | * May be connected with some representation of M&S personalities into the schema |
| | 241 | |
| | 242 | |
| | 243 | == Core Components == |
| | 244 | |
| | 245 | === PAN Compiler - C. Loomis === |
| | 246 | |
| | 247 | Code hosted on SourceForge since 8.2.6 |
| | 248 | * Bug tracking moved to SF too |
| | 249 | |
| | 250 | Production version is v8 |
| | 251 | * v7 deprecated |
| | 252 | * 8.2.3 : CDB compatibility, annotation |
| | 253 | * 8.2.4 : selective debugging |
| | 254 | * 8.2.7 introduces prepend/append functions to replace push/npush |
| | 255 | |
| | 256 | Outstanding bugs: |
| | 257 | * Race condition in validation |
| | 258 | * Corner cases with unintuitive behaviour |
| | 259 | * Enforce final flag in structure templates: no real request... |
| | 260 | |
| | 261 | Enhancement requests (in priority order): |
| | 262 | 1. Restricted include (aka entitlements) |
| | 263 | 1. Add perf tips to documentation |
| | 264 | 1. XInclude directive to replace Embedded |
| | 265 | 1. Add OBJECT to debug() and error() output |
| | 266 | 1. Add prefix/define statements to shorten literal paths |
| | 267 | 1. Internationalize error messages |
| | 268 | 1. Enable/disable debugging from within pan |
| | 269 | 1. Allow include to take a list (from M&S) |
| | 270 | |
| | 271 | |
| | 272 | Other ideas/wishes: |
| | 273 | * Better Eclipse integration: editor, debugger, dialog boxes to select options for ant/pan |
| | 274 | * Ability to include a file which is not a template inside a template as an alternative to `<<EOF` |
| | 275 | * File name should be relative to the loadpath as for other templates |
| | 276 | * Rework string escaping handling for automatic escaping by the compiler and a prefix in the string to know if it has been escaped or not without ambiguity |
| | 277 | |
| | 278 | Restricted include statements: design questions |
| | 279 | * Allow the scope of configuration tree changes to be specified when including a template ? |
| | 280 | * - : require modification of templates to do entitlement |
| | 281 | * Specified allowed areas of tree ? or restricted (disallowed) areas ? |
| | 282 | * Need both, discusse if exclude before include or the opposite |
| | 283 | * Can '/a' be fixed but '/a/b/c' be changed ? |
| | 284 | * Enforce entitlement on loadpaths ? |
| | 285 | * What about variables ? |
| | 286 | * Allow wildcards and/or regular expressions on the template names ? |
| | 287 | |
| | 288 | |
| | 289 | === NCM Components - N. Williams === |
| | 290 | |
| | 291 | ncm-cron: more smearing |
| | 292 | * `frequency` may be replaced by `timing`, a nlist allowing to specify a targert (eg. interval into hours) + a smear interval |
| | 293 | * hours/days/month/... are strings that allow expressions and are validated |
| | 294 | * If present `timing` takes precedence over `frequency` |
| | 295 | |
| | 296 | ncm-download: retrieve URL |
| | 297 | * funtionality similar to filecopy but contents download with curl form a URL |
| | 298 | * Support proxies |
| | 299 | * Can use `spnego` support into curl for authenticated access to URL, using Krb principal for example |
| | 300 | * Support global definition of http server to use and relative URLs |
| | 301 | * Subclassable: a method allow to tell the component the configuration path to use, instead of the default, allowing the component to be called by another component for one specific file |
| | 302 | * An idea to reuse in other components like ncm-filecopy |
| | 303 | * A problem with ncm-ncd which might not match exception |
| | 304 | |
| | 305 | CCM: support for local CCM DB format |
| | 306 | * Specified with `dbformat`: can be `GDBM_File`, `DB_File`, `CDB_File` (compact DB, nothing to do with Quattor CDB!) |
| | 307 | * Format is stored in file `.fmt` close to `.db` file: only the writers need to take care of it, read is properly handled by CCM library |
| | 308 | * Download ''if-modified-since'' local disk version, ignoring the mtime in CDP notification message |
| | 309 | * ccm-fetch now uses the library coming from ccm-fetch.new (3 years old!) |
| | 310 | |
| | 311 | ncm-cdispd/ncd: would like to add configuration status feedback |
| | 312 | * `/var/lib/ncm` contains an entry per component: |
| | 313 | * If the file is empty, component must run |
| | 314 | * If run successfully, delete the files |
| | 315 | * If failed, write the exception into the file |
| | 316 | * If a component is inactive, ncm-cdispd remove the file for the component in `var/lib/ncm` unconditionnaly (without checking its contents) |
| | 317 | |
| | 318 | spma/aii-ks: randomization of proxy choosen when multiple available |
| | 319 | * If the AII server processing the kickstart file is in the list of proxies, use it |
| | 320 | * Support for a list of "installack" servers, requested in // to ensure consistency when several AII servers are used in // |
| | 321 | * Require ncm-nscd start before starting configuration to ensure name service can be restarted without impact on the configuration process |
| | 322 | |
| | 323 | AII: --notify, --firmware, use_fqdn |
| | 324 | * `--notify`: automatically configures anything thant needs configurating, based on `profiles-info.xml` |
| | 325 | * `--firmware`sets alternate pxe boot target to boot relevant firmware installer based on HW information from the profile |
| | 326 | * Schema change required |
| | 327 | * Reset done by another external tool |
| | 328 | * Global lock no longer requested for aii-shellfe, only for aii-dhcp. |
| | 329 | * Only a lock on the specific client host |
| | 330 | * Fixed caching, created multi-level cache hierarchy where first level is the DNS domain name |
| | 331 | |
| | 332 | |
| | 333 | |
| | 334 | == QWG == |
| | 335 | |
| | 336 | |
| | 337 | == SourceForge Migration - S. Child == |
| | 338 | |
| | 339 | Main source repository now on SF as a snapshot (without history) |
| | 340 | * 22 developpers registered: many active developpers still missing |
| | 341 | * Send your SF account to Stephen |
| | 342 | * Build tools basically working but unmaintanable now that ME left |
| | 343 | |
| | 344 | quattor.org now is a wiki (MediaWiki) |
| | 345 | * An account is required to edit pages |
| | 346 | * Feel free to add new pages for experimental stuff without linking it to other pages |
| | 347 | |
| | 348 | Issue tracker used only by panc so far. |
| | 349 | * Would be better if we could use the Trac one as it seems to be available on SF |
| | 350 | |
| | 351 | Download menu: mainly pan currently, difficult to automate updates |
| | 352 | |
| | 353 | Documentation menu is unused as this is only plain web pages |
| | 354 | |
| | 355 | Mailing lists still hosted at CERN: next candidate for migration |
| | 356 | * Create new lists with current subcribers |
| | 357 | * Block submission on mailing lists at CERN |
| | 358 | |
| | 359 | QWG code and documentation still hosted at LAL |
| | 360 | * Trac now available on SF, test it as it would be much easier |
| | 361 | * Check if Trac DB can be reimported and how to migrate SVN (snapshot would be acceptable). |
| | 362 | |
| | 363 | User branches: no problem, create them in branches, currently unused. |
| | 364 | |
| | 365 | Announcing new tags on the mailing list: no consensus, not sure it will be manageable |
| | 366 | * Would be preferable to have a dashboard, let's discuss it as part of QBT future |
| | 367 | |
| | 368 | |
| | 369 | == Quattor Datawarehousing - N. Williams == |
| | 370 | |
| | 371 | Quattor and databases: |
| | 372 | * Managing relations between objects and metadata having integrity constraints should involve a relational model and generally connected to enterprise |
| | 373 | * Managing configuration behaviours should be the task of an appropriate language like PAN |
| | 374 | * What is configured where ? require a DB like CDB2SQL |
| | 375 | * Not only display of existing configuration with different views but also historical views |
| | 376 | |
| | 377 | QuatView: uploads selected information from profiles into MySQL, displays with a web browser |
| | 378 | * Big limitation : single timestamp |
| | 379 | * Currently only MySQL but should be easy to use other back-ends |
| | 380 | * Recent additions/enhancements to have the web client much more flexible and usable, eg. search on every column |
| | 381 | |
| | 382 | CDB2SQL: similar features to QuatView but without the web tool to browse the DB |
| | 383 | * Use DBI API |
| | 384 | * Use DB bulk insert |
| | 385 | * Implemented as a server module that can run anywhere |
| | 386 | * Detect changed profiles |
| | 387 | * `-ora` version is in fact not Oracle specific (MySQL is still the default) but has a lot of Oracle specific additions (mainly views) |
| | 388 | * `-dist` version more recent and faster but currently Orcale specific, although not hard to change |
| | 389 | * cdb2sql just schedule uploads of XML that are done by a multi-threaded process |
| | 390 | |
| | 391 | Server modules: |
| | 392 | * CDB and SCDB send CDP notification that profiles have changed |
| | 393 | * SCDB uses a CCM CDP notification sent to the host affected |
| | 394 | * CDB send a message to a server in charge of taking appropriate actions for the affected clients |
| | 395 | * Involves calling cdb2sql-sync |
| | 396 | |
| | 397 | '''Proposal''': use cdb2sql with Quatview web interface |
| | 398 | * Extend schema to provide versioned data : from underlying SCM ? from XML ? |
| | 399 | * Build into the profile by panc ? |
| | 400 | * Include revision tag in data |
| | 401 | * Publish datawarehouse to AII instead of notifying datawarehouse and AII in // |
| | 402 | |
| | 403 | |
| | 404 | == Quattor and Inventory Management == |
| | 405 | |
| | 406 | === CERN - V. Lefébure === |
| | 407 | |
| | 408 | CDB2SQL to retrieve information from Quattor CDB. All other applications using DB produced by CDB2SQL |
| | 409 | |
| | 410 | CERN specific schema extension to describe the HW. |
| | 411 | * Include the template name describing the HW |
| | 412 | * Track warranty contract ID |
| | 413 | * Track power consumption |
| | 414 | * Also used for machines not managed by Quattor to keep track of them in the inventory |
| | 415 | |
| | 416 | CC Tracker: displays a geographical view of the machines managed by Quattor |
| | 417 | |
| | 418 | HMW: application to handle installation, move, rename, retire... of a machine |
| | 419 | * Form-based application with lot of parameters pre-filled with the spread sheet delivered with the machines |
| | 420 | * Interconnect with other DB/apps involved: Remedy, LAN DB |
| | 421 | |
| | 422 | Request for new HW done by users using a Web form |
| | 423 | |
| | 424 | |
| | 425 | === Morgan & Stanley Asset Management - S. d'Aquila === |
| | 426 | |
| | 427 | ''Note: M&S interested to hear feedback on how this work may be useful for the community.'' |
| | 428 | |
| | 429 | Aurora was relying heavily on AFS |
| | 430 | * Configuration lives into hand-crafted files, prone to human error, impossible to browse in a reasonnable amount of time |
| | 431 | * Poor data quality of data about hosts |
| | 432 | * No fine-grained entitlement |
| | 433 | |
| | 434 | Aquilon design goals: |
| | 435 | * Model systems by what make them similar rather than different |
| | 436 | * Manage resource at a higher level with no manual intervention and application-centric view |
| | 437 | * Avoid NIH syndrom |
| | 438 | * Vendor neutral with the option to open source |
| | 439 | * Build a system it is easy to interact with |
| | 440 | * Mix of declarative configuration (configure this machine as a mail server) and proscriptive configuration wher the entire configuration is under the control of a configuration management tool. |
| | 441 | * Ensure consistency between server and clients configuration |
| | 442 | |
| | 443 | Why use a DB: |
| | 444 | * Fast efficient retrieval of data with referential integrity |
| | 445 | * Concurrency control, transaction, fine-grain locking for free |
| | 446 | * Be a source of information for legacy systems |
| | 447 | |
| | 448 | Technology choices: |
| | 449 | * Twisted Python: event driven network application framework |
| | 450 | * SqlAlchemy: object relational mapper and SQL toolkit |
| | 451 | |
| | 452 | Assets managed: |
| | 453 | * Hardware: machines and their components |
| | 454 | * Real estate they occupy |
| | 455 | * Namespaces: DNS domains, resetved tcp/udp, users, groups |
| | 456 | * Services |
| | 457 | |
| | 458 | Lot of effort put in architecture definition and taxonomy: |
| | 459 | * Personality: what is machine role ? |
| | 460 | * Archetype: philosophical basis behind the build process |
| | 461 | * Examples: Aquilon, Aurora, Aegis, Windows |
| | 462 | * Future: clusters of VMware, network gear, SAN/NAS devices... |
| | 463 | * Location:a flexible way of hierarchically organizing our stuff (regions, campus, building...) |
| | 464 | * Configuration assumption: it is usually better to use a service instance local rather than remote or at least as closest as possible |
| | 465 | |
| | 466 | A system configuration is made of: |
| | 467 | * Hostname (and associated interfaces) |
| | 468 | * Archetype and its requirements |
| | 469 | * Personality |
| | 470 | * Services it uses structured as an ordered list of templates |
| | 471 | * For each service, define instances: a host providing the service, its location and the template associated |
| | 472 | * Service mapping responsible for automatic selection of instance based on requirements defined somewhere else like archetype, personality |
| | 473 | |
| | 474 | Future directions: |
| | 475 | * Same kind of requirements for archetypes to personalities |
| | 476 | * Advanced entitlement and audit capabilities |
| | 477 | |
| | 478 | |
| | 479 | == Quattor in Amazon Cloud - C. Loomis == |
| | 480 | |
| | 481 | AWS vs. Xen: |
| | 482 | * Network ocnfiguration: all machines have private and public IP addresses but users cannot predict or allocate them before starting the machine. |
| | 483 | * Network interface uses the private address for configuration |
| | 484 | * DNS contains only public address, not the private one |
| | 485 | * IP address can be changed on the fly when using Elastic IP |
| | 486 | * hostname command doesn't return the DNS name associated with the public IP address |
| | 487 | * Installation: PXE not supported for installation of the machine, must start from an existing machine image |
| | 488 | * Must use limited list of supported kernels: only RHEL5 kernels |
| | 489 | |
| | 490 | Current config: |
| | 491 | * Quattor server in the cloud: quattor.stratuslab.org |
| | 492 | * Only packages and profiles, no AII. Only httpd |
| | 493 | * SVN at SixSq |
| | 494 | * Base VM image used already has the basic quattor client installed + a script run during first boot as part of init.d to do the first ccm-fetch and ncm-ncd |
| | 495 | |
| | 496 | Issues and questions: |
| | 497 | * Multiple machines can use the same profile: easy and clean way to define only one WN per site. |
| | 498 | * Machine names not known at compile time : how to link batch server and clients, nfs server and clients ? How to handle late binding ? |
| | 499 | * Change notifications fail : no link between profile name and machine name |
| | 500 | * Allow machines to register for changes ? |
| | 501 | * Move to "chat room" (Jabber?) messaging for changes ? |
| | 502 | * Workflow: how to manage image disks, IP addresses, machine lifecycle ? |
| | 503 | * Should Quattor manage only image instead of machines ? |
| | 504 | |
| | 505 | |
| | 506 | == Virtualization Update - S. Child == |
| | 507 | |
| | 508 | Xen: |
| | 509 | * ncm-xen pretty stable but needs some work |
| | 510 | * Should delete managed configuration files when removed from profile |
| | 511 | * Filesystems code should be removed and use ncm-filesystmes |
| | 512 | * QWG helper code: "database" mapping guests to hosts, xen/configure_guests() function |
| | 513 | * configure_guests() populates `/software/components/xen/domains` from guest templates, automatically set `/hardware/location` to the DOM0 host name |
| | 514 | * guests and host in different cluster: a SCDB workaround implemented by adding the guest clusters in `cluster.pan.includes` |
| | 515 | |
| | 516 | Current virtualization deployment: |
| | 517 | * TCD and NIKHEF: Xen + QWG |
| | 518 | * NIKHEF has some local tricks for cross-cluster HW handling |
| | 519 | * CERN: Xen + enclosures |
| | 520 | * Enclosures are close to `XEN_DB` : a nlist of parents with their children as the value |
| | 521 | * M&S: VMware + XML injection from config into hypervisor |
| | 522 | * No way to manage VMware server with Quattor (black box) |
| | 523 | |
| | 524 | openvz: used by UAM, currently in testing, configured with Quattor |
| | 525 | |
| | 526 | VM migration: only proposal from Luis, on http://quattor.org |
| | 527 | * Is it within the scope of Quattor or should it be integrated in some new lifecycle workflow manager ? |
| | 528 | * Need to find concrete use case for VM migration ? |
| | 529 | |
| | 530 | Enclosures: generic model for host/guests dependencies |
| | 531 | * VM, blades... |
| | 532 | * Move to it ? |
| | 533 | |
| | 534 | |
| | 535 | == European FP7 Quattor Project - C. Loomis == |
| | 536 | |
| | 537 | Goal: get additional manpower to make some significant changes to Quattor toolkit and do the documentation |
| | 538 | * Currently a strong community with several developpers |
| | 539 | * But no dedicated people: mainly fixing urgent problems |
| | 540 | |
| | 541 | FP7 Infastructure calls in Sept. 09 related to EGI |
| | 542 | * One of them related to middleware, including tools and services for deployment |
| | 543 | |
| | 544 | Institutional requirements: |
| | 545 | * Need to identify a lead member, CNRS could act |
| | 546 | * Must involve typically, at least 3-5 different countries |
| | 547 | * Partners must be legal bodies, including JRU |
| | 548 | * Both academic and commercial partners |
| | 549 | |
| | 550 | Project requirements: |
| | 551 | * Scale of funding must fit in the call guidelines. |
| | 552 | * Usually significant matching effort required: european funding only covers 50-80% of the project cost |
| | 553 | * Matching resources may be people and manpower |
| | 554 | * Must have JRA, NA and SA componenents |
| | 555 | * Must include dissemination and sustainability plans: should fit easily with the SF move |
| | 556 | |
| | 557 | Project outline proposed: |
| | 558 | * NA: management, quality control, dissemination/training |
| | 559 | * Pre-packaged appliance to ease starting with Quattor |
| | 560 | * SA: build and test infrastructure, release management |
| | 561 | * Focus on Quattor use for gLite, QWG may be a good link with production infrastructure |
| | 562 | * JRA: improvement of current tools, control of cloud and virtualized resources, full lifecycle management, IPv6 support |
| | 563 | * EU has a strong push on IPv6 |
| | 564 | * For current tools, focus on move to API allowing language independance |
| | 565 | * Must take into account that SA activities are better funded than others |
| | 566 | |
| | 567 | Concrete tasks for building the project: |
| | 568 | * Need to identify partners and the project activities they are interested in: must cover all major parts of the Quattor toolkit |
| | 569 | * Must define the roadmap for the new features we want |
| | 570 | * Determine plans for dissemination and sustainability |
| | 571 | |
| | 572 | Timeline for preparing the project: |
| | 573 | * June: identify the partners and lead partner |
| | 574 | * Detailed outline of project by Sept. |
| | 575 | * Finalize project description and financial aspects by Dec. |
| | 576 | |
| | 577 | |
| | 578 | == Improving Quattor's Accessibility - C. Loomis == |
| | 579 | |
| | 580 | Quattor jas a steep and difficult learning curve |
| | 581 | * Both for users and developpers |
| | 582 | * Some is linked to the comprehensive nature of the toolkit but not all |
| | 583 | * Some reasons include: |
| | 584 | * Inadequate documentation, despite the improvements |
| | 585 | * Treating Quattor as an all or nothing affairs |
| | 586 | * Leftovers from old projects |
| | 587 | |
| | 588 | Inconsistent branding: `Quattor` name is very seldom seen in the individual tools |
| | 589 | * Very difficult for new users/admins to identify Quattor parts: daemons, config file names... |
| | 590 | * Move all config files to `/etc/quattor` |
| | 591 | * Review acronyms and change those which are not useful |
| | 592 | * Need to decide what we want to change and what is the roadmap for a non disruptive change |
| | 593 | |
| | 594 | Documentation: |
| | 595 | * Lack of a good overview document |
| | 596 | * Incomplete and often outdated |
| | 597 | * Difficult to relate to a particular release tool |
| | 598 | * Automate documentation as much as possible |
| | 599 | * For example, use annotation in templates to document ''public'' variables |
| | 600 | |
| | 601 | Tutorials: |
| | 602 | * Very important but material often outdated |
| | 603 | * Generally focused on managing a whole site with Quattor |
| | 604 | * Need some tutorials on how to incrementally add Quattor to an existing management infrastructure |
| | 605 | * Explore Amazon AWS as a way to provide testing resources to people who want to evaluate Quattor |
| | 606 | |
| | 607 | Multiple APIs/Tools |
| | 608 | * Better document strength and weakness of similar tools (eg. SCDB/CDB) |
| | 609 | * Have a good justification for multiple tools |
| | 610 | * Need a way to mark deprecated components and APIs |
| | 611 | |
| | 612 | Coding practices: have a set a coding practices (at least for Perl-based components) with a good agreement but nearly completely unenforced |
| | 613 | |
| | 614 | Missing functionalities |
| | 615 | * PAN: Eclipse-based editor and debugger |
| | 616 | * Components: lots of duplicated code for skeleton of component, cannot run component as standalone script |
| | 617 | * Hooks for monitoring Quattor status |
| | 618 | * Messaging infrastructure inside Quattor |
| | 619 | |
| | 620 | MEthodology: |
| | 621 | * Agree on a list of problems |
| | 622 | * Evaluate te impact of fixing each of the problems |
| | 623 | * Implement them along with other developments |
| | 624 | * Automate as much as possible: prevent regressing problems that have been fixed |
| | 625 | |
| | 626 | |
| | 627 | == Quattor SW Process - C. Loomis == |
| | 628 | |
| | 629 | Current situation: |
| | 630 | * Quattor build tools mostly standardized with some adhoc checks for conventions |
| | 631 | * Nightly build performed from trunk |
| | 632 | * No dashboard of current state of code base |
| | 633 | * Few automated tests of Quattor tools |
| | 634 | * Little gathering of documentation: script exists but need to be integrated into SF |
| | 635 | * No quality checks of the Perl code |
| | 636 | * panc is using Java findbugs, similar things exists for Perl |
| | 637 | |
| | 638 | Continous build: suggest moving to system like [https://hudson.dev.java.net Hudson] |
| | 639 | * Language agnostic |
| | 640 | * Build when there is a code change |
| | 641 | * Can define multiple builds and hierarchies of builds |
| | 642 | * Provides a dashboard of current and past results |
| | 643 | * Must be done outside SF |
| | 644 | |
| | 645 | Coding standards and development guidelines |
| | 646 | * Good documents from Luis but nothing to enforce them |
| | 647 | * Should start at running [http://perlcritic.com Perl::Critic]: highly configurable, large set of code checks, possibility to include project specific requirements |
| | 648 | |
| | 649 | Unit tests: |
| | 650 | * Having a good set of unit tests makes refactoring and changing code more reliable |
| | 651 | * Many choices for framework for Perl: Perl's own may be easiest |
| | 652 | * Need to implement in a way that many tests come along for free |
| | 653 | * Not easy for NCM components as the test should include the action done: need to add ability in underlying libraries to use another root |
| | 654 | * May be used for other purposes like configuring images |
| | 655 | |
| | 656 | Documentation: something like Perl::Tidy may help to generate documentation in a format easily usable in a central place |
| | 657 | * Can tidy perl code extracting POD documentation and producing HTML pages |
| | 658 | |
| | 659 | Quattor build tools: contain many thing relating to code checking that should be moved to more appropriate tools |
| | 660 | * Try to design a new generation of QBT starting with the core features of the current ones (probably tagging releases) and adding later checks with appropriate tools |
| | 661 | |
| | 662 | |
| | 663 | == Future Features == |
| | 664 | |
| | 665 | Working groups of people collaborating in developments of new features, before a final, complete release |
| | 666 | * Monitoring, visualization/datawarehousing, virtualization, messaging infrastructure... |
| | 667 | |
| | 668 | Template viewer: evaluate how to move forward based on panc output |
| | 669 | * tplview still based on direct parsing of pan templates |
| | 670 | |
| | 671 | |
| | 672 | == Conclusions == |
| | 673 | |
| | 674 | Decisions: |
| | 675 | * Software contribution and fixes: code maintenance is a common responsability, everybody is allowed to commit a change in any component |
| | 676 | * No need to get permission from the maintainer |
| | 677 | * Maintainer is a person with a global overview able to help other peoples with their contributions and who may rev iew the changes |
| | 678 | |
| | 679 | Actions: |
| | 680 | * Merge monitoring nagios3/ templates into the main nagios/ templates: Guillaume/Michel |
| | 681 | * Short term: remove nagios3/ from branches |
| | 682 | * Migrate mailing lists to SF |
| | 683 | * Setup of Trac on SF: Michel |
| | 684 | * Migration of QWG/SCDB to SF: Stephen/Michel, medium priority |
| | 685 | * Roadmap and strategy for consistent naming of command names, config files... |
| | 686 | * Perl module namespace: N. William |
| | 687 | * Command names: Stephen, use `quattor-` as a common prefix |
| | 688 | * Config files: Michel, move to `/etc/quattor` |
| | 689 | * Proposal for a new generation QBT : Cal |
| | 690 | * Evaluate possibility of a chat room in SF for support and other real-time discussions |
| | 691 | * Software tools: |
| | 692 | * Autobuild: Eric to look at Hudson and other continous build system |
| | 693 | * Perl::Critic: Cal |
| | 694 | * Test frameworks: Nick |
| | 695 | * Perl::Tidy or similar tools: Stephen |
| | 696 | * Review "dummy WN" hack and try to reimplement in a more sensible way: Stephen |
| | 697 | * Prepare FP7 project: coordinated by Cal, expression of interest by others |
| | 698 | * Make schema more modular if possible: Cal |
| | 699 | |
| | 700 | Next workshop: |
| | 701 | * Proposal: Brussel for the next one, Greece for winter 2010 |
| | 702 | * Fix the dates with a Doodle as soon as possible, target: end of October |
| | 703 | * Factor-out specific topics, like gLite templates (1/2 day before or after) |
| | 704 | * // sessions for specific working groups or topics |
| | 705 | * Plan 1 for FP7 project |
| | 706 | * Consider day or 2 tutorial |
| | 707 | * Co-located to allow contacts between developpers and users ? |
| | 708 | * Tutorials at LISA or same |
| | 709 | * Announce widely |
| | 710 | * Fill agenda early: name programe chairs, identify main topics |
| | 711 | |
| | 712 | |
| | 713 | |
| | 714 | |