| | 1 | = Quattor Workshop - CERN - 16-18/3/2011 = |
| | 2 | [[TracNav]] |
| | 3 | |
| | 4 | [[TOC(inline)]] |
| | 5 | |
| | 6 | |
| | 7 | [http://indico.cern.ch/conferenceTimeTable.py?confId=118603#all.detailed Agenda]. |
| | 8 | |
| | 9 | == Quattor Status Report - M. Jouvin == |
| | 10 | |
| | 11 | See [http://indico.cern.ch/getFile.py/access?contribId=16&sessionId=13&resId=0&materialId=slides&confId=118603 Slides]. |
| | 12 | |
| | 13 | == Core Tools == |
| | 14 | |
| | 15 | === Quattor Configuration Modules - L. Munoz Meijias === |
| | 16 | |
| | 17 | Recent addition |
| | 18 | * `ncm-ofed` (from Stijn): Infiniband configuration, very preliminary |
| | 19 | |
| | 20 | Obsolete/suspect modules |
| | 21 | * `ncm-http`: used only at CERN |
| | 22 | * `ncm-tomcat`: used only at CERN |
| | 23 | * `ncm-tomcat2`: who uses it? |
| | 24 | * `ncm-rproxy`: replaced by `ncm-squid`? |
| | 25 | * `ncm-smartd` and `ncm-smartd5`: same schema, what is the difference? |
| | 26 | |
| | 27 | Component quality: testing is tedious, old bugs are often back... |
| | 28 | * CAF is a critical tool to help with component reliability |
| | 29 | * Can run commands, update/write files with traceability and a real test suite |
| | 30 | * LC status: still maintained, private copy in Quattor |
| | 31 | * LC::File should not be used directly, CAF::FileWriter and CAF::FileEditor instead |
| | 32 | * LC::Process no longer expands to a subshell: new version will allow to split command on whitespaces |
| | 33 | |
| | 34 | `ncm-accounts`: rewritten twice in last 2 years, hope v6 will fix all problems definitely |
| | 35 | * Very fast with large number of accounts |
| | 36 | * It correctly handles creation of user's home directories |
| | 37 | |
| | 38 | `ncm-networks`: lot of complains, ugly code, need rewriting, must review what the component should really do |
| | 39 | * Define IP config for network cards |
| | 40 | * Define bridge devices: required for virtualization support |
| | 41 | * Tuning of device parameters |
| | 42 | * Configuration of channel bonding: kernel module manipulations should be done with ncm-modprobes |
| | 43 | * Must work on diskless servers: no more restart of the network just for testing |
| | 44 | * TCP-parameters should be controlled by ncm-sysctl |
| | 45 | * IPv6 support: schema extension needed |
| | 46 | * Loic may try to look at it |
| | 47 | |
| | 48 | `ncm-sudo`: CERN needs to be able to manage sudoers.ldap |
| | 49 | |
| | 50 | `ncm-useraccess`: proposal to remove PAM-based ACL support |
| | 51 | * Mistake to have put it here, use `ncm-pam` instead |
| | 52 | |
| | 53 | `ncm-authconfig` rewritten with LDAP support |
| | 54 | |
| | 55 | `ncm-modprobe`: support for SL4 to SL6 |
| | 56 | * SL3 support removed |
| | 57 | |
| | 58 | `ncm-spma` v2: new incompatible schema, 2-phase upgrade needed |
| | 59 | |
| | 60 | `ncm-chkconfig`: problem with removal of ability to specify both `on` and `off` |
| | 61 | * Need ability to say "set service on to runlevel ",4,5 but off at other levels" |
| | 62 | |
| | 63 | CCM: no changes since Feb. 2009, MS patches still not there? |
| | 64 | * Still no support for local profiles required for easier testing |
| | 65 | |
| | 66 | `ncm-ncd`: version 1.3 by MS allows subclassing |
| | 67 | * Some components won't work with 1.2 in the future |
| | 68 | * Should try to enable `-t` in Perl to get warnings about dangerous components |
| | 69 | * `cdp-listend` and `ncm-cdispd` should drop privileges when not needed |
| | 70 | |
| | 71 | RHEL6 support: another platform to maintain... introduces Perl 5.12 |
| | 72 | * Major language changes but backward compatible |
| | 73 | * Use stricter modes in Perl 5.12 to identify improvements needed to existing components |
| | 74 | * Don't jump to quickly into new, incompatible features |
| | 75 | * Core libraries (CAF, LC) are not affected by the changes |
| | 76 | |
| | 77 | Some actions needed |
| | 78 | * Identify obsolete components and move them to another location (delete them?) |
| | 79 | * Identify and fix components using backquotes to spawn commands instead of CAF::Process |
| | 80 | * Do a Quattor component release? or official/recommended list of components? |
| | 81 | * Can the new build tools help with this? |
| | 82 | * What process? |
| | 83 | |
| | 84 | Luis will no longer be able to (significantly) contribute as he's leaving CERN... |
| | 85 | |
| | 86 | |
| | 87 | === Aquilon and other MS Developements Status - N. Williams === |
| | 88 | |
| | 89 | Main developments: integration with ESX through QRD |
| | 90 | |
| | 91 | Profile signatue: decided it was not the way to go, just encrypting profiles over the wire through Krb host keytab |
| | 92 | |
| | 93 | ncm-network: MS would like to be involved in rewrite, ready to contribute |
| | 94 | |
| | 95 | ncm-modprobes: difficulties with last version because of removal of shell expansion |
| | 96 | |
| | 97 | RHEL6: found various issues, some of them not identified at CERN (in particular AII-related ones) |
| | 98 | * Goal: support ready in a few months, fixing problems as they are discovered |
| | 99 | |
| | 100 | Aquilon: will add support for "clusters" in a way similar to SCDB to help with use of QWG templates. |
| | 101 | |
| | 102 | Will start looking at support Solaris 11 in the future. |
| | 103 | * Probably not before autumn |
| | 104 | |
| | 105 | May also look at feeding their internal Windows management system from Quattor. |
| | 106 | |
| | 107 | |
| | 108 | == QWG Templates == |
| | 109 | |
| | 110 | === QWG Templates Update - M. Jouvin === |
| | 111 | |
| | 112 | See slides. |
| | 113 | |
| | 114 | === Nagios Templates - R. Starink === |
| | 115 | |
| | 116 | Almost done |
| | 117 | * monitoring/nagios/config: server configuration |
| | 118 | * RPM includes |
| | 119 | * Apache configuration |
| | 120 | |
| | 121 | Difference between generic and local configuration: most work done locally, still need to be merged |
| | 122 | |
| | 123 | Pnp4Nagios: service tpl needs site modifications |
| | 124 | |
| | 125 | Hierarchy of Nagios servers: optional |
| | 126 | |
| | 127 | Grid monitoring: based on EGEE/EGI work |
| | 128 | * Uses YAIM, not easy to integrate into existing QWG templates. More discussion required. |
| | 129 | * Replace ncm-yaim by ncm-filecopy? |
| | 130 | * Not generic |
| | 131 | |
| | 132 | Summary |
| | 133 | * No visible progress yet but some progress behind the scenes |
| | 134 | * Merge into SVN requires time and dedication: would like to test results |
| | 135 | * Grid monitoring to be done after completion of base work |
| | 136 | |
| | 137 | |
| | 138 | === LEMON - I. Fedorko === |
| | 139 | |
| | 140 | Exception sensor: runs on the monitored node to report abnormal conditions (based on local conditions and metrics collected by other sensors on the node) and optionally run a corrective action. |
| | 141 | |
| | 142 | LEMON usage: CERN (several instances at CC and in online farms), CNAF, MS |
| | 143 | |
| | 144 | Backends |
| | 145 | * Supported: Oracle and flat-files |
| | 146 | * No alarming possible with flat-files: rely on PL/SQL (available for most DBs) |
| | 147 | * Postgres/MySQL support should be possible to add as internally LEMON is DB vendor independent |
| | 148 | |
| | 149 | LEMON usage at CERN: 11K entities (8K machines), 5+ Knodes with 150-200 metrics |
| | 150 | * Performance monitoring |
| | 151 | * Application monitoring |
| | 152 | * Infrastructure metrics: temperature, power... |
| | 153 | * 5 core sensors covering 60% of measurements |
| | 154 | * 1 LEMON server (+1 for historical data) |
| | 155 | |
| | 156 | Quattor management of LEMON |
| | 157 | * LEMON agent configured with ncm-fmonagent |
| | 158 | * LEMON client configuration files signed with SINDES |
| | 159 | |
| | 160 | Current activities |
| | 161 | * Federation of instances (worked with MS) |
| | 162 | * Virtual machines monitoring: Quattor used to generate images (golden nodes) leading to several VM instances sharing the same SINDES certificate |
| | 163 | * Also customized sensors for hypervisors |
| | 164 | * New LEMON web for better scalability and improved flexibility (more dynamic operations) |
| | 165 | |
| | 166 | Future |
| | 167 | * Consolidation of monitoring activities at CERNIT, including experiment support |
| | 168 | * New DB schema for increased scalability |
| | 169 | * DB data export for data mining |
| | 170 | * Exception classification for alarming |
| | 171 | * Remote instances (monitoring of remote T0) |
| | 172 | * Probably integrating with Nagios for collection |
| | 173 | * Interfacing to Nagios: interested by ideas/experiences |
| | 174 | * Interfacing with Windows monitoring |
| | 175 | |
| | 176 | LEMON development team at CERN: 1 to 1.5 FTE |
| | 177 | * Enough for maintenance for CERN usage |
| | 178 | * Not enough to do major reengineering: a problem as the use cases have changed |
| | 179 | * Consolidating monitoring activities (and manpower) is one direction |
| | 180 | * Interacting with other tools is the other direction |
| | 181 | |
| | 182 | |
| | 183 | == CMS Experience with Quattor - J. A. Coarasa Perez == |
| | 184 | |
| | 185 | CMS online computing infrastructure requirements |
| | 186 | * Complete autonomy from CERN IT and network |
| | 187 | * Myrinet core network |
| | 188 | * Redundant services (2 different computing rooms distant of 200m), managed remotely from a central control room |
| | 189 | * Fast configuration turnaround |
| | 190 | * Ability to read from electronics at 100 kHz (100 GB/s) |
| | 191 | * Computing power to do data selection (2 GB/s output): ~2500 cores |
| | 192 | * Enough disk storage for 2 days: ~300 TB |
| | 193 | * Spare computers for quick replacement in case of failure (~200) |
| | 194 | |
| | 195 | Limited manpower: ~5 FTEs |
| | 196 | |
| | 197 | Quattor is the core configuration management tool |
| | 198 | * Computers installed through Quattor but not using AII |
| | 199 | * Based on a (very) old Quattor version: 1.3 |
| | 200 | * panc 6 |
| | 201 | * RPM repositories hosted on NetApp filers with 6 SWrep servers |
| | 202 | * Performance: 10mn for compiling the whole 2500 machines |
| | 203 | * Time to deploy changes: ~10 mn (cdp notification used) |
| | 204 | * Reinstallation of 1K machines in 1h 1/4 |
| | 205 | * Problem when SWrep servers were overloaded: spma timeout, no retry |
| | 206 | |
| | 207 | Well defined template structure/layout (CMS specific) |
| | 208 | |
| | 209 | Some CMS-specific developments |
| | 210 | * Dropbox for RPMs to allow people without Quattor expertize to update their SW in a traceable/reproducible way |
| | 211 | * Buggy components requiring workarounds (in particular for configuration removal) |
| | 212 | * Copy of configuration files to individual computers: copyd |
| | 213 | * Network configuration derived from DNS |
| | 214 | * RPMs used to create users |
| | 215 | * Web frontend to CDB configuration: template summarizer (parsing part of the profiles according to CMS way of configuring systems) |
| | 216 | * Software Update Tools: allow to update existing RPM in a group of computers (dropbox) |
| | 217 | * Upgrade or downgrade |
| | 218 | * Allows to rollback changes |
| | 219 | * Allow to check the status of the update |
| | 220 | * Perform checks on the RPM: already exists, not named properly... |
| | 221 | * Cannot be used to add a new RPM to the configuration: need to request addition of a new RPM |
| | 222 | * Only one person can run it at any time |
| | 223 | * Ability to run commands on a set of computers selected based on the use of a given template (like a type) |
| | 224 | |
| | 225 | Conclusions |
| | 226 | * Quattor has been and is scalable to CMS needs |
| | 227 | * Not always as flexible as we'd like... |
| | 228 | |
| | 229 | |