= 7th Quattor Workshop - London - 11-13/3/09 = [[TracNav]] [[TOC(inline)]] See [http://indico.cern.ch/conferenceTimeTable.py?confId=50010&showDate=all&showSession=all&detailLevel=contribution&viewMode=parallel Agenda]. == Site Reports == === LAL - G. Philippon === === Grid Ireland - S. Childs === Not much changes in the number of admins or resources * Virtualization of most nodes (Xen) Developments in monitoring: * NAgios + Ganglia * MonAMI to feed Ganglia from DPM and Torque * LEMON : upgraded to lemon-web on SL5 Many tools upgraded : checkdeps, quatview * ncm-accounts massive speedup SCDB : merge "hierarchical site" model into trunk Issues: * Get network and file sytems fully under Quattor control * Consistent scheme for monitoring in Quattor * Dummy WN speedup trick integrated into the compiler === NIKHEF - R. Starink === 4 clusters: 300 machines * Currently 5 people involved with Quattor SCDB + local changes: deployment not done by SCDB bu by local tools to allow deployment of a specific machine * Related to historical way of managing systems at NIKHEF but has the disadventage that a postponed change deployment may break something later on some other nodes... Xen: 7 hosts, 38 guests * Based on QWG but issues with host and guests in different clusters: workaround found Monitoring entirely based on Nagios with 1 master and 3 slaves * Based on QWG but some mods to handle hierarchy of servers that are willing to share panc v7 to v8 transition: no problem but no performance improvement observed * Very happy of new logging features NCM components: * ncm-openvpn to configure server and clients * ncm-yaim: complete refactoring/rewriting, new features, some backward incompatible changes Issues: * WN compilation speed-up has some pb with compile-time dependency * Strength of community with increase usage: what about the ability to support everybody? === LAPP - E. Fede === Quattor server is running on a VMware virtual machine * 110 profiles * 4 people using it Running autobuild of RPMs for NCM components and other core components from SourceForge * http://lapp.in2p3.fr/Quattor * trunk and tags/latest * repodata available from YUM === CERN - V. Lefébure === Main instance: 7500 profiles in 139 clusters * +1200 increase * 1900 profiles corresponding to machines not managed by Quattor * Running the last version of CDB, panc v8 ready * v8: 20% improvement in compile time but not yet in production. all know issues solved * Problem of use of RECORD type by some components (ncm-httpd, ncm-tomcat....) Xen-based virtualization: support for SLC5 hypervisors ready Issues of number of users: 65 ACL groups Package list templates: working on automation * Use of comps.xml * Automatic detection of missing dependencies CDB2SQL: Python version fast but buggy, no manpower to fix it, reverted to previous version === Morgan & Stanley - N. Williams === In production now: AQB, AQDB, LEMON * 7500 nodes, compile at 10 minutes (8-core machine) but aii-shellfe --notify at 1h (with patches not yet committed) ! * 5 template admins * New building just commissionned and expect to double the number of machine in the next months, plan to keep one server (+1 for redundancy) Issues: * Format change of XML profiles painful: dropping LINK support forces "big-bang" changes * Configuration success feedback: thinking at implementing a DB of last time a component was run, updated by ncm-ncd/ncm-cidspd that could be compared with the timestamp of the last configuration Will submit code now via SourceForge * Waiting for approval to open source : AQDB, FUSE interface to configuration browsing, AII, CCM patches === UAM - Laura del Caño === Luis left, Laura is his replacement. Proposal of tasks that UAM could handle: * Maintenance of monitoring tools * openvz support * AII 5 clusters * Use of ant local tasks for template management * Performance tests of new machines configured with Quattor New component in progress: * ncm-amanda to configure Amanda backup SW * ncm-pnp4nagios Some local developments: * Postgresql DB to store machine info and group them into categories + ant local task to generate the profile and some other templates (monitoring) for the machine * SinDes alternative used to manage secured access to profile (AII hook) === Greek Grid - D. Zilaskos === 1 Quattor server to manage 2 clusters representing 133 machines spanning 13 subnets * 4 Xen hosts, 19 guests * SVN server installed with Trac * 3 admins + 2 new people who recently joined Developpements and issues: * Wiki guides for Quattor newbies * New components: still in progress... some services like Hydra evolving very quickly * Involvement in OAT benefits as work is implemented and tested locally with Quattor * Thinking at an administration model for Southern Europe based on GRIF/GRID Ireland experience * Lot of small sites, very limited effort available... === CNAF - A. Chierici === 90% of the templates adapted to the new schema * Inspired by QWG * Next step is migration of gLite nodes to SLC5 Xen used on 3 servers providing 16 cores each * LHCb T2 running on Xen * Quattor used to configure the profiles of guests but Dom0 managed by hand * Investigating KVM Planning to install soon new ncm-yaim and ncm-accounts CDB vs. SCDB: still thinking about migration but need to investigate the impact for the users * Is CDB still supported after ME left ? * Who will take care of new core release ? Presenting a poster about Quattor at CHEP: anyone who could help to present it ? One new people to help with Quattor at CNAF (Elisabetha) * All new sysadmins teached to use Quattor == Aquilion Drive Through - W. Hertlein == Aquilon is an architecture to address system management at M&S, in particular scalability, management delegation, application-centric management... * Goal is to install hundred of machines without any manual intervention AQDB is the CDB replacement: no direct interaction between users and templates, everything goes through the Aquilon broker (AQB) * Aquilon configuration stored into AQDB Workflow of provisionning machines: * Rack is the unit of work: racks delivered cabled * Limited number of vendors and models * When racks is powered on, top of rack (tor_switch) switch send a DHCP request and receive a temporary address that will allow to configure it after discoverin its type * Switch entered in AQDB * DNS integration : all servers configured with the same information * DNS DB built from a periodic dump of AQDB (every 3h) * DHCP integration: a DHCP server set to server a specific set of machines * Configuration built from a periodic dump of AQDB * Discovering machine: done by scripts scanning the tor_switch an dquerying it with snmp get * Create a machine entry in AQDB for each discovered MAC entry * After a machine has been entered in AQDB, create a machine plenary (PAN) template describing the HW * Machine plenary template is an equivalent of SCDB hardware/machine + the service/personnality * Creating hosts: logical entry for the host created and associated with a plenary template * IP is derived by choosing an available IP from the tor_switch subnet and put in the plenary template to avoid an IP address change triggers recompilation of another host * Wait for the propagation of previous information, in particular DNS: may take a couple of hours * Next day build out the hosts binding host to required services and use panc to compile the template * Invoke aii-shellfe to switch over the install pxe image (done through AQB) * DNS + Krb propagation delay: looking for some optimizations in the future * Finishing the build process: a script runs on the newly built host to update its information in AQDB * Record any management interfaces found and if successful update the status information * Grid hand-off: after successful build of a host, it is transferred to an application groups * Deployment schedules are set up in advance * Use personality to set up a host with application defaults * Spread hosts for a group over several racks for better resilience Managing services: a service may have several instance * 1 service instance is bind to a host Monitoring hosts to meet SLA * Hosts are monitored by a daemon that does periodic snmp sweep of all the hosts * Hosts are grouped by personality with a threshold of hosts allowed to be offline * Host personality also defines a reboot schedule: potential previous threshold violation will prevent the reboot * Include some draining capabilities * LEMON is enabled as a service to provide visualization of aggregated metrics * LEMON configured from Aquilon to guarantee consistency Updating personalities involve working on a local copy of the personality and creating a new temporary one (AQ domain), compiling it with a fake profile and putting the change back into AQDB * Currently no check enforced by 'aq put' that the new version compiles. Rely on conventions... * A test host can be associated with the new personality and reconfigured to validate the changes * AQ then allows to merge changes into a production personality and reconfigure all nodes using it == Monitoring == === Monitoring Templates in QWG - S. Kenny === LEMON: * NCM components: fmonagent, oramonserver * Templates: standard/monitoring/lemon * Web front-end via filecopy Nagios: * NCM components: nagios, ncg * Templates: standard/monitoring/nagios * A few changes to support Nagios 3 * Hosts created from HW DB * Services defied as separate templates and added to NAGIOS_SERVICE_TEMPLATES variable but not very scalable * ncm-ncg currently being developped to produce an input file for WLCG NCG which generates the service definition: looks promising Ganglia: configured with filecopy * Would be better to have a component generating the required config file on client and server from the site hierarchi description MonAMI: configured with filecopy Currently, every monitoring tool has its own configuration. Ideally monitoring schema should be mostly tool-independant. Proposed model based on current LEMON config: * Host: coming from DB_MACHINE * Cluster: group of nodes sharing the same node types * Super-cluster: * Need to be part of the information in the node profile so that it can be used by several components * May be connected with some representation of M&S personalities into the schema == Core Components == === PAN Compiler - C. Loomis === Code hosted on SourceForge since 8.2.6 * Bug tracking moved to SF too Production version is v8 * v7 deprecated * 8.2.3 : CDB compatibility, annotation * 8.2.4 : selective debugging * 8.2.7 introduces prepend/append functions to replace push/npush Outstanding bugs: * Race condition in validation * Corner cases with unintuitive behaviour * Enforce final flag in structure templates: no real request... Enhancement requests (in priority order): 1. Restricted include (aka entitlements) 1. Add perf tips to documentation 1. XInclude directive to replace Embedded 1. Add OBJECT to debug() and error() output 1. Add prefix/define statements to shorten literal paths 1. Internationalize error messages 1. Enable/disable debugging from within pan 1. Allow include to take a list (from M&S) Other ideas/wishes: * Better Eclipse integration: editor, debugger, dialog boxes to select options for ant/pan * Ability to include a file which is not a template inside a template as an alternative to `<