| Version 2 (modified by , 18 years ago) ( diff ) |
|---|
Quattor Workshop - CNAF - 17-18/3/08
TracNav
Site Reports
CERN - V. Lefebure
Several Quattor instances :
- Main instance : 7600 profiles with 1700 not Quattor managed (for inventory only), in 140 clusters
- 2 instances for Linux controls
- Desktops : currently using only a small subset of NCM components to configure side-wide defaults
- Special requirement on components : touch only the part they manage, don't remove comments
- Rely on lcm rather than ncm-ncd to allow users to select component to run
PAN : still using v6 because of a performance issue with duplicate() function. Fixed proposed for v8, CERN may skip v7.
Namespaces : used mainly for staging purposes (/prod, /test...)
- Just started to use namespace to organize templates
CDB : currently on a 4-core machine
- 33 minutes for 7600 profiles
- Memory issue : compile by batch of 100 profiles
- May be related to the big RPM repository
- More and more CDB users : issues with ACL management and performances
- Compilation serialization leads to long apparent commit time : thinking about allowing some sessions if no interferences between them.
SPMA and SWrep
- Plan to cleanup RPM repository to improve compilation memory requirement
CDB2SQL and Oracle
- CDB2SQL being rewritten : plan for 3x speed improvement
- CCM v2 deployed : SSL based
Quattor development activities focused on maintenance of CERN templates, in particular namespace migration.
Xen-based virtualization use increasing but needs a more flexible template structure.
Trying to define a profile template structure everyboday will use with aim to separate OS configuration, VO configuration...
Other needs and issues :
- SMS needs to access CDB ACLs
- Update in progress to Quattor 1.3 templates : performance issue with push and npush, 2 CERN specific properties in structure_interface
- Reviewing how to better handle non Quattor managed objects : remove requirements for not used information
- Plan to use a specific profile_base for these systems
- Migration of service data historically in CDB to SDB (Service Database)
LAL - M. Jouvin
http://indico.cern.ch/materialDisplay.py?contribId=2&sessionId=0&materialId=slides&confId=28976
NIKHEF - R. Starink
Sites:
- NIKHEF-ELPROD : ~150 hosts
- Will increase to 300 in April
- Testbed : ~15 nodes
Install with Quattor for all grid machines
- CentOS 3, 4, 5
Grid MW deployment with ncm-yaim
- Lack of time to loak at/swith to QWG : open to benefit from & contribute to communicty effort
- Interest from another site : occasion to look at it
Straightforward implementation of Xen guests using S. Childs's work.
Monitoring : Nagios + Ganglia
- Ganglia presents per-node or summary view of successful/failed execution of ncm-ncd
- Nagios : alarms in case of non-zero exist (using NPRE)
- Server configured manually
Pan Compiler : using v7.
SCDB : non-SVN version
- Using Makefiles to hide SCDB tools
AII : successful migration to v2.
Facing scaling issues with TFTP and/or RPM repositories
- TFTP unlikely to be the cause
- External reasons like network ?
Grid-Ireland - S. Childs
18 distributed sites managed : ~400 nodes
- Single SVN repository with 6 active users
- RPM repositories replicated by rsync on each site
New compute resources deployed : ~40 Condor VMs
Isntallation of LEMON monitoring
- Just for TCD site currently
- Starting with Stijn's receipee
- Work required for better integration : alarms, IPMI
Xen config tieded up.
Developping GraphXML output format for template dependencies in panc v8.
Range of new non-grid machines installed
- Portal server
- Data management
QWG pulled in via svn:externals
- "current" pointer to certified revision
- "trunk" pointer for development
- Need to specify ignore:externals in case QWG repository is not accessible
- Using Stijn's dummy WN trick for performance improvement
QWG structure based on a site hierarchy : target "compile.sites"
Deployment of OS errata : still not working seamlessly
- Mainly kernel issues
Plan to bring more service nodes in Quattor (e.g. Web servers)
Morgan Stanley - N. Williams
20K machines in 4 sites
- Looked at several solutions, including Quattor
- Pressure to have 10K machines managed with the replacement system by June. Will be a major test before final decision.
Like the Quattor architecture.
Tried SCDB but don't like Subversion and prefer CDB model with specialized commands for non specialist users
- Try to write a new thing merging both : AQDB
- Looking at performances for compiling 10K machines
- Concerns about scalibility of build server : DHCP, HTTP...
Using AII to manage initial installation :
- Would like to dramatically improve installation time from 20 mn down to 5 mn
10 people involved, 5 really writing templates : main complaint is difficulty to locate where are the things included
- Would like to use namespace to provide better predictibily
SPMA : dependency management is very time consuming
Monitoring : very interested by LEMON but no time to look at it in details.
UAM - L. Munoz Merias
Several clusters :
- Atlas T2 : 150 nodes, SL 4.5, managed with QWG templates, monitoring with Nagios
- GVM-UAM : private cluster for UAM users, not necessarily HEP. Many pb with HW support on SL 4.5.
- 40-desktop cluster : still running old-CDB, will probably be closed (old HW)
Quattor changes :
- Implementation of staged deployment
- Dropped SWrep in favor of HTTPrep
- Secure delivery of profiles, using certificates : lightweight alternative to SINDES developped.
- AII v2
- Nagios integration into templates
Plans for the short term:
- Panc v8
- Quattorization of Quattor instances
INFN - A. Chierici
Running SLC4/gLite 3.1, except for CE
- Disk servers run 64-bit (except if machine doesn't support it)
Quattor configuration :
- Using ncm-yaim to configure grid services
- Adopted a new NS schema, close to NIKHEF one
- Xen studied and tested : Xen-UI to be deployed soon
M.E. back at CNAF : continue to support core development
- But only for one year...
New LHCb Tier-2 hosted at CNAF and fully quattorized.
PANC : migrated to v7 after Madrid workshop. 40% perf improvement.
LEMON configured using Quattor
- Storage nodes and WNs
- On WNs, conflict with GridICE
- Migration in progress to NS
- LEMON used for monitoring, Nagios used for alarming
More than 10 people involved in Quattor template maintenance at CNAF
Concerns about CERN role in the future : is the community strong enough to take over CERN role ?
Developped a web application to display node Quattor status graphically
- Organized by rack
- When a pointer is on a machine, displays the status of the components
- Status retrieved from a MySQL database, filled with a cron job
- Able to send status through RSS
Philips Research - S. Vrijaldenhoven
Current internal cluster not managed by Quattor. Want to move to grid to get access to more resources.
Test grid cluster managed with Quattor (SCDB+QWG templates)
- SL 4.5, gLite 3.1
- Plan to move to production soon (planned March 25th).
Nagios work going on.
IBCP - C. Eloto
1 cluster running gLite 3.1
- 1 CE, 1 SE, 18 WNs
- Everything installed in VMs
- Quattor choosen to compensate from lack of manpower : seem to provide efficient management
Site not yet certified : pb with SE.
Faced many problems with early release of QWG templates for gLite 3.1 and some unusual settings (RPM server different from TFTP server)
Documentation needs to be improved for AII configuration, Python 32-bit for 64-bit, gLite 3.0/3.1 coexistence.
BEgrid - Stijn De Weirdt
Quattor configuration :
- Central configuration database (SCDB)
- RPM repositories with SWrep
- Certificates used for ACls on SVN and SWrep
- Deployment in 2 phases :
- When central admins update central repository, site admins are notified
- Deploy the changes with a custom script if they are interested
Sensitive information deployed with SINDES but not properly integrated with AII
- Need to be done for AII v2
Still not using QWG OS templates for historical reasons but plan to move.
Use of dummy WN build (idea from CERN) to speed up the compilation
- Compile just once
- Reuse compiled version of a WN in every real WN profile
- Not yet integrated with QWG : some changes required, mainly in
machine-types/base.tplandmachine-types/wn.tpl
Other issues :
- Better integration of monitoring tools in Quattor
- Integration of DNS management in a way similar to DHCP
Main Developments
QWG Templates - M. Jouvin
http://indico.cern.ch/materialDisplay.py?contribId=18&sessionId=2&materialId=slides&confId=28976
AII v2 - R. Starink
Reasons for v2 :
- v1 limited to PXE + Kickstart. Want to support other installation infrastructure like JumpStart
- Device schema limited to /dev/[hs]d[a-z] : other schema supported by some hacks
- RAID and LVM support very limited and impossible to combine
- KS templates were becoming too complex and not maintainable
- Untyped schema : no validation possible at compile time
- Code unnecessarily complex : 2800 lines in v1 vs 1600 in v2
v2 architecture :
- Front-end is using plug-ins to do the real work
- 3 types of plugins : NBP, DHCP, osinstall
- Node profile determines which plug-ins are loaded
aii-pxelinux : default plug-in for NBP, no more use of template
aii-ks : default plug-in for osinstall, no more use of template
- Generator for Kickstart files
- Support complex blockdevice combinations : based on ncm-lib-blockdevices
- Flexible partitioning and formatting of file systems... but a bit slower
- Site specific setup through hooks : plug-ins for aii-ks organized in 3 groups (pre-install, post-install, post-reboot)
- Configured through PAN
- Ordered list
AII configuration has changed... but easy to adjust
- NBP : configuration moved to /system/aii/nbp/pxelinux
- osinstall configuration : configuration moved to /system/aii/osinstall/ks
- Customizatble via 22 variables AII_OSINSTALL_*
- hooks : excute NCM object, print what is desired in KS file
- File system definitions : separated between /sys/blockdevices and /system/filesystems
- Require new pan-templates and ncm-lib-blockdevices
Almost no change in command line interface : aii-shellfe, aii-installfe
- --notify being reimplemented
Upgrading Quattor server
- Pre-requisites : CCM >= 2.0.2, pan-templates >= 2.7.1, ncm-lib-blockevices >= 0.17
- Packages : aii-server >= 2.0.4, aii-ks >= 1.0.1, aii-pxelinux >= 1.0.0
- Configuration moved from /etc to /etc/aii
- Move configuration for NBP and osinstall : remove everything under /system/aii/*/options as it will be assumed to be related to a plugin 'option'
Possible contributions :
- Alternative modules
- Hooks : generic (e.g. SINDES support) or site specific to be used by others as a starting point
- Separate directory for sharing contributions and enhancements : contrib
- Extensions maintained by extension author
- A generic extension may be moved to standard AII
Support for AII v1 has been dropped.
Already tested with Xen at NIKHEF.
Update on blockdevices layout - L. Munoz Mejias
blockdevices :
- partitions_add : bulk addition of partitions to a disk. Partitions given as a pair name/size.
- lvs_add : similar for LVM
- Support for LVM stripping aded :
stripe_sizeproperty onlogical_volumesstructure. - No support for HW raid yet
- No support for quota yet : looking for suggestions about a quota schema
Core Components
PAN Compiler - Cal
Production version 7.2.9
- Bug fixes only
Version 8 in development, still in trunk
- 8.1.0 tagged last week and available for evaluation
- Feedback from each site would help moving this to production : large chunks rewritten
- Removed features deprecated in v7
- Keyworks : define, delete, description, descro
- Types : embed, fetch, stream
- Newly deprecated features in v8
- Bareword include :
include mytemplate;changed toinclude { 'mytemplate};` - Using
typefor binding : must be replace bybind - Lowercase automatic variables :
loadpath,self,object
- Bareword include :
- Language changes :
- External path syntax : myboject/some/absolute/path to be deprecated in favor of /my/object/tpl:/some/absolute/path
- Will allow eventually object templates to be namespaced
- Literal escaping of paths : can escape part of a path
- External path syntax : myboject/some/absolute/path to be deprecated in favor of /my/object/tpl:/some/absolute/path
- More limits on structure template contents : variable, functions, include no longer allowed
- New and changed functions :
- format() : printf-like capabilities. Can avoid incremental building of configuration file contents.
- Not printing everything by itself
- is_defined(), is_boolean()... return false instead of error if variable does not exist.
- String manipulation : to_lowercase/uppercase(), split(), replace()
- to_string() will accept any element, undefined, null and resources included
- format() : printf-like capabilities. Can avoid incremental building of configuration file contents.
- New automatic variable : TEMPLATE. Name of the template that initiated a DML block.
- Not changed with function calls, including create()
- New output format : write machine template as a dot (Graphviz) file
- Logging capabilities added
- Several logging types : task, call, memory, all, none
- Messages short and easily parsed for analysis
- Cost : 15% slower with all loging
- Example analysis scripts for memory usage vs. time, task per thread vs. time, graph of call (include) structure, performance studies (how much time spent in every template)
- Change in global variable handling :
variable X = ...exists(X)...always true.- Must use null instead of undef for tri-state variables
Implementation changes in v8 :
- Better handling of SELF : faster and less memory intensive, more consistent in various contexts
- Some optimization : evaluation and use of compile-time expressions, specialized operators to aovid redundant checks at runtime, optimization to allow stricter syntax checking (earlier detection of some errors)
- "Read-only" resources : infrastructure in place to avoid unnecessary copying of resources but not used yet. Some semantic issues to work out.
- GRIF full build : 25% faster than v7
Documentation updated, including tutorial and man pages for PAN functions.
Migration of standard templates to v8 required to avoid deprecation warnings
- CERN has to migrate to v8 to benefit from last change to improve performance : require to remove no longer supported keywords
Would like to clean up panc script options. What are the options used ?
- To be discussed on the mailing list
Authorization/Entitlements (Morgan&Stanley request)
- What change : template or configuration ? Real issue is probably configuration and it's triky to know the reference value.
- Who did it ? How to get user identity ?
- Was it authorized ? How to define authorization ? One possibility is to define parts of configuration than can be modified by a template but at the price of flexibility.
- Need to refine the use case and possible design : first discuss on mailing list and try to write a wrap up, then try to implement a prototype
- To be acceptable the performance price for this should be only when you use the feature (and not for everybody like in panc v6).