The atlasgce-modules are Puppet modules for contextualizing analysis clusters for the ATLAS Experiment. They are developed primarily for Google Compute Engine (GCE).
The modules have been tested on the CentOS 6 operating system. It should work out of the box for most RedHat based systems of the same generation, such as SL6 and SLC6.
Work is also in progress to partially support CernVM. The SLC5 based CernVM 2.6 and 2.7 will have support for all modules except packagerepos (due to lack of Conary support in Puppet) and cvmfs (which is already configured during the CernVM contextualization). μCernVM will be SLC6 based and use RPM for package management and should be fully supported.
Debian based systems are not supported but support can be added.
These modules have been developed for GCE but support for other clouds can be impemented.
The atlasgce-modules provide from-scratch contextualization on bare machines (virtual or physical) for ATLAS analysis.
Three different roles are available: The manager role (head
), the worker role (node
), and the worker role for a Cloud Scheduler environment (csnode
).
The manager role consists of the following elements:
- The AutoPyFactory service fetches jobs from a PanDA queue and submits them locally to Condor.
- The Condor collector, negotiator, and schedd services manage job submission and the distribution of subjobs over the worker nodes.
- The XRootD and Cluster Management services act as a local XRootD redirector and are responsible for accessing and caching input data files through the Federated ATLAS XRootD system. (Optional)
- The CernVM-FS service provides consistent access to ATLAS software. CernVM-FS is not strictly required for the manager role, but can be helpful when debugging the Condor services. (Optional)
- Compatibility packages for running SLC5 binaries on SLC6.
The worker role consists of the following elements:
- The Condor startd service runs the individual subjobs as dictated by the manager.
- The XRootD, Cluster Management, and File Residency Management services are responsible for accessing input data files through the manager and download those that are not yet available in the cache.
- The CernVM-FS service provides consistent access to ATLAS software.
- Compatibility packages for running SLC5 binaries on SLC6.
The Cloud Scheduler worker role consists of the following elements:
- The Condor startd service runs the individual subjobs as dictated by the Cloud Scheduler.
- The CernVM-FS service provides consistent access to ATLAS software.
- Compatibility packages for running SLC5 binaries on SLC6.
The contextualization of the machine is governed by Puppet through the means of a public collection of modules (the atlasgce-modules, this repository) and one or more scripts for bootstrapping.
The standard way of bootstrapping GCE images is to use a startup script, and this method is used for the manager and worker nodes.
For Cloud Scheduler worker nodes startup scripts are not supported, and bootstrapping is instead achieved using a machine image that is prepared with software that downloads and runs a bootstrap script supplied through the userdata
metadata attribute. (Cloud Scheduler might support startup scripts on GCE in the future.)
See atlasgce-scripts for more information on the bootstrapping procedure.
The atlasgce-modules in combination with the bootstrapping procedure handles contextualization of everything from preparing and mounting additional storage, to add package repositories and download required software, to create and setup user accounts for services, and to configure and start the supported services.
This contextualization is done on a bare machine, meaning that no software other than the package manager is required. Even Puppet is installed during the bootstrapping procedure.
This means that the extra work of preparing machine images with the required software and configurations, with the rather high turn-around time that comes with it, is effectively abolished. The extra cost of redoing the contextualization at instantiation has been found to be very small on GCE.
See What is Puppet?
This section describes suggested usage together with the atlasgce-scripts. It describes how to configure GCE options, how to configure the node template and other parts of the bootstrapping procedure, and how to start, update, and stop a cluster.
Note: Configuration of the GCE project including adding SSH keys and configuring the firewall for incoming traffic (if necessary) is not covered here. Refer to the official documentation.
Note: Detailed information about configurable parts of the atlasgce-scripts can be found in its documentation.
- Download atlasgce-scripts
git clone https://github.com/spiiph/atlasgce-scripts.git
- Download atlasgce-modules (Optional)
git clone https://github.com/spiiph/atlasgce-modules.git
- Enter the
atlasgce-scripts
directory and editdefaults.sh
to change the GCE configuration to reflect your project and cluster - Edit the
gce_node_head.pp
andgce_node_worker.pp
to configure important options such as role, manager node address, XRootD redirector, PanDA settings, etc. - Edit
mount-head.sh
andmount-worker.sh
to match your disk setup. (Remember to change the mounts ingce_node_head.pp
andgce_node_worker.pp
accordingly.) - Edit
modules.sh
if you want to download the module repository in a non-standard way. Note: if the repository format is changed from git to something else, theupdate-cluster.sh
script also has to be updated.
Once the node template and bootstrapping procedure has been configured three commands are used to control the cluster. These commands read information they require (such as GCE project, number of worker nodes in the cluster, etc.) from defaults.sh
.
start-cluster.sh
— starts a manager node and worker nodesstop-cluster.sh
— deletes the manager node and worker nodesupdate-cluster.sh
— fetches updates to the module repository to each node and and applies them
See the documentation in each subdirectory for detailed information about each module.
The packagerepos module manages package repositories containing extra software and compatibility libraries required to run ATLAS software. These include the SLC repositories and repositories for HT Condor, CernVM-FS, and AutoPyFactory.
The autofs and cvmfs repositories manage the the CernVM-FS configuration and the Autofs service.
The xrootd module manages configuration and services for XRootD, Cluster Management, and File Residency Management.
The condor module manages the HT Condor configuration and the collector, negotiator, schedd, and startd services.
The apf module manages the AutoPyFactory configuration and service.
The gce_node module is an umbrella module to configure a machine for one of the specified roles. It is responsible for installing compatibility packages and doing any contextualization that is not directly tied to any of the other modules.