Skip to content

spiiph/atlasgce-modules

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

atlasgce-modules

Description

The atlasgce-modules are Puppet modules for contextualizing analysis clusters for the ATLAS Experiment. They are developed primarily for Google Compute Engine (GCE).

Operating system support

The modules have been tested on the CentOS 6 operating system. It should work out of the box for most RedHat based systems of the same generation, such as SL6 and SLC6.

Work is also in progress to partially support CernVM. The SLC5 based CernVM 2.6 and 2.7 will have support for all modules except packagerepos (due to lack of Conary support in Puppet) and cvmfs (which is already configured during the CernVM contextualization). μCernVM will be SLC6 based and use RPM for package management and should be fully supported.

Debian based systems are not supported but support can be added.

Cloud support

These modules have been developed for GCE but support for other clouds can be impemented.

Functionality

Overview

The atlasgce-modules provide from-scratch contextualization on bare machines (virtual or physical) for ATLAS analysis.

Three different roles are available: The manager role (head), the worker role (node), and the worker role for a Cloud Scheduler environment (csnode).

The manager role (head)

The manager role consists of the following elements:

  • The AutoPyFactory service fetches jobs from a PanDA queue and submits them locally to Condor.
  • The Condor collector, negotiator, and schedd services manage job submission and the distribution of subjobs over the worker nodes.
  • The XRootD and Cluster Management services act as a local XRootD redirector and are responsible for accessing and caching input data files through the Federated ATLAS XRootD system. (Optional)
  • The CernVM-FS service provides consistent access to ATLAS software. CernVM-FS is not strictly required for the manager role, but can be helpful when debugging the Condor services. (Optional)
  • Compatibility packages for running SLC5 binaries on SLC6.

The worker role (node)

The worker role consists of the following elements:

  • The Condor startd service runs the individual subjobs as dictated by the manager.
  • The XRootD, Cluster Management, and File Residency Management services are responsible for accessing input data files through the manager and download those that are not yet available in the cache.
  • The CernVM-FS service provides consistent access to ATLAS software.
  • Compatibility packages for running SLC5 binaries on SLC6.

The Cloud Scheduler worker role (csnode)

The Cloud Scheduler worker role consists of the following elements:

  • The Condor startd service runs the individual subjobs as dictated by the Cloud Scheduler.
  • The CernVM-FS service provides consistent access to ATLAS software.
  • Compatibility packages for running SLC5 binaries on SLC6.

Contextualization

The contextualization of the machine is governed by Puppet through the means of a public collection of modules (the atlasgce-modules, this repository) and one or more scripts for bootstrapping.

Bootstrapping

The standard way of bootstrapping GCE images is to use a startup script, and this method is used for the manager and worker nodes.

For Cloud Scheduler worker nodes startup scripts are not supported, and bootstrapping is instead achieved using a machine image that is prepared with software that downloads and runs a bootstrap script supplied through the userdata metadata attribute. (Cloud Scheduler might support startup scripts on GCE in the future.)

See atlasgce-scripts for more information on the bootstrapping procedure.

What is contextualized?

The atlasgce-modules in combination with the bootstrapping procedure handles contextualization of everything from preparing and mounting additional storage, to add package repositories and download required software, to create and setup user accounts for services, and to configure and start the supported services.

This contextualization is done on a bare machine, meaning that no software other than the package manager is required. Even Puppet is installed during the bootstrapping procedure.

This means that the extra work of preparing machine images with the required software and configurations, with the rather high turn-around time that comes with it, is effectively abolished. The extra cost of redoing the contextualization at instantiation has been found to be very small on GCE.

Puppet

See What is Puppet?

Usage

This section describes suggested usage together with the atlasgce-scripts. It describes how to configure GCE options, how to configure the node template and other parts of the bootstrapping procedure, and how to start, update, and stop a cluster.

Note: Configuration of the GCE project including adding SSH keys and configuring the firewall for incoming traffic (if necessary) is not covered here. Refer to the official documentation.

Note: Detailed information about configurable parts of the atlasgce-scripts can be found in its documentation.

Getting started

  1. Download atlasgce-scripts
git clone https://github.com/spiiph/atlasgce-scripts.git
  1. Download atlasgce-modules (Optional)
git clone https://github.com/spiiph/atlasgce-modules.git
  1. Enter the atlasgce-scripts directory and edit defaults.sh to change the GCE configuration to reflect your project and cluster
  2. Edit the gce_node_head.pp and gce_node_worker.pp to configure important options such as role, manager node address, XRootD redirector, PanDA settings, etc.
  3. Edit mount-head.sh and mount-worker.sh to match your disk setup. (Remember to change the mounts in gce_node_head.pp and gce_node_worker.pp accordingly.)
  4. Edit modules.sh if you want to download the module repository in a non-standard way. Note: if the repository format is changed from git to something else, the update-cluster.sh script also has to be updated.

Managing the cluster

Once the node template and bootstrapping procedure has been configured three commands are used to control the cluster. These commands read information they require (such as GCE project, number of worker nodes in the cluster, etc.) from defaults.sh.

  • start-cluster.sh — starts a manager node and worker nodes
  • stop-cluster.sh — deletes the manager node and worker nodes
  • update-cluster.sh — fetches updates to the module repository to each node and and applies them

Contents

Detailed module documentation

See the documentation in each subdirectory for detailed information about each module.

packagerepos

The packagerepos module manages package repositories containing extra software and compatibility libraries required to run ATLAS software. These include the SLC repositories and repositories for HT Condor, CernVM-FS, and AutoPyFactory.

autofs and cvmfs

The autofs and cvmfs repositories manage the the CernVM-FS configuration and the Autofs service.

xrootd

The xrootd module manages configuration and services for XRootD, Cluster Management, and File Residency Management.

condor

The condor module manages the HT Condor configuration and the collector, negotiator, schedd, and startd services.

apf

The apf module manages the AutoPyFactory configuration and service.

gce_node

The gce_node module is an umbrella module to configure a machine for one of the specified roles. It is responsible for installing compatibility packages and doing any contextualization that is not directly tied to any of the other modules.

About

atlasgce-modules

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published