Skip to content

Commit 566d5d1

Browse files
authored
Merge pull request #8055 from martynia/janusz_pilot_docs
[Integration] Centralised pilot logging documentation
2 parents 072f781 + 1c2e9ad commit 566d5d1

File tree

1 file changed

+71
-2
lines changed
  • docs/source/AdministratorGuide/Systems/WorkloadManagement/Pilots

1 file changed

+71
-2
lines changed

docs/source/AdministratorGuide/Systems/WorkloadManagement/Pilots/index.rst

+71-2
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,6 @@ Pilots started when not controlled by the SiteDirector
181181

182182
You should keep reading if your resources include IAAS and IAAC type of resources, like Virtual Machines.
183183
If this is the case, then you need to:
184-
185184
- provide a certificate, or a proxy, to start the pilot;
186185
- such certificate/proxy should have the `GenericPilot` property;
187186
- in case of multi-VO environment, the Pilot should set the `/Resources/Computing/CEDefaults/VirtualOrganization` (as done e.g. by `vm-pilot <https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/WorkloadManagementSystem/Utilities/CloudBootstrap/vm-pilot#L122>`_);
@@ -190,7 +189,7 @@ If this is the case, then you need to:
190189
We have introduced a special command named "GetPilotVersion" that you should use,
191190
and possibly extend, in case you want to send/start pilots that don't know beforehand the (VO)DIRAC version they are going to install.
192191
In this case, you have to provide a json file freely accessible that contains the pilot version.
193-
This is tipically the case for VMs in IAAS and IAAC.
192+
This is typically the case for VMs in IAAS and IAAC.
194193

195194
The files to consider are in https://github.com/DIRACGrid/Pilot
196195

@@ -269,3 +268,73 @@ A simple example using the LHCbPilot extension follows::
269268
--name "$1" \
270269
--cert \
271270
--certLocation=/scratch/dirac/etc/grid-security \
271+
272+
Centralised Pilot Logging
273+
===========================
274+
The pilot jobs generate log files which are primarily accessed for debugging if
275+
there are issues with a particular resource; these (*classic*) log files are stored in a
276+
resource dependent manner. On a grid CE, the pilot writes logs to stdout/stderr
277+
which are captured by the batch system and can later be retrieved using a CE
278+
specific tool. For a cloud resource the logs are typically written to a file on
279+
a given virtual machine instance where there is no standard or simple way for
280+
them to be retrieved.
281+
282+
The centralised (*remote*) pilot logging system offers a new resource agnostic logging
283+
to ensure that the pilot logs are captured and made readily accessible for all
284+
resources as an extra debugging facility in parallel with the existing CE-based
285+
logging system. It also offers the ability to preview logs while the pilot
286+
is running.
287+
288+
The design of the new pilot logging system for DIRAC is based around having the
289+
pilot jobs periodically send their logs back to a central storage service based on the
290+
Tornado web server. For this to work *TornadoPilotLoggingHandler* has to be installed on Tornado.
291+
Further processing of the log entries is done by a back-end plugin;
292+
the plugin to use is selected by the collector service configuration. Currently only
293+
a plugin which stores logs in a file on Tornado is implemented (*FileCacheLoggingPlugin*).
294+
When a pilot job marks a log file finalised, it can be copied by the *PilotLoggingAgent*
295+
to a selected SE.
296+
297+
The centralised logger can be enabled on a VO-by-VO basis. In addition a CE whitelist can
298+
also be provided to restrict pilot logging to those CEs.
299+
300+
Remote logger *FileCacheLoggingPlugin* requires following obligatory configuration parameters set in *Operations/<vo_name>/Pilot* or *Operations/Defaults/Pilot*:
301+
302+
- *RemoteLogging* - Enable remote logging (default False - disabled).
303+
- *RemoteLoggerURL* - to be set to the Tornado endpoint, e.g. *https://<host.name>:8443/WorkloadManagement/TornadoPilotLogging*.
304+
- *UploadSE* - Dirac SE name, where complete logs will be periodically uploaded to by the *PilotLoggingAgent*.
305+
- *UploadPath* - VO-specific upload path on the SE (e.g. */<vo_name>/pilotlogs/*).
306+
307+
To fine-tune the logger the following parameters could also be adjusted, if necessary:
308+
309+
- *PilotLogLevel* - log level, default INFO.
310+
- *RemoteLoggerBufsize* - client-side buffer size in lines; default=1000. If the buffer is full it is flushed, causing log records
311+
to be sent to the server. The buffer
312+
is also flushed when an initial pilot activity is finished (i.e. before pilot commands are run) and when a pilot command
313+
finishes (successfully ot not).
314+
- *RemoteLoggerTimerInterval* - a client-side timer interval in seconds. The logs are
315+
periodically flushed. Default: 0 - disabled. The idea behind this option is to make logs available for inspection, should a pilot get stuck.
316+
- *RemoteLoggerCEsWhiteList* - a list of CEs for which the logger records are sent.
317+
Default: no CE restriction.
318+
- *RemoteLogsPriority* - which logs to get first, default False; this will attempt to retrieve
319+
classic logs first.
320+
321+
*PilotLoggingAgent* configuration:
322+
323+
- in *Operations* - *Shifter/DataManager* User and Group of a shifter proxy used to upload data.
324+
325+
Agent's options:
326+
327+
- *ClearPilotsDelay* - logs lifetime in days on Tornado document area, default: 30.
328+
- *proxyTimeleftLimit* - time limit in seconds, before we get a new one; default: 600.
329+
330+
The administrator interface for retrieving pilot log files has also been
331+
connected to the collector. When the admin requests a pilot log from the DIRAC
332+
PilotManager service, the default resource-based method for fetching the log
333+
file is tried first; if for any reason this fails (e.g. log not available or
334+
the resource is offline) then the remote log collector is queried instead. The
335+
collector uses the configured plugin to try to retrieve the log file from the
336+
store. The order of the log sources is configurable by the DIRAC administrator (see
337+
*RemoteLogsPriority* flag above)
338+
allowing the collector to be queried before the resource-based system. This
339+
fallback mechanism is completely transparent to the administrator, the log is
340+
simply fetched from whichever source has it available.

0 commit comments

Comments
 (0)