DM-43722 Implement Transformed EFD service #72

rcboufleur · 2025-03-17T11:59:30Z

Overview

This pull request implements a structured framework for processing and transforming data from the Engineering and Facilities Database (EFD). It enables data retrieval, transformation, schema generation, and integration within the LSST ecosystem.

Main components:

Configurable data transformation of EFD topics
Automated schema generation for database tables
Task scheduling and retry management
Statistical summarization of time-series data
YAML-based configuration files per instrument
Customized Alembic migration versioning for schema updates

This implementation supports data processing for instruments such as LATISS, LSSTComCam, and LSSTComCamSim.

Configuration Framework

Pydantic-based models (Field, Topic, Column, ConfigModel) for YAML validation
Validation for unpivoted tables
Instrument-specific configurations (e.g., config_latiss.yaml)

Data Transformation Pipeline

Processes exposures and visits using the Transform class
Queries EFD data via InfluxDbDao
Dynamically processes columns based on configuration
Stores transformed data in PostgreSQL via ExposureEfdDao/VisitEfdDao
Supports structured (pivoted) and key-value (unpivoted) data formats

Summary Statistics

Computes metrics such as mean, standard deviation, and RMS from time-series data
Supports time-based filtering (e.g., most recent value in the last minute)
Integrates with transformation pipelines

Schema Generation & Alembic Migrations

generate_schema_from_config.py creates database schemas from configuration files
Supports structured tables (exposure_efd, visit1_efd) and key-value tables (exposure_efd_unpivoted, visit1_efd_unpivoted)
Includes transformed_efd_scheduler table for task tracking
Implements customized Alembic migration versioning:
- Saves and restores schema snapshots during migrations
- Tracks configuration changes across database versions

Task Management

QueueManager handles task creation, retries, and status tracking
Supports execution via Kubernetes Jobs (fixed intervals) and CronJobs (scheduled runs)
Implements automatic retries for failed tasks with defined limits

Code Structure

config_model.py – Configuration validation models
summary.py – Statistical operations on time-series data
transform.py – Core transformation logic
transform_efd.py – CLI entry point and workflow orchestration
generate_schema_from_config.py – Schema generation
dao/*.py – Database access layer (PostgreSQL/InfluxDB)
queue_manager.py – Task queue management

Validation & Error Handling

Configuration validation: Ensures valid tables for unpivoted data and correct data types in summaries
Task retry logic: Automatic retries with detailed error logging
Data integrity checks: Time range validation for exposures/visits and database constraint enforcement
Schema migration validation: Ensures Alembic migration consistency with stored snapshots

Testing

Unit tests: Validation of configuration models, statistical calculations, and transformation edge cases (incomplete)

.vscode/settings.json

alembic/transformed_efd_latiss/env.py

python/lsst/consdb/transformed_efd/dao/base.py

JeremyMcCormick · 2025-03-17T17:27:41Z

This may be planned but you will need to cleanup the commit history. Commit messages should be like "Add something," etc. (Assuming you are familiar with DM standards on this, and I realize this is still a draft.)

You should also not be doing merges from main. This should be done with git rebase -i main after you have pulled main. Again, please see the DM standards on this in the dev guide.

ktlim

Still working through this, but posting some initial thoughts.

.vscode/settings.json

Dockerfile.efdtransform

ktlim · 2025-03-20T23:11:43Z

Dockerfile.efdtransform

+
+# Create and populate the data directory
+RUN mkdir -p /opt/lsst/software/stack/data
+COPY --chown=lsst:lsst tmp/efd_transform/*.db /opt/lsst/software/stack/data/


I'm not seeing where this tmp directory comes from. Should this be removed for production?

Yes, it does. It was the database used before we had postgres available.

Dockerfile.efdtransform

ktlim · 2025-03-20T23:17:08Z

Dockerfile.efdtransform

+    PGUSER="rubin" \
+    CONSDB_URL="sqlite:////opt/lsst/software/stack/data/test.db" \
+    TIMEDELTA="5" \
+    LOG_FILE="/opt/lsst/software/stack/data/transform.log" \


Writing to stdout or stderr is more Kubernetes-friendly, and it doesn't risk (as much) filling the disk and crashing the job. But if jobs deal with short time periods, this is not so much of an issue.

It currently writes logs to both stdout/stderr and a log file. The log file can be removed or made optional.

pyproject.toml

python/lsst/consdb/transformed_efd/config_model.py

ktlim · 2025-03-20T23:22:38Z

python/lsst/consdb/transformed_efd/__init__.py

@@ -0,0 +1,39 @@
+"""Provides a structured framework for processing and transforming data from the (EFD).


The documentation here does not exactly conform to https://developer.lsst.io/python/numpydoc.html though it is close. A later ticket can be used to clean this up.

Sure. I'll double check it.

…method. Introduced mutable tables for better data modifications. Added logging for memory usage to track performance. Allowed handling of duplicated idle tasks. Developed a new query method for improved data retrieval.

… ENV

… errors across the codebase. Renaming the con attribute to connexion improves readability. Additionally, flake8-reported issues were fixed to maintain code quality.

rcboufleur changed the title ~~Tickets/dm 43722~~ DM-43722 Implement Transformed EFD service Mar 17, 2025

rcboufleur requested a review from JeremyMcCormick March 17, 2025 12:04

JeremyMcCormick reviewed Mar 17, 2025

View reviewed changes

.vscode/settings.json Outdated Show resolved Hide resolved

JeremyMcCormick reviewed Mar 17, 2025

View reviewed changes

alembic/transformed_efd_latiss/env.py Outdated Show resolved Hide resolved

JeremyMcCormick reviewed Mar 17, 2025

View reviewed changes

python/lsst/consdb/transformed_efd/dao/base.py Show resolved Hide resolved

JeremyMcCormick reviewed Mar 17, 2025

View reviewed changes

python/lsst/consdb/transformed_efd/dao/base.py Show resolved Hide resolved

ktlim reviewed Mar 20, 2025

View reviewed changes

rcboufleur force-pushed the tickets/DM-43722 branch from 5047fa4 to a9bfb54 Compare March 24, 2025 21:49

glaubervila and others added 21 commits March 24, 2025 18:52

Implement initial EFD transformations

39a44ee

Update missing config values

56032c4

Fix exception for unimplemented dialect

85c99c7

Refactor Summary for column operations

445a646

Refactor Aggregate class to Summary for better clarity and consistency

239552b

Refactor ATAOS_correctionOffsets_w function to use mean

e013d78

Update ExposureEfd schema for config_LATISS

8d3aac4

Add VisitEFD table and fix empty values

62cce83

Batch upsert transactions; add DAO docstrings

b0a241c

Refactor configuration file loading/validation

5e70961

Update InfluxDB API for topic queries

c9600d6

Fix lint

677b569

Insert additional config parameters

91eeabb

Add Dockerfile for Efd Transform

e3fe1e1

Add main command to Dockerfile

db3553d

Implement InfluxDB packed time series retrieval using API query

4b45179

Fix lint

dea0588

Implement new config formats and validation

c50dc15

Implement unique topic queries

f40b4fc

Change base image to w_2024_33

203a81f

Use env variables for usdf_efd API

a474636

rcboufleur added 29 commits March 24, 2025 18:52

Refactor pytest for config_model

4a7c2a7

Implement pytest for generate_schema

a8ae1fe

Implement pytest for queue_manager

96ac4be

Implement pytest for summary methods

5cd4684

Update temporary files used in local testing

eb73c39

Update Dockerfile configurations

e730892

Update PYTHONPATH in Dockerfile

d8cbbe3

Fix LATISS schema tables

2349cae

Fix lint errors

e8cdbc1

Update temporary files (local run)

649a1d5

Fix linting issues across the codebase

b6bc6d0

Fix pre-commit issues across the transform_efd codebase

90dd9d8

Add bulter_repo column to scheduler table

02d67ca

Extend queue_manager and TransformDB methods

968cd21

Refactor and update jobs and cronjobs workflow

e7abad1

Update schema yaml files

7a78c91

Fix efdtransform entry at .github/workflows/build.yaml

a4c906d

Pause cronjob from picking up failed jobs

363946e

Ignore local testing files in git

93729c1

Refactor file names and locations

22c966f

Add Alembic initial migration

6be23c8

Stop tracking .vscode directory

36e31cb

Update python version in pre commit config

b4e930b

Conform EOF standards

5b21612

Refactor environment variables in Dockerfile to use one per line with…

011339c

… ENV

Add license preamble to all Python files

f60a3f1

Add @classmethod to validate_tables_when_unpivoted

aa6390b

Refactor 'con' as 'connexion' for clarity and consistency. Fix flake8…

bad8fbd

… errors across the codebase. Renaming the con attribute to connexion improves readability. Additionally, flake8-reported issues were fixed to maintain code quality.

rcboufleur force-pushed the tickets/DM-43722 branch from a9bfb54 to bad8fbd Compare March 24, 2025 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-43722 Implement Transformed EFD service #72

DM-43722 Implement Transformed EFD service #72

rcboufleur commented Mar 17, 2025

JeremyMcCormick commented Mar 17, 2025 •

edited

Loading

ktlim left a comment

ktlim Mar 20, 2025

rcboufleur Mar 24, 2025

ktlim Mar 20, 2025

rcboufleur Mar 24, 2025

ktlim Mar 20, 2025

rcboufleur Mar 24, 2025

		@@ -0,0 +1,39 @@
		"""Provides a structured framework for processing and transforming data from the (EFD).

DM-43722 Implement Transformed EFD service #72

Are you sure you want to change the base?

DM-43722 Implement Transformed EFD service #72

Conversation

rcboufleur commented Mar 17, 2025

Overview

Configuration Framework

Data Transformation Pipeline

Summary Statistics

Schema Generation & Alembic Migrations

Task Management

Code Structure

Validation & Error Handling

Testing

JeremyMcCormick commented Mar 17, 2025 • edited Loading

ktlim left a comment

Choose a reason for hiding this comment

ktlim Mar 20, 2025

Choose a reason for hiding this comment

rcboufleur Mar 24, 2025

Choose a reason for hiding this comment

ktlim Mar 20, 2025

Choose a reason for hiding this comment

rcboufleur Mar 24, 2025

Choose a reason for hiding this comment

ktlim Mar 20, 2025

Choose a reason for hiding this comment

rcboufleur Mar 24, 2025

Choose a reason for hiding this comment

JeremyMcCormick commented Mar 17, 2025 •

edited

Loading