Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-43722 Implement Transformed EFD service #72

Draft
wants to merge 113 commits into
base: main
Choose a base branch
from
Draft

Conversation

rcboufleur
Copy link

Overview

This pull request implements a structured framework for processing and transforming data from the Engineering and Facilities Database (EFD). It enables data retrieval, transformation, schema generation, and integration within the LSST ecosystem.

Main components:

  • Configurable data transformation of EFD topics
  • Automated schema generation for database tables
  • Task scheduling and retry management
  • Statistical summarization of time-series data
  • YAML-based configuration files per instrument
  • Customized Alembic migration versioning for schema updates

This implementation supports data processing for instruments such as LATISS, LSSTComCam, and LSSTComCamSim.

Configuration Framework
  • Pydantic-based models (Field, Topic, Column, ConfigModel) for YAML validation
  • Validation for unpivoted tables
  • Instrument-specific configurations (e.g., config_latiss.yaml)
Data Transformation Pipeline
  • Processes exposures and visits using the Transform class
  • Queries EFD data via InfluxDbDao
  • Dynamically processes columns based on configuration
  • Stores transformed data in PostgreSQL via ExposureEfdDao/VisitEfdDao
  • Supports structured (pivoted) and key-value (unpivoted) data formats
Summary Statistics
  • Computes metrics such as mean, standard deviation, and RMS from time-series data
  • Supports time-based filtering (e.g., most recent value in the last minute)
  • Integrates with transformation pipelines
Schema Generation & Alembic Migrations
  • generate_schema_from_config.py creates database schemas from configuration files
  • Supports structured tables (exposure_efd, visit1_efd) and key-value tables (exposure_efd_unpivoted, visit1_efd_unpivoted)
  • Includes transformed_efd_scheduler table for task tracking
  • Implements customized Alembic migration versioning:
    • Saves and restores schema snapshots during migrations
    • Tracks configuration changes across database versions
Task Management
  • QueueManager handles task creation, retries, and status tracking
  • Supports execution via Kubernetes Jobs (fixed intervals) and CronJobs (scheduled runs)
  • Implements automatic retries for failed tasks with defined limits

Code Structure

  • config_model.py – Configuration validation models
  • summary.py – Statistical operations on time-series data
  • transform.py – Core transformation logic
  • transform_efd.py – CLI entry point and workflow orchestration
  • generate_schema_from_config.py – Schema generation
  • dao/*.py – Database access layer (PostgreSQL/InfluxDB)
  • queue_manager.py – Task queue management

Validation & Error Handling

  • Configuration validation: Ensures valid tables for unpivoted data and correct data types in summaries
  • Task retry logic: Automatic retries with detailed error logging
  • Data integrity checks: Time range validation for exposures/visits and database constraint enforcement
  • Schema migration validation: Ensures Alembic migration consistency with stored snapshots

Testing

  • Unit tests: Validation of configuration models, statistical calculations, and transformation edge cases (incomplete)

@rcboufleur rcboufleur changed the title Tickets/dm 43722 DM-43722 Implement Transformed EFD service Mar 17, 2025
@JeremyMcCormick
Copy link
Contributor

JeremyMcCormick commented Mar 17, 2025

This may be planned but you will need to cleanup the commit history. Commit messages should be like "Add something," etc. (Assuming you are familiar with DM standards on this, and I realize this is still a draft.)

You should also not be doing merges from main. This should be done with git rebase -i main after you have pulled main. Again, please see the DM standards on this in the dev guide.

Copy link
Contributor

@ktlim ktlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through this, but posting some initial thoughts.


# Create and populate the data directory
RUN mkdir -p /opt/lsst/software/stack/data
COPY --chown=lsst:lsst tmp/efd_transform/*.db /opt/lsst/software/stack/data/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing where this tmp directory comes from. Should this be removed for production?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. It was the database used before we had postgres available.

PGUSER="rubin" \
CONSDB_URL="sqlite:////opt/lsst/software/stack/data/test.db" \
TIMEDELTA="5" \
LOG_FILE="/opt/lsst/software/stack/data/transform.log" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to stdout or stderr is more Kubernetes-friendly, and it doesn't risk (as much) filling the disk and crashing the job. But if jobs deal with short time periods, this is not so much of an issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently writes logs to both stdout/stderr and a log file. The log file can be removed or made optional.

@@ -0,0 +1,39 @@
"""Provides a structured framework for processing and transforming data from the (EFD).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation here does not exactly conform to https://developer.lsst.io/python/numpydoc.html though it is close. A later ticket can be used to clean this up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll double check it.

…method.

Introduced mutable tables for better data modifications.
Added logging for memory usage to track performance.
Allowed handling of duplicated idle tasks.
Developed a new query method for improved data retrieval.
… errors across the codebase.

Renaming the con attribute to connexion improves readability. Additionally, flake8-reported issues were fixed to maintain code quality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants