Make AirflowRunFacet more reliable for OpenLineage consumers #47165
Unanswered
kacpermuda
asked this question in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem Statement
The AirflowRunFacet (and AirflowDagRunFacet, which is built in the same way) in the OpenLineage provider provides valuable metadata to OpenLineage consumers. While a spec file exists here, most fields are not required by it, and the final format of some fields is not consistently enforced by the OpenLineage integration. Instead, in some cases, the facet simply reflects the structure and data of the Airflow core models, which can vary between Airflow versions but also can present some information differently depending on whether we are dealing with single or multiple objects (see example below). This approach makes the facet prone to unexpected changes and inconsistencies, as it directly mirrors Airflow’s internal representation at any given moment.
For example:
Consumers of OpenLineage data must account for not only different OpenLineage versions but also varying Airflow versions, making integration more fragile compared to other facets, which are explicitly defined in the spec and backward-compatible.
We have some tests to verify how some parts of this facet will look like, but I'm not sure if we check cross-Airflow version stability of this facet so well. Also, there is always room for more tests, so I'll look into that.
Example
This issue caught my attention while testing how a DAG’s schedule is represented in OpenLineage events. As seen in #47150, the representation can vary. The
dag.timetable
field:objects
key with a list of datasets):And in Airflow 3 the use of
dataset
will be renamed toasset
, so:This is just an example (and it could be somehow standardized on the serialization level in core Airflow), but it made me think if we can / should try to provide a more consistent experience for the OL consumers regardless of core Airflow.
Proposed Solution
We could consider the following approaches:
This approach would admittedly place more burden on OpenLineage to handle differences across Airflow, but it would shift complexity away from consumers and ensure that OpenLineage events remain stable and predictable - ultimately making the ecosystem more robust and easier to integrate with in the long run.
Would love to hear your thoughts on whether you think this is a real problem and, if so, if OpenLineage should solve it and what other solutions might be worth considering.
CC @mobuchowski @JDarDagran
Beta Was this translation helpful? Give feedback.
All reactions