-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAGRun API endpoint returns status 500 for some run states #19836
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
WTH, a DAG run is not supposed to be skipped. There’s a serious bug somewhere. |
@frapa-az How can we reproduce DagRun to be in |
@ephraimbuddy Thanks for answering! I manage to reproduce it by simply pausing the DAG (not the DAG run). I do not know exactly if this is timing dependent, but for me it seems rather easy to reproduce. Usually I pause just after the run was scheduled. I have a few in the last few days, so it must not be impossible: |
Just curious, are any of your DAGs happen to be subDAGs? |
@uranusjr what do you mean by subDags? I do not unfortunately know the details because we have an abstraction layer on top, but I think there's actually a single DAG that runs all pipelines, creating tasks dinamically based on a json spec. Does that help? |
Ah! SubDAGs are DAGs that belong to another DAG, created with SubDagOperator: https://www.astronomer.io/guides/subdags I’m guessing your abstraction layer is implemented exactly with this—the top-level DAG creates a task for each of the DAGs you create through the layer, and those DAGs (except the top one) are actually subDAGs from Airflow’s perspective. From my code search, there is one single possible code path that would set a DAG run’s state to skipped. If a SubDagOperator task is skipped, the task scheduler would set its state to skipped (obviously), and the operator would also set its controlling subDAG’s state to This is the exact function that does it: airflow/airflow/api/common/experimental/mark_tasks.py Lines 195 to 210 in c4e8959
This function is only called by A DAG run can only be |
@uranusjr Wow, thanks for finding out! I can ask the team responsible for this, if they give me more details to ensure that this is really the case. Most probably I'll come back tomorrow with an answer. I guess setting it to |
When you do, it’d be most awesome if you could also tell them we appreciate them pushing how Airflow works to the limit that this obscure bug can surface 😛 |
I digged the code. We're not using Sub-DAGs, but we do something really weird with callbacks. So after all, this is our misuse of AirFlow and not a bug in AirFlow itself. Here the relevant snippet (I believe): def on_failure_callback(context):
if isinstance(context['exception'], TaskCanceledException):
session = settings.Session()
context['task_instance'].set_state(State.SKIPPED, session=session)
context['dag_run'].state = State.SKIPPED
session.merge(context['dag_run'])
session.commit()
task.on_failure_callback = on_failure_callback However, maybe it would be good to add some validation to the states before saving them to avoid that they go to such inconsistent states? I find it a little strange that AirFlow accepts this to be saved into the database, but then validates it when returning it from the API. |
Currently you are fully in power of what you do with Airflow code, but power also means responsibility. This is the basic premise of Airlfow "open DB access" mode that Airflow currently supports. There is nothig to prevent any DB operations from the code point of view. You'd have to implement complex Databse triggers or tap into SQLAlchemy logic to prevent this (but the latter could be bypassed). Airflow supports "raw" DB operation. Use those powers wisely. And this one is not very disruptive opration. Currently in Airlfow you can literally delete whole database from callback if you want. So preventing this case is like trying to plug a small hole with your finger where half of the bottom of your boat is missing. However, we are already working on Multitenancy implementation and specifically it includes DB-Isolation mode, where you won't be able to directly execute DB operations, but you will have some API methods to call (including set_task state on your task). Until there it makes litle sense to do anything with it. See the notes here: (https://docs.google.com/document/d/19d0jQeARnWm8VTVc_qSIEDldQ81acRk0KuhVwAd96wo/edit). |
Closing this one for now then |
FWIW, we can pretty trivially validate state values on assignment (both on |
Ah right! |
Apache Airflow version
2.0.2
Operating System
Unknown
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
No response
What happened
I am using the REST API.
When I make a request to the DAGRun endpoint and one DAGRun is in state skipped, the request fails with 500 internal server error:
This seemingly happens because the state of the run is skipped, and not one of those three.
What you expected to happen
I would expect
state=skipped
to be returned by the API.How to reproduce
No response
Anything else
I have no access and do not know how AirFlow was set-up in our current system.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: