Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential workspace deadlocks (ocrd_network) #1142

Merged
merged 7 commits into from
Dec 5, 2023
Merged

Fix potential workspace deadlocks (ocrd_network) #1142

merged 7 commits into from
Dec 5, 2023

Conversation

MehmedGIT
Copy link
Contributor

@MehmedGIT MehmedGIT commented Nov 21, 2023

Fix for #1125:

  • the processing endpoint will now fail early in case the processing agent (processing worker or processor server) is not deployed. As a result, the workspace will no longer be locked for cases where the processing agent is missing.
  • the workflow endpoint will now fail early in case a processing agent is missing in the tasks (workflow script). That would reduce the need to clean leftovers of a workflow that would potentially fail.
  • remove restriction: the processing server will now also detect processing workers deployed manually after the deployment stage of the processing server with the config file
  • code improvement - got rid of some code duplications and replaced literals with constants

There are still issues related to the workflow endpoint. The workflow endpoint assumes that the processing endpoint would always return a job_id. However, this is not the case, hence, either:

  • The workflow endpoint should do proper error handling
  • The processing endpoint should not raise exceptions, but log those to the log file of the specific job_id

The first choice was implemented.

@MehmedGIT MehmedGIT changed the base branch from network-processor-api to master November 21, 2023 14:10
@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Nov 21, 2023

There are still issues related to the workflow endpoint. The workflow endpoint assumes that the processing endpoint would always return a job_id. However, this is not the case, hence, either:

  • The workflow endpoint should do proper error handling
  • The processing endpoint should not raise exceptions, but log those to the log file of the specific job_id

@kba, @joschrew let me know your opinion regarding this.

To reproduce, just remove one processor from the processing server config file but keep that processor in the workflow script.

@kba
Copy link
Member

kba commented Nov 21, 2023

There are still issues related to the workflow endpoint. The workflow endpoint assumes that the processing endpoint would always return a job_id. However, this is not the case, hence, either:

  • The workflow endpoint should do proper error handling
  • The processing endpoint should not raise exceptions, but log those to the log file of the specific job_id

@kba, @joschrew let me know your opinion regarding this.

To reproduce, just remove one processor from the processing server config file but keep that processor in the workflow script.

I am not sure whether I understand the alternatives. I think the behavior as of 6b88ad2 (returning 404 and {"detail":"Agent of type 'worker' does not exist for '...'"} is correct from a user's perspective.

@MehmedGIT
Copy link
Contributor Author

I am not sure whether I understand the alternatives. I think the behavior as of 6b88ad2 (returning 404 and {"detail":"Agent of type 'worker' does not exist for '...'"} is correct from a user's perspective.

Yes, but this is only correct for the processing endpoint. However, from the viewpoint of the workflow endpoint, it is different. It will be a bad response when the user expects some workflow_job_id, regardless if any or all steps in that workflow fail, but they get {"detail":"Agent of type 'worker' does not exist for '...'"}. This is especially bad if previous workflow steps were successful.

@joschrew
Copy link
Contributor

I would choose what is easier to implement, I am not sure but I think the processing-endpoint to "always" return a job-id might be easier to achieve.
Some of the exceptions are "automatically" prevented when called by the workflow-endpoint like workspace not existing or wrong agent type. So logging the error and returning a job_id only would have to be done when the input validation failed or when the processor is not available. And we already have cases when a job_id is returned and the job fails regardless in the end (so no surprise when a job fails after returning a 200). We could also test existing processors beforehand in run_workflow to catch these cases beforehand as well.
So I would probably to a mixture: check processor existence beforehand in workflow-runs and log wrong input instead of fail on it. But maybe just doing try/catch in run_workflow could be cleaner...

@MehmedGIT
Copy link
Contributor Author

I would choose what is easier to implement, I am not sure but I think the processing-endpoint to "always" return a job-id might be easier to achieve.

Well, that is more convenient with the workflow endpoint, but unfortunate if the user wants to run a single job through the processing endpoint. The response for processing job submission would always be a job_id regardless of errors. Then the user would need to check the logs to get more details instead of getting the details in the response of job submission.

Some of the exceptions are "automatically" prevented when called by the workflow-endpoint like workspace not existing or wrong agent type. So logging the error and returning a job_id only would have to be done when the input validation failed or when the processor is not available.

That's true.

And we already have cases when a job_id is returned and the job fails regardless in the end (so no surprise when a job fails after returning a 200).

Yes, when everything seems fine with the request before it is forwarded either to a processing worker or processor server, the processing server returns 200 with a job_id to the user. However, when the job execution by either of the agents fails, that's when the job status changes to FAILED.

We could also test existing processors beforehand in run_workflow to catch these cases beforehand as well.

That's a good idea.

So I would probably to a mixture: check processor existence beforehand in workflow-runs and log wrong input instead of fail on it.

Yes, a mixture is the right way. Unfortunately, the processor existence check still needs to happen in the processing endpoint as well.

@MehmedGIT
Copy link
Contributor Author

We could also test existing processors beforehand in run_workflow to catch these cases beforehand as well.

This is now implemented and was the most basic way to resolve the issue.

@MehmedGIT MehmedGIT marked this pull request as ready for review November 29, 2023 13:11
@MehmedGIT MehmedGIT requested review from kba and joschrew November 29, 2023 13:12
@MehmedGIT MehmedGIT changed the title Fix potential workspace deadlocks Fix potential workspace deadlocks (ocrd_network) Nov 29, 2023
Copy link
Contributor

@joschrew joschrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have checked what has changed codewise and run some tests with docker-compose and the changes. For me it works as expected.

I noticed 2 things which could optionally be changed additionally (GitHub does not make it possible for me to add suggestions for these):
- processing_server.py line 533: the "TODO"-comment: could be removed, as I think the mentioned changes where possibly be made. But maybe more of the TODO-Notes should be revised in another run
- server_utils.py line 94: The response message when input validation of the processor-job fails should be more clear imo. At least the same message the logger writes could be send back to a client.

@MehmedGIT
Copy link
Contributor Author

  • processing_server.py line 533: the "TODO"-comment: could be removed, as I think the mentioned changes where possibly be made. But maybe more of the TODO-Notes should be revised in another run

True, the duplications were fixed in the current PR.

  • server_utils.py line 94: The response message when input validation of the processor-job fails should be more clear imo. At least the same message the logger writes could be send back to a client.

Added.

@kba
Copy link
Member

kba commented Dec 5, 2023

It makes sense to catch undeployed agents before executing a workflow and if users are using the endpoints directly, they can be expected to poll the job. So LGTM, thanks, merging.

@kba kba merged commit d2a1546 into master Dec 5, 2023
@kba kba deleted the fix-ws-locks branch December 5, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants