Fix potential workspace deadlocks (ocrd_network) #1142

MehmedGIT · 2023-11-21T14:05:34Z

Fix for #1125:

the processing endpoint will now fail early in case the processing agent (processing worker or processor server) is not deployed. As a result, the workspace will no longer be locked for cases where the processing agent is missing.
the workflow endpoint will now fail early in case a processing agent is missing in the tasks (workflow script). That would reduce the need to clean leftovers of a workflow that would potentially fail.
remove restriction: the processing server will now also detect processing workers deployed manually after the deployment stage of the processing server with the config file
code improvement - got rid of some code duplications and replaced literals with constants

~~There are still issues related to the workflow endpoint.~~ The workflow endpoint assumes that the processing endpoint would always return a job_id. However, this is not the case, hence, either:

The workflow endpoint should do proper error handling
The processing endpoint should not raise exceptions, but log those to the log file of the specific job_id

The first choice was implemented.

ocrd_network/ocrd_network/processing_server.py

MehmedGIT · 2023-11-21T14:47:30Z

There are still issues related to the workflow endpoint. The workflow endpoint assumes that the processing endpoint would always return a job_id. However, this is not the case, hence, either:

The workflow endpoint should do proper error handling

The processing endpoint should not raise exceptions, but log those to the log file of the specific job_id

@kba, @joschrew let me know your opinion regarding this.

To reproduce, just remove one processor from the processing server config file but keep that processor in the workflow script.

kba · 2023-11-21T15:02:33Z

There are still issues related to the workflow endpoint. The workflow endpoint assumes that the processing endpoint would always return a job_id. However, this is not the case, hence, either:

The workflow endpoint should do proper error handling

The processing endpoint should not raise exceptions, but log those to the log file of the specific job_id

@kba, @joschrew let me know your opinion regarding this.

To reproduce, just remove one processor from the processing server config file but keep that processor in the workflow script.

I am not sure whether I understand the alternatives. I think the behavior as of 6b88ad2 (returning 404 and {"detail":"Agent of type 'worker' does not exist for '...'"} is correct from a user's perspective.

MehmedGIT · 2023-11-21T15:08:56Z

I am not sure whether I understand the alternatives. I think the behavior as of 6b88ad2 (returning 404 and {"detail":"Agent of type 'worker' does not exist for '...'"} is correct from a user's perspective.

Yes, but this is only correct for the processing endpoint. However, from the viewpoint of the workflow endpoint, it is different. It will be a bad response when the user expects some workflow_job_id, regardless if any or all steps in that workflow fail, but they get {"detail":"Agent of type 'worker' does not exist for '...'"}. This is especially bad if previous workflow steps were successful.

joschrew · 2023-11-21T15:25:29Z

I would choose what is easier to implement, I am not sure but I think the processing-endpoint to "always" return a job-id might be easier to achieve.
Some of the exceptions are "automatically" prevented when called by the workflow-endpoint like workspace not existing or wrong agent type. So logging the error and returning a job_id only would have to be done when the input validation failed or when the processor is not available. And we already have cases when a job_id is returned and the job fails regardless in the end (so no surprise when a job fails after returning a 200). We could also test existing processors beforehand in run_workflow to catch these cases beforehand as well.
So I would probably to a mixture: check processor existence beforehand in workflow-runs and log wrong input instead of fail on it. But maybe just doing try/catch in run_workflow could be cleaner...

MehmedGIT · 2023-11-21T15:49:40Z

I would choose what is easier to implement, I am not sure but I think the processing-endpoint to "always" return a job-id might be easier to achieve.

Well, that is more convenient with the workflow endpoint, but unfortunate if the user wants to run a single job through the processing endpoint. The response for processing job submission would always be a job_id regardless of errors. Then the user would need to check the logs to get more details instead of getting the details in the response of job submission.

Some of the exceptions are "automatically" prevented when called by the workflow-endpoint like workspace not existing or wrong agent type. So logging the error and returning a job_id only would have to be done when the input validation failed or when the processor is not available.

That's true.

And we already have cases when a job_id is returned and the job fails regardless in the end (so no surprise when a job fails after returning a 200).

Yes, when everything seems fine with the request before it is forwarded either to a processing worker or processor server, the processing server returns 200 with a job_id to the user. However, when the job execution by either of the agents fails, that's when the job status changes to FAILED.

We could also test existing processors beforehand in run_workflow to catch these cases beforehand as well.

That's a good idea.

So I would probably to a mixture: check processor existence beforehand in workflow-runs and log wrong input instead of fail on it.

Yes, a mixture is the right way. Unfortunately, the processor existence check still needs to happen in the processing endpoint as well.

MehmedGIT · 2023-11-29T11:50:57Z

We could also test existing processors beforehand in run_workflow to catch these cases beforehand as well.

This is now implemented and was the most basic way to resolve the issue.

joschrew

LGTM. I have checked what has changed codewise and run some tests with docker-compose and the changes. For me it works as expected.

I noticed 2 things which could optionally be changed additionally (GitHub does not make it possible for me to add suggestions for these):
- processing_server.py line 533: the "TODO"-comment: could be removed, as I think the mentioned changes where possibly be made. But maybe more of the TODO-Notes should be revised in another run
- server_utils.py line 94: The response message when input validation of the processor-job fails should be more clear imo. At least the same message the logger writes could be send back to a client.

MehmedGIT · 2023-12-05T13:24:24Z

processing_server.py line 533: the "TODO"-comment: could be removed, as I think the mentioned changes where possibly be made. But maybe more of the TODO-Notes should be revised in another run

True, the duplications were fixed in the current PR.

server_utils.py line 94: The response message when input validation of the processor-job fails should be more clear imo. At least the same message the logger writes could be send back to a client.

Added.

kba · 2023-12-05T14:20:44Z

It makes sense to catch undeployed agents before executing a workflow and if users are using the endpoints directly, they can be expected to poll the job. So LGTM, thanks, merging.

fix ws deadlocks

d808e85

MehmedGIT changed the base branch from network-processor-api to master November 21, 2023 14:10

merge network-processor-api

6b88ad2

kba reviewed Nov 21, 2023

View reviewed changes

MehmedGIT added 2 commits November 29, 2023 11:53

Merge branch 'master' into fix-ws-locks

370f307

validate processing agents in the workflow

8e9b131

improve: use constants for literals

1e9ce86

MehmedGIT marked this pull request as ready for review November 29, 2023 13:11

MehmedGIT requested review from kba and joschrew November 29, 2023 13:12

MehmedGIT changed the title ~~Fix potential workspace deadlocks~~ Fix potential workspace deadlocks (ocrd_network) Nov 29, 2023

joschrew approved these changes Dec 5, 2023

View reviewed changes

MehmedGIT added 2 commits December 5, 2023 14:22

improve error message

e72ec33

remove fixed todo

a582643

kba merged commit d2a1546 into master Dec 5, 2023

kba deleted the fix-ws-locks branch December 5, 2023 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix potential workspace deadlocks (ocrd_network) #1142

Fix potential workspace deadlocks (ocrd_network) #1142

MehmedGIT commented Nov 21, 2023 •

edited

Loading

MehmedGIT commented Nov 21, 2023 •

edited

Loading

kba commented Nov 21, 2023

MehmedGIT commented Nov 21, 2023

joschrew commented Nov 21, 2023

MehmedGIT commented Nov 21, 2023

MehmedGIT commented Nov 29, 2023

joschrew left a comment •

edited

Loading

MehmedGIT commented Dec 5, 2023

kba commented Dec 5, 2023

Fix potential workspace deadlocks (ocrd_network) #1142

Fix potential workspace deadlocks (ocrd_network) #1142

Conversation

MehmedGIT commented Nov 21, 2023 • edited Loading

MehmedGIT commented Nov 21, 2023 • edited Loading

kba commented Nov 21, 2023

MehmedGIT commented Nov 21, 2023

joschrew commented Nov 21, 2023

MehmedGIT commented Nov 21, 2023

MehmedGIT commented Nov 29, 2023

joschrew left a comment • edited Loading

Choose a reason for hiding this comment

MehmedGIT commented Dec 5, 2023

kba commented Dec 5, 2023

MehmedGIT commented Nov 21, 2023 •

edited

Loading

MehmedGIT commented Nov 21, 2023 •

edited

Loading

joschrew left a comment •

edited

Loading