Remove internal bulk processor retries #3739

alexshtin · 2022-12-20T19:18:28Z

What changed?
Remove internal bulk processor retries.

Why?
To simplify retry logic because visibility task processor has its own retry logic and it is better rely on it.

How did you test it?
Local runs with different failures from Elasticsearch.

Potential risks
No risks.

Is hotfix candidate?
Yes.

alexshtin · 2022-12-20T19:21:23Z

service/worker/addsearchattributes/workflow.go

+	}
+
+	switch httpStatusCode {
+	case http.StatusBadRequest, http.StatusUnauthorized, http.StatusForbidden, http.StatusNotFound, http.StatusConflict:


https://discuss.elastic.co/t/knowing-when-and-when-not-to-retry-a-request-based-on-elasticsearchexception-or-ioexception-with-the-resthighlevelclient/183779

MichaelSnowden · 2022-12-20T20:20:20Z

common/persistence/visibility/store/elasticsearch/processor.go

-		httpStatus := client.HttpStatus(err)
-		isRetryable := client.IsRetryableStatus(httpStatus)
+		var httpStatus int
+		if err, isElasticErr := err.(*elastic.Error); isElasticErr {


nit: Use errors.As. It's safer because it works even when the errors are wrapped

MichaelSnowden · 2022-12-20T20:35:03Z

service/worker/addsearchattributes/workflow.go

+	var httpStatusCode int
+	if err, isElasticErr := err.(*elastic.Error); isElasticErr {
+		httpStatusCode = err.Status
+	}


I think this is safer because it works with wrapped errors, and we return early. Leaving the status code variable as the zero value works because of the default branch, but I think it's better to just return early so that it's clear that this value has no meaning when it's not an ES error.

var esErr *elastic.Error if !errors.As(err, &esErr) { return true } httpStatusCode := esErr.Status

MichaelSnowden · 2022-12-20T20:35:50Z

service/worker/addsearchattributes/workflow.go

+	case http.StatusBadRequest, http.StatusUnauthorized, http.StatusForbidden, http.StatusNotFound, http.StatusConflict:
+		return false
+	default:
+		return true


This means we retry all non-ES errors. Is that what we want?

What do you mean "non-ES"? This error came from Elasticsearch and Elasticsearch uses http status codes to indicate error.

I think I understand what you meant. Yes, non-ES errors are most likely some network failures and should be retryable. Also they might indicate some code bug (like bad formed url or missed required parameter). In this case it would be probably better not to retry but I don't know how to differentiate them. Generally, the idea is not to retry something that we know for sure is non-retryable and retry all the rest.

MichaelSnowden

Mostly LGTM, just a few comments

…e codes

alexshtin requested a review from a team as a code owner December 20, 2022 19:18

alexshtin added the release/1.19.1 Patches for v1.19.1 label Dec 20, 2022

alexshtin commented Dec 20, 2022

View reviewed changes

MichaelSnowden reviewed Dec 20, 2022

View reviewed changes

alexshtin added 4 commits December 20, 2022 13:37

Exclude http status code 500 from Elasticsearch bulk process retryabl…

e1ae9b5

…e codes

Remove internal bulk processor retries

b0f1979

Fix unit tests

68d3676

Address feedback

81067a4

alexshtin force-pushed the feature/es-bulk-500 branch from c670b47 to 81067a4 Compare December 20, 2022 21:54

yycptt approved these changes Dec 20, 2022

View reviewed changes

alexshtin merged commit 2b761b4 into temporalio:master Dec 21, 2022

alexshtin deleted the feature/es-bulk-500 branch December 21, 2022 07:44

yycptt pushed a commit that referenced this pull request Jan 12, 2023

Remove internal bulk processor retries (#3739)

7b64aa0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove internal bulk processor retries #3739

Remove internal bulk processor retries #3739

alexshtin commented Dec 20, 2022

alexshtin Dec 20, 2022

MichaelSnowden Dec 20, 2022

MichaelSnowden Dec 20, 2022 •

edited

Loading

MichaelSnowden Dec 20, 2022 •

edited

Loading

alexshtin Dec 20, 2022

alexshtin Dec 20, 2022

MichaelSnowden left a comment

Remove internal bulk processor retries #3739

Remove internal bulk processor retries #3739

Conversation

alexshtin commented Dec 20, 2022

alexshtin Dec 20, 2022

Choose a reason for hiding this comment

MichaelSnowden Dec 20, 2022

Choose a reason for hiding this comment

MichaelSnowden Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

MichaelSnowden Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

alexshtin Dec 20, 2022

Choose a reason for hiding this comment

alexshtin Dec 20, 2022

Choose a reason for hiding this comment

MichaelSnowden left a comment

Choose a reason for hiding this comment

MichaelSnowden Dec 20, 2022 •

edited

Loading

MichaelSnowden Dec 20, 2022 •

edited

Loading