Skip to content

Commit 86d961f

Browse files
comandeo-mongoalexbeviprestonvasquez
authored
DRIVERS-1571 Retry on different mongos when possible (#1450)
Co-authored-by: Alex Bevilacqua <alex@alexbevi.com> Co-authored-by: Preston Vasquez <prestonvasquez@icloud.com>
1 parent 3ff3800 commit 86d961f

File tree

5 files changed

+188
-16
lines changed

5 files changed

+188
-16
lines changed

source/retryable-reads/retryable-reads.rst

+23-7
Original file line numberDiff line numberDiff line change
@@ -268,10 +268,13 @@ selecting a server for a retry attempt.
268268
3a. Selecting the server for retry
269269
''''''''''''''''''''''''''''''''''
270270

271-
If the driver cannot select a server for a retry attempt or the newly selected
272-
server does not support retryable reads, retrying is not possible and drivers
273-
MUST raise the previous retryable error. In both cases, the caller is able to
274-
infer that an attempt was made.
271+
In a sharded cluster, the server on which the operation failed MUST be provided
272+
to the server selection mechanism as a deprioritized server.
273+
274+
If the driver cannot select a server for
275+
a retry attempt or the newly selected server does not support retryable reads,
276+
retrying is not possible and drivers MUST raise the previous retryable error.
277+
In both cases, the caller is able to infer that an attempt was made.
275278

276279
3b. Sending an equivalent command for a retry attempt
277280
'''''''''''''''''''''''''''''''''''''''''''''''''''''''
@@ -357,9 +360,17 @@ and reflects the flow described above.
357360
*/
358361
function executeRetryableRead(command, session) {
359362
Exception previousError = null;
363+
Server previousServer = null;
360364
while true {
361365
try {
362-
server = selectServer();
366+
if (previousServer == null) {
367+
server = selectServer();
368+
} else {
369+
// If a previous attempt was made, deprioritize the previous server
370+
// where the command failed.
371+
deprioritizedServers = [ previousServer ];
372+
server = selectServer(deprioritizedServers);
373+
}
363374
} catch (ServerSelectionException exception) {
364375
if (previousError == null) {
365376
// If this is the first attempt, propagate the exception.
@@ -416,9 +427,11 @@ and reflects the flow described above.
416427
} catch (NetworkException networkError) {
417428
updateTopologyDescriptionForNetworkError(server, networkError);
418429
previousError = networkError;
430+
previousServer = server;
419431
} catch (NotWritablePrimaryException notPrimaryError) {
420432
updateTopologyDescriptionForNotWritablePrimaryError(server, notPrimaryError);
421433
previousError = notPrimaryError;
434+
previousServer = server;
422435
} catch (DriverException error) {
423436
if ( previousError != null ) {
424437
throw previousError;
@@ -614,8 +627,8 @@ The spec concerns itself with retrying read operations that encounter a
614627
retryable error (i.e. no response due to network error or a response indicating
615628
that the node is no longer a primary). A retryable error may be classified as
616629
either a transient error (e.g. dropped connection, replica set failover) or
617-
persistent outage. If a transient error results in the server being marked as
618-
"unknown", a subsequent retry attempt will allow the driver to rediscover the
630+
persistent outage. If a transient error results in the server being marked as
631+
"unknown", a subsequent retry attempt will allow the driver to rediscover the
619632
primary within the designated server selection timeout period (30 seconds by
620633
default). If server selection times out during this retry attempt, we can
621634
reasonably assume that there is a persistent outage. In the case of a persistent
@@ -678,6 +691,9 @@ degraded performance can simply disable ``retryableReads``.
678691
Changelog
679692
=========
680693
694+
:2023-08-??: Require that in a sharded cluster the server on which the
695+
operation failed MUST be provided to the server selection
696+
mechanism as a deprioritized server.
681697
:2023-08-21: Update Q&A that contradicts SDAM transient error logic
682698
:2022-11-09: CLAM must apply both events and log messages.
683699
:2022-10-18: When CSOT is enabled multiple retry attempts may occur.

source/retryable-reads/tests/README.rst

+70
Original file line numberDiff line numberDiff line change
@@ -232,10 +232,80 @@ This test requires MongoDB 4.2.9+ for ``blockConnection`` support in the failpoi
232232

233233
9. Disable the failpoint.
234234

235+
Retrying Reads in a Sharded Cluster
236+
===================================
237+
238+
These tests will be used to ensure drivers properly retry reads on a different
239+
mongos.
240+
241+
Retryable Reads Are Retried on a Different mongos if One is Available
242+
---------------------------------------------------------------------
243+
244+
This test MUST be executed against a sharded cluster that has at least two
245+
mongos instances.
246+
247+
1. Ensure that a test is run against a sharded cluster that has at least two
248+
mongoses. If there are more than two mongoses in the cluster, pick two to
249+
test against.
250+
251+
2. Create a client per mongos using the direct connection, and configure the
252+
following fail points on each mongos::
253+
254+
{
255+
configureFailPoint: "failCommand",
256+
mode: { times: 1 },
257+
data: {
258+
failCommands: ["find"],
259+
errorCode: 6,
260+
closeConnection: true
261+
}
262+
}
263+
264+
3. Create a client with ``retryReads=true`` that connects to the cluster,
265+
providing the two selected mongoses as seeds.
266+
267+
4. Enable command monitoring, and execute a ``find`` command that is
268+
supposed to fail on both mongoses.
269+
270+
5. Asserts that there were failed command events from each mongos.
271+
272+
6. Disable the fail points.
273+
274+
275+
Retryable Reads Are Retried on the Same mongos if No Others are Available
276+
-------------------------------------------------------------------------
277+
278+
1. Ensure that a test is run against a sharded cluster. If there are multiple
279+
mongoses in the cluster, pick one to test against.
280+
281+
2. Create a client that connects to the mongos using the direct connection,
282+
and configure the following fail point on the mongos::
283+
284+
{
285+
configureFailPoint: "failCommand",
286+
mode: { times: 1 },
287+
data: {
288+
failCommands: ["find"],
289+
errorCode: 6,
290+
closeConnection: true
291+
}
292+
}
293+
294+
3. Create a client with ``retryReads=true`` that connects to the cluster,
295+
providing the selected mongos as the seed.
296+
297+
4. Enable command monitoring, and execute a ``find`` command.
298+
299+
5. Asserts that there was a failed command and a successful command event.
300+
301+
6. Disable the fail point.
302+
235303

236304
Changelog
237305
=========
238306

307+
:2023-08-??: Add prose tests for retrying in a sharded cluster.
308+
239309
:2022-04-22: Clarifications to ``serverless`` and ``useMultipleMongoses``.
240310

241311
:2022-01-10: Create legacy and unified subdirectories for new unified tests

source/retryable-writes/retryable-writes.rst

+17-7
Original file line numberDiff line numberDiff line change
@@ -395,11 +395,14 @@ of the following conditions is reached:
395395
<../client-side-operations-timeout/client-side-operations-timeout.rst#retryability>`__.
396396
- CSOT is not enabled and one retry was attempted.
397397

398-
For each retry attempt, drivers MUST select a writable server. If the driver
399-
cannot select a server for a retry attempt or the selected server does not
400-
support retryable writes, retrying is not possible and drivers MUST raise the
401-
retryable error from the previous attempt. In both cases, the caller is able
402-
to infer that an attempt was made.
398+
For each retry attempt, drivers MUST select a writable server. In a sharded
399+
cluster, the server on which the operation failed MUST be provided to
400+
the server selection mechanism as a deprioritized server.
401+
402+
If the driver cannot select a server for a retry attempt
403+
or the selected server does not support retryable writes, retrying is not
404+
possible and drivers MUST raise the retryable error from the previous attempt.
405+
In both cases, the caller is able to infer that an attempt was made.
403406

404407
If a retry attempt also fails, drivers MUST update their topology according to
405408
the SDAM spec (see: `Error Handling`_). If an error would not allow the caller
@@ -492,11 +495,15 @@ The above rules are implemented in the following pseudo-code:
492495
}
493496
}
494497
495-
/* If we cannot select a writable server, do not proceed with retrying and
498+
/*
499+
* We try to select server that is not the one that failed by passing the
500+
* failed server as a deprioritized server.
501+
* If we cannot select a writable server, do not proceed with retrying and
496502
* throw the previous error. The caller can then infer that an attempt was
497503
* made and failed. */
498504
try {
499-
server = selectServer("writable");
505+
deprioritizedServers = [ server ];
506+
server = selectServer("writable", deprioritizedServers);
500507
} catch (Exception ignoredError) {
501508
throw previousError;
502509
}
@@ -822,6 +829,9 @@ inconsistent with the server and potentially confusing to developers.
822829
Changelog
823830
=========
824831
832+
:2023-08-??: Require that in a sharded cluster the server on which the
833+
operation failed MUST be provided to the server selection
834+
mechanism as a deprioritized server.
825835
:2022-11-17: Add logic for persisting "currentError" as "previousError" on first
826836
retry attempt, avoiding raising "null" errors.
827837
:2022-11-09: CLAM must apply both events and log messages.

source/retryable-writes/tests/README.rst

+67
Original file line numberDiff line numberDiff line change
@@ -456,9 +456,76 @@ and sharded clusters.
456456
mode: "off",
457457
})
458458

459+
#. Test that in a sharded cluster writes are retried on a different mongos if
460+
one available
461+
462+
This test MUST be executed against a sharded cluster that has at least two
463+
mongos instances.
464+
465+
1. Ensure that a test is run against a sharded cluster that has at least two
466+
mongoses. If there are more than two mongoses in the cluster, pick two to
467+
test against.
468+
469+
2. Create a client per mongos using the direct connection, and configure the
470+
following fail point on each mongos::
471+
472+
{
473+
configureFailPoint: "failCommand",
474+
mode: { times: 1 },
475+
data: {
476+
failCommands: ["insert"],
477+
errorCode: 6,
478+
errorLabels: ["RetryableWriteError"],
479+
closeConnection: true
480+
}
481+
}
482+
483+
3. Create a client with ``retryWrites=true`` that connects to the cluster,
484+
providing the two selected mongoses as seeds.
485+
486+
4. Enable command monitoring, and execute a write command that is
487+
supposed to fail on both mongoses.
488+
489+
5. Asserts that there were failed command events from each mongos.
490+
491+
6. Disable the fail points.
492+
493+
#. Test that in a sharded cluster on the same mongos if no other is available
494+
495+
This test MUST be executed against a sharded cluster
496+
497+
1. Ensure that a test is run against a sharded cluster. If there are multiple
498+
mongoses in the cluster, pick one to test against.
499+
500+
2. Create a client that connects to the mongos using the direct connection,
501+
and configure the following fail point on the mongos::
502+
503+
{
504+
configureFailPoint: "failCommand",
505+
mode: { times: 1 },
506+
data: {
507+
failCommands: ["insert"],
508+
errorCode: 6,
509+
errorLabels: ["RetryableWriteError"],
510+
closeConnection: true
511+
}
512+
}
513+
514+
3. Create a client with ``retryWrites=true`` that connects to the cluster,
515+
providing the selected mongos as the seed.
516+
517+
4. Enable command monitoring, and execute a write command that is
518+
supposed to fail.
519+
520+
5. Asserts that there was a failed command and a successful command event.
521+
522+
6. Disable the fail point.
523+
459524
Changelog
460525
=========
461526

527+
:2023-08-??: Add prose tests for retrying in a sharded cluster.
528+
462529
:2022-08-30: Add prose test verifying correct error handling for errors with
463530
the NoWritesPerformed label, which is to return the original
464531
error.

source/server-selection/server-selection.rst

+11-2
Original file line numberDiff line numberDiff line change
@@ -843,7 +843,11 @@ For multi-threaded clients, the server selection algorithm is as follows:
843843
2. If the topology wire version is invalid, raise an error and log a
844844
`"Server selection failed" message`_.
845845

846-
3. Find suitable servers by topology type and operation type
846+
3. Find suitable servers by topology type and operation type. If a list of
847+
deprioritized servers is provided, and the topology is a sharded cluster,
848+
these servers should be selected only if there are no other suitable servers.
849+
The server selection algorithm MUST ignore the deprioritized servers if the
850+
topology is not a sharded cluster.
847851

848852
4. Filter the suitable servers by calling the optional, application-provided server
849853
selector.
@@ -915,7 +919,11 @@ as follows:
915919
5. If the topology wire version is invalid, raise an error and log a
916920
`"Server selection failed" message`_.
917921

918-
6. Find suitable servers by topology type and operation type
922+
6. Find suitable servers by topology type and operation type. If a list of
923+
deprioritized servers is provided, and the topology is a sharded cluster,
924+
these servers should be selected only if there are no other suitable servers.
925+
The server selection algorithm MUST ignore the deprioritized servers if the
926+
topology is not a sharded cluster.
919927

920928
7. Filter the suitable servers by calling the optional, application-provided
921929
server selector.
@@ -2070,3 +2078,4 @@ Changelog
20702078
:2022-01-19: Require that timeouts be applied per the client-side operations timeout spec
20712079
:2022-10-05: Remove spec front matter, move footnote, and reformat changelog.
20722080
:2022-11-09: Add log messages and tests.
2081+
:2023-08-??: Add list of deprioritized servers for sharded cluster topology.

0 commit comments

Comments
 (0)