Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCL 1.X Fix for ShardEnd corruption and preventing lease table interference in multi-app JVM #776

Merged
merged 2 commits into from
Jan 26, 2021

Conversation

ashwing
Copy link
Contributor

@ashwing ashwing commented Jan 23, 2021

  1. Updating HashRange directly while creating leases and including HashRange in serializing and deserializing to/from DDB
  2. Fix for making LeaseCleanupManager non-singleton to avoid cross-table interference in multiple apps running in same JVM
  3. Fixing updateMetaInfo method to not update other lease table fields as this might lead to stale info persistence
  4. Preventing shard deletion in LeaseCleanupManager if a valid shard does not have child shards in lease table and in Kinesis Service
  5. Adding childshards update support in updateMetaInfo
  6. Making LeaseCleanupManager to call updateMetaInfo instead of update for childshard update in lease

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@ashwing ashwing requested a review from joshua-kim January 23, 2021 22:44
…uding HashRange in seralizing and deserializing to/from DDB

2. Fix for making LeaseCleanupManager non-singleton to avoid cross-table interference in multiple apps running in same JVM
3. Fixing updateMetaInfo method to not update other lease table fields
4. Preventing shard deletion in LeaseCleanupManager if a valid shard does not have child shards in lease table and in Kinesis Service
5. Adding childshards update support in updateMetaInfo
6. Fixing LeaseCleanupManager to call updateMetaInfo instead of update for childshard update in lease

7. Fixing unit tests to accommodate HashRange changes
@ashwing ashwing force-pushed the shard_end_corruption_issues branch from 5e51f67 to 9cb5020 Compare January 24, 2021 02:01
Comment on lines -99 to +101
static final long LEASE_TABLE_CHECK_FREQUENCY_MILLIS = 3 * 1000L;
static final long MIN_WAIT_TIME_FOR_LEASE_TABLE_CHECK_MILLIS = 1 * 1000L;
static final long MAX_WAIT_TIME_FOR_LEASE_TABLE_CHECK_MILLIS = 30 * 1000L;
static long LEASE_TABLE_CHECK_FREQUENCY_MILLIS = 3 * 1000L;
static long MIN_WAIT_TIME_FOR_LEASE_TABLE_CHECK_MILLIS = 1 * 1000L;
static long MAX_WAIT_TIME_FOR_LEASE_TABLE_CHECK_MILLIS = 30 * 1000L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we marking these as non-final?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, I see it's for unit test. Ideally we would be using builder pattern to avoid lengthy constructors, how bad is the build time if we don't override these values in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be refactoring the long arg constructors. temporarily giving this package level access. this saves more than 3 minutes of build time.

if (CollectionUtils.isNullOrEmpty(childShardKeys)) {
LOG.error("No child shards returned from service for shard " + shardInfo.getShardId());
// If no children shard is found in DDB and from service, then do not delete the lease
Copy link
Contributor

@joshua-kim joshua-kim Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this adds additional safety since this is just retried either way on the next deletion run, are we adding this for unit testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent the the lease deletion and log error as this closed and valid shard could not retrieve children info from anywhere. This is a final guard to protect against bad server response.

Copy link
Contributor

@joshua-kim joshua-kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ashwing ashwing linked an issue Jan 26, 2021 that may be closed by this pull request
@ashwing ashwing merged commit e873c99 into v1.x Jan 26, 2021
@ashwing ashwing changed the title KCL 1.X Guardrails for ShardEnd corruption and preventing lease table interference in multi-app JVM KCL 1.X Fix for ShardEnd corruption and preventing lease table interference in multi-app JVM Jan 26, 2021
@ashwing ashwing deleted the shard_end_corruption_issues branch January 26, 2021 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants