Deal with Dynamo write throughput scaling during batch writes #345

gcv · 2018-11-03T03:14:25Z

We need to deal with the spiky loads we get when a new report adds a slew of new gene requirements, causing potentially very large imports to occur (adding 60 new genes means 60×N writes, where N is the number of users in the system!).

In addition to increasing table throughput, we may need to batch update Lambda invocations. The places where this needs to happen are marked with TODO: Split into pieces before calling? in bioinformatics.

Dynamo auto-scaling may not react quickly enough. According to this article, Dynamo auto-scaling is implemented as a CloudWatch alarm which takes up to 15 minutes to react. This will not work for us, as we have to contend with short Lambda timeouts doing Dynamo writes (300sec maximum).

A Serverless plugin for doing some of this exists, but it may not create a sufficiently aggressive CloudWatch scaling alarm.

Another article which covers dynamo scaling: https://medium.com/rue-la-la-tech/how-rue-la-la-bulk-loads-into-dynamodb-ad1469af578e

The text was updated successfully, but these errors were encountered:

gcv · 2018-11-13T01:20:57Z

Proposed solution:

No auto-scaling.
Figure out what throughput we need to achieve some reasonable write rate (1000 base entries per second?).
Increase Dynamo write throughput before running an update.
Run the update.
Decrease Dynamo write throughput.

Need to check that Dynamo alerts us when write throughput is set to high. Otherwise it can get awfully expensive if the process fails to set the write throughput back to low.

aneilbaboo · 2019-02-13T23:44:04Z

Copying solution proposed in #363:
Doing blind Lambda-based writes to Dynamo is unsustainable for larger numbers of users and bases referenced in reports. We need to transition to performing all Dynamo writes, throttling, and throughput scaling by means of a queue.

This can probably be done without breaking existing code. Lambdas responsible for Dynamo writes will instead enqueue the needed operations. Another process will take care of dequeuing and performing the actual writes, and can also handle Dynamo scaling. When the queue begins to grow to the point that the process can no longer sustain it without causing a shard to occur, it can alert us so we can figure out what to do.

aneilbaboo · 2019-02-14T00:44:54Z

WORK IN PROGRESS

There are 2 separate issues

WRITE CAPACITY: Since each Dynamo shard can only handle a maximum provisioned throughput of 1,000.
Now, SINCE we're partitioning on user ids, each user is on a single shard, so when we're uploading data for a user, we CANNOT exceed 1,000 writes PER second.
USAGE COST: Since Dynamo only allows 27 downscaling events per day (4 any time + 1 after an hour since the last downscaling event), we need to batch variantCall writes together to ensure that we end up with efficient usage of the table

Here is an equation that describes how many shards you'll have in Dynamo (from https://cloudonaut.io/dynamodb-pitfall-limited-throughput-due-to-hot-partitions/ ):

MAX( (Provisioned Read Throughput / 3,000), (Provisioned Write Throughput / 1,000), (Used Storage / 10 GB))

SOLUTION:

Batch all writes to the VariantCall table

every user is placed into a single queue

Put user Id in in SQS queue
- new user after initial upload
A process runs every 2 hours
- scale up write capacity
- write all data
- scale down write capacity
- TODO: what happens if process takes longer than 2 hours to complete?
Note: Throttle writes so that < 1,000 writes per second hit DynamoDB for each user
- in principle, we could parallelize writes for multiple users such that multiple lambdas are spun up, up to N, where N = maximum write throughput / 1000. (Which provides one process per user where each process can write at up to 1000 rows per second)
- we don't have to parallelize for the first version

aneilbaboo · 2019-02-14T00:45:12Z

Or maybe we need AWS Data Pipeline:
https://docs.amazonaws.cn/en_us/amazondynamodb/latest/developerguide/DynamoDBPipeline.html

gcv assigned aneilbaboo and gcv Nov 3, 2018

gcv added the high High priority label Nov 3, 2018

gcv unassigned aneilbaboo Nov 13, 2018

aneilbaboo mentioned this issue Feb 13, 2019

Better Dynamo scaling for batch load operations #374

Closed

aneilbaboo changed the title ~~Deal with Dynamo scaling~~ Deal with Dynamo write throughput scaling during batch writes Feb 13, 2019

aneilbaboo mentioned this issue Feb 13, 2019

Transition to using a queue for all Dynamo writes #363

Closed

aneilbaboo unassigned gcv Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with Dynamo write throughput scaling during batch writes #345

Deal with Dynamo write throughput scaling during batch writes #345

gcv commented Nov 3, 2018 •

edited by aneilbaboo

Loading

gcv commented Nov 13, 2018

aneilbaboo commented Feb 13, 2019

aneilbaboo commented Feb 14, 2019

aneilbaboo commented Feb 14, 2019

Deal with Dynamo write throughput scaling during batch writes #345

Deal with Dynamo write throughput scaling during batch writes #345

Comments

gcv commented Nov 3, 2018 • edited by aneilbaboo Loading

gcv commented Nov 13, 2018

aneilbaboo commented Feb 13, 2019

aneilbaboo commented Feb 14, 2019

aneilbaboo commented Feb 14, 2019

gcv commented Nov 3, 2018 •

edited by aneilbaboo

Loading