Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with Dynamo write throughput scaling during batch writes #345

Open
gcv opened this issue Nov 3, 2018 · 4 comments
Open

Deal with Dynamo write throughput scaling during batch writes #345

gcv opened this issue Nov 3, 2018 · 4 comments
Labels
high High priority

Comments

@gcv
Copy link
Contributor

gcv commented Nov 3, 2018

We need to deal with the spiky loads we get when a new report adds a slew of new gene requirements, causing potentially very large imports to occur (adding 60 new genes means 60×N writes, where N is the number of users in the system!).

In addition to increasing table throughput, we may need to batch update Lambda invocations. The places where this needs to happen are marked with TODO: Split into pieces before calling? in bioinformatics.

Dynamo auto-scaling may not react quickly enough. According to this article, Dynamo auto-scaling is implemented as a CloudWatch alarm which takes up to 15 minutes to react. This will not work for us, as we have to contend with short Lambda timeouts doing Dynamo writes (300sec maximum).

A Serverless plugin for doing some of this exists, but it may not create a sufficiently aggressive CloudWatch scaling alarm.

Another article which covers dynamo scaling: https://medium.com/rue-la-la-tech/how-rue-la-la-bulk-loads-into-dynamodb-ad1469af578e

@gcv gcv assigned aneilbaboo and gcv Nov 3, 2018
@gcv gcv added the high High priority label Nov 3, 2018
@gcv
Copy link
Contributor Author

gcv commented Nov 13, 2018

Proposed solution:

  1. No auto-scaling.
  2. Figure out what throughput we need to achieve some reasonable write rate (1000 base entries per second?).
  3. Increase Dynamo write throughput before running an update.
  4. Run the update.
  5. Decrease Dynamo write throughput.

Need to check that Dynamo alerts us when write throughput is set to high. Otherwise it can get awfully expensive if the process fails to set the write throughput back to low.

@aneilbaboo aneilbaboo changed the title Deal with Dynamo scaling Deal with Dynamo write throughput scaling during batch writes Feb 13, 2019
@aneilbaboo
Copy link
Contributor

Copying solution proposed in #363:
Doing blind Lambda-based writes to Dynamo is unsustainable for larger numbers of users and bases referenced in reports. We need to transition to performing all Dynamo writes, throttling, and throughput scaling by means of a queue.

This can probably be done without breaking existing code. Lambdas responsible for Dynamo writes will instead enqueue the needed operations. Another process will take care of dequeuing and performing the actual writes, and can also handle Dynamo scaling. When the queue begins to grow to the point that the process can no longer sustain it without causing a shard to occur, it can alert us so we can figure out what to do.

@aneilbaboo
Copy link
Contributor

WORK IN PROGRESS

There are 2 separate issues

  1. WRITE CAPACITY: Since each Dynamo shard can only handle a maximum provisioned throughput of 1,000.
    Now, SINCE we're partitioning on user ids, each user is on a single shard, so when we're uploading data for a user, we CANNOT exceed 1,000 writes PER second.

  2. USAGE COST: Since Dynamo only allows 27 downscaling events per day (4 any time + 1 after an hour since the last downscaling event), we need to batch variantCall writes together to ensure that we end up with efficient usage of the table

Here is an equation that describes how many shards you'll have in Dynamo (from https://cloudonaut.io/dynamodb-pitfall-limited-throughput-due-to-hot-partitions/ ):

MAX( (Provisioned Read Throughput / 3,000), (Provisioned Write Throughput / 1,000), (Used Storage / 10 GB))

SOLUTION:

  1. Batch all writes to the VariantCall table
  • every user is placed into a single queue
  1. Put user Id in in SQS queue

    • new user after initial upload
  2. A process runs every 2 hours

    • scale up write capacity
    • write all data
    • scale down write capacity
    • TODO: what happens if process takes longer than 2 hours to complete?

    Note: Throttle writes so that < 1,000 writes per second hit DynamoDB for each user

    • in principle, we could parallelize writes for multiple users such that multiple lambdas are spun up, up to N, where N = maximum write throughput / 1000. (Which provides one process per user where each process can write at up to 1000 rows per second)
    • we don't have to parallelize for the first version

@aneilbaboo
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high High priority
Projects
None yet
Development

No branches or pull requests

2 participants