Rc vs 778 spike import limit #8197

RoriCremer · 2023-02-10T19:21:53Z

The goal of this PR is to adjust the ingest in two ways:

To update the ingest to loop through all samples (not just the first 10k)
To update the ingest to be far more efficient in a few ways:
- To remove the files that are downloaded to each vm so that they do not carry around the extra weight
- To check that the samples in the fofns have not been ingested already so that additional work doesn't need to be done toward processing those samples.

There is still work to do around making the bulk ingest process significantly more user-friendly

scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl

…l. Apparently.

1. We do NOT want to assume that the sample ids we want are in the name field. Pass that through as a parameter. 2. We want to explicitly pause every 500 samples, as that's our page size. It slows our requests down enough to not spam the backend server and hit 503 errors, although it does slow down the rate at which we can write the files if the dataset is too big. Which shouldn't be a concern, because as long as it doesn't cause errors it is still a hands off process. 3. We want to account to heterogenous data. In AoU Delta, for instance, the control samples keep their vcf and vcf_index data in a different field. This would cause the whole thing to fail if we weren't accounting for that explicitly, and now we generate an errors.txt file that will hold the row that we couldn't find the correct columns for so they can be examined later

… data table and being slightly more informative in the output of the python script

…ficiency (and handling larger callsets)

codecov · 2023-02-23T03:19:32Z

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@4a1c203). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files

@@               Coverage Diff                @@
##             ah_var_store     #8197   +/-   ##
================================================
  Coverage                ?   83.979%           
  Complexity              ?     34803           
================================================
  Files                   ?      2194           
  Lines                   ?    167039           
  Branches                ?     18005           
================================================
  Hits                    ?    140278           
  Misses                  ?     20534           
  Partials                ?      6227

gatk-bot · 2023-03-07T20:41:06Z

Github actions tests reported job failures from actions build 4358183003
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	8	4358183003.10	logs

rsasch

first pass of comments

scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl

scripts/variantstore/wdl/GvsImportGenomes.wdl

scripts/variantstore/wdl/GvsPrepareBulkImport.wdl

mcovarr

first pass

scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl

scripts/variantstore/wdl/GvsPrepareBulkImport.wdl

scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl

scripts/variantstore/wdl/GvsImportGenomes.wdl

scripts/variantstore/wdl/extract/generate_FOFNs_for_import.py

scripts/variantstore/wdl/GvsPrepareBulkImport.wdl

mcovarr · 2023-03-15T21:50:38Z

scripts/variantstore/wdl/GvsPrepareBulkImport.wdl

+
+    >>>
+    runtime {
+        docker: "us.gcr.io/broad-dsde-methods/variantstore:2023-1-20-FOFN"


ISO 8601 nit, names should be like YYYY-MM-DD, i.e. always two digits for month and day. Nice for sorting things chronologically and lexically at the same time. 🙂

RoriCremer force-pushed the rc-VS-778-spike-import-limit branch from 9782194 to c1dbb93 Compare February 14, 2023 02:35

RoriCremer marked this pull request as ready for review February 21, 2023 04:23

RoriCremer commented Feb 22, 2023

View reviewed changes

scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl Outdated Show resolved Hide resolved

koncheto-broad and others added 23 commits February 22, 2023 21:58

laying framework for FOFN bulk import code

bf1a95d

adding in terra notebook utils code

66eb6da

updating wdl

6bef86d

updating environment variables to make this work better

faca6e2

quotey McBetterQuotes

ed9e517

extra environment variables

c8eced0

normalizing variable name with other wdls that require it

2655cb7

gotta explicitly set WORKSPACE_NAMESPACE to the google project as wel…

ca24401

…l. Apparently.

typoooooooooooooooooo

85f6127

Didn't pipe the output files the entire way up

7c35c62

whoopsie

66616ef

typo

33957c1

silly mistake copying the functioning code over from the workbook

9e87c22

making script more robust against specifying imaginary columns in the…

47488c9

… data table and being slightly more informative in the output of the python script

increasing the size of the disk this is running on for the sake of ef…

b475e62

…ficiency (and handling larger callsets)

Passing errors up

656be5b

init for wrapper ingest WDL

ab344b1

add read_lines

97eec7e

add dockstore

02b6813

add dockstore 2

c9d0e92

linting errors

b323841

update params

dc7d07c

RoriCremer force-pushed the rc-VS-778-spike-import-limit branch from c1dbb93 to dc7d07c Compare February 23, 2023 02:58

short term testing (rate lim)

1460fed

make it only 25 shards!

ecc2e13

RoriCremer added 16 commits March 14, 2023 22:24

we get it on our own

7a78c58

adjust check

700a00a

dont clean up for debugging

59c6f40

put in the right docker

50da72d

and now add GATK back

f29efd5

try to better understand the input_vcf value

987da01

gatk correctly

474a3e5

try curling the python instead

bc98879

update gatk docker for the right gatk method, but does it have python?

28f9edc

need the compression postfix?

f796921

fix typo

99d59fb

index has to have the same name silly

77f8007

be more robust and less cautious

477a9b4

forgot that I left this for testing

648ec48

general clean up for merge

4e68b24

comment out throttling

15c0966

RoriCremer force-pushed the rc-VS-778-spike-import-limit branch from 62795cf to 15c0966 Compare March 15, 2023 02:25

rsasch suggested changes Mar 15, 2023

View reviewed changes

mcovarr reviewed Mar 15, 2023

View reviewed changes

RoriCremer added 8 commits March 20, 2023 00:24

clean up comment

2295e9c

clean up python

9f6e46b

correct fc-secure bucket bug

17fb7ee

better spacing

39b8f7f

update docker

8e2ca8c

testing on the classic ingest

aecb4c3

up to date docker image

798b1c5

fix python typo

81aa964

RoriCremer merged commit c0535f2 into ah_var_store Mar 27, 2023

RoriCremer deleted the rc-VS-778-spike-import-limit branch March 27, 2023 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rc vs 778 spike import limit #8197

Rc vs 778 spike import limit #8197

RoriCremer commented Feb 10, 2023 •

edited

Loading

codecov bot commented Feb 23, 2023 •

edited

Loading

gatk-bot commented Mar 7, 2023

rsasch left a comment

mcovarr left a comment

mcovarr Mar 15, 2023

Rc vs 778 spike import limit #8197

Rc vs 778 spike import limit #8197

Conversation

RoriCremer commented Feb 10, 2023 • edited Loading

codecov bot commented Feb 23, 2023 • edited Loading

Codecov Report

gatk-bot commented Mar 7, 2023

rsasch left a comment

Choose a reason for hiding this comment

mcovarr left a comment

Choose a reason for hiding this comment

mcovarr Mar 15, 2023

Choose a reason for hiding this comment

RoriCremer commented Feb 10, 2023 •

edited

Loading

codecov bot commented Feb 23, 2023 •

edited

Loading