Introduce local disk cache for references #1523

jkaczynski1 · 2020-12-21T11:51:04Z

Description

PR introduces local disk cache for references. The two existing options i.e. downloading references from a web site or providing a single FASTA file are not practical in case of large data sets. The solution supports multilevel cache to avoid storing all the files in one folder.

Things to think about before submitting:

Make sure your changes compile and new tests pass locally.
Add new tests or update existing ones:
- A bug fix should include a test that previously would have failed and passes now.
- New features should come with new tests that exercise and validate the new functionality.
Extended the README / documentation, if necessary
Check your code style.
Write a clear commit title and message
- The commit message should describe what changed and is targeted at htsjdk developers
- Breaking changes should be mentioned in the commit message.

lbergelson · 2020-12-21T21:32:19Z

@jkaczynski1 Hello, thank you for this PR. Could you explain a bit more of the motivation?

The automatic reference discovery/download is not currently a really well supported or used part of htsjdk so I'm curious how you're using it. Are you running a project with MANY reference different reference sequences from something like bacterial assemblies? Or cram files with sections that each refer to different assemblies that are not well contained in a single fasta?
The usage I'm used to is that a project will typically have 1 reference file with many samples aligned to it. Or perhaps several similar reference files, like different builds of their organism genome. In those cases we typically explicitly specify the per sample reference file when opening the sample files.

I'd like to understand your use case a bit more before I review this.

lbergelson · 2020-12-21T21:33:51Z

I'd also like to understand the advantage of splitting reference files into different sub folders. Is there some per-folder size limit you're hitting? I can understand wanting to split files across multiple filesystems but I don't see the advantage of multiple folders unless you really have thousands of references.

jkaczynski1 · 2020-12-22T15:18:13Z

@lbergelson I am working with human whole-genome sequencing data. A typical WGS CRAM file requires 20-200 references, one per each sequence. The hg38 genome contains 455 sequences overall. If a user of the cache works with multiple genomes the cache can contain thousands of files. Without the cache the code will attempt downloading each reference from the EBI web site which can be very slow.
Regarding splitting the references into different sub-folders the proposed solution is similar to samtools REF_CACHE. Both single level and multilevel structure is supported so it is optional. If a user prefers to keep all the references in one folder he can do so. The advantage of being compliant with the samtools is that you can use the tools they provide to populate the cache.

lbergelson · 2020-12-22T16:07:57Z

@jkaczynski1 Thank you for clarifying! I think we have slightly different terminology we're using. I think of a reference file as a set of sequences (i.e. a fasta with all the human contigs in it) where as you're meaning is a single md5 addressable sequence.

I didn't realize samtools had a cache like this. It makes sense to match their cache structure.

cmnbroad · 2020-12-22T16:52:45Z

Both samtools and sra have these caches, and although I see the value, I vaguely recall having been surprised by both of them in the past, so I'm luke warm about embracing them. A few comments/questions:

Does samtools have some ejection strategy ?
Would it suffice to have htsjdk read from the cache, but not write to it, and just rely on the samtools to manage it ?
If we are going to write into the samtools cache, then we should have some tests that verify samtools interop (samtools is installed for the CI tests, and can be executed from within htsjdk tests).
I'll put a few more comments inline.

cmnbroad

Some preliminary comments.

src/main/java/htsjdk/samtools/Defaults.java

src/test/java/htsjdk/samtools/cram/ref/ReferenceSourceTest.java

src/test/resources/htsjdk/samtools/cram/7ddd8a4b4f2c1dec43476a738b1a9b72

jmarshall · 2020-12-22T22:44:16Z

HTSlib, and hence samtools, uses and populates this cache as controlled by the $REF_CACHE and $REF_PATH environment variables documented here. Enabling the possibility of configuring a multi-level directory structure is aimed at the case of a single cache shared institute-wide that might contain reference contigs for many different organisms and/or unfinished assemblies with thousands or millions of contigs. Re “a [multi-]fasta with all the human contigs in it” I think the original plan was that HTSlib would allow such multifasta files to be specified in $REF_PATH in some way, but IIRC this was never implemented.

It was anticipated and hoped that other implementations would want to share this cache, and the names of the environment variables were consciously chosen to be agnostic accordingly — i.e., it was intentionally not called anything like $HTSLIB_CACHE but rather a more non-committal name. I'm fairly sure this caching arrangement was discussed with the Picard/HTSJDK maintainers at the time, but alas my records of that appear to be lost in the mists of time.

Hence (1) I expect the HTSlib maintainers would like to see HTSJDK also using this cache and will be motivated to assist in ensuring interoperability; (2) please consider configuring this via the same environment variables in addition to / instead of getStringProperty("ref_cache", …).

jkaczynski1 · 2020-12-23T08:08:40Z

I have inserted my answers below:

Does samtools have some ejection strategy ?
I do not know.

Would it suffice to have htsjdk read from the cache, but not write to it, and just rely on the samtools to manage it ?
Currently there are two options:
REFERENCE_FASTA - a user determines and downloads a required reference as a single FASTA file. The feature is efficient in terms of speed but it requires manual actions. Typically a user has to download more data than needed for example the entire human genome.
USE_CRAM_REF_DOWNLOAD + EBI_REFERENCE_SERVICE_URL_MASK - missing references are automatically discovered and downloaded. However the current implementation downloads the same file over and over again which can be very slow.
The proposed enhancement gives users the best of both worlds. Missing references are automatically discovered and downloaded once.
Making the feature "read-only" removes the automation.

If we are going to write into the samtools cache, then we should have some tests that verify samtools interop (samtools is installed for the CI tests, and can be executed from within htsjdk tests).
If the htsjdk unit test framework supports interoperability testing with samtools I can provide such a test.

I'll put a few more comments inline.
I can address the issues when we agree on the requirements.

jkaczynski1 · 2020-12-23T08:15:15Z

@jmarshall I agree using the reference cache in the same way as samtools is valuable. To be fully compliant we would have to replace the USE_CRAM_REF_DOWNLOAD + EBI_REFERENCE_SERVICE_URL_MASK combination with the REF_PATH and refactor the code accordingly at the expense of loosing backward compatibility.
Regarding using the environment variables majority of Java libraries prefer system properties because they can be set from within the JVM while the environment variables are read-only and thus hard to test. I see it as as system wide decision to start using environment variables.

codecov-io · 2020-12-29T11:42:46Z

Codecov Report

Merging #1523 (63c6ffd) into master (2c4c9b2) will decrease coverage by 0.077%.
The diff coverage is 4.651%.

@@               Coverage Diff               @@
##              master     #1523       +/-   ##
===============================================
- Coverage     69.404%   69.328%   -0.077%     
- Complexity      8920      8923        +3     
===============================================
  Files            601       602        +1     
  Lines          35515     35563       +48     
  Branches        5904      5914       +10     
===============================================
+ Hits           24649     24655        +6     
- Misses          8532      8572       +40     
- Partials        2334      2336        +2

Impacted Files	Coverage Δ	Complexity Δ
...java/htsjdk/samtools/cram/ref/ReferenceSource.java	`33.784% <0.000%> (-12.945%)`	`19.000 <0.000> (ø)`
src/main/java/htsjdk/samtools/Defaults.java	`76.667% <100.000%> (+0.805%)`	`6.000 <0.000> (ø)`
...main/java/htsjdk/samtools/filter/InvertFilter.java	`80.000% <0.000%> (ø)`	`3.000% <0.000%> (?%)`

Introduce local disk cache for references

08ba02e

cmnbroad reviewed Dec 22, 2020

View reviewed changes

Move ref_cache property name into a static constant

63c6ffd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce local disk cache for references #1523

Introduce local disk cache for references #1523

jkaczynski1 commented Dec 21, 2020

lbergelson commented Dec 21, 2020

lbergelson commented Dec 21, 2020

jkaczynski1 commented Dec 22, 2020

lbergelson commented Dec 22, 2020

cmnbroad commented Dec 22, 2020

cmnbroad left a comment

jmarshall commented Dec 22, 2020

jkaczynski1 commented Dec 23, 2020

jkaczynski1 commented Dec 23, 2020

codecov-io commented Dec 29, 2020 •

edited

Loading

Introduce local disk cache for references #1523

Are you sure you want to change the base?

Introduce local disk cache for references #1523

Conversation

jkaczynski1 commented Dec 21, 2020

Description

Things to think about before submitting:

lbergelson commented Dec 21, 2020

lbergelson commented Dec 21, 2020

jkaczynski1 commented Dec 22, 2020

lbergelson commented Dec 22, 2020

cmnbroad commented Dec 22, 2020

cmnbroad left a comment

Choose a reason for hiding this comment

jmarshall commented Dec 22, 2020

jkaczynski1 commented Dec 23, 2020

jkaczynski1 commented Dec 23, 2020

codecov-io commented Dec 29, 2020 • edited Loading

Codecov Report

codecov-io commented Dec 29, 2020 •

edited

Loading