Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce local disk cache for references #1523

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jkaczynski1
Copy link

Description

PR introduces local disk cache for references. The two existing options i.e. downloading references from a web site or providing a single FASTA file are not practical in case of large data sets. The solution supports multilevel cache to avoid storing all the files in one folder.

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

@lbergelson
Copy link
Member

@jkaczynski1 Hello, thank you for this PR. Could you explain a bit more of the motivation?

The automatic reference discovery/download is not currently a really well supported or used part of htsjdk so I'm curious how you're using it. Are you running a project with MANY reference different reference sequences from something like bacterial assemblies? Or cram files with sections that each refer to different assemblies that are not well contained in a single fasta?
The usage I'm used to is that a project will typically have 1 reference file with many samples aligned to it. Or perhaps several similar reference files, like different builds of their organism genome. In those cases we typically explicitly specify the per sample reference file when opening the sample files.

I'd like to understand your use case a bit more before I review this.

@lbergelson
Copy link
Member

I'd also like to understand the advantage of splitting reference files into different sub folders. Is there some per-folder size limit you're hitting? I can understand wanting to split files across multiple filesystems but I don't see the advantage of multiple folders unless you really have thousands of references.

@jkaczynski1
Copy link
Author

@lbergelson I am working with human whole-genome sequencing data. A typical WGS CRAM file requires 20-200 references, one per each sequence. The hg38 genome contains 455 sequences overall. If a user of the cache works with multiple genomes the cache can contain thousands of files. Without the cache the code will attempt downloading each reference from the EBI web site which can be very slow.
Regarding splitting the references into different sub-folders the proposed solution is similar to samtools REF_CACHE. Both single level and multilevel structure is supported so it is optional. If a user prefers to keep all the references in one folder he can do so. The advantage of being compliant with the samtools is that you can use the tools they provide to populate the cache.

@lbergelson
Copy link
Member

@jkaczynski1 Thank you for clarifying! I think we have slightly different terminology we're using. I think of a reference file as a set of sequences (i.e. a fasta with all the human contigs in it) where as you're meaning is a single md5 addressable sequence.

I didn't realize samtools had a cache like this. It makes sense to match their cache structure.

@cmnbroad
Copy link
Collaborator

Both samtools and sra have these caches, and although I see the value, I vaguely recall having been surprised by both of them in the past, so I'm luke warm about embracing them. A few comments/questions:

  • Does samtools have some ejection strategy ?
  • Would it suffice to have htsjdk read from the cache, but not write to it, and just rely on the samtools to manage it ?
  • If we are going to write into the samtools cache, then we should have some tests that verify samtools interop (samtools is installed for the CI tests, and can be executed from within htsjdk tests).
  • I'll put a few more comments inline.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments.

@jmarshall
Copy link
Member

HTSlib, and hence samtools, uses and populates this cache as controlled by the $REF_CACHE and $REF_PATH environment variables documented here. Enabling the possibility of configuring a multi-level directory structure is aimed at the case of a single cache shared institute-wide that might contain reference contigs for many different organisms and/or unfinished assemblies with thousands or millions of contigs. Re “a [multi-]fasta with all the human contigs in it” I think the original plan was that HTSlib would allow such multifasta files to be specified in $REF_PATH in some way, but IIRC this was never implemented.

It was anticipated and hoped that other implementations would want to share this cache, and the names of the environment variables were consciously chosen to be agnostic accordingly — i.e., it was intentionally not called anything like $HTSLIB_CACHE but rather a more non-committal name. I'm fairly sure this caching arrangement was discussed with the Picard/HTSJDK maintainers at the time, but alas my records of that appear to be lost in the mists of time.

Hence (1) I expect the HTSlib maintainers would like to see HTSJDK also using this cache and will be motivated to assist in ensuring interoperability; (2) please consider configuring this via the same environment variables in addition to / instead of getStringProperty("ref_cache", …).

@jkaczynski1
Copy link
Author

I have inserted my answers below:

  • Does samtools have some ejection strategy ?
    I do not know.
  • Would it suffice to have htsjdk read from the cache, but not write to it, and just rely on the samtools to manage it ?
    Currently there are two options:
    REFERENCE_FASTA - a user determines and downloads a required reference as a single FASTA file. The feature is efficient in terms of speed but it requires manual actions. Typically a user has to download more data than needed for example the entire human genome.
    USE_CRAM_REF_DOWNLOAD + EBI_REFERENCE_SERVICE_URL_MASK - missing references are automatically discovered and downloaded. However the current implementation downloads the same file over and over again which can be very slow.
    The proposed enhancement gives users the best of both worlds. Missing references are automatically discovered and downloaded once.
    Making the feature "read-only" removes the automation.
  • If we are going to write into the samtools cache, then we should have some tests that verify samtools interop (samtools is installed for the CI tests, and can be executed from within htsjdk tests).
    If the htsjdk unit test framework supports interoperability testing with samtools I can provide such a test.
  • I'll put a few more comments inline.
    I can address the issues when we agree on the requirements.

@jkaczynski1
Copy link
Author

@jmarshall I agree using the reference cache in the same way as samtools is valuable. To be fully compliant we would have to replace the USE_CRAM_REF_DOWNLOAD + EBI_REFERENCE_SERVICE_URL_MASK combination with the REF_PATH and refactor the code accordingly at the expense of loosing backward compatibility.
Regarding using the environment variables majority of Java libraries prefer system properties because they can be set from within the JVM while the environment variables are read-only and thus hard to test. I see it as as system wide decision to start using environment variables.

@codecov-io
Copy link

codecov-io commented Dec 29, 2020

Codecov Report

Merging #1523 (63c6ffd) into master (2c4c9b2) will decrease coverage by 0.077%.
The diff coverage is 4.651%.

@@               Coverage Diff               @@
##              master     #1523       +/-   ##
===============================================
- Coverage     69.404%   69.328%   -0.077%     
- Complexity      8920      8923        +3     
===============================================
  Files            601       602        +1     
  Lines          35515     35563       +48     
  Branches        5904      5914       +10     
===============================================
+ Hits           24649     24655        +6     
- Misses          8532      8572       +40     
- Partials        2334      2336        +2     
Impacted Files Coverage Δ Complexity Δ
...java/htsjdk/samtools/cram/ref/ReferenceSource.java 33.784% <0.000%> (-12.945%) 19.000 <0.000> (ø)
src/main/java/htsjdk/samtools/Defaults.java 76.667% <100.000%> (+0.805%) 6.000 <0.000> (ø)
...main/java/htsjdk/samtools/filter/InvertFilter.java 80.000% <0.000%> (ø) 3.000% <0.000%> (?%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants