-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce local disk cache for references #1523
base: master
Are you sure you want to change the base?
Conversation
@jkaczynski1 Hello, thank you for this PR. Could you explain a bit more of the motivation? The automatic reference discovery/download is not currently a really well supported or used part of htsjdk so I'm curious how you're using it. Are you running a project with MANY reference different reference sequences from something like bacterial assemblies? Or cram files with sections that each refer to different assemblies that are not well contained in a single fasta? I'd like to understand your use case a bit more before I review this. |
I'd also like to understand the advantage of splitting reference files into different sub folders. Is there some per-folder size limit you're hitting? I can understand wanting to split files across multiple filesystems but I don't see the advantage of multiple folders unless you really have thousands of references. |
@lbergelson I am working with human whole-genome sequencing data. A typical WGS CRAM file requires 20-200 references, one per each sequence. The hg38 genome contains 455 sequences overall. If a user of the cache works with multiple genomes the cache can contain thousands of files. Without the cache the code will attempt downloading each reference from the EBI web site which can be very slow. |
@jkaczynski1 Thank you for clarifying! I think we have slightly different terminology we're using. I think of a reference file as a set of sequences (i.e. a fasta with all the human contigs in it) where as you're meaning is a single md5 addressable sequence. I didn't realize samtools had a cache like this. It makes sense to match their cache structure. |
Both samtools and sra have these caches, and although I see the value, I vaguely recall having been surprised by both of them in the past, so I'm luke warm about embracing them. A few comments/questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments.
src/test/java/htsjdk/samtools/cram/ref/ReferenceSourceTest.java
Outdated
Show resolved
Hide resolved
src/test/resources/htsjdk/samtools/cram/7ddd8a4b4f2c1dec43476a738b1a9b72
Outdated
Show resolved
Hide resolved
HTSlib, and hence samtools, uses and populates this cache as controlled by the It was anticipated and hoped that other implementations would want to share this cache, and the names of the environment variables were consciously chosen to be agnostic accordingly — i.e., it was intentionally not called anything like Hence (1) I expect the HTSlib maintainers would like to see HTSJDK also using this cache and will be motivated to assist in ensuring interoperability; (2) please consider configuring this via the same environment variables in addition to / instead of |
I have inserted my answers below:
|
@jmarshall I agree using the reference cache in the same way as samtools is valuable. To be fully compliant we would have to replace the USE_CRAM_REF_DOWNLOAD + EBI_REFERENCE_SERVICE_URL_MASK combination with the REF_PATH and refactor the code accordingly at the expense of loosing backward compatibility. |
Codecov Report
@@ Coverage Diff @@
## master #1523 +/- ##
===============================================
- Coverage 69.404% 69.328% -0.077%
- Complexity 8920 8923 +3
===============================================
Files 601 602 +1
Lines 35515 35563 +48
Branches 5904 5914 +10
===============================================
+ Hits 24649 24655 +6
- Misses 8532 8572 +40
- Partials 2334 2336 +2
|
Description
PR introduces local disk cache for references. The two existing options i.e. downloading references from a web site or providing a single FASTA file are not practical in case of large data sets. The solution supports multilevel cache to avoid storing all the files in one folder.
Things to think about before submitting: