-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] variants and output #46
Comments
Thanks for posting such a detailed explanation of the issue. Since you're interested in mutations on specific haplotypes, could you install the latest version: Yes, --region and -z are not supposed to be used together. Based on the output you received and the length of the haplotypes, it seems to me that not all your reads were considered in the analysis. VILOCA is dividing the alignment into windows based on your bed file and only reads covering at least 85% of the window are included in that window, reads that cover less than 85% of the window are discarded for that window.
|
Thanks for getting back to me this quick!
Besides, I investigated the
Edit: While I was writing this I found your
are the ranges from A to D or B to C? |
Dear @LaraFuhrmann and VILOCA dev team,
currently, I am using VILOCA in an attempt to reconstruct the haplotypes of a pooled SARS-CoV-2 (SC2) sample, focusing my analysis only on the virus' spike (S) gene. Unfortunately, I face some difficulties somewhere between understanding the variant reporting and the corresponding haplotypes returned by the program. I don't have a precise question yet so I'll explain the situation as I write, my apologies if this post becomes very verbose.
The data
bed
file is limited to exactly those primers targeting the S geneMy expectation
I understand VILOCA as an approach to determine co-occurring mutations from the heterogeneous SC2 population in my pooled sample. Each set of co-occurring mutations defines a mutation profile. I expect the corresponding haplotypes to be consensus sequences reflecting one particular mutation profile each.
Program execution
I installed viloca from bioconda.
I attempted two program runs so far
(1)
(2)
Run (2) is a re-run of (1) since I don't know if
--region
and-z
are might not supposed to be used together. However, in both cases similar issues occurred. Both runs finished seemingly successful withrequiring ~1h wall clock time.
The issue(s)
I'll demonstrate the issues using the results of run (2).
The first thing I checked after the run is the variants.
SNVs_0.010000_final.vcf (I shortened some decimals for readability)
The reported variants are absolutely off target w.r.t. what I can spot in IGV:


There are five SNPs that should be easily found given their allele frequency. Not a single SNP reported in the VCF is matching the SNPs in the screenshot above.
Again, if I search in the IGV track for e.g. the first variant (POS=22599) in the VCF file, I see the following picture:
The C (ALT) base over G (REF) is neither a major allele nor the most frequent ALT base. C is only the third most likely occurring allele with 7% reads supporting it.
An intermediate question at this point is: did I do something wrong using VILOCA to be so far off the ground truth?
Second, and I don't know if this is a hint or an issue, I have seven
w-<genome-region>.reads-support.fas
files in thehaplotypes
subfolder containing 3,5,8,2,8,4, and 2 sequences each. Each of the sequences in thesefas
files is ~20-30nt long. According to the data I expected at least ~1200nt haplotype sequences in the absolute worst case that the phasing of co-occurring variants is discontinued after one amplicon.Can you spot any obvious error in the way I use VILOCA?
Anything wrong with the data used?
Wrong assumptions?
Any help much appreciated!
The text was updated successfully, but these errors were encountered: