Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] variants and output #46

Open
Krannich479 opened this issue Mar 4, 2025 · 2 comments
Open

[Question] variants and output #46

Krannich479 opened this issue Mar 4, 2025 · 2 comments

Comments

@Krannich479
Copy link

Dear @LaraFuhrmann and VILOCA dev team,

currently, I am using VILOCA in an attempt to reconstruct the haplotypes of a pooled SARS-CoV-2 (SC2) sample, focusing my analysis only on the virus' spike (S) gene. Unfortunately, I face some difficulties somewhere between understanding the variant reporting and the corresponding haplotypes returned by the program. I don't have a precise question yet so I'll explain the situation as I write, my apologies if this post becomes very verbose.

The data

  • a pooled sample of five distinct SC2 haplotypes
  • decent quality ONT long read data from R10 flowcells
  • a custom primer scheme of roughly 1200nt per amplicon
  • four of my amplicons span the S gene region (the S gene is ~3821nt long according to the Wuhan-Hu-1 reference genome)
  • the primers' bed file is limited to exactly those primers targeting the S gene
  • read depth is 5,000 - 15,000 at amplicon regions

My expectation

I understand VILOCA as an approach to determine co-occurring mutations from the heterogeneous SC2 population in my pooled sample. Each set of co-occurring mutations defines a mutation profile. I expect the corresponding haplotypes to be consensus sequences reflecting one particular mutation profile each.

Program execution

I installed viloca from bioconda.

$ viloca --version
1.0.0

I attempted two program runs so far

(1)

 (env_viloca)$ viloca run \
    -b barcode01.sorted.primerclipped.bam \
    -f NC_045512.2.fasta \
    -z nCoV-2019.scheme.bed \
    --region NC_045512.2:21563-25767 \
    --seed 479 \
    -of vcf \
    -t 16 \
    --mode use_quality_scores

(2)

 (env_viloca)$ viloca run \
    -b barcode01.sorted.primerclipped.bam \
    -f NC_045512.2.fasta \
    -z nCoV-2019.scheme.bed \
    --seed 479 \
    -of vcf \
    -t 16 \
    --mode use_quality_scores

Run (2) is a re-run of (1) since I don't know if --region and -z are might not supposed to be used together. However, in both cases similar issues occurred. Both runs finished seemingly successful with

Index file was created successfully.
VILOCA terminated successfully.

requiring ~1h wall clock time.

The issue(s)

I'll demonstrate the issues using the results of run (2).

The first thing I checked after the run is the variants.
SNVs_0.010000_final.vcf (I shortened some decimals for readability)

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
NC_045512.2     22599   .       G       C       100     PASS    Freq1=0.0653331;Post1=1.0
NC_045512.2     23525   .       C       T       100     PASS    Freq1=0.0380719;Post1=1.0
NC_045512.2     24642   .       C       T       133.70      PASS    Freq1=0.0007442;Post1=0.99999
NC_045512.2     24724   .       G       A       81.29       PASS    Freq1=0.0007392;Post1=0.99999

The reported variants are absolutely off target w.r.t. what I can spot in IGV:
Image
There are five SNPs that should be easily found given their allele frequency. Not a single SNP reported in the VCF is matching the SNPs in the screenshot above.
Again, if I search in the IGV track for e.g. the first variant (POS=22599) in the VCF file, I see the following picture:
Image
The C (ALT) base over G (REF) is neither a major allele nor the most frequent ALT base. C is only the third most likely occurring allele with 7% reads supporting it.
An intermediate question at this point is: did I do something wrong using VILOCA to be so far off the ground truth?

Second, and I don't know if this is a hint or an issue, I have seven w-<genome-region>.reads-support.fas files in the haplotypes subfolder containing 3,5,8,2,8,4, and 2 sequences each. Each of the sequences in these fas files is ~20-30nt long. According to the data I expected at least ~1200nt haplotype sequences in the absolute worst case that the phasing of co-occurring variants is discontinued after one amplicon.


Can you spot any obvious error in the way I use VILOCA?
Anything wrong with the data used?
Wrong assumptions?
Any help much appreciated!

@LaraFuhrmann
Copy link
Collaborator

LaraFuhrmann commented Mar 4, 2025

Thanks for posting such a detailed explanation of the issue.

Since you're interested in mutations on specific haplotypes, could you install the latest version:
pip install git+https://github.com/cbg-ethz/VILOCA@dev
This includes an updated output table for co-occurring mutations. You can find more details about the update here: #45

Yes, --region and -z are not supposed to be used together.

Based on the output you received and the length of the haplotypes, it seems to me that not all your reads were considered in the analysis. VILOCA is dividing the alignment into windows based on your bed file and only reads covering at least 85% of the window are included in that window, reads that cover less than 85% of the window are discarded for that window.

  • Is there the expected coverage for each window/amplicon in the coverage.txt file?
  • What is your usual read length?
  • What is the usual amplicon length as defined in your bed file?

@Krannich479
Copy link
Author

Thanks for getting back to me this quick!
I used your latest dev commit d3c76c6 to rerun (2), so I should be up to date now. By comparing to #45 I am not sure if we have the same output format in cooccurring_mutations.csv. My header is

,haplotype_id,chrom,start,reads,support,position,ref,var,coverage

Besides, I investigated the coverage.txt and might found the error. The ranges in that file resemble the genome indices of my primers! So I presume my primer scheme bed file is not formatted as the program expects. I couldn't find a format description required by VILOCA but my scheme is in artic-primer format:

<REF> <START> <END> <NAME> <POOL> <STRAND> [<SEQ>]

Edit: While I was writing this I found your /tests/data_1/scheme.insert.bed. VILOCA seems to expect one line per amplicon. Are the start/end indices with or without primers? E.g.

amplicon   |------primer1------|----------------seq---------------|--------primer2--------|
range      A                   B                                  C                       D

are the ranges from A to D or B to C?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants