all scripts using in building a species tree of Maesa (Primulaceae s.l.) Pirada Sumanon, April 2021
The analyses were performed mainly on GenomeDK cluster.
All programs/softwares used in the pipeline that cannot be installed by conda need to be installed at root (~
)
All data are stored in a project folder /faststorage/project/Maesa
- data: directory for raw data with renamed files and also alignments after manual edited (this directory is backed up)
raw
edited_alignments
- scripts: all scripts used (this directory is backed up)
- steps: input and output in each step of the pipeline
essential
: all txt files that needed in the analysis (e.g. outgroups.txt)input_rawreads
: raw reads copied from../data
fastqc
: contains 2 subdirectories for the results for raw reads (fastqc_raw
) and trimmed reads (fastqc_trimmed
) (NOTE: I run fastqc on Linospadix & my own computer, so this directory is not exist on GenomeDK.)trimmed
: trimmed readshybpiper
: all about hybpiper *Target file needs to be at~/target
coverage
: output of coverage trimmingseq_sets2
: sequence sets after coverage trimming and length filteringMAFFT
: all related to sequence alignmentsoptrimal_prepare
: files before optrimaloptrimal_ready
: files ready for optrimal and also outputs from optrimalalignments_for_editing
: trimmed alignments for manaul editingalignments_edited
: alignments after manual editing, plus a subdirectorynoempty
for alignments that remove empty sequences (_noempty.fasta) and also alignments without outgroup sequences (_noOG.fasta)addOGexons
: alignments which exon outgroups were added to. These OGexonadded alignments are ready for further analysis.iqtree_prepare
: clean alignments (no exon1 and exon2) and partition files, and a subdirectory containing a set of 10 MLtrees for each alignments for RogueNaRok.RNR
: everything produced during RNR to detect and prune rogue taxa plus pruned alignmentsprelim_iqtree
: first gene trees from IQ-Tree before Treeshrinktreeshrink
: for TreeShrink analysisalignments_edited_2nd
: already checked for long branches and edited alignments for final tree buildingfinaltree
: final gene trees from IQ-tree and species tree from ASTRAL
Start at the Global Working Directory (main directory with subproject and scripts; in my case, it's /faststorage/project/Maesa
)
Execute QC_and_Trim.sh
for running fastqc on raw reads, trimming raw reads with trimmomatic and running fastqc on trimmed reads
- fastqc outputs stored in
fastqc
, inspect the .html outputs for read quality of each sample - trimmomatic outputs (trimmed reads) stored in
trimmed
by executing readsfirst.sh
from Global Working Directory (/faststorage/project/Maesa
, in my case)
run from within hybpiper
directory by executing statsum.sh
this script uses get_seq_lengths.py and test_stats.py of HybPiper and will produce 2 output files
- seq_lengths.txt for producing a heatmanp in R
- stats.txt for summary
by running retrieve_seq.sh
from within hybpiper
directory,
For intronerate.py, I used intronerate_dev.py instead of the one from release version. (mossmatters/HybPiper#41).
Run paralog_inv.sh
, then create paralog_genelist.txt and execute paralog_retrieve.sh
To create a paralog genelist, use the file obtained from previous step/copy the output printed on screen while running
paralog_inv.sh
, remove all the information from it except for the gene name. Example: '2 paralogs written for Gene001' changes to 'Gene001'. Then delete any duplicated lines. In some text editor, go to "Text->Process Duplicated Lines". One gene per line.
For more detail on paralogs: https://github.com/mossmatters/HybPiper/wiki/Paralogs
This part follows Wolf Eiserhardt's scripts for Dypsidinae_species_tree.
In a global working directory, run coverage_use.sh
This script will execute coverage.py
then samples2genes.py
.
Output: the new gene fasta files with coverage trimming will be saved in seq_sets2
and ready for alignment.
From GWD, run MAFFT.sh
.
The aligned sequences will be stored in MAFFT/alignedseq
.
This step creates new alignments that contain the original alignments plus the exon sequences of the two species that had the highest recovery success at each locus. Using exon_mapper.py
written by Wolf Eiserhardt.
The outputs will be stored in MAFFT/alignments_exon
.
- prepare alignments by executing
optrimal_prepare.sh
, the alignments ready for optrimal will be stored inoptrimal_ready
. - In
optrimal_ready
, generatecutoff_trim.txt
with desired-gt
values to be tested (one value per line, the values must included 0 and 1). For example:
0
0.2
0.3
0.4
0.5
0.6
0.7
1
run optrimal_run.sh
from GWD.
This script implemented PASTA_taster.sh and optrimAL.R written by Shee, Zhi Qiang (https://github.com/keblat/bioinfo-utils/blob/master/docs/advice/scripts/optrimAl.txt).
Note: optrimAL.R was modified at the last line to NOT discard alignments with data loss exceeding 30%.
The last part of the script will copy the trimmed alignments to alignments_for_editing
. Then, it will filter paralog loci out and move them to paralogs
subdirectory.
Note: some alignments may contain empty sequences, but we will deal with this issue after finish manual editing.
Each alignment needs to cleaned manually in an alignment editor program (i.e. AliView).
- copy the directory
alignments_for_editing
to a place of your choice - after making all neccessary edits, save the edited version of all alignments to
alignments_edited
- If alignments are found to be overall wrong or doubtful (e.g. paralogs/chimeric sequences), move these alignments to
alignments_bad
and excluded from further analysis.
From GWD, run iqtree_prepare.sh
. This script will perform as followed:
- remove empty sequences from alignments using
remove_empty.py
- remove outgroup sequences and add outgroup exons instead. outgroup sequences used in our analysis (Ardisia) look weird for intron regions, so we decided to remove outgroup sequences and added only outgroup exons for the analysis. To do that, the script use
removeOG.py
andaddOGexon.py
- generate partition files and remove exon sequences (exon1 and exon2) from the alignments using
partition_v2.py
Based on low branch supports in our preliminary tree, we thought that it was severly affected from rogue taxa. Thus, we decided to implemented RogueNaRok algorithm (https://github.com/aberer/RogueNaRok) to detect and prune those taxa out. To do that, we have to:
- generate a set of ML trees: we go for 10 trees using
10MLtrees.sh
executed withiniqtree_prepare
directory the files that ready for RNR will be store inRNR
directory - In
RNR
, runRNR.sh
which produces rogue-pruned alignments
RNR may remove outgroup sequences in some alignments, we need to add them back by running postRNR.sh
.
run firstiqtree.sh
to get gene trees.
from GWD, execute treeshrink_prep.sh
followed by treeshrink.sh
, this will print a report of long branches detected by TreeShrink in report.txt
in treeshrink
directory. Use it to check the alignments in 2nd manual editing.
A quick check of the alignments (now stored in steps/prelim_iqtree/done
in AliView based on the result from TreeShrink, remove/edit sequences either appropriate. May use gene trees generated from 7.4 as a guideline. The edited alignments from this step should be stored in alignments_edited_2nd
execute final_iqtree.sh
from GWD, this script will
- remove some problematic samples and repetitive samples (IMPORTANT: prepare a list of samples in blacklists.txt, same line seperated by space)
#example format of blacklists.txt
sample1 sample99 sample124
- copy input files to
finaltree
directory for working - perform IQ-tree
run speciestree.sh
which check outgroups, root gene trees, collapse low support branches, build species tree and annotate tree.
I set to collapse branch that has BS<=10 and <30, so will get 2 sets of outputs. Choose the one that has higher resolution.
Then for fullannotation files (genetrees_bs30_astralfull_annot.tre or genetrees_bs10_astralfull_annot.tre), open it in TextEditor and
Replace:
[q1=[\d,.,E,-]+;q2=[\d,.,E,-]+;q3=[\d,.,E,-]+;f1=[\d,.,E,-]+;f2=[\d,.,E,-]+;f3=[\d,.,E,-]+;pp1=[\d,.,E,-]+;pp2=[\d,.,E,-]+;pp3=[\d,.,E,-]+;QC=[\d,.,E,-]+;EN=([\d,.,E,-]+)]
With:
\1
And at some branches, there might be NaN, So replace:
[q1=[\w+]+;q2=[\w+]+;q3=[\w+]+;f1=[\d,.,E,-]+;f2=[\d,.,E,-]+;f3=[\d,.,E,-]+;pp1=[\d,.,E,-]+;pp2=[\d,.,E,-]+;pp3=[\d,.,E,-]+;QC=[\d,.,E,-]+;EN=([\d,.,E,-]+)]
With: \1
save the file as "genetrees_bs30_astral_EN.tre" or "genetrees_bs10_astral_EN.tre"