- Introduction
- Running the pipeline
- Main arguments
- Reference genomes
- Preprocessing
- Alignment
- Counts
- Reference-guided Transcriptome Assembly
- Profiles
- Job resources
- Automatic resubmission
- Custom resource requests
- Other command line parameters
Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen
/ tmux
or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.
It is recommended to limit the Nextflow Java virtual machines memory. We recommend adding the following line to your environment (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'
The typical command for running the pipeline is as follows:
nextflow run main.nf --reads '*_R{1,2}.fastq.gz' --aligner 'star' --counts 'star' -profile 'singularity'
This will launch the pipeline with the singularity
configuration profile. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
You can change the output director using the --outDir/-w
options.
Gene-based quantification is a very common question in RNA-seq analysis. Broadly speaking, two different strategies can be used:
- Sequencing reads are first aligned on a reference genome, and gene counts are estimated using
tools such as
STAR
,HTSeqCounts
orfeatureCounts
.
nextflow run main.nf --reads '*_R{1,2}.fastq.gz' --aligner 'star' --counts 'star' -profile 'singularity,cluster'
- Or gene abundance can be directly infered from raw sequencing data using pseudo-alignment (or selective-alignment) methods such as
salmon
nextflow run main.nf --reads '*_R{1,2}.fastq.gz' --pseudoAligner 'salmon' -profile 'singularity,cluster'
In this case, there is no aligned (BAM) file, and the pipeline just take raw fastq files and directly extract counts table.
Currently, several studies have shown that transcript quantification approaches (such as salmon
) are more sensitive than read
counting methods (such as HTSeqCounts
or featureCounts
), although most of these demonstrations are made from simulated data.
So far, the current best practice would thus be to favor the usage of salmon
over other tools.
Then regarding the differences / benefits of running salmon
in alignment mode (from a BAM) versus in selective-alignment mode (from raw reads), this point is discussed in the mapping and alignment methodology paper. The most obvious difference is that modes which are not based on a BAM file are much faster. On the other hand, it is usually useful to have a BAM file to perform additional analysis and visualization of the data. Regarding accuracy, it doesn't seem to have huge difference and as long as the salmon
indexes are properly built (considering the genome as "decoy" sequence), both methods usually lead to accurate quantification estimates.
One of the main interest of salmon
is its ability to estimate the abundance at both genes and (known) transcripts levels.
If you are interested in isoform analysis, it could also be useful to run STAR
in a two-pass mode. Here, the idea is to run a first alignment with usual parameters,
then collect the junctions detected and use them as "annotated" junctions for the second mapping pass.
In addition, all default tools' parameters can be updated on the command line. As an example here, we would like to update the --numBootstraps
parameters which is
required to run tools such as sleuth
for differential analysis as the isoform levels.
The typical command line to estimate both known isoform and gene abundances with salmon
would be ;
nextflow run main.nf --reads '*_R{1,2}.fastq.gz' --aligner 'star' --starTwoPass --counts 'salmon' --salmonQuantOpts '--numBootstraps 200' -profile 'singularity,cluster'
Since v4.0.0, the pipeline now includes tools for reference-guided de-novo assembly.
The goal of such analysis is to detect new isoform/genes from short reads data. The typical output is a new gtf
file with known and new transcripts.
In the current version, scallop
and stringtie
are available and can be specified using the --denovo
option.
Note that both methods require a BAM file as input. Multiple tools can be specificed (comma separated).
The results are then assessed using the gffCompare
utility which will compare a know gtf
with the one(s) generated by the pipeline.
In most of the case, a high fraction is detection transcripts should correspond to known ones. GffCompare
thus proposes sensitivity/specificy metrics for accuracy estimation.
Here again, it is recommanded to run STAR
in two-pass mode to improve novel splice junction detection.
nextflow run main.nf --reads '*_R{1,2}.fastq.gz' --aligner 'star' --starTwoPass --denovo 'stringtie,scallop' -profile 'singularity,cluster'
Use this to specify the location of your input FastQ files. For example:
--reads 'path/to/data/sample_*_{1,2}.fastq'
Please note the following requirements:
- The path must be enclosed in quotes
- The path must have at least one
*
wildcard character - When using the pipeline with paired end data, the path must use
{1,2}
notation to specify read pairs.
If left unspecified, a default pattern is used: data/*{1,2}.fastq.gz
Use this to specify a sample plan file instead of a regular expression to find fastq files. For example :
--samplePlan 'path/to/data/sample_plan.csv
The sample plan is a csv file with the following information :
Sample ID | Sample Name | Path to R1 fastq file | Path to R2 fastq file
The current version of the pipeline supports two different aligners;
Since version 4.0.0
, no default value is defined. You can specify the tool to use as follows:
--aligner 'STAR'
Recent advances in the field of RNA-seq data analysis promote the usage of pseudo-aligner which are able to estimate the gene/transcripts abundance directly from the raw fastq files.
The following tools are currently available;
--pseudoAligner 'salmon'
The raw count table for all samples can be generated from alignment file (bam) using one of the following tool:
STAR
. Require--aligner 'STAR'
. Default value.featureCounts
HTSeqCounts
salmon
Since version 4.0.0
, no default value is defined. You can specify one of these tools using:
--counts 'salmon`
By default, the pipeline expects paired-end data. If you have single-end data, you need to specify --singleEnd
on the command line when you launch the pipeline. A normal glob pattern, enclosed in quotation marks, can then be used for --reads
. For example:
--singleEnd --reads '*.fastq.gz'
It is not possible to run a mixture of single-end and paired-end files in one run.
Several parts of the RNA-seq data analysis rely on the strandness information.
If you already have the information, you should specifiy the strandness using either forward
(ie. stranded), no
, (ie. unstranded), reverse
(ie. reverse stranded), as follows:
--stranded 'reverse'
If you do not have the information, you can the automatic detection mode (default mode) as follows:
--stranded 'auto'
In the case, the pipeline will the run the rseqc
tool to automatically detect the strandness parameter.
The pipeline config files come bundled with paths to the genomes reference files.
There are different species supported in the genomes references file. To run the pipeline, you must specify which to use with the --genome
flag.
You can find the keys to specify the genomes in the genomes config file. Common genomes that are supported are:
- Human
--genome hg38
- Mouse
--genome mm10
There are numerous others - check the config file for more.
Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the genomes resource. See the Nextflow documentation for instructions on where to save such a file.
The syntax for this reference configuration is as follows:
params {
genomes {
'hg19' {
fasta = '<path to genome fasta file for identito monitoring>'
fastaFai = '<path to genome index file for identito monitoring>'
bowtie2 = '<path to the bowtie2 index files>'
star = '<path to the STAR index files>'
hisat2 = '<path to the HiSat2 index files>'
salmon = '<path to the Salmon index files>'
rrna = '<path to bowtie1 mapping on rRNA reference>'
bed12 = '<path to Bed12 annotation file>'
gtf = '<path to GTF annotation file>'
transcriptsFasta = '<path to fasta transcriptome file for pseudo-alignment>'
gencode = '<boolean - is the annotation file based on Gencode ?'
polym = '<path to bed file for identito monitoring>'
}
// Any number of additional genomes, key is used with --genome
}
}
Note that these paths can be updated on command line using the following parameters:
--bowtie2Index
- Path to bowtie2 index--starIndex
- Path to STAR index--hisat2Index
- Path to HiSAT2 index--salmonIndex
- Path to Salmon index--gtf
- Path to GTF file--gencode
- Specify that the GTF file is from Gencode--bed12
- Path to gene bed12 file--transcriptsFasta
- Path to transcriptome in fasta format--saveAlignedIntermediates
- Save the BAM files from the Aligment step - not done by default
The raw data trimming can be performed with TrimGalore if the --trimming
option is specified.
By default, TrimGalore
should be able to automatically detect the Illumna 3' adapter. If more advance trimming parameter are required, plese use the option --trimmingOpts
.
The trimmed fastq files are then used for downstream analysis but are not exported by default. Use the --saveIntermediates
parameters to export them.
In the context of Mouse xenograft samples, it is strongly recommanded to distinguish Mouse from Human reads in order to avoid data misalignment.
To do so, we implemented the xengsort
tool (--pdx
) which generates in output distinct fastq files for both genomes.
These new fastq files will then be used for downstream alignment and analysis.
According to the --genome
option specified, the analysis will be either performed from the Human or from the Mouse fastq files.
Change default bowtie1 mapping options for rRNA cleaning. See the nextflow.config
file for details.
Change default Hisat2 mapping option. See the nextflow.config
file for details.
Change default STAR mapping options for mapping. By default STAR is run with the following options
--outSAMmultNmax 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --outSAMprimaryFlag OneBestScore \
--outMultimapperOrder Random --outSAMattributes All
In other words, it means that only one alignment will be reported in the output, randomly chosen from the top scoring alignments (in case of multiple alignments). The allowed number of mismatches is indexed on the read length (0.04 * read length). And all common SAM attributes will be added.
Note that the default STAR options can vary from an organism to another. For instance, for Human data, the pipeline adds the ENCODE recommanded options to the default ones
--outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000
See the nextflow.config
file for details.
Run the STAR aligner in two pass mode with the --twopassMode Basic
option
Change default HTSeq options. See the nextflow.config
file for details.
Change default featureCounts options. See the nextflow.config
file for details.
Change default options for Salmon quantification from either BAM file or FASTQ files. See the nextflow.config
file for details.
Specify which tools to use for reference-guided transcriptome assembly. Several tools can be specified (comma separated)
--denovo 'scallop,stringtie'
Change default scallop options. See the nextflow.config
file for details.
Change default stringtie options. See the nextflow.config
file for details.
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Note that multiple profiles can be loaded, for example: -profile docker
- the order of arguments is important!
Look at the Profiles documentation for details.
Each step in the pipeline has a default set of requirements for number of CPUs, memory and time (see the conf/base.conf
file).
For most of the steps in the pipeline, if the job exits with an error code of 143
(exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.
The pipeline is made with a few skip options that allow to skip optional steps in the workflow. The following options can be used:
--skipQC
- Skip all QC steps apart from MultiQC--skipRrna
- Skip rRNA mapping--skipFastqc
- Skip FastQC--skipQualimap
- Skip genebody coverage step--skipSaturation
- Skip Saturation qc--skipDupradar
- Skip dupRadar (and Picard MarkDups)--skipExpan
- Skip exploratory analysis--skipBigwig
- Do not generate bigwig files--skipIdentito
- Skip identito monitoring--skipMultiqc
- Skip MultiQC--skipSoftVersions
- Skip software versions reporting
Specify a two-columns (tab-delimited) metadata file to diplay in the final Multiqc report.
The output directory where the results will be saved.
Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
This is used in the MultiQC report (if not default) and in the summary HTML / e-mail (always).
NB: Single hyphen (core Nextflow option)
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
NB: Single hyphen (core Nextflow option)
Specify the path to a specific config file (this is a core NextFlow command).
NB: Single hyphen (core Nextflow option)
Note - you can use this to override pipeline defaults.
Use to set a top-limit for the default memory requirement for each process.
Should be a string in the format integer-unit. eg. --max_memory '8.GB'
Use to set a top-limit for the default time requirement for each process.
Should be a string in the format integer-unit. eg. --max_time '2.h'
Use to set a top-limit for the default CPU requirement for each process.
Should be a string in the format integer-unit. eg. --max_cpus 1
Specify a path to a custom MultiQC configuration file.