nfRNAseq
https://nf-co.re/rnaseq/3.14.0
nf-core/RNAseq the most popular nf-core pipeline.
nf-core list -s stars
let’s use nf-core tools to build a command to run the nf-core/RNAseq pipeline
mkdir -p ~/workshop/nfRNAseq
cd ~/workshop/nfRNAseq
nf-core launch -h
Launch a pipeline using a web GUI or command line prompts.
Uses the pipeline schema file to collect inputs for all available pipeline parameters.
When finished, saves a file with the selected parameters which can be passed to Nextflow using the
-params-file option.
nf-core launch nf-core/rnaseq
- use the arrow keys to select the following parameters
- select release - version '3.14.0'
selecting the release will download the pipeline, effectively running ’nextflow pull
you can see the key nextflow files are pulled to the ~/.nextflow directory, including the key executables and configs (nextflow.config, main.nf)
(exit with control+c if needed)
ls /home/workshop/.nextflow/assets/nf-core/rnaseq/
This is one of many ways to pull a nf-core pipeline.
you can then select GUI or command line, with the GUI giving a URL to access via a browser by copying and pasting.
we can continue with adding variables with nf-core launch
-name test
-profile singularity \
-resume
--input nfSampleSheet.csv \
--outdir Out \
--fasta Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
--gtf Homo_sapiens.GRCh38.111.gtf \
this will give you a nf-core launch command with those variable
nf-core launch --id 1728482936_6064841c2138
alternatively you can use a nextflow run command
no need to run this command, in the interest of time (and the lack of disk space on your nectar instance), I’ve pre-prepared the outputs for this run
navigate to run directory to see the nextflow run command
cd ~/workshop/RNAseq
ls
cat nf_rnaseq.sh
cat nfSampleSheet.csv
dataset
The dataset used throughout this workshop is as follow:
- 16 Samples sequenced with an MGI400 sequencer at SAGC using the Tecan Universal RNA-seq library protocol.
- 2 different cancer cell lines (human)
- treatment vs control
- 4 replicates for each
- most of this information is reflected in the samplesheet and run command
Run Summary
list results directory. All nf-core outputs have a consistent structure of the outputs
tree /home/workshop/workshop/nfRNAseq/outs
Genomics Files Background
Quick background on genomic file formats to help describe the inputs and outputs of an NGS experiment.
Knowing what these files are isn’t only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformatics
Being able to use and manipulate each file open’s up many opportunities, and is often required for troubleshooting
Many are plain text files, this means they can be manipulated with basic text editing.
.fastq /.fastq.gz
.fasta
the genome.fa file is a plain text representation of the genome sequence. This is the ‘reference’ to which the sequencing files (.fastq) are alligned.
head ./smallRNA/outs/mirtrace/[:]/mirtrace/qc_passed_reads.rnatype_unknown.collapsed/Acontrol1.fastp.fasta
.gtf
genes.gtf ; a genomic interval file referencing the genome. This file depicts the genes/ transcripts. It’s a required for counting where reads are mapped to, and often has alot more annotation data in regards to the gene/transcript.
.bed
bam
Walkthough of outputs nf-core/RNAseq
the multiqc summary is a real strength of nf-core pipeline. alot of the key analyses are captured
Overview of the key processes run with nf-core/RNAseq.
- identifying steps to trouble shooting the success of a run.
- far more than just mapping (star) and counting (RSEM)
- demonstrating the need for a workflow manager
Preprocessing
- cat - Merge re-sequenced FastQ files
- FastQC - Raw read QC
- UMI-tools extract - UMI barcode extraction
- TrimGalore - Adapter and quality trimming
- BBSplit - Removal of genome contaminants
- SortMeRNA - Removal of ribosomal RNA
Alignment and quantification
- STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification
- STAR via RSEM - Alignment and quantification of expression levels
- HISAT2 - Memory efficient splice aware alignment to a reference
Alignment post-processing
- SAMtools - Sort and index alignments
- UMI-tools dedup - UMI-based deduplication
- picard MarkDuplicates - Duplicate read marking
Other steps
- StringTie - Transcript assembly and quantification
- BEDTools and bedGraphToBigWig - Create bigWig coverage files
Quality control
- RSeQC - Various RNA-seq QC metrics
- Qualimap - Various RNA-seq QC metrics
- dupRadar - Assessment of technical / biological read duplication
- Preseq - Estimation of library complexity
- featureCounts - Read counting relative to gene biotype
- DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram 7 MultiQC - Present QC for raw reads, alignment, read counting and sample similiarity
Pseudoalignment and quantification
- Salmon - Wicked fast gene and isoform quantification relative to the transcriptome
- Kallisto - Near-optimal probabilistic RNA-seq quantification
- Workflow reporting and genomes
- Reference genome files - Saving reference genome indices/files
- Pipeline information - Report metrics generated during the workflow execution