nfRNAseq

nf-core/RNAseq

https://nf-co.re/rnaseq/3.14.0

nf-core/RNAseq the most popular nf-core pipeline.

nf-core list -s stars

let’s use nf-core tools to build a command to run the nf-core/RNAseq pipeline

mkdir -p  ~/workshop/nfRNAseq
cd ~/workshop/nfRNAseq 

nf-core launch -h

Launch a pipeline using a web GUI or command line prompts.
Uses the pipeline schema file to collect inputs for all available pipeline parameters. When finished, saves a file with the selected parameters which can be passed to Nextflow using the -params-file option.

nf-core launch nf-core/rnaseq

- use the arrow keys to select the following parameters
	- select release - version '3.14.0'

selecting the release will download the pipeline, effectively running ’nextflow pull you can see the key nextflow files are pulled to the ~/.nextflow directory, including the key executables and configs (nextflow.config, main.nf)
(exit with control+c if needed)

ls /home/workshop/.nextflow/assets/nf-core/rnaseq/

This is one of many ways to pull a nf-core pipeline.

you can then select GUI or command line, with the GUI giving a URL to access via a browser by copying and pasting.

we can continue with adding variables with nf-core launch

	-name test
	-profile singularity \
	-resume
	--input nfSampleSheet.csv \
	--outdir Out \
	--fasta Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
	--gtf Homo_sapiens.GRCh38.111.gtf \

this will give you a nf-core launch command with those variable

nf-core launch --id 1728482936_6064841c2138

alternatively you can use a nextflow run command
no need to run this command, in the interest of time (and the lack of disk space on your nectar instance), I’ve pre-prepared the outputs for this run

navigate to run directory to see the nextflow run command

cd ~/workshop/RNAseq
ls
cat nf_rnaseq.sh

cat nfSampleSheet.csv

dataset

The dataset used throughout this workshop is as follow:

- 16 Samples sequenced with an MGI400 sequencer at SAGC using the Tecan Universal RNA-seq library protocol.
- 2 different cancer cell lines (human)
- treatment vs control
- 4 replicates for each

most of this information is reflected in the samplesheet and run command

Run Summary

list results directory. All nf-core outputs have a consistent structure of the outputs

tree /home/workshop/workshop/nfRNAseq/outs

link to execution timeline

link to execution report

Genomics Files Background

Quick background on genomic file formats to help describe the inputs and outputs of an NGS experiment. Knowing what these files are isn’t only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformatics Being able to use and manipulate each file open’s up many opportunities, and is often required for troubleshooting
Many are plain text files, this means they can be manipulated with basic text editing.

.fastq /.fastq.gz

.fasta

the genome.fa file is a plain text representation of the genome sequence. This is the ‘reference’ to which the sequencing files (.fastq) are alligned.

head ./smallRNA/outs/mirtrace/[:]/mirtrace/qc_passed_reads.rnatype_unknown.collapsed/Acontrol1.fastp.fasta

Walkthough of outputs nf-core/RNAseq

the multiqc summary is a real strength of nf-core pipeline. alot of the key analyses are captured

link to multiqc report

Overview of the key processes run with nf-core/RNAseq.
- identifying steps to trouble shooting the success of a run.
- far more than just mapping (star) and counting (RSEM)
- demonstrating the need for a workflow manager

Preprocessing

cat - Merge re-sequenced FastQ files
FastQC - Raw read QC

link to fastqc.html

UMI-tools extract - UMI barcode extraction
TrimGalore - Adapter and quality trimming
BBSplit - Removal of genome contaminants
SortMeRNA - Removal of ribosomal RNA

Alignment and quantification

STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification
STAR via RSEM - Alignment and quantification of expression levels
HISAT2 - Memory efficient splice aware alignment to a reference

Alignment post-processing

SAMtools - Sort and index alignments
UMI-tools dedup - UMI-based deduplication
picard MarkDuplicates - Duplicate read marking

Other steps

StringTie - Transcript assembly and quantification
BEDTools and bedGraphToBigWig - Create bigWig coverage files

Quality control

RSeQC - Various RNA-seq QC metrics
Qualimap - Various RNA-seq QC metrics
dupRadar - Assessment of technical / biological read duplication
Preseq - Estimation of library complexity
featureCounts - Read counting relative to gene biotype
DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram 7 MultiQC - Present QC for raw reads, alignment, read counting and sample similiarity

Pseudoalignment and quantification

Salmon - Wicked fast gene and isoform quantification relative to the transcriptome
Kallisto - Near-optimal probabilistic RNA-seq quantification
Workflow reporting and genomes
Reference genome files - Saving reference genome indices/files
Pipeline information - Report metrics generated during the workflow execution