Stereo-seq background

Stereo-seq (SpaTial Enhanced Resolution Omics-sequencing) builds upon the chip technology already used in MGI DNA nanoball sequencing (DNBSEQ). For background on DNBSEQ, see here.

The chips are patterned, with fixed spots of 220 nm in diameter and spaced at a distance of 500 or 715 nm. The standard chip size is 1cm x 1cm, but the largest size available does go up to 13cm x 13cm (fresh frozen samples only).

The first step is to deposit DNB templates with random barcodes on the chip, which is sequenced to obtain the CID (coordinate ID) of each DNB. This is the “mask file”, which is basically just a big matrix which barcodes correspond to which locations, and is required to run SAW.

Then, oligos with MIDs (molecular IDs) and polyT sequences are ligated to each spot.

Next, the tissue is loaded onto the chip, followed by fixation and permeabilization to capture polyA RNAs.

Finally, cDNA is generated via reverse transcription and amplification and used as a template for library prep.

More information on Stereo-seq can be found in the paper

Available versions

  • Current fresh-frozen (FF) kit version is v1.3
  • OMNI is the new FFPE offering, which also uses total RNA

Processing data with SAW

Once your experiment is complete, you will have the following:

  • Mask file (.h5)
  • Reads (.fastq)

You will need to process the data using Stereo-seq Analysis Workflow (SAW). This command-line tool provides similar functionality for Stereo-seq to what Spaceranger provides for 10x spatial platforms.

The latest version of SAW is 8.1, with reference documentation found here:

https://stereotoolss-organization.gitbook.io/saw-user-manual-v8.1

Actually building a reference and running SAW will take quite a long time, so we won’t actually run it for this workshop. However, we’ll walk through the steps and then take a look at some pre-processed files.

Obtaining SAW

You can download the latest version of SAW from the STOmics download page (does require making an account).

Alternatively, you can find earlier versions of SAW on Github. Please note that the interface has changed significantly between versions 7.x and 8.x. Version 8.x introduces both internal workflow management with Snakemake, and compatibility with OMNI.

The documentation for using SAW v8.1 with fresh-frozen datasets can be found here: https://stereotoolss-organization.gitbook.io/saw-user-manual-v8.1/tutorials/run-main-pipeline/stereo-seq-ff

In addition, saw --help and subcommands (e.g. saw count --help) provide more information.

SAW makeRef

SAW v8.1 is quite easy to use, but not yet very configurable. To build the reference for this example, here’s the command I used:

#!/bin/bash

saw makeRef \
    --mode=STAR \
    --fasta=/path/to/GRCm39.genome.fa \
    --gtf=/path/to/annotation.gtf \
    --genome=/path/to/reference/output

Notes:

  • The GENCODE GRCm39 reference was used
  • The --genome parameter actually specifies the name of the output directory
  • There is no way to pass parameters to STAR in this version.

To have more control over the parameters used, you can use version 7.x of SAW from Github. This version is implemented as a bash script which calls into a container image, and is easier to customise (and easier to break).

SAW count

Next, we want to run the main pipeline which is called via saw count. Here’s how it was run for this demo:

#!/bin/bash

saw count \
    --id=workshop_demo \
    --sn=C04042E3 \
    --chip-mask="/path/to/C04042E3.barcodeToPos.h5" \
    --omics=transcriptomics \
    --kit-version="Stereo-seq T FF V1.3" \
    --sequencing-type="PE75_50+100" \
    --organism=mouse \
    --tissue=brain \
    --fastqs=/path/to/fastq/directory/ \
    --image-tar="/path/to/C04042E3_SC_20240925_142342_4.1.0.tar.gz" \
    --reference=/path/to/reference/output

Notes:

  • --reference here should point to the directory specified in the --genome parameter of saw makeRef
  • --image-tar accepts image files processed by StereoMap or ImageStudio, and is optional
  • --fastqs points to a directory containing fastq file pairs. More than one pair can be present (e.g. if multiple sequencing runs or lanes were used)
  • --organism and --tissue can be any string

SAW outputs

Once this is complete, an outs directory will be created similar to this:

$ tree outs/
.
├── analysis
│   ├── C04042E3.bin20_1.0.h5ad
│   ├── C04042E3.bin20_1.0.marker_features.csv
│   ├── C04042E3.bin50_1.0.h5ad
│   ├── C04042E3.bin50_1.0.marker_features.csv
│   ├── C04042E3.cellbin_1.0.adjusted.h5ad
│   ├── C04042E3.cellbin_1.0.adjusted.marker_features.csv
│   ├── C04042E3.cellbin_1.0.h5ad
│   └── C04042E3.cellbin_1.0.marker_features.csv
├── bam
│   └── annotated_bam
│       ├── C04042E3.Aligned.sortedByCoord.out.merge.q10.dedup.target.bam
│       └── C04042E3.Aligned.sortedByCoord.out.merge.q10.dedup.target.bam.csi
├── C04042E3.report.tar.gz
├── feature_expression
│   ├── C04042E3.adjusted.cellbin.gef
│   ├── C04042E3.cellbin.gef
│   ├── C04042E3.merge.barcodeReadsCount.txt
│   ├── C04042E3_raw_barcode_gene_exp.txt
│   ├── C04042E3.raw.gef
│   └── C04042E3.tissue.gef
├── image
│   ├── C04042E3_HE_mask_edm_dis_10.tif
│   ├── C04042E3_HE_mask.tif
│   ├── C04042E3_HE_regist.tif
│   └── C04042E3_HE_tissue_cut.tif
├── visualization.tar.gz
└── visualization.tar.gz.md5

A detailed guide to these outputs can be found in the SAW manual.

HTML report

The first place to look is generally the HTML report. We can extract it from the archive and open it in a browser:

# First extract the archive if needed

if (!file.exists('data/report/')) {
  untar('data/C04042E3.report.tar.gz', exdir='data')
}

# Now access it from data/report/report.html

A more detailed guide to the SAW report can be found in the SAW manual.

GEF files

These are the main count matrices, in HDF5 format.

h5ad files

These contain clustering results, for the most part. The default method is Leiden clustering.

visualization.tar.gz

This packages the main outputs into a single file for visualization. It is also a handy file for distributing results easily, as it packages most of the main outputs (excluding the BAM files).

# Extracting the visualization archive

if (!file.exists('data/visualization/')) {
  untar('data/visualization.tar.gz', exdir='data')
}

Taking a look inside, we see many of the other output files have been repackaged here:

.
├── C04042E3.adjusted.cellbin.gef
├── C04042E3.bin20_1.0.h5ad
├── C04042E3.bin50_1.0.h5ad
├── C04042E3.cellbin_1.0.adjusted.h5ad
├── C04042E3.rpi
├── C04042E3_SC_20240925_142342_4.1.0.tar.gz
├── C04042E3.stereo
├── C04042E3.tissue.gef
└── HE_matrix_template.txt

A JSON manifest of the experiment can be found in .stereo, which allows loading all of the packaged data in a single step from StereoMap.

Although these have varied extensions, they’re mostly just hdf5 format archives.

The rpi file can be converted to tif files using SAW (although it’s easier to just take the output directly from saw count)

StereoMap

StereoMap is an interactive tool to examine your Stereo-seq data (similar to Loupe Browser from 10x).

StereoMap can be downloaded from STOmics directly: https://en.stomics.tech/products/stomics-software/stomics-offline-software/list.html

StereoMap v4.1 is compatible with SAW v8.1. Previous versions also included image QC in a seperate program, ImageStudio, which is now integrated with StereoMap.

StereoPy

StereoPy is a Python library designed to work with Stereo-seq data.

https://stereopy.readthedocs.io/en/latest/content/01_Usage_principles.html

We will use it to investigate our data over on JupyterLab.

Loading in SpatialExperiment

SpatialExperiment library is an S4 class that extends SingleCellExperiment. While there’s no way to import directly, we can find all the data we need from the h5ad files provided from SAW.

First let’s load the matrix.

library(rhdf5)
library(Matrix)

# Load the matrix data (CSR format)
matrix_data <- h5read("data/visualization/C04042E3.bin20_1.0.h5ad", "X")

# Extract dimensions (this is stored as an attribute)
shape <- h5readAttributes("data/visualization/C04042E3.bin20_1.0.h5ad", "X")$shape

# Extract the components
data <- matrix_data$data        # Non-zero values
indices <- matrix_data$indices  # Column indices of non-zero values
indptr <- matrix_data$indptr    # Row pointers for non-zero values

# Reconstruct the CSR matrix (have to add 1 to indices for R)
csr_matrix <- sparseMatrix(
  i = as.integer(rep(seq_len(length(indptr)-1), diff(indptr))),  # expand row indices
  j = as.integer(indices + 1),  # R uses 1-based indexing, so add 1 to column indices
  x = as.numeric(data),
  dims = shape,
)

We can also add in row and col data.

obs <- h5read("data/visualization/C04042E3.bin20_1.0.h5ad", "obs")
var <- h5read("data/visualization/C04042E3.bin20_1.0.h5ad", "var")

library(S4Vectors)
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
    table, tapply, union, unique, unsplit, which.max, which.min

Attaching package: 'S4Vectors'
The following objects are masked from 'package:Matrix':

    expand, unname
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname
library(SpatialExperiment)
Loading required package: SingleCellExperiment
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':

    rowMedians
The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians
# create SpatialExperiment object
spe <- SpatialExperiment(
  assays = list(counts = t(csr_matrix)), 
  colData = DataFrame(cluster=as.factor(obs$leiden$codes), in_tissue=TRUE), # get cluster in there 
  rowData = DataFrame(gene_id=var$`_index`),
  spatialCoords = data.matrix(DataFrame(x=obs$x, y=obs$y))
)

We can now use this as we would any other spot-based experiment.

library(ggspavis)
Loading required package: ggplot2
Registered S3 method overwritten by 'ggside':
  method from   
  +.gg   ggplot2
# Get norm counts 
colData(spe)$sums <- colSums(counts(spe))
plotSpots(spe, annotate="sums")