Single-cell GPU Acceleration with Seurat

ggmlR can accelerate the heavy steps of a single-cell RNA-seq pipeline on the GPU, working directly on Seurat objects. There is no conversion on your side and no hard dependency: Seurat / SeuratObject live in Suggests, so ggmlR installs fine without them and the adapter only activates when they are present. Vulkan is used automatically when a GPU is available, with a transparent CPU fallback.

The adapter leans on two more Suggests packages for the neighbour-graph path: Matrix (the sparse graphs) and FNN (kd-tree kNN search). Install everything once alongside ggmlR, then just load Seurat — the S3 methods activate on their own:

install.packages(c("Seurat", "Matrix", "FNN"))
library(ggmlR)
library(Seurat)

The one call: RunGGML()

RunGGML() mirrors Seurat’s own RunPCA() / NormalizeData() style — object in, object out, pipe-friendly. The supported operations map onto the expensive matrix steps of a standard workflow:

op Replaces What runs on the GPU
"normalize" NormalizeData() (LogNormalize) per-cell library-size scaling + log1p, elementwise
"scale" ScaleData() per-gene z-score (x − mean) / sd + clamp, over the full dense matrix
"embed" RunPCA() gene-by-gene covariance multiply (the eigendecomposition stays on the CPU — ggml has no eigensolver)
"umap" RunUMAP() two custom compute shaders — a tiled f32 pairwise-distance kernel and an SGD layout kernel
"neighbors" FindNeighbors() kNN distances on the GPU feeding a shared-nearest-neighbour (SNN/Jaccard) graph

"normalize" and "scale" are transforms: they write the result back into the assay (the data and scale.data layers), so the rest of the Seurat pipeline picks them up unchanged. "embed" and "umap" add a dimensionality reduction. "neighbors" writes the <assay>_nn / <assay>_snn graphs into @graphs, exactly where FindClusters() looks.

A worked example

We build a small synthetic object here so the vignette is self-contained; in practice you would load your own data with Read10X().

set.seed(1)
ng <- 400L; nc <- 200L
counts <- matrix(rpois(ng * nc, lambda = 5), nrow = ng, ncol = nc)
rownames(counts) <- paste0("gene", seq_len(ng))
colnames(counts) <- paste0("cell", seq_len(nc))
counts <- methods::as(counts, "dgCMatrix")

pbmc <- CreateSeuratObject(counts = counts)
pbmc

Normalize and scale on the GPU

pbmc <- RunGGML(pbmc, op = "normalize")   # -> assay "data" layer
pbmc <- RunGGML(pbmc, op = "scale")       # -> assay "scale.data" layer

These behave like NormalizeData() and ScaleData() — the transformed matrices now live in the standard assay layers:

dim(LayerData(pbmc, layer = "data"))
dim(LayerData(pbmc, layer = "scale.data"))

The GPU "normalize" result matches Seurat’s NormalizeData() to floating-point tolerance:

gpu_data <- as.matrix(LayerData(pbmc, layer = "data"))
ref_data <- as.matrix(LayerData(
  NormalizeData(pbmc, verbose = FALSE), layer = "data"))
max(abs(gpu_data - ref_data))

PCA (embed)

pbmc <- RunGGML(pbmc, op = "embed", n_components = 20, reduction_name = "ggml")
Embeddings(pbmc, "ggml")[1:3, 1:4]

The reduction is an ordinary DimReduc, so every downstream step — Seurat’s own or ggmlR’s — uses it directly.

UMAP (umap)

op = "umap" lays the cells out in 2-D on the GPU. Both heavy phases are custom Vulkan compute shaders: a tiled f32 pairwise-distance kernel that builds the kNN graph (it sidesteps mul_mat, whose f16 accumulation would reorder nearest neighbours and corrupt the graph), and an SGD layout kernel that runs one dispatch per epoch with lock-free Hogwild updates. With a kd-tree kNN search and a sparse fuzzy graph in between, the whole UMAP runs an order of magnitude faster than a naive reference while matching its layout to float32 precision.

pbmc <- RunGGML(pbmc, op = "umap", reduction = "ggml", dims = 1:20,
                reduction_name = "umap")
Embeddings(pbmc, "umap")[1:3, ]

Neighbour graphs (neighbors)

op = "neighbors" is the GPU equivalent of FindNeighbors(): it builds the binary kNN graph and the shared-nearest-neighbour (SNN) graph whose weights are the Jaccard overlap of each pair’s neighbourhoods. The graphs land in @graphs under Seurat’s naming convention, so FindClusters() consumes them unchanged.

pbmc <- RunGGML(pbmc, op = "neighbors", reduction = "ggml", dims = 1:20)
Graphs(pbmc)                                   # <assay>_nn and <assay>_snn

pbmc <- FindClusters(pbmc, graph.name = paste0(DefaultAssay(pbmc), "_snn"),
                     verbose = FALSE)
table(pbmc$seurat_clusters)
DimPlot(pbmc, reduction = "umap", group.by = "seurat_clusters", label = TRUE)

The SNN weights are numerically identical to FindNeighbors() when both use the same exact kNN (Seurat’s default Annoy is approximate, so an exact match needs nn.method = "rann").

Provenance

Each operation records which backend it used (and timings) in the object’s Misc slot, keyed by the layer / reduction it produced:

Misc(pbmc, "data_ggml")$backend          # normalize
Misc(pbmc, "scale.data_ggml")$backend    # scale
Misc(pbmc, "ggml_ggml")$backend          # embed (and neighbors)
Misc(pbmc, "umap_ggml")$backend_sgd      # umap: layout phase backend

For "umap" the per-phase backends are reported separately (backend_dist for the kNN distances, backend_sgd for the layout), since each falls back to the CPU independently; the summary backend is "vulkan" only when both ran on the GPU.

The layers underneath

RunGGML() is a thin wrapper over three public generics you can also call on their own — handy for a bare matrix with no Seurat object, or to inspect capabilities before dispatch:

# What can the adapter do?
names(ggml_ops_registry())
ggml_ops_registry("embed")

# Compose the layers manually on a plain matrix:
mat  <- ggml_extract(gpu_data)                       # genes x cells, dense
task <- ggml_task("embed", mat, params = list(n_components = 10))
res  <- ggml_run(task)                               # ggml_result
dim(res$embedding)                                   # cells x components
  • ggml_extract() pulls a feature × cell matrix out of a Seurat / dgCMatrix / matrix, handling Seurat v4 (GetAssayData) vs v5 (LayerData) and sparse → dense.
  • ggml_run() validates against ggml_ops_registry() and routes to the Vulkan GPU or the CPU (device = "auto", with fallback).
  • ggml_inject() writes the result back — a reduction for "embed" / "umap", an assay layer for the "normalize" / "scale" transforms, or the <assay>_nn / <assay>_snn graphs for "neighbors".

Bioconductor: SingleCellExperiment

The same RunGGML() works on a SingleCellExperiment (SCE) — the adapter has methods for both object models. On an SCE the default assay read is logcounts, results land in reducedDim(), transforms overwrite the named assay, and the neighbour graphs and provenance go into metadata().

library(SingleCellExperiment)

# a self-contained SCE (genes x cells), so this section does not depend on the
# Seurat object built earlier
set.seed(1)
ng <- 200L; nc <- 120L
sce_counts <- matrix(stats::rpois(ng * nc, lambda = 5), ng, nc)
rownames(sce_counts) <- paste0("gene", seq_len(ng))
colnames(sce_counts) <- paste0("cell", seq_len(nc))
sce <- SingleCellExperiment(assays = list(
  counts    = sce_counts,
  logcounts = log1p(sce_counts)))

sce <- RunGGML(sce, op = "embed", n_components = 20)         # -> reducedDim "ggml"
sce <- RunGGML(sce, op = "neighbors", reduction = "ggml", dims = 1:20)

reducedDimNames(sce)                                         # "ggml"
names(S4Vectors::metadata(sce))                              # ggml_nn / ggml_snn / ggml_ggml

Install the Bioconductor pieces alongside ggmlR if you work with SCE objects:

BiocManager::install(c("SingleCellExperiment", "SummarizedExperiment", "S4Vectors"))

What is and isn’t accelerated

Five steps of a standard workflow move to the GPU; only the final community detection stays on the CPU:

Standard step ggmlR Runs on
NormalizeData() RunGGML(op = "normalize") GPU
ScaleData() RunGGML(op = "scale") GPU
RunPCA() RunGGML(op = "embed") GPU matrix multiply (eigensolve on CPU)
RunUMAP() RunGGML(op = "umap") GPU (distance + SGD shaders)
FindNeighbors() RunGGML(op = "neighbors") GPU distances → CPU sparse SNN
FindClusters() — (use Seurat’s) CPU — iterative graph Louvain/Leiden

So a typical run is normalize → scale → embed → umap for visualisation and embed → neighbors → FindClusters for clustering. FindClusters is left to Seurat because community detection is iterative and graph-structured — a poor fit for the GPU and already well optimised on the CPU. The PCA eigendecomposition likewise stays on the CPU, as ggml has no eigensolver. The same operations are available on SingleCellExperiment objects (see above), so the adapter covers both the Seurat and Bioconductor ecosystems.