RNAseq analysis of the mouse mammary tumors derived from PROCR+ mammary stem cells, bulk tumors

I. Signatures extraction

Author

Michal Kloc, Marion Salvador

Published

March 1, 2026

Introduction

The Protein C receptor (PROCR) was identified as a marker of a population of adult multipotent mammary stem cells (MaSCs) in mouse mammary epithelium (Wang et al. 2015). In this vignette, we present parts of the analysis of the bulk RNAseq data of the tumors derived from mouse PROCR+ MaSCs. For the complete code, please refer to our Zenodo repository [Zenodo will be available upon publication].

The analysis presented here contains:

Scoring the similarity of the mouse mammary tumors to the reference data set of annotated human breast tumors (Prat and Perou 2009).
Unbiased extraction of molecular signatures (metagenes) from the mouse mammary tumors and their functional annotation.

Description of the experiment, alignment to the genome.

The mice (n = 20) developed mammary tumors that were harvested 4 to 5 weeks after 4-OH-Tam induction. 2 control samples of normal mouse mammary epithelial cells were also included.

Reads were aligned to the mouse genome (UCSC version mm10) with STAR, version 2.7.9a, using the multi-map settings outFilterMultimapNmax = 10 (maximal number of multiple alignments allowed for a read) and outSAMmultNmax = 1 (maximal number of SAM lines for each mapped read). The output was sorted and indexed with SAMtools, version 1.11. Duplicate reads were marked using MarkDuplicates from Picard, version 2.23.4. Strand-specific coverage tracks per sample were generated by tiling the genome in 20bp windows and counting overlapping reads in each window using the bamCount function from the Bioconductor package bamsignals (Bioconductor version 3.13). These window counts were then exported in bigWig format using the export function rtracklayer of the Bioconductor package. The rsubread::featureCounts functionwas used to count the number of reads (5’ends) overlapping with the exons of each gene assuming an exon union model.

Reading in raw data, filtering and normalization

First, we load necessary packages and read in the raw quantified data in a form of DGEList

suppressPackageStartupMessages({
    if (!require("BiocManager"))
        install.packages("BiocManager")
    BiocManager::install(version = "3.18", force = TRUE)
    if (!require("SummarizedExperiment"))
        BiocManager::install("SummarizedExperiment")
    library(SummarizedExperiment)
    if (!require("edgeR"))
        BiocManager::install("edgeR")
    library(edgeR)
    if (!require("stringr"))
        BiocManager::install("stringr")
    library(stringr)
    if (!require("ggplot2"))
        BiocManager::install("ggplot2")
    library(ggplot2)
    if (!require("ggfortify"))
        BiocManager::install("ggfortify")
    library(ggfortify)
    if (!require("ggrepel"))
        BiocManager::install("ggrepel")
    library(ggrepel)
    if (!require("GEOquery"))
        BiocManager::install("GEOquery")
    library(GEOquery)
    if (!require("biomaRt"))
        BiocManager::install("biomaRt")
    library(biomaRt)
    if (!require("org.Hs.eg.db"))
        BiocManager::install("org.Hs.eg.db")
    library(org.Hs.eg.db)
    if (!require("pheatmap"))
        BiocManager::install("pheatmap")
    library(pheatmap)
    if (!require("ComplexHeatmap"))
        BiocManager::install("ComplexHeatmap")
    library(ComplexHeatmap)
    if (!require("circlize"))
        BiocManager::install("circlize")
    library(circlize)
    if (!require("reshape2"))
        BiocManager::install("reshape2")
    library(reshape2)
    if (!require("ggcorrplot"))
        BiocManager::install("ggcorrplot")
    library(ggcorrplot)
})

Bioconductor version 3.18 (BiocManager 1.30.25), R 4.3.2 (2023-10-31)

Old packages: 'corrplot', 'emmeans', 'gert'

setwd("../RNAseq_PROCRpos-derived_tumor_pieces")
# when downloaded from Zenodo, data stored in input_data folder
dge <- readRDS("input_data/dge01_procr-derived_chunks.rds")

This next step contains renaming of the samples’ description. We explicitly named the samples nr. 21 and 22 as controls.

dge$samples$ExternalSampleName <- gsub("__", "_", dge$samples$ExternalSampleName)
dge$samples$code_id <- colnames(dge)
colnames(dge) <- dge$samples$ExternalSampleName
colnames(dge)[1:20] <- 1:20
colnames(dge)[c(21, 22)] <- c("Ctr1", "Ctr2")

Next, we filtered out low-expressed genes and calculated normalization factors.

keep.exprs <- filterByExpr(dge, group = dge$samples$SampleGroup)
dge <- dge[keep.exprs, , keep.lib.sizes = FALSE]
dge <- calcNormFactors(dge)
dim(dge)

[1] 20211    22

source("../shared_Rscripts/pca_functions_general.R")
res <- computePCA(dge)
res[[1]]["SampleName"] <- colnames(dge)
plotPCs(res)

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Reference human data set

In order to assess human relevance of the mouse mammary tumors, we next correlated the gene expression to the reference human PAM50-annotated breast cancer dataset (Prat et al. 2010). The reference consisted of microarray log-normalized data.

# load series and platform data from GEO
gset <- getGEO("GSE18229", GSEMatrix = TRUE, AnnotGPL = TRUE)[[1]]

Found 7 file(s)

GSE18229-GPL1390_series_matrix.txt.gz

GSE18229-GPL1708_series_matrix.txt.gz

GSE18229-GPL5325_series_matrix.txt.gz

Annotation GPL not available, so will use submitter GPL instead

GSE18229-GPL6607_series_matrix.txt.gz

Annotation GPL not available, so will use submitter GPL instead

GSE18229-GPL7504_series_matrix.txt.gz

Annotation GPL not available, so will use submitter GPL instead

GSE18229-GPL885_series_matrix.txt.gz

GSE18229-GPL887_series_matrix.txt.gz

# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
# select PAM50 classification
classification <- pData(gset)[c(66)]
names(classification) <- "PAM50_classification"
table(classification)

PAM50_classification
  Basal Claudin    Her2    LumA    LumB  Normal 
     32      19      22      70      37      17

To perform the correlation analyses, we needed to reduce the two data sets on 1-to-1 orthologous genes. We used EnseblIDs to do so. The microarray data were annotated by Entrezid, so first we built an EnseblID annotation. We also had to deal with the situation, where multiple probes on the microarray chip corresponded to the same EnseblID. In such situations, we kept the probes with the highest variabilty (more biologically informative of the sample differences). ::: {.cell}

gset <- gset[fData(gset)$Gene.ID != "", ]  #remove those without EntrezId
dim(gset)

Features  Samples 
   19429      199

# annotation
fData(gset)$ensembl <- mapIds(org.Hs.eg.db, keys = fData(gset)$Gene.ID, column = c("ENSEMBL"),
    keytype = "ENTREZID", multiVals = "first")  #no nan

'select()' returned 1:many mapping between keys and columns

a <- table(fData(gset)$ensembl)  #some ensemblids more than once?
length(a[a > 1])  # 2835 genenes has non-unique ensembl, keep the representative with highest dispersion

[1] 2835

testStat <- apply(exprs(gset), 1, IQR, na.rm = TRUE)  # (inter-quantile distance) variance for each row
tSsp <- split.default(testStat, fData(gset)$ensembl)
kept <- sapply(tSsp, function(x) names(which.max(x)))
gset <- gset[kept, ]  #
featureNames(gset) <- fData(gset)$ensembl
dim(gset)

Features  Samples 
   14391      199

:::

Correlation between the mouse mammary tumors derived from PROCR+ MaSCs and the PAM50 human molecular breast cancer subtypes

Next, we used BioMart package to find and annotatate unique mouse-human orthologs.

ensembl <- useEnsembl(biomart = "genes", dataset = "mmusculus_gene_ensembl", version = 105)
orthology <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", "hsapiens_homolog_associated_gene_name",
    "hsapiens_homolog_ensembl_gene", "hsapiens_homolog_orthology_type", "hsapiens_homolog_subtype",
    "hsapiens_homolog_orthology_confidence"), mart = ensembl)
# Some filtering to keep only 1-to-1 orthologs
orthology1to1 <- orthology[orthology$hsapiens_homolog_ensembl_gene != "", ]
orthology1to1 <- orthology1to1[orthology1to1$hsapiens_homolog_orthology_type %in%
    c("ortholog_one2one"), ]
## Filter on taxonomic level. We could be more stringent here but would lose
## some orthologs identified as 'dubious duplications'
orthology1to1 <- orthology1to1[orthology1to1$hsapiens_homolog_subtype %in% c("Boreoeutheria",
    "Eutheria", "Euarchontoglires", "Theria", "Amniota"), ]
row.names(orthology1to1) <- orthology1to1$hsapiens_homolog_ensembl_gene
fData(gset)$hm1to1orthologs <- orthology1to1[fData(gset)$ensembl, "ensembl_gene_id"]

The gene expression matrix was storein in a separate GEOmatrix object and the respective mouse EnsemblIDs were assigned to the genes.

GEOmatrix <- exprs(gset)
colnames(GEOmatrix) <- classification[, 1]
rownames(GEOmatrix) <- fData(gset)$hm1to1orthologs
GEOmatrix <- GEOmatrix[!is.na(rownames(GEOmatrix)), !is.na(colnames(GEOmatrix))]

We next performed Pearson correlation analysis on the co-expressed genes in both data sets. To reduce the noise, we selected the top 1000 highly variable genes for the analysis.

# get our data
MYDATAmatrix <- edgeR::cpm(dge, log = TRUE)
GeneVar <- apply(MYDATAmatrix, 1, IQR, na.rm = TRUE)
rank.dispersion <- rank(-GeneVar)
MYDATAmatrix <- MYDATAmatrix[order(rank.dispersion), ]  #sort genes according to its variance
# which genes are shared across the datasets?
shared.genes <- rownames(MYDATAmatrix)[rownames(MYDATAmatrix) %in% rownames(GEOmatrix)]  #names of the shared genes
dim(MYDATAmatrix[shared.genes, ])

[1] 10812    22

MYDATAmatrix <- MYDATAmatrix[shared.genes, ]
GEOmatrix <- GEOmatrix[shared.genes, ]
table(rownames(MYDATAmatrix) == rownames(GEOmatrix))


 TRUE 
10812

# Correlation on 1000 most varying genes
MYDATAmatrix1000 <- MYDATAmatrix[1:1000, ]
GEOmatrix1000 <- GEOmatrix[1:1000, ]
table(rownames(MYDATAmatrix1000) == rownames(GEOmatrix1000))


TRUE 
1000

We also mean-centered the data. This was important to correct for a technological batch effect between the data sets. Thus, the correlation was performed on the deviation of the expression levels in each sample from the dataset mean.

MYDATAmatrix1000 <- t(scale(t(MYDATAmatrix1000), center = TRUE, scale = FALSE))
GEOmatrix1000 <- t(scale(t(GEOmatrix1000), center = TRUE, scale = FALSE))
CorMat1000 <- cor(MYDATAmatrix1000, GEOmatrix1000, use = "na.or.complete", method = "pearson")
dim(CorMat1000)

[1]  22 197

We also performed sample by sample correlation, thus the cross-correlation matrix CorMat1000 had 22 rows (corresponding to our samples) and 197 columns (corresponding to the reference dataset). We aggregated the results over the reference samples from the same PAM50 class (Parker et al. 2009) simply by taking the mean value.

CorMat1000.aggr <- t(aggregate(t(CorMat1000), list(colnames(as.data.frame(CorMat1000))),
    mean))
colnames(CorMat1000.aggr) <- CorMat1000.aggr[1, ]
CorMat1000.aggr <- CorMat1000.aggr[2:NROW(CorMat1000.aggr), ]
CorMat1000.aggr <- apply(CorMat1000.aggr, 2, as.numeric)
rownames(CorMat1000.aggr) <- rownames(CorMat1000)
pheatmap(CorMat1000.aggr, cluster_cols = FALSE, main = "Pearson correlation", display_numbers = T)

Using this annotation, we reproduce the Figure 2D from the manuscript.

ClaudinClass <- c("claudin-low", "claudin-low", "differentiated", "differentiated",
    "claudin-low", "mixed", "claudin-low", "mixed", "differentiated", "claudin-low",
    "mixed", "differentiated", "claudin-low", "claudin-low", "claudin-low", "claudin-low",
    "claudin-low", "claudin-low", "mixed", "claudin-low", "control", "control")
dge$samples$clusters <- ClaudinClass
procr.annotation <- dge$samples$clusters
heat_anno <- HeatmapAnnotation(group = procr.annotation, col = list(group = c(mixed = "firebrick",
    differentiated = "cornflowerblue", `claudin-low` = "gold3", control = "grey17")))
transCor <- t(CorMat1000.aggr)
limit.value <- max(abs(min(transCor)), max(transCor))
col_fun = colorRamp2(seq(-limit.value, limit.value, length = 3), c("darkolivegreen",
    "#EEEEEE", "red3"))
Heatmap(transCor, name = "Cor. Coef.", row_names_gp = grid::gpar(fontsize = 10),
    column_names_gp = grid::gpar(fontsize = 10), cluster_rows = TRUE, row_title = NULL,
    top_annotation = heat_anno, show_row_names = TRUE, show_column_names = TRUE,
    col = col_fun)

Plotting only the correlation to the Claudin reference, we reproduce the supplementary Figure S8 B.

CorMat1000.sel <- CorMat1000[, colnames(CorMat1000) %in% c("Claudin")]
melt.Cor <- melt(CorMat1000.sel)
colnames(melt.Cor) <- c("sample", "class", "corr")
melt.Cor$category <- ClaudinClass
ggplot(melt.Cor, aes(x = sample, y = corr, fill = category)) + geom_boxplot() + stat_summary(fun = mean,
    geom = "point", shape = 23, size = 4) + ggtitle("Correlation with the claudin-low class") +
    scale_fill_manual(values = c(mixed = "firebrick", differentiated = "cornflowerblue",
        `claudin-low` = "gold3", control = "grey17"))

We can also generate a PCA plot with this annotation and we can see that the visible clusters correspond to the define classes (Figure 1F).

# make pca with sample group annotation
dge2 <- dge
dge2$samples$SampleGroup <- ClaudinClass
res <- computePCA(dge2)
res[[1]]["SampleName"] <- colnames(dge2)
color.groups <- c(mixed = "firebrick", differentiated = "cornflowerblue", `claudin-low` = "gold3",
    control = "grey17")
plotPCs(res, color = color.groups)

Here we note that robustness of the results was tested also on 1200 and 800 most variable genes. Spearman correlation was also tested yielding similar results, thus a role of outliers biasing the Pearson correlation was ruled out.

Heatmap of the selected genes

Next, we plotted a heatmap with selected genes reproducing a supplementary Figure S8 A.

test.genes <- c("Krt5", "Krt15", "Krt14", "Acta2", "Tp63", "Itgb1", "Krt8", "Krt18",
    "Esr1", "Pgr", "Gata3", "Elf5", "Erbb2", "Cdh1", "Vim", "Fn1", "Twist1", "Twist2",
    "Zeb1", "Zeb2", "Snai1", "Snai2", "Cldn1", "Cldn4", "Cldn6", "Cldn8", "Cldn12",
    "Cldn19", "Cldn23", "Ocln", "Sparc", "Procr", "Lgr5", "Axin2")
dge.sel <- dge[dge$genes$SYMBOL %in% test.genes, ]
z <- edgeR::cpm(dge.sel, log = TRUE)
rownames(z) <- dge.sel$genes$SYMBOL
myBreaks <- seq(-3, 3, by = 0.2)
Heatmap((z - rowMeans(z))/rowSds(z), name = "Z score", row_names_gp = grid::gpar(fontsize = 10),
    column_names_gp = grid::gpar(fontsize = 10), row_title = NULL, top_annotation = heat_anno)

Extraction of molecular signatures

We next used Non-negative matrix factorization (NMF) to extract molecular signatures (metagenes) from the samples. To that end, ButchR package will be used (Quintero et al. 2020). For installation instruction refer to the package github page and the full code available at [our Zenodo].

suppressPackageStartupMessages({
    if (!require("knitr"))
        BiocManager::install("knitr")
    library(knitr)
    if (!require("viridis"))
        BiocManager::install("viridis")
    library(viridis)
    if (!require("org.Mm.eg.db"))
        BiocManager::install("org.Mm.eg.db")
    library(org.Mm.eg.db)
    if (!require("clusterProfiler"))
        BiocManager::install("clusterProfiler")
    library(clusterProfiler)
    # if (!require('reticulate')) BiocManager::install('reticulate'); if
    # (!require('devtools')) BiocManager::install('devtools');
    # library(devtools)
})
# reticulate::use_condaenv('tf', required = TRUE) library(reticulate)
# py_config()
library(ButchR)

NMF approximates the RNA count matrix C, \dim{C}= \{m,n\} (m detected genes, n samples) as a product of two matrices C \approx W H with non-negative entries. The algorithm also encourages sparcity of the W and H matrices. These two characteristics (non-negativity and sparseness) ensure interpretability of the results. The decomposition has a free parameter k of the latent dimension (\dim{W}= \{m,k\},\dim{H}= \{k,n\}) which has to be chosen beforehand. ButchR offers multiple diagnostic measures to assign the most optimal k for the input data, however, any k is “allowed” and the main criterion is the interpretability of the extracted signatures. We ran NMF over 30 random initializations of the exposure matrix H and signature matrix W for the range k = 3, 4, 5, 6. The convergence of the stochastic NMF algorithm was assessed when the assignment of each sample to a given signature did not change after 40 iterations.

procr.matrix.cts <- edgeR::cpm(dge, log = FALSE)
k_min <- 3
k_max <- 6
seed <- 102
# original command
# procr_nmf_exp <- run_NMF_tensor(X = procr.matrix.cts, ranks = k_min:k_max,
# method = 'NMF', n_initializations = 30, extract_features = TRUE, seed = seed)

Results from run_NMF_tensor function correspond to the input_data/procr_nmf_exp.rda file.

load(file = "../RNAseq_PROCRpos-derived_tumor_pieces/input_data/procr_nmf_exp.rda")

procr_nmf_exp <- normalizeW(procr_nmf_exp)
gg_plotKStats(procr_nmf_exp)

No optimal latent dimension k was suggested. Though, from the diagnostics plot, k = 4 appeeared to be a minimal suitable dimension to consider (minimizing Frobenius error measures and Amari distance while scoring relatively high in the other metrics which is desirable). Thus, we selected this k = 4 for the downstream analysis.

The next piece of code plots the exposure matrix H (Figure 3A).

procr_Hk4 <- HMatrix(procr_nmf_exp, k = 4)
rownames(procr_Hk4) <- c("Sign.1", "Sign.2", "Sign.3", "Sign.4")
class(procr_Hk4)  #matrix array

[1] "matrix" "array"

procr_Hk4 <- as.matrix(procr_Hk4)
colnames(procr_Hk4)[1:20] <- 1:20
procr.annotation <- dge$samples$clusters
print(procr.annotation)

 [1] "claudin-low"    "claudin-low"    "differentiated" "differentiated"
 [5] "claudin-low"    "mixed"          "claudin-low"    "mixed"         
 [9] "differentiated" "claudin-low"    "mixed"          "differentiated"
[13] "claudin-low"    "claudin-low"    "claudin-low"    "claudin-low"   
[17] "claudin-low"    "claudin-low"    "mixed"          "claudin-low"   
[21] "control"        "control"

heat_anno <- HeatmapAnnotation(group = procr.annotation, col = list(group = c(mixed = "firebrick",
    differentiated = "cornflowerblue", `claudin-low` = "gold3", control = "grey17")))
h_heatmap_4 <- Heatmap(procr_Hk4, col = viridis(100), name = "Exposure", clustering_distance_columns = "pearson",
    show_column_dend = TRUE, top_annotation = heat_anno, show_column_names = TRUE,
    show_row_names = TRUE, cluster_rows = FALSE)
print(h_heatmap_4)

Next, we plotted the normalized signature matrix W (Figure S10 A).

# SIGNATURES
procr_features <- SignatureSpecificFeatures(procr_nmf_exp, k = 4, return_all_features = TRUE)
colnames(procr_features) <- paste0("Sign.", 1:4)
kable(head(procr_features))

	Sign.2	Sign.3	Sign.4
ENSMUSG00000069125	0	1	0
ENSMUSG00000103224	1	1	1
ENSMUSG00000091223	1	1	1
ENSMUSG00000025255	0	0	1
ENSMUSG00000103034	1	1	1
ENSMUSG00000027499	1	1	0

# keep only those specific
procr_specific <- rownames(procr_features)[rowSums(procr_features) == 1]
length(procr_specific)  #6927

[1] 6927

procr_Wspecific <- WMatrix(procr_nmf_exp, k = 4)[procr_specific, ]  #reduce the W matrix to the specific ones
colnames(procr_Wspecific) <- paste0("Sign.", 1:4)
# This W matrix is the same as procr_Wk4 from the input_data folder
# normalize exposure score in W matrix across rows
procr_Wspecific.norm <- procr_Wspecific/matrixStats::rowMaxs(procr_Wspecific)  #normalizes W to the max value in the row...better for plotting
# Display selected features on W matrix
w_heatmap <- Heatmap(procr_Wspecific.norm, col = inferno(100), name = "W matrix",
    clustering_distance_columns = "pearson", show_column_dend = TRUE, show_column_names = TRUE,
    show_row_names = FALSE, cluster_rows = TRUE, cluster_columns = FALSE)
w_heatmap

Functional annotation of the metagenes

For the functional annotation of the metagenes, we used clusterProfiler R package that implements Fisher’s exact test for overrepresentation analysis.

# Sign.4 <-> PROCR metagene
boolean.genes.Sig.Mes <- procr_Wspecific.norm[, "Sign.4"] == 1
# SORT BY Wmatrix
procr_Wspecific.noNorm.SigMes <- sort(procr_Wspecific[boolean.genes.Sig.Mes, 4],
    decreasing = TRUE)
dgeSigMes <- dge[names(procr_Wspecific.noNorm.SigMes), ]
names(procr_Wspecific.noNorm.SigMes) <- dgeSigMes$genes$SYMBOL
# SAME FOR the normal signature
boolean.genes.Sig.Normal <- procr_Wspecific.norm[, "Sign.3"] == 1
procr_Wspecific.noNorm.SigNormal <- sort(procr_Wspecific[boolean.genes.Sig.Normal,
    3], decreasing = TRUE)
dgeSigNormal <- dge[names(procr_Wspecific.noNorm.SigNormal), ]
names(procr_Wspecific.noNorm.SigNormal) <- dgeSigNormal$genes$SYMBOL
# SAME FOR 'ANTI-CLAUDIN'
boolean.genes.Sig.Anti <- procr_Wspecific.norm[, "Sign.2"] == 1
procr_Wspecific.noNorm.SigAnti <- sort(procr_Wspecific[boolean.genes.Sig.Anti, 2],
    decreasing = TRUE)
dgeSigAnti <- dge[names(procr_Wspecific.noNorm.SigAnti), ]
names(procr_Wspecific.noNorm.SigAnti) <- dgeSigAnti$genes$SYMBOL
# SAME FOR 'Mixed'
boolean.genes.Sig.Spec <- procr_Wspecific.norm[, "Sign.1"] == 1
procr_Wspecific.noNorm.SigSpec <- sort(procr_Wspecific[boolean.genes.Sig.Spec, 1],
    decreasing = TRUE)
dgeSigSpec <- dge[names(procr_Wspecific.noNorm.SigSpec), ]
names(procr_Wspecific.noNorm.SigSpec) <- dgeSigSpec$genes$SYMBOL

The full-length metagenes contain over 1000 genes each, however, most of them with very low weights. For the functional annotation, we selected the top 10 percent of the genes based on their weights (most important genes that contribute to each metagene).

# need EntrezIds
gene.set.SigMes <- names(procr_Wspecific.noNorm.SigMes)[1:trunc(0.1 * length(names(procr_Wspecific.noNorm.SigMes)))]
gene.set.SigSpec <- names(procr_Wspecific.noNorm.SigSpec)[1:trunc(0.1 * length(names(procr_Wspecific.noNorm.SigSpec)))]
gene.set.SigAnti <- names(procr_Wspecific.noNorm.SigAnti)[1:trunc(0.1 * length(names(procr_Wspecific.noNorm.SigAnti)))]
gene.set.SigNormal <- names(procr_Wspecific.noNorm.SigNormal)[1:trunc(0.1 * length(names(procr_Wspecific.noNorm.SigNormal)))]
# sanity check
intersect(gene.set.SigNormal, gene.set.SigMes)

character(0)

For the annotation, we used the Hallmark gene sets from MSigDB (Liberzon et al. 2011), see Figure 3B.

# Mouse symbol
mm <- org.Mm.eg.db
my.symbols.SigMes <- gene.set.SigMes
dictionary.SigMes <- AnnotationDbi::select(mm, keys = my.symbols.SigMes, columns = c("ENTREZID",
    "SYMBOL"), keytype = "SYMBOL")

'select()' returned 1:1 mapping between keys and columns

my.symbols.SigSpec <- gene.set.SigSpec
dictionary.SigSpec <- AnnotationDbi::select(mm, keys = my.symbols.SigSpec, columns = c("ENTREZID",
    "SYMBOL"), keytype = "SYMBOL")

'select()' returned 1:1 mapping between keys and columns

my.symbols.SigAnti <- gene.set.SigAnti
dictionary.SigAnti <- AnnotationDbi::select(mm, keys = my.symbols.SigAnti, columns = c("ENTREZID",
    "SYMBOL"), keytype = "SYMBOL")

'select()' returned 1:1 mapping between keys and columns

my.symbols.SigNormal <- gene.set.SigNormal
dictionary.SigNormal <- AnnotationDbi::select(mm, keys = my.symbols.SigNormal, columns = c("ENTREZID",
    "SYMBOL"), keytype = "SYMBOL")

'select()' returned 1:1 mapping between keys and columns

input.overrepr <- list(dictionary.SigMes$ENTREZID, dictionary.SigNormal$ENTREZID,
    dictionary.SigAnti$ENTREZID, dictionary.SigSpec$ENTREZID)
names(input.overrepr) <- c("Sig.4", "Sig.3", "Sig.2", "Sig.1")
m_t2g <- msigdbr::msigdbr(species = "Mus musculus", category = c("H")) %>%
    dplyr::select(gs_name, entrez_gene) %>%
    dplyr::distinct(gs_name, entrez_gene)
xx <- compareCluster(input.overrepr, enricher, TERM2GENE = m_t2g, pvalueCutoff = 0.1,
    pAdjustMethod = "BH")
xx <- as.data.frame(xx)
xx$Description <- str_remove(xx$Description, "HALLMARK_")
ggplot(xx, showCategory = 10, aes(Cluster, Description)) + theme_bw() + ylab("") +
    xlab("Signature") + geom_point(aes(color = p.adjust, size = Count)) + scale_color_gradient(low = "blue",
    high = "green") + ggtitle("Gene set enrichment")

Session Info

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] clusterProfiler_4.10.1      org.Mm.eg.db_3.18.0        
 [3] viridis_0.6.5               viridisLite_0.4.2          
 [5] ButchR_1.0                  devtools_2.4.5             
 [7] usethis_3.0.0               ggcorrplot_0.1.4.1         
 [9] reshape2_1.4.4              circlize_0.4.16            
[11] ComplexHeatmap_2.18.0       pheatmap_1.0.12            
[13] org.Hs.eg.db_3.18.0         AnnotationDbi_1.64.1       
[15] biomaRt_2.58.2              GEOquery_2.70.0            
[17] ggrepel_0.9.6               ggfortify_0.4.17           
[19] ggplot2_3.5.1               stringr_1.5.1              
[21] edgeR_4.0.16                limma_3.58.1               
[23] SummarizedExperiment_1.32.0 Biobase_2.62.0             
[25] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
[27] IRanges_2.36.0              S4Vectors_0.40.2           
[29] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
[31] matrixStats_1.4.1           BiocManager_1.30.25        
[33] png_0.1-8                   knitr_1.48                 

loaded via a namespace (and not attached):
  [1] splines_4.3.2           later_1.3.2             ggplotify_0.1.2        
  [4] bitops_1.0-9            filelock_1.0.3          R.oo_1.26.0            
  [7] tibble_3.2.1            polyclip_1.10-7         XML_3.99-0.17          
 [10] lifecycle_1.0.4         doParallel_1.0.17       MASS_7.3-60.0.1        
 [13] lattice_0.22-6          magrittr_2.0.3          rmarkdown_2.28         
 [16] yaml_2.3.10             remotes_2.5.0           httpuv_1.6.15          
 [19] sessioninfo_1.2.2       pkgbuild_1.4.4          reticulate_1.39.0      
 [22] cowplot_1.1.3           DBI_1.2.3               RColorBrewer_1.1-3     
 [25] abind_1.4-8             pkgload_1.4.0           zlibbioc_1.48.2        
 [28] R.utils_2.12.3          purrr_1.0.2             msigdbr_7.5.1          
 [31] ggraph_2.2.1            RCurl_1.98-1.16         yulab.utils_0.1.7      
 [34] tweenr_2.0.3            rappdirs_0.3.3          GenomeInfoDbData_1.2.11
 [37] enrichplot_1.22.0       tidytree_0.4.6          codetools_0.2-20       
 [40] DelayedArray_0.28.0     ggforce_0.4.2           DOSE_3.28.2            
 [43] xml2_1.3.6              tidyselect_1.2.1        shape_1.4.6.1          
 [46] aplot_0.2.3             farver_2.1.2            BiocFileCache_2.10.2   
 [49] jsonlite_1.8.9          GetoptLong_1.0.5        ellipsis_0.3.2         
 [52] tidygraph_1.3.1         iterators_1.0.14        foreach_1.5.2          
 [55] tools_4.3.2             progress_1.2.3          treeio_1.26.0          
 [58] Rcpp_1.0.13             glue_1.8.0              gridExtra_2.3          
 [61] SparseArray_1.2.4       xfun_0.48               qvalue_2.34.0          
 [64] dplyr_1.1.4             withr_3.0.1             formatR_1.14           
 [67] fastmap_1.2.0           fansi_1.0.6             digest_0.6.37          
 [70] gridGraphics_0.5-1      R6_2.5.1                mime_0.12              
 [73] colorspace_2.1-1        Cairo_1.6-2             GO.db_3.18.0           
 [76] RSQLite_2.3.7           R.methodsS3_1.8.2       utf8_1.2.4             
 [79] tidyr_1.3.1             generics_0.1.3          data.table_1.16.2      
 [82] graphlayouts_1.2.0      prettyunits_1.2.0       httr_1.4.7             
 [85] htmlwidgets_1.6.4       S4Arrays_1.2.1          riverplot_0.10         
 [88] scatterpie_0.2.4        pkgconfig_2.0.3         gtable_0.3.5           
 [91] blob_1.2.4              XVector_0.42.0          shadowtext_0.1.4       
 [94] htmltools_0.5.8.1       profvis_0.4.0           fgsea_1.28.0           
 [97] clue_0.3-65             scales_1.3.0            ggfun_0.1.6            
[100] rstudioapi_0.16.0       tzdb_0.4.0              rjson_0.2.23           
[103] nlme_3.1-166            curl_5.2.3              cachem_1.1.0           
[106] GlobalOptions_0.1.2     parallel_4.3.2          miniUI_0.1.1.1         
[109] HDO.db_0.99.1           pillar_1.9.0            vctrs_0.6.5            
[112] urlchecker_1.0.1        promises_1.3.0          dbplyr_2.5.0           
[115] xtable_1.8-4            cluster_2.1.6           evaluate_1.0.1         
[118] readr_2.1.5             cli_3.6.3               locfit_1.5-9.10        
[121] compiler_4.3.2          rlang_1.1.4             crayon_1.5.3           
[124] labeling_0.4.3          plyr_1.8.9              fs_1.6.4               
[127] stringi_1.8.4           BiocParallel_1.36.0     nnls_1.5               
[130] babelgene_22.9          munsell_0.5.1           Biostrings_2.70.3      
[133] lazyeval_0.2.2          GOSemSim_2.28.1         Matrix_1.6-5           
[136] patchwork_1.3.0         hms_1.1.3               bit64_4.5.2            
[139] KEGGREST_1.42.0         statmod_1.5.0           shiny_1.9.1            
[142] igraph_2.0.3            memoise_2.0.1           ggtree_3.10.1          
[145] fastmatch_1.1-4         bit_4.5.0               gson_0.1.0             
[148] ape_5.8

References

Liberzon, Arthur, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo Tamayo, and Jill P. Mesirov. 2011. “Molecular signatures database (MSigDB) 3.0.” Bioinformatics 27 (12): 1739–40. https://doi.org/10.1093/bioinformatics/btr260.

Parker, Joel S., Michael Mullins, Maggie C. U. Cheang, Samuel Leung, David Voduc, Tammi Vickery, Sherri Davies, et al. 2009. “Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes.” Journal of Clinical Oncology 27 (8): 1160–67. https://doi.org/10.1200/JCO.2008.18.1370.

Prat, Aleix, Joel S. Parker, Olga Karginova, Cheng Fan, Chad Livasy, Jason I. Herschkowitz, Xiaping He, and Charles M. Perou. 2010. “Phenotypic and Molecular Characterization of the Claudin-Low Intrinsic Subtype of Breast Cancer.” Breast Cancer Research 12 (5): R68. https://doi.org/10.1186/bcr2635.

Prat, Aleix, and Charles M. Perou. 2009. “Mammary Development Meets Cancer Genomics.” Nature Medicine 15 (8): 842–44. https://doi.org/10.1038/nm0809-842.

Quintero, Andres, Daniel Hübschmann, Nils Kurzawa, Sebastian Steinhauser, Philipp Rentzsch, Stephen Krämer, Carolin Andresen, et al. 2020. “ShinyButchR: Interactive NMF-based decomposition workflow of genome-scale datasets.” Biology Methods and Protocols 5 (1): bpaa022. https://doi.org/10.1093/biomethods/bpaa022.

Wang, Daisong, Cheguo Cai, Xiaobing Dong, Qing Cissy Yu, Xiao-Ou Zhang, Li Yang, and Yi Arial Zeng. 2015. “Identification of Multipotent Mammary Stem Cells by Protein c Receptor Expression.” Nature 517 (7532): 81–84. https://doi.org/10.1038/nature13851.