RNAseq analysis of the mouse mammary tumors derived from PROCR+ mammary stem cells, bulk tumors
I. Signatures extraction
Author
Michal Kloc, Marion Salvador
Published
March 1, 2026
Introduction
The Protein C receptor (PROCR) was identified as a marker of a population of adult multipotent mammary stem cells (MaSCs) in mouse mammary epithelium (Wang et al. 2015). In this vignette, we present parts of the analysis of the bulk RNAseq data of the tumors derived from mouse PROCR+ MaSCs. For the complete code, please refer to our Zenodo repository [Zenodo will be available upon publication].
The analysis presented here contains:
Scoring the similarity of the mouse mammary tumors to the reference data set of annotated human breast tumors (Prat and Perou 2009).
Unbiased extraction of molecular signatures (metagenes) from the mouse mammary tumors and their functional annotation.
Description of the experiment, alignment to the genome.
The mice (n = 20) developed mammary tumors that were harvested 4 to 5 weeks after 4-OH-Tam induction. 2 control samples of normal mouse mammary epithelial cells were also included.
Reads were aligned to the mouse genome (UCSC version mm10) with STAR, version 2.7.9a, using the multi-map settings outFilterMultimapNmax = 10 (maximal number of multiple alignments allowed for a read) and outSAMmultNmax = 1 (maximal number of SAM lines for each mapped read). The output was sorted and indexed with SAMtools, version 1.11. Duplicate reads were marked using MarkDuplicates from Picard, version 2.23.4. Strand-specific coverage tracks per sample were generated by tiling the genome in 20bp windows and counting overlapping reads in each window using the bamCount function from the Bioconductor package bamsignals (Bioconductor version 3.13). These window counts were then exported in bigWig format using the export function rtracklayer of the Bioconductor package. The rsubread::featureCounts functionwas used to count the number of reads (5’ends) overlapping with the exons of each gene assuming an exon union model.
Reading in raw data, filtering and normalization
First, we load necessary packages and read in the raw quantified data in a form of DGEList
Bioconductor version 3.18 (BiocManager 1.30.25), R 4.3.2 (2023-10-31)
Old packages: 'corrplot', 'emmeans', 'gert'
setwd("../RNAseq_PROCRpos-derived_tumor_pieces")# when downloaded from Zenodo, data stored in input_data folderdge <-readRDS("input_data/dge01_procr-derived_chunks.rds")
This next step contains renaming of the samples’ description. We explicitly named the samples nr. 21 and 22 as controls.
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
Reference human data set
In order to assess human relevance of the mouse mammary tumors, we next correlated the gene expression to the reference human PAM50-annotated breast cancer dataset (Prat et al. 2010). The reference consisted of microarray log-normalized data.
# load series and platform data from GEOgset <-getGEO("GSE18229", GSEMatrix =TRUE, AnnotGPL =TRUE)[[1]]
Found 7 file(s)
GSE18229-GPL1390_series_matrix.txt.gz
GSE18229-GPL1708_series_matrix.txt.gz
GSE18229-GPL5325_series_matrix.txt.gz
Annotation GPL not available, so will use submitter GPL instead
GSE18229-GPL6607_series_matrix.txt.gz
Annotation GPL not available, so will use submitter GPL instead
GSE18229-GPL7504_series_matrix.txt.gz
Annotation GPL not available, so will use submitter GPL instead
GSE18229-GPL885_series_matrix.txt.gz
GSE18229-GPL887_series_matrix.txt.gz
# make proper column names to match toptablefvarLabels(gset) <-make.names(fvarLabels(gset))# select PAM50 classificationclassification <-pData(gset)[c(66)]names(classification) <-"PAM50_classification"table(classification)
To perform the correlation analyses, we needed to reduce the two data sets on 1-to-1 orthologous genes. We used EnseblIDs to do so. The microarray data were annotated by Entrezid, so first we built an EnseblID annotation. We also had to deal with the situation, where multiple probes on the microarray chip corresponded to the same EnseblID. In such situations, we kept the probes with the highest variabilty (more biologically informative of the sample differences). ::: {.cell}
gset <- gset[fData(gset)$Gene.ID !="", ] #remove those without EntrezIddim(gset)
'select()' returned 1:many mapping between keys and columns
a <-table(fData(gset)$ensembl) #some ensemblids more than once?length(a[a >1]) # 2835 genenes has non-unique ensembl, keep the representative with highest dispersion
Correlation between the mouse mammary tumors derived from PROCR+ MaSCs and the PAM50 human molecular breast cancer subtypes
Next, we used BioMart package to find and annotatate unique mouse-human orthologs.
ensembl <-useEnsembl(biomart ="genes", dataset ="mmusculus_gene_ensembl", version =105)orthology <-getBM(attributes =c("ensembl_gene_id", "external_gene_name", "hsapiens_homolog_associated_gene_name","hsapiens_homolog_ensembl_gene", "hsapiens_homolog_orthology_type", "hsapiens_homolog_subtype","hsapiens_homolog_orthology_confidence"), mart = ensembl)# Some filtering to keep only 1-to-1 orthologsorthology1to1 <- orthology[orthology$hsapiens_homolog_ensembl_gene !="", ]orthology1to1 <- orthology1to1[orthology1to1$hsapiens_homolog_orthology_type %in%c("ortholog_one2one"), ]## Filter on taxonomic level. We could be more stringent here but would lose## some orthologs identified as 'dubious duplications'orthology1to1 <- orthology1to1[orthology1to1$hsapiens_homolog_subtype %in%c("Boreoeutheria","Eutheria", "Euarchontoglires", "Theria", "Amniota"), ]row.names(orthology1to1) <- orthology1to1$hsapiens_homolog_ensembl_genefData(gset)$hm1to1orthologs <- orthology1to1[fData(gset)$ensembl, "ensembl_gene_id"]
The gene expression matrix was storein in a separate GEOmatrix object and the respective mouse EnsemblIDs were assigned to the genes.
We next performed Pearson correlation analysis on the co-expressed genes in both data sets. To reduce the noise, we selected the top 1000 highly variable genes for the analysis.
# get our dataMYDATAmatrix <- edgeR::cpm(dge, log =TRUE)GeneVar <-apply(MYDATAmatrix, 1, IQR, na.rm =TRUE)rank.dispersion <-rank(-GeneVar)MYDATAmatrix <- MYDATAmatrix[order(rank.dispersion), ] #sort genes according to its variance# which genes are shared across the datasets?shared.genes <-rownames(MYDATAmatrix)[rownames(MYDATAmatrix) %in%rownames(GEOmatrix)] #names of the shared genesdim(MYDATAmatrix[shared.genes, ])
# Correlation on 1000 most varying genesMYDATAmatrix1000 <- MYDATAmatrix[1:1000, ]GEOmatrix1000 <- GEOmatrix[1:1000, ]table(rownames(MYDATAmatrix1000) ==rownames(GEOmatrix1000))
TRUE
1000
We also mean-centered the data. This was important to correct for a technological batch effect between the data sets. Thus, the correlation was performed on the deviation of the expression levels in each sample from the dataset mean.
MYDATAmatrix1000 <-t(scale(t(MYDATAmatrix1000), center =TRUE, scale =FALSE))GEOmatrix1000 <-t(scale(t(GEOmatrix1000), center =TRUE, scale =FALSE))CorMat1000 <-cor(MYDATAmatrix1000, GEOmatrix1000, use ="na.or.complete", method ="pearson")dim(CorMat1000)
[1] 22 197
We also performed sample by sample correlation, thus the cross-correlation matrix CorMat1000 had 22 rows (corresponding to our samples) and 197 columns (corresponding to the reference dataset). We aggregated the results over the reference samples from the same PAM50 class (Parker et al. 2009) simply by taking the mean value.
Plotting only the correlation to the Claudin reference, we reproduce the supplementary Figure S8 B.
CorMat1000.sel <- CorMat1000[, colnames(CorMat1000) %in%c("Claudin")]melt.Cor <-melt(CorMat1000.sel)colnames(melt.Cor) <-c("sample", "class", "corr")melt.Cor$category <- ClaudinClassggplot(melt.Cor, aes(x = sample, y = corr, fill = category)) +geom_boxplot() +stat_summary(fun = mean,geom ="point", shape =23, size =4) +ggtitle("Correlation with the claudin-low class") +scale_fill_manual(values =c(mixed ="firebrick", differentiated ="cornflowerblue",`claudin-low`="gold3", control ="grey17"))
We can also generate a PCA plot with this annotation and we can see that the visible clusters correspond to the define classes (Figure 1F).
# make pca with sample group annotationdge2 <- dgedge2$samples$SampleGroup <- ClaudinClassres <-computePCA(dge2)res[[1]]["SampleName"] <-colnames(dge2)color.groups <-c(mixed ="firebrick", differentiated ="cornflowerblue", `claudin-low`="gold3",control ="grey17")plotPCs(res, color = color.groups)
Here we note that robustness of the results was tested also on 1200 and 800 most variable genes. Spearman correlation was also tested yielding similar results, thus a role of outliers biasing the Pearson correlation was ruled out.
Heatmap of the selected genes
Next, we plotted a heatmap with selected genes reproducing a supplementary Figure S8 A.
We next used Non-negative matrix factorization (NMF) to extract molecular signatures (metagenes) from the samples. To that end, ButchR package will be used (Quintero et al. 2020). For installation instruction refer to the package github page and the full code available at [our Zenodo].
NMF approximates the RNA count matrix C, \dim{C}= \{m,n\} (m detected genes, n samples) as a product of two matrices C \approx W H with non-negative entries. The algorithm also encourages sparcity of the W and H matrices. These two characteristics (non-negativity and sparseness) ensure interpretability of the results. The decomposition has a free parameter k of the latent dimension (\dim{W}= \{m,k\},\dim{H}= \{k,n\}) which has to be chosen beforehand. ButchR offers multiple diagnostic measures to assign the most optimal k for the input data, however, any k is “allowed” and the main criterion is the interpretability of the extracted signatures. We ran NMF over 30 random initializations of the exposure matrix H and signature matrix W for the range k = 3, 4, 5, 6. The convergence of the stochastic NMF algorithm was assessed when the assignment of each sample to a given signature did not change after 40 iterations.
No optimal latent dimension k was suggested. Though, from the diagnostics plot, k = 4 appeeared to be a minimal suitable dimension to consider (minimizing Frobenius error measures and Amari distance while scoring relatively high in the other metrics which is desirable). Thus, we selected this k = 4 for the downstream analysis.
The next piece of code plots the exposure matrix H (Figure 3A).
procr_Hk4 <-HMatrix(procr_nmf_exp, k =4)rownames(procr_Hk4) <-c("Sign.1", "Sign.2", "Sign.3", "Sign.4")class(procr_Hk4) #matrix array
heat_anno <-HeatmapAnnotation(group = procr.annotation, col =list(group =c(mixed ="firebrick",differentiated ="cornflowerblue", `claudin-low`="gold3", control ="grey17")))h_heatmap_4 <-Heatmap(procr_Hk4, col =viridis(100), name ="Exposure", clustering_distance_columns ="pearson",show_column_dend =TRUE, top_annotation = heat_anno, show_column_names =TRUE,show_row_names =TRUE, cluster_rows =FALSE)print(h_heatmap_4)
Next, we plotted the normalized signature matrix W (Figure S10 A).
# SIGNATURESprocr_features <-SignatureSpecificFeatures(procr_nmf_exp, k =4, return_all_features =TRUE)colnames(procr_features) <-paste0("Sign.", 1:4)kable(head(procr_features))
Sign.1
Sign.2
Sign.3
Sign.4
ENSMUSG00000069125
0
0
1
0
ENSMUSG00000103224
0
1
1
1
ENSMUSG00000091223
0
1
1
1
ENSMUSG00000025255
0
0
0
1
ENSMUSG00000103034
0
1
1
1
ENSMUSG00000027499
0
1
1
0
# keep only those specificprocr_specific <-rownames(procr_features)[rowSums(procr_features) ==1]length(procr_specific) #6927
[1] 6927
procr_Wspecific <-WMatrix(procr_nmf_exp, k =4)[procr_specific, ] #reduce the W matrix to the specific onescolnames(procr_Wspecific) <-paste0("Sign.", 1:4)# This W matrix is the same as procr_Wk4 from the input_data folder# normalize exposure score in W matrix across rowsprocr_Wspecific.norm <- procr_Wspecific/matrixStats::rowMaxs(procr_Wspecific) #normalizes W to the max value in the row...better for plotting# Display selected features on W matrixw_heatmap <-Heatmap(procr_Wspecific.norm, col =inferno(100), name ="W matrix",clustering_distance_columns ="pearson", show_column_dend =TRUE, show_column_names =TRUE,show_row_names =FALSE, cluster_rows =TRUE, cluster_columns =FALSE)w_heatmap
Functional annotation of the metagenes
For the functional annotation of the metagenes, we used clusterProfiler R package that implements Fisher’s exact test for overrepresentation analysis.
# Sign.4 <-> PROCR metageneboolean.genes.Sig.Mes <- procr_Wspecific.norm[, "Sign.4"] ==1# SORT BY Wmatrixprocr_Wspecific.noNorm.SigMes <-sort(procr_Wspecific[boolean.genes.Sig.Mes, 4],decreasing =TRUE)dgeSigMes <- dge[names(procr_Wspecific.noNorm.SigMes), ]names(procr_Wspecific.noNorm.SigMes) <- dgeSigMes$genes$SYMBOL# SAME FOR the normal signatureboolean.genes.Sig.Normal <- procr_Wspecific.norm[, "Sign.3"] ==1procr_Wspecific.noNorm.SigNormal <-sort(procr_Wspecific[boolean.genes.Sig.Normal,3], decreasing =TRUE)dgeSigNormal <- dge[names(procr_Wspecific.noNorm.SigNormal), ]names(procr_Wspecific.noNorm.SigNormal) <- dgeSigNormal$genes$SYMBOL# SAME FOR 'ANTI-CLAUDIN'boolean.genes.Sig.Anti <- procr_Wspecific.norm[, "Sign.2"] ==1procr_Wspecific.noNorm.SigAnti <-sort(procr_Wspecific[boolean.genes.Sig.Anti, 2],decreasing =TRUE)dgeSigAnti <- dge[names(procr_Wspecific.noNorm.SigAnti), ]names(procr_Wspecific.noNorm.SigAnti) <- dgeSigAnti$genes$SYMBOL# SAME FOR 'Mixed'boolean.genes.Sig.Spec <- procr_Wspecific.norm[, "Sign.1"] ==1procr_Wspecific.noNorm.SigSpec <-sort(procr_Wspecific[boolean.genes.Sig.Spec, 1],decreasing =TRUE)dgeSigSpec <- dge[names(procr_Wspecific.noNorm.SigSpec), ]names(procr_Wspecific.noNorm.SigSpec) <- dgeSigSpec$genes$SYMBOL
The full-length metagenes contain over 1000 genes each, however, most of them with very low weights. For the functional annotation, we selected the top 10 percent of the genes based on their weights (most important genes that contribute to each metagene).
# need EntrezIdsgene.set.SigMes <-names(procr_Wspecific.noNorm.SigMes)[1:trunc(0.1*length(names(procr_Wspecific.noNorm.SigMes)))]gene.set.SigSpec <-names(procr_Wspecific.noNorm.SigSpec)[1:trunc(0.1*length(names(procr_Wspecific.noNorm.SigSpec)))]gene.set.SigAnti <-names(procr_Wspecific.noNorm.SigAnti)[1:trunc(0.1*length(names(procr_Wspecific.noNorm.SigAnti)))]gene.set.SigNormal <-names(procr_Wspecific.noNorm.SigNormal)[1:trunc(0.1*length(names(procr_Wspecific.noNorm.SigNormal)))]# sanity checkintersect(gene.set.SigNormal, gene.set.SigMes)
character(0)
For the annotation, we used the Hallmark gene sets from MSigDB (Liberzon et al. 2011), see Figure 3B.
Liberzon, Arthur, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo Tamayo, and Jill P. Mesirov. 2011. “Molecular signatures database (MSigDB) 3.0.”Bioinformatics 27 (12): 1739–40. https://doi.org/10.1093/bioinformatics/btr260.
Parker, Joel S., Michael Mullins, Maggie C. U. Cheang, Samuel Leung, David Voduc, Tammi Vickery, Sherri Davies, et al. 2009. “Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes.”Journal of Clinical Oncology 27 (8): 1160–67. https://doi.org/10.1200/JCO.2008.18.1370.
Prat, Aleix, Joel S. Parker, Olga Karginova, Cheng Fan, Chad Livasy, Jason I. Herschkowitz, Xiaping He, and Charles M. Perou. 2010. “Phenotypic and Molecular Characterization of the Claudin-Low Intrinsic Subtype of Breast Cancer.”Breast Cancer Research 12 (5): R68. https://doi.org/10.1186/bcr2635.
Prat, Aleix, and Charles M. Perou. 2009. “Mammary Development Meets Cancer Genomics.”Nature Medicine 15 (8): 842–44. https://doi.org/10.1038/nm0809-842.
Quintero, Andres, Daniel Hübschmann, Nils Kurzawa, Sebastian Steinhauser, Philipp Rentzsch, Stephen Krämer, Carolin Andresen, et al. 2020. “ShinyButchR: Interactive NMF-based decomposition workflow of genome-scale datasets.”Biology Methods and Protocols 5 (1): bpaa022. https://doi.org/10.1093/biomethods/bpaa022.
Wang, Daisong, Cheguo Cai, Xiaobing Dong, Qing Cissy Yu, Xiao-Ou Zhang, Li Yang, and Yi Arial Zeng. 2015. “Identification of Multipotent Mammary Stem Cells by Protein c Receptor Expression.”Nature 517 (7532): 81–84. https://doi.org/10.1038/nature13851.