Let’s use mixOmics
package for multi-omics data integration using the DIABLO (Data Integration Analysis for Biomarker discovery using Latent cOmponents) method on the breast TCGA dataset. We will go through data loading, pairwise PLS models, design matrix setup, model tuning, visualization, and performance evaluation. Checks(e.g., dimension prints, summaries) are included and printouts to verify each step.
First, load the mixOmics
library and set a seed for reproducibility. Then, load the breast TCGA data and prepare the input lists.
library(mixOmics) # import the mixOmics library
set.seed(123) # for reproducibility, remove for normal use
data(breast.TCGA) # load in the data
data = list(miRNA = breast.TCGA$data.train$mirna, # set a list of all the X dataframes
mRNA = breast.TCGA$data.train$mrna,
proteomics = breast.TCGA$data.train$protein)
# Check: Print dimensions of each dataset
lapply(data, dim) # check their dimensions
## $miRNA
## [1] 150 184
##
## $mRNA
## [1] 150 200
##
## $proteomics
## [1] 150 142
Y = breast.TCGA$data.train$subtype # set the response variable as the Y dataframe
# Check: Summary of the response variable
summary(Y)
## Basal Her2 LumA
## 45 30 75
Note: Verify the dimensions match expectations (e.g., same number of samples across omics). The summary shows the distribution of subtypes.
We build pairwise sparse PLS models to explore correlations between omics datasets. We plot circle correlation plots for visualization.
list.keepX = c(25, 25) # select arbitrary values of features to keep
list.keepY = c(25, 25)
pls1 <- spls(data[["miRNA"]], data[["mRNA"]], keepX = list.keepX, keepY = list.keepY) # generate three pairwise PLS models
pls2 <- spls(data[["miRNA"]], data[["proteomics"]], keepX = list.keepX, keepY = list.keepY)
pls3 <- spls(data[["mRNA"]], data[["proteomics"]], keepX = list.keepX, keepY = list.keepY)
plotVar(pls1, cutoff = 0.5, title = "(a) miRNA vs mRNA", legend = c("miRNA", "mRNA"), # plot features of first PLS
var.names = FALSE, style = 'graphics',
pch = c(16, 17), cex = c(2,2),
col = c('darkorchid', 'lightgreen'))
plotVar(pls2, cutoff = 0.5, title = "(a) miRNA vs proteomics", legend = c("miRNA", "proteomics"), # plot features of second PLS
var.names = FALSE, style = 'graphics',
pch = c(16, 17), cex = c(2,2),
col = c('darkorchid', 'lightgreen'))
plotVar(pls3, cutoff = 0.5, title = "(a) mRNA vs proteomics", legend = c("mRNA", "proteomics"), # plot features of third PLS
var.names = FALSE, style = 'graphics',
pch = c(16, 17), cex = c(2,2),
col = c('darkorchid', 'lightgreen'))
FIGURE 1: Circle Correlation Plots for pairwise PLS models on the breast TCGA data. Only displays the top 25 features for each dimension, subsetting by those with a correlation above 0.5.
Note: These plots show correlations between features. Check if the models converged properly by inspecting the outputs.
For the next calculations, we adjust to single component for correlation checks.
Calculate and print correlations between variates from pairwise models.
cor(pls1$variates$X, pls1$variates$Y) # calculate correlation of miRNA and mRNA
## comp1
## comp1 0.8841792
cor(pls2$variates$X, pls2$variates$Y) # calculate correlation of miRNA and proteins
## comp1
## comp1 0.8361993
cor(pls3$variates$X, pls3$variates$Y) # calculate correlation of mRNA and proteins
## comp1
## comp1 0.9360264
Note: Higher correlations indicate stronger associations between omics layers. Values above 0.5 are typically good.
Create a design matrix for the DIABLO model.
design = matrix(0.1, ncol = length(data), nrow = length(data), # for square matrix filled with 0.1s
dimnames = list(names(data), names(data)))
diag(design) = 0 # set diagonal to 0s
# Print the design matrix
design
## miRNA mRNA proteomics
## miRNA 0.0 0.1 0.1
## mRNA 0.1 0.0 0.1
## proteomics 0.1 0.1 0.0
Note: The design matrix specifies connections between blocks. Off-diagonals are set to 0.1 for weak connections.
Fit a basic DIABLO model with 5 components.
basic.diablo.model = block.splsda(X = data, Y = Y, ncomp = 5, design = design) # form basic DIABLO model
Use cross-validation to choose the optimal number of components.
perf.diablo = perf(basic.diablo.model, validation = 'Mfold', folds = 10, nrepeat = 10) # run component number tuning with repeated CV
plot(perf.diablo) # plot output of tuning
FIGURE 2: Choosing the number of components in block.plsda
using perf()
with 10 × 10-fold CV function in the breast.TCGA
study. Classification error rates (overall and balanced, see Section 7.3) are represented on the y-axis with respect to the number of components on the x-axis for each prediction distance presented in PLS-DA
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"] # set the optimal ncomp value
# Print the optimal choices
perf.diablo$choice.ncomp$WeightedVote # show the optimal choice for ncomp for each dist metric
## max.dist centroids.dist mahalanobis.dist
## Overall.ER 3 2 3
## Overall.BER 3 2 3
Note: The plot helps select ncomp where error rates stabilize. Printout shows choices per metric.
Tune the number of features to keep per component. (Note: This chunk is set to eval=FALSE for time-saving; load precomputed results.)
# Use the existing design matrix already created
test.keepX <- list(
mRNA = c(5:9, seq(10, 18, 2), seq(20, 30, 5)),
miRNA = c(5:9, seq(10, 18, 2), seq(20, 30, 5)),
proteomics = c(5:9, seq(10, 18, 2), seq(20, 30, 5))
)
tune.TCGA <- tune.block.splsda(
X = data,
Y = Y,
ncomp = ncomp,
test.keepX = test.keepX,
design = design,
validation = "Mfold",
folds = 10,
nrepeat = 1,
dist = "centroids.dist"
)
Load precomputed tuning results.
list.keepX = tune.TCGA$choice.keepX # set the optimal values of features to retain
# Print the optimal keepX
list.keepX
## $mRNA
## [1] 15
##
## $miRNA
## [1] 15
##
## $proteomics
## [1] 10
Tuning selects sparse features. Check the list for reasonableness (e.g., not too few or many).
Fit the final optimized DIABLO model.
final.diablo.model = block.splsda(X = data, Y = Y, ncomp = ncomp, # set the optimised DIABLO model
keepX = list.keepX, design = design)
# Check: Print the design matrix of the final model
final.diablo.model$design # design matrix for the final model
## miRNA mRNA proteomics Y
## miRNA 0.0 0.1 0.1 1
## mRNA 0.1 0.0 0.1 1
## proteomics 0.1 0.1 0.0 1
## Y 1.0 1.0 1.0 0
# Check: Selected variables for mRNA on component 1
selectVar(final.diablo.model, block = 'mRNA', comp = 1)$mRNA$name # the features selected to form the first component
## [1] "ZNF552" "KDM4B" "CCNA2" "LRIG1" "PREX1" "FUT8" "C4orf34"
## [8] "TTC39A" "ASPM" "SLC43A3" "MEX3A" "SEMA3C" "E2F1" "STC2"
## [15] "FMNL2"
Note: Verify the model design matches input. Selected variables should be printed for inspection.
plotDiablo(final.diablo.model, ncomp = 1)
FIGURE 3: Diagnostic plot from multiblock sPLS-DA applied on the breast.TCGA
study. Samples are represented based on the specified component (here ncomp = 1
) for each data set (mRNA, miRNA and protein). Samples are coloured by breast cancer subtype and 95% confidence ellipse plots are represented.
plotIndiv(final.diablo.model, ind.names = FALSE, legend = TRUE, title = 'DIABLO Sample Plots')
FIGURE 4: Sample plot from multiblock sPLS-DA performed on the breast.TCGA
study. The samples are plotted according to their scores on the first 2 components for each data set. Samples are coloured by cancer subtype
plotArrow(final.diablo.model, ind.names = FALSE, legend = TRUE, title = 'DIABLO')
FIGURE 5: Arrow plot from multiblock sPLS-DA performed on the breast.TCGA
study. The samples are projected into the space spanned by the first two components for each data set then overlaid across data sets.
plotVar(final.diablo.model, var.names = FALSE, style = 'graphics', legend = TRUE,
pch = c(16, 17, 15), cex = c(2,2,2), col = c('darkorchid', 'brown1', 'lightgreen'))
FIGURE 6: Correlation circle plot from multiblock sPLS-DA performed on the breast.TCGA
study. Variable types are indicated with different symbols and colours, and are overlaid on the same plot.
circosPlot(final.diablo.model, cutoff = 0.7, line = TRUE,
color.blocks= c('darkorchid', 'brown1', 'lightgreen'),
color.cor = c("chocolate3","grey20"), size.labels = 1.5)
FIGURE 7: Circos plot from multiblock sPLS-DA performed on the breast.TCGA
study. The plot represents the correlations greater than 0.7 between variables of different types, represented on the side quadrants
network(final.diablo.model, blocks = c(1,2,3),
color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)
FIGURE 8: Relevance network for the variables selected by multiblock sPLS-DA performed on the breast.TCGA
study on component 1. Each node represents a selected with colours indicating their type. The colour of the edges represent positive or negative correlations
Optionally, export the network (eval=FALSE).
library(igraph)
my.network = network(final.diablo.model, blocks = c(1,2,3),
color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)
write.graph(my.network$gR, file = "myNetwork.gml", format = "gml")
plotLoadings(final.diablo.model, comp = 2, contrib = 'max', method = 'median')
FIGURE 9: Loading plot for the variables selected by multiblock sPLS-DA performed on the breast.TCGA
study on component 1. The most important variables (according to the absolute value of their coefficients) are ordered from bottom to top. As this is a supervised analysis, colours indicate the class for which the median expression value is the highest for each feature
cimDiablo(final.diablo.model)
##
## trimming values to [-3, 3] range for cim visualisation. See 'trim' arg in ?cimDiablo
FIGURE 10: Clustered Image Map for the variables selected by multiblock sPLS-DA performed on the breast.TCGA
study on component 1. By default, Euclidean distance and Complete linkage methods are used. The CIM represents samples in rows (indicated by their breast cancer subtype on the left hand side of the plot) and selected features in columns (indicated by their data type at the top of the plot).
Note: These visualizations help interpret the model. Check for separation by subtypes and strong correlations.
Perform repeated cross-validation on the final model.
perf.diablo = perf(final.diablo.model, validation = 'Mfold', M = 10, nrepeat = 10,
dist = 'centroids.dist') # run repeated CV performance evaluation
# Print error rates
perf.diablo$MajorityVote.error.rate
## $centroids.dist
## comp1 comp2
## Basal 0.02888889 0.04666667
## Her2 0.20666667 0.08666667
## LumA 0.05466667 0.05066667
## Overall.ER 0.07733333 0.05666667
## Overall.BER 0.09674074 0.06133333
perf.diablo$WeightedVote.error.rate
## $centroids.dist
## comp1 comp2
## Basal 0.01333333 0.03777778
## Her2 0.14000000 0.08333333
## LumA 0.05466667 0.04533333
## Overall.ER 0.05933333 0.05066667
## Overall.BER 0.06933333 0.05548148
auc.splsda = auroc(final.diablo.model, roc.block = "miRNA", roc.comp = 2, print = FALSE)
FIGURE 11: ROC and AUC based on multiblock sPLS-DA performed on the breast.TCGA
study for the miRNA data set after 2 components. The function calculates the ROC curve and AUC for one class vs. the others.
Note: Low error rates and high AUC indicate good performance. Inspect for overfitting.
Prepare test data and predict.
data.test.TCGA = list(mRNA = breast.TCGA$data.test$mrna,
miRNA = breast.TCGA$data.test$mirna)
predict.diablo = predict(final.diablo.model, newdata = data.test.TCGA)
confusion.mat = get.confusion_matrix(truth = breast.TCGA$data.test$subtype,
predicted = predict.diablo$WeightedVote$centroids.dist[,2])
# Print confusion matrix
confusion.mat
## predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal 19 2 0
## Her2 0 14 0
## LumA 0 1 34
# Print Balanced Error Rate
get.BER(confusion.mat)
## [1] 0.04126984
Note: The confusion matrix and BER assess predictive accuracy on unseen data.
rmarkdown::render("case_study/DIABLO_TCGA_Case_Study.Rmd")
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Rocky Linux 8.10 (Green Obsidian)
##
## Matrix products: default
## BLAS/LAPACK: /opt/apps/OpenBLAS/0.3.26/lib/libopenblas_zenp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/Los_Angeles
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] mixOmics_6.30.0 ggplot2_3.5.2 lattice_0.22-6 MASS_7.3-61
## [5] knitr_1.49
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.3.1 sass_0.4.9 utf8_1.2.4
## [4] generics_0.1.3 renv_1.1.4 stringi_1.8.4
## [7] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1
## [10] grid_4.4.2 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] plyr_1.8.9 jsonlite_1.8.9 Matrix_1.7-1
## [16] ggrepel_0.9.6 RSpectra_0.16-2 gridExtra_2.3
## [19] BiocManager_1.30.26 purrr_1.0.2 fansi_1.0.6
## [22] scales_1.3.0 codetools_0.2-20 jquerylib_0.1.4
## [25] cli_3.6.3 rlang_1.1.4 munsell_0.5.1
## [28] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [31] ellipse_0.5.0 tools_4.4.2 parallel_4.4.2
## [34] reshape2_1.4.4 BiocParallel_1.40.0 dplyr_1.1.4
## [37] colorspace_2.1-1 corpcor_1.6.10 vctrs_0.6.5
## [40] R6_2.5.1 matrixStats_1.4.1 lifecycle_1.0.4
## [43] stringr_1.5.1 pkgconfig_2.0.3 pillar_1.9.0
## [46] bslib_0.8.0 gtable_0.3.6 glue_1.8.0
## [49] rARPACK_0.11-0 Rcpp_1.0.13-1 xfun_0.49
## [52] tibble_3.2.1 tidyselect_1.2.1 farver_2.1.2
## [55] htmltools_0.5.8.1 igraph_2.1.1 labeling_0.4.3
## [58] rmarkdown_2.29 compiler_4.4.2