DIABLO: Multi-Omics Integration with mixOmics on Breast TCGA Data

Let’s use mixOmics package for multi-omics data integration using the DIABLO (Data Integration Analysis for Biomarker discovery using Latent cOmponents) method on the breast TCGA dataset. We will go through data loading, pairwise PLS models, design matrix setup, model tuning, visualization, and performance evaluation. Checks(e.g., dimension prints, summaries) are included and printouts to verify each step.

Setup

Load Libraries and Data

First, load the mixOmics library and set a seed for reproducibility. Then, load the breast TCGA data and prepare the input lists.

library(mixOmics) # import the mixOmics library

set.seed(123) # for reproducibility, remove for normal use

data(breast.TCGA) # load in the data

data = list(miRNA = breast.TCGA$data.train$mirna, # set a list of all the X dataframes
            mRNA = breast.TCGA$data.train$mrna,
            proteomics = breast.TCGA$data.train$protein)

# Check: Print dimensions of each dataset
lapply(data, dim) # check their dimensions

## $miRNA
## [1] 150 184
## 
## $mRNA
## [1] 150 200
## 
## $proteomics
## [1] 150 142

Y = breast.TCGA$data.train$subtype # set the response variable as the Y dataframe

# Check: Summary of the response variable
summary(Y)

## Basal  Her2  LumA 
##    45    30    75

Note: Verify the dimensions match expectations (e.g., same number of samples across omics). The summary shows the distribution of subtypes.

Pairwise PLS Models

We build pairwise sparse PLS models to explore correlations between omics datasets. We plot circle correlation plots for visualization.

list.keepX = c(25, 25) # select arbitrary values of features to keep
list.keepY = c(25, 25)

pls1 <- spls(data[["miRNA"]], data[["mRNA"]], keepX = list.keepX, keepY = list.keepY) # generate three pairwise PLS models
pls2 <- spls(data[["miRNA"]], data[["proteomics"]], keepX = list.keepX, keepY = list.keepY)
pls3 <- spls(data[["mRNA"]], data[["proteomics"]], keepX = list.keepX, keepY = list.keepY)

plotVar(pls1, cutoff = 0.5, title = "(a) miRNA vs mRNA", legend = c("miRNA", "mRNA"), # plot features of first PLS
        var.names = FALSE, style = 'graphics', 
        pch = c(16, 17), cex = c(2,2), 
        col = c('darkorchid', 'lightgreen'))

plotVar(pls2, cutoff = 0.5, title = "(a) miRNA vs proteomics", legend = c("miRNA", "proteomics"), # plot features of second PLS
        var.names = FALSE, style = 'graphics', 
        pch = c(16, 17), cex = c(2,2), 
        col = c('darkorchid', 'lightgreen'))

plotVar(pls3, cutoff = 0.5, title = "(a) mRNA vs proteomics", legend = c("mRNA", "proteomics"), # plot features of third PLS
        var.names = FALSE, style = 'graphics', 
        pch = c(16, 17), cex = c(2,2), 
        col = c('darkorchid', 'lightgreen'))

FIGURE 1: Circle Correlation Plots for pairwise PLS models on the breast TCGA data. Only displays the top 25 features for each dimension, subsetting by those with a correlation above 0.5.

FIGURE 1: Circle Correlation Plots for pairwise PLS models on the breast TCGA data. Only displays the top 25 features for each dimension, subsetting by those with a correlation above 0.5.

Note: These plots show correlations between features. Check if the models converged properly by inspecting the outputs.

For the next calculations, we adjust to single component for correlation checks.

Check Correlations

Calculate and print correlations between variates from pairwise models.

cor(pls1$variates$X, pls1$variates$Y) # calculate correlation of miRNA and mRNA

##           comp1
## comp1 0.8841792

cor(pls2$variates$X, pls2$variates$Y) # calculate correlation of miRNA and proteins

##           comp1
## comp1 0.8361993

cor(pls3$variates$X, pls3$variates$Y) # calculate correlation of mRNA and proteins

##           comp1
## comp1 0.9360264

Note: Higher correlations indicate stronger associations between omics layers. Values above 0.5 are typically good.

Design Matrix

Create a design matrix for the DIABLO model.

design = matrix(0.1, ncol = length(data), nrow = length(data), # for square matrix filled with 0.1s
                dimnames = list(names(data), names(data)))
diag(design) = 0 # set diagonal to 0s

# Print the design matrix
design

##            miRNA mRNA proteomics
## miRNA        0.0  0.1        0.1
## mRNA         0.1  0.0        0.1
## proteomics   0.1  0.1        0.0

Note: The design matrix specifies connections between blocks. Off-diagonals are set to 0.1 for weak connections.

Basic DIABLO Model

Fit a basic DIABLO model with 5 components.

basic.diablo.model = block.splsda(X = data, Y = Y, ncomp = 5, design = design) # form basic DIABLO model

Tune Number of Components

Use cross-validation to choose the optimal number of components.

perf.diablo = perf(basic.diablo.model, validation = 'Mfold', folds = 10, nrepeat = 10) # run component number tuning with repeated CV

plot(perf.diablo) # plot output of tuning

$FIGURE 2: Choosing the number of components in `block.plsda` using `perf()` with 10 × 10-fold CV function in the `breast.TCGA` study. Classification error rates (overall and balanced, see Section 7.3) are represented on the y-axis with respect to the number of components on the x-axis for each prediction distance presented in PLS-DA$

FIGURE 2: Choosing the number of components in block.plsda using perf() with 10 × 10-fold CV function in the breast.TCGA study. Classification error rates (overall and balanced, see Section 7.3) are represented on the y-axis with respect to the number of components on the x-axis for each prediction distance presented in PLS-DA

ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"] # set the optimal ncomp value

# Print the optimal choices
perf.diablo$choice.ncomp$WeightedVote # show the optimal choice for ncomp for each dist metric

##             max.dist centroids.dist mahalanobis.dist
## Overall.ER         3              2                3
## Overall.BER        3              2                3

Note: The plot helps select ncomp where error rates stabilize. Printout shows choices per metric.

Tune Number of Features

Tune the number of features to keep per component. (Note: This chunk is set to eval=FALSE for time-saving; load precomputed results.)

# Use the existing design matrix already created  

test.keepX <- list(
  mRNA       = c(5:9, seq(10, 18, 2), seq(20, 30, 5)),
  miRNA      = c(5:9, seq(10, 18, 2), seq(20, 30, 5)),
  proteomics = c(5:9, seq(10, 18, 2), seq(20, 30, 5))
)

tune.TCGA <- tune.block.splsda(
  X          = data, 
  Y          = Y, 
  ncomp      = ncomp,
  test.keepX = test.keepX, 
  design     = design,
  validation = "Mfold", 
  folds      = 10, 
  nrepeat    = 1,
  dist       = "centroids.dist"
)

Load precomputed tuning results.

list.keepX = tune.TCGA$choice.keepX # set the optimal values of features to retain

# Print the optimal keepX
list.keepX

## $mRNA
## [1] 15
## 
## $miRNA
## [1] 15
## 
## $proteomics
## [1] 10

Tuning selects sparse features. Check the list for reasonableness (e.g., not too few or many).

Final DIABLO Model

Fit the final optimized DIABLO model.

final.diablo.model = block.splsda(X = data, Y = Y, ncomp = ncomp, # set the optimised DIABLO model
                          keepX = list.keepX, design = design)

# Check: Print the design matrix of the final model
final.diablo.model$design # design matrix for the final model

##            miRNA mRNA proteomics Y
## miRNA        0.0  0.1        0.1 1
## mRNA         0.1  0.0        0.1 1
## proteomics   0.1  0.1        0.0 1
## Y            1.0  1.0        1.0 0

# Check: Selected variables for mRNA on component 1
selectVar(final.diablo.model, block = 'mRNA', comp = 1)$mRNA$name # the features selected to form the first component

##  [1] "ZNF552"  "KDM4B"   "CCNA2"   "LRIG1"   "PREX1"   "FUT8"    "C4orf34"
##  [8] "TTC39A"  "ASPM"    "SLC43A3" "MEX3A"   "SEMA3C"  "E2F1"    "STC2"   
## [15] "FMNL2"

Note: Verify the model design matches input. Selected variables should be printed for inspection.

Visualizations

Diagnostic Plot

plotDiablo(final.diablo.model, ncomp = 1)

FIGURE 3: Diagnostic plot from multiblock sPLS-DA applied on the breast.TCGA study. Samples are represented based on the specified component (here ncomp = 1) for each data set (mRNA, miRNA and protein). Samples are coloured by breast cancer subtype and 95% confidence ellipse plots are represented.

Sample Plot

plotIndiv(final.diablo.model, ind.names = FALSE, legend = TRUE, title = 'DIABLO Sample Plots')

FIGURE 4: Sample plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are plotted according to their scores on the first 2 components for each data set. Samples are coloured by cancer subtype

Arrow Plot

plotArrow(final.diablo.model, ind.names = FALSE, legend = TRUE, title = 'DIABLO')

FIGURE 5: Arrow plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are projected into the space spanned by the first two components for each data set then overlaid across data sets.

Correlation Circle Plot

plotVar(final.diablo.model, var.names = FALSE, style = 'graphics', legend = TRUE,
        pch = c(16, 17, 15), cex = c(2,2,2), col = c('darkorchid', 'brown1', 'lightgreen'))

FIGURE 6: Correlation circle plot from multiblock sPLS-DA performed on the breast.TCGA study. Variable types are indicated with different symbols and colours, and are overlaid on the same plot.

Circos Plot

circosPlot(final.diablo.model, cutoff = 0.7, line = TRUE,
           color.blocks= c('darkorchid', 'brown1', 'lightgreen'),
           color.cor = c("chocolate3","grey20"), size.labels = 1.5)

FIGURE 7: Circos plot from multiblock sPLS-DA performed on the breast.TCGA study. The plot represents the correlations greater than 0.7 between variables of different types, represented on the side quadrants

Relevance Network

network(final.diablo.model, blocks = c(1,2,3),
        color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)

FIGURE 8: Relevance network for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. Each node represents a selected with colours indicating their type. The colour of the edges represent positive or negative correlations

Optionally, export the network (eval=FALSE).

library(igraph)
my.network = network(final.diablo.model, blocks = c(1,2,3),
        color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)
write.graph(my.network$gR, file = "myNetwork.gml", format = "gml")

Loading Plot

plotLoadings(final.diablo.model, comp = 2, contrib = 'max', method = 'median')

FIGURE 9: Loading plot for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. The most important variables (according to the absolute value of their coefficients) are ordered from bottom to top. As this is a supervised analysis, colours indicate the class for which the median expression value is the highest for each feature

Clustered Image Map

cimDiablo(final.diablo.model)

## 
## trimming values to [-3, 3] range for cim visualisation. See 'trim' arg in ?cimDiablo

FIGURE 10: Clustered Image Map for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. By default, Euclidean distance and Complete linkage methods are used. The CIM represents samples in rows (indicated by their breast cancer subtype on the left hand side of the plot) and selected features in columns (indicated by their data type at the top of the plot).

Note: These visualizations help interpret the model. Check for separation by subtypes and strong correlations.

Performance Evaluation

Perform repeated cross-validation on the final model.

perf.diablo = perf(final.diablo.model, validation = 'Mfold', M = 10, nrepeat = 10, 
                   dist = 'centroids.dist') # run repeated CV performance evaluation

# Print error rates
perf.diablo$MajorityVote.error.rate

## $centroids.dist
##                  comp1      comp2
## Basal       0.02888889 0.04666667
## Her2        0.20666667 0.08666667
## LumA        0.05466667 0.05066667
## Overall.ER  0.07733333 0.05666667
## Overall.BER 0.09674074 0.06133333

perf.diablo$WeightedVote.error.rate

## $centroids.dist
##                  comp1      comp2
## Basal       0.01333333 0.03777778
## Her2        0.14000000 0.08333333
## LumA        0.05466667 0.04533333
## Overall.ER  0.05933333 0.05066667
## Overall.BER 0.06933333 0.05548148

ROC and AUC

auc.splsda = auroc(final.diablo.model, roc.block = "miRNA", roc.comp = 2, print = FALSE)

FIGURE 11: ROC and AUC based on multiblock sPLS-DA performed on the breast.TCGA study for the miRNA data set after 2 components. The function calculates the ROC curve and AUC for one class vs. the others.

Note: Low error rates and high AUC indicate good performance. Inspect for overfitting.

Prediction on Test Data

Prepare test data and predict.

data.test.TCGA = list(mRNA = breast.TCGA$data.test$mrna,
                      miRNA = breast.TCGA$data.test$mirna)

predict.diablo = predict(final.diablo.model, newdata = data.test.TCGA)

confusion.mat = get.confusion_matrix(truth = breast.TCGA$data.test$subtype,
                     predicted = predict.diablo$WeightedVote$centroids.dist[,2])

# Print confusion matrix
confusion.mat

##       predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal                 19                 2                 0
## Her2                   0                14                 0
## LumA                   0                 1                34

# Print Balanced Error Rate
get.BER(confusion.mat)

## [1] 0.04126984

Note: The confusion matrix and BER assess predictive accuracy on unseen data.

Session Information

render the Rmd with rmarkdown::render("case_study/DIABLO_TCGA_Case_Study.Rmd")

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Rocky Linux 8.10 (Green Obsidian)
## 
## Matrix products: default
## BLAS/LAPACK: /opt/apps/OpenBLAS/0.3.26/lib/libopenblas_zenp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Los_Angeles
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] mixOmics_6.30.0 ggplot2_3.5.2   lattice_0.22-6  MASS_7.3-61    
## [5] knitr_1.49     
## 
## loaded via a namespace (and not attached):
##  [1] tidyr_1.3.1         sass_0.4.9          utf8_1.2.4         
##  [4] generics_0.1.3      renv_1.1.4          stringi_1.8.4      
##  [7] digest_0.6.37       magrittr_2.0.3      evaluate_1.0.1     
## [10] grid_4.4.2          RColorBrewer_1.1-3  fastmap_1.2.0      
## [13] plyr_1.8.9          jsonlite_1.8.9      Matrix_1.7-1       
## [16] ggrepel_0.9.6       RSpectra_0.16-2     gridExtra_2.3      
## [19] BiocManager_1.30.26 purrr_1.0.2         fansi_1.0.6        
## [22] scales_1.3.0        codetools_0.2-20    jquerylib_0.1.4    
## [25] cli_3.6.3           rlang_1.1.4         munsell_0.5.1      
## [28] withr_3.0.2         cachem_1.1.0        yaml_2.3.10        
## [31] ellipse_0.5.0       tools_4.4.2         parallel_4.4.2     
## [34] reshape2_1.4.4      BiocParallel_1.40.0 dplyr_1.1.4        
## [37] colorspace_2.1-1    corpcor_1.6.10      vctrs_0.6.5        
## [40] R6_2.5.1            matrixStats_1.4.1   lifecycle_1.0.4    
## [43] stringr_1.5.1       pkgconfig_2.0.3     pillar_1.9.0       
## [46] bslib_0.8.0         gtable_0.3.6        glue_1.8.0         
## [49] rARPACK_0.11-0      Rcpp_1.0.13-1       xfun_0.49          
## [52] tibble_3.2.1        tidyselect_1.2.1    farver_2.1.2       
## [55] htmltools_0.5.8.1   igraph_2.1.1        labeling_0.4.3     
## [58] rmarkdown_2.29      compiler_4.4.2