In [13]:
# Standard library
import gc
import warnings
from typing import Union, Optional, Tuple, List, Dict, Any
from pathlib import Path
import pickle
import base64
# Core data science
import numpy as np
import pandas as pd
# Scikit-learn
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from PIL import Image
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
DATA_DIR = Path("/share/crsp/lab/pkaiser/ddlin/single-cell-multimodal-ml/data")
RAW_DIR = DATA_DIR.joinpath("raw")
PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
FP_CITE_TRAIN_INPUTS = RAW_DIR.joinpath("train_cite_inputs.h5")
FP_CITE_TRAIN_TARGETS = RAW_DIR.joinpath("train_cite_targets.h5")
FP_CITE_TEST_INPUTS = RAW_DIR.joinpath("test_cite_inputs.h5")
FP_IMPORTANT_COLS = PROCESSED_DIR.joinpath("important_cols.txt")
FP_CONSTANT_COLS = PROCESSED_DIR.joinpath("constant_cols.txt")
FP_EVALUATION_IDS = RAW_DIR.joinpath("evaluation_ids.csv")
FP_CELL_METADATA = RAW_DIR.joinpath("metadata.csv")
VERBOSE = 0
Keras for single cell CITE-seq & Multiome¶
In [18]:
# Open and display
img = Image.open(DATA_DIR.joinpath("images", "multiome.jpg"))
plt.figure(figsize=(10, 8))
plt.imshow(img)
plt.axis('off')
plt.show()
Understanding how a single genome produces diverse cellular states is essential for deciphering the mechanisms of tissue function or dysfunction in health and disease. This study addresses this fundamental challenge in single-cell biology.¶
- The dataset consists of single-cell multiomics data from mobilized peripheral CD34+ hematopoietic stem and progenitor cells (HSPCs) obtained from four healthy human donors.
- At each sampling time point from each culture plate, cells were collected and analyzed using two single-cell assays: the 10x Chromium Single Cell Multiome ATAC + Gene Expression technology (Multiome) and the 10x Genomics Single Cell Gene Expression with Feature Barcoding technology, utilizing the TotalSeq™-B Human Universal Cocktail, V1.0 (CITE-seq).
Assay Data Modalities¶
10x Chromium Single Cell Multiome ATAC + Gene Expression
- Chromatin accessibility (epigenomic modality) via ATAC‑seq: measures which regions of the genome are in open (accessible) chromatin, reflecting putative regulatory elements (enhancers, promoters).
- mRNA expression (transcriptomic modality) via 3′‑end gene expression profiling: quantifies poly‑A transcripts in each cell.
10x Genomics Single Cell Gene Expression + TotalSeq™‑B Feature Barcoding (CITE‑seq)
- mRNA expression (transcriptomic modality) via the same 3′‑end gene expression chemistry as above.
- Surface protein abundance (proteomic modality) via oligo‑conjugated antibodies (TotalSeq™‑B): each antibody carries a DNA barcode that’s sequenced alongside the transcriptome, giving you per‑cell counts for each targeted protein.
ML Question Framing¶
- Multiome: Predict gene expression levels based on chromatin accessibility data.
- CITE-seq: Predict surface protein levels from gene expression data.
------ CITEseq MODEL ---------¶
Data Preprocessing¶
- Load metadata and key features (mRNA levels of cell surface proteins).
- Apply dimension reduction using TruncatedSVD for sparse matrices.
- Reindex data for training.
In [4]:
# Read in the metadata
metadata_df = pd.read_csv(DATA_DIR.joinpath("raw", "metadata.csv"),index_col='cell_id')
metadata_df = metadata_df[metadata_df.technology == "citeseq"]
metadata_df.head()
Out[4]:
day | donor | cell_type | technology | |
---|---|---|---|---|
cell_id | ||||
c2150f55becb | 2 | 27678 | HSC | citeseq |
65b7edf8a4da | 2 | 27678 | HSC | citeseq |
c1b26cb1057b | 2 | 27678 | EryP | citeseq |
917168fa6f83 | 2 | 27678 | NeuP | citeseq |
2b29feeca86d | 2 | 27678 | EryP | citeseq |
In [7]:
# Load the important cols (cell surface protein) and constant cols (cols with no info)
with open(FP_IMPORTANT_COLS, "r") as f:
important_cols = [line.strip() for line in f]
with open(FP_CONSTANT_COLS, 'r') as s:
constant_cols = [line.strip() for line in s]
print("First 5 important columns:", important_cols[:5])
print("First 5 constant columns:", constant_cols[:5])
First 5 important columns: ['ENSG00000135218_CD36', 'ENSG00000010278_CD9', 'ENSG00000204287_HLA-DRA', 'ENSG00000117091_CD48', 'ENSG00000004468_CD38'] First 5 constant columns: ['ENSG00000003137_CYP26B1', 'ENSG00000004848_ARX', 'ENSG00000006606_CCL26', 'ENSG00000010379_SLC6A13', 'ENSG00000010932_FMO1']
In [8]:
# Boolean flag: if True, compute & save SVD; if False, load precomputed arrays
RUN_SVD = False
PROCESSED_DIR = DATA_DIR / "processed"
TRAIN_PKL = PROCESSED_DIR / "train_Citeseq_truncated_512.pkl"
TEST_PKL = PROCESSED_DIR / "test_Citeseq_truncated_512.pkl"
# Read the raw cite training data
X_raw = pd.read_hdf(DATA_DIR / "raw" / "train_cite_inputs.h5").drop(columns=constant_cols)
# Get the index, use it to arrange the metadata
cell_index = X_raw.index
meta = metadata_df.reindex(cell_index)
# Extract impotant cell surface proteins RNA expression levels
X0 = X_raw[important_cols].values
# Clean up the large file for mem management
del X_raw
gc.collect()
# Read test and convert to sparse matrix
Xt = pd.read_hdf(FP_CITE_TEST_INPUTS).drop(columns = constant_cols)
cell_index_test = Xt.index
meta_test = metadata_df.reindex(cell_index_test)
X0t = Xt[important_cols].values
del Xt
gc.collect()
# Normalize the data, fit on train and apply to both train and test
st = StandardScaler()
X0 = st.fit_transform(X0)
X0t = st.transform(X0t)
if RUN_SVD:
print("Computing SVD embeddings...")
# Dimension of SVD
n_components = 512
svd = TruncatedSVD(n_components=n_components, random_state=42)
# 1. Load raw, drop constant cols, reindex
X_raw = pd.read_hdf(DATA_DIR / "raw" / "train_cite_inputs.h5").drop(columns=constant_cols)
Xt_raw = pd.read_hdf(DATA_DIR / "raw" / "test_cite_inputs.h5").drop(columns=constant_cols)
# 2. Fit SVD on training, transform both
X = svd.fit_transform(X_raw)
Xt = svd.transform(Xt_raw)
# 3. Persist to disk
PROCESSED_DIR.mkdir(exist_ok=True)
with open(TRAIN_PKL, "wb") as f:
pickle.dump(X, f)
with open(TEST_PKL, "wb") as f:
pickle.dump(Xt, f)
# 4. Cleanup
del X_raw, Xt_raw
gc.collect()
print(f"Computed and saved SVD embeddings X (512 PCs): {X.shape}, X_test: {Xt.shape}")
else:
# Load precomputed embeddings
with open(TRAIN_PKL, "rb") as f:
X = pickle.load(f)
with open(TEST_PKL, "rb") as f:
Xt = pickle.load(f)
print(f"Loaded precomputed SVD embeddings X (512 PCs): {X.shape}, X_test: {Xt.shape}")
Loaded precomputed SVD embeddings X (512 PCs): (70988, 512), X_test: (48663, 512)
Target normalization¶
- Our target is the protein expression levels for each cell in the training set
- We will predict protein expression from RNA expression embeddings
In [11]:
# Load targets, our target is the protein expression levels for each cell in the training set
print(f'Reading CITE-seq targets from {FP_CITE_TRAIN_TARGETS}')
# Note: This file contains the protein expression levels for each cell in the training set
Y = pd.read_hdf(FP_CITE_TRAIN_TARGETS)
Y.head()
Reading CITE-seq targets from /share/crsp/lab/pkaiser/ddlin/single-cell-multimodal-ml/data/raw/train_cite_targets.h5
Out[11]:
gene_id | CD86 | CD274 | CD270 | CD155 | CD112 | CD47 | CD48 | CD40 | CD154 | CD52 | ... | CD94 | CD162 | CD85j | CD23 | CD328 | HLA-E | CD82 | CD101 | CD88 | CD224 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cell_id | |||||||||||||||||||||
45006fe3e4c8 | 1.167804 | 0.622530 | 0.106959 | 0.324989 | 3.331674 | 6.426002 | 1.480766 | -0.728392 | -0.468851 | -0.073285 | ... | -0.448390 | 3.220174 | -0.533004 | 0.674956 | -0.006187 | 0.682148 | 1.398105 | 0.414292 | 1.780314 | 0.548070 |
d02759a80ba2 | 0.818970 | 0.506009 | 1.078682 | 6.848758 | 3.524885 | 5.279456 | 4.930438 | 2.069372 | 0.333652 | -0.468088 | ... | 0.323613 | 8.407108 | 0.131301 | 0.047607 | -0.243628 | 0.547864 | 1.832587 | 0.982308 | 2.736507 | 2.184063 |
c016c6b0efa5 | -0.356703 | -0.422261 | -0.824493 | 1.137495 | 0.518924 | 7.221962 | -0.375034 | 1.738071 | 0.142919 | -0.971460 | ... | 1.348692 | 4.888579 | -0.279483 | -0.131097 | -0.177604 | -0.689188 | 9.013709 | -1.182975 | 3.958148 | 2.868600 |
ba7f733a4f75 | -1.201507 | 0.149115 | 2.022468 | 6.021595 | 7.258670 | 2.792436 | 21.708519 | -0.137913 | 1.649969 | -0.754680 | ... | 1.504426 | 12.391979 | 0.511394 | 0.587863 | -0.752638 | 1.714851 | 3.893782 | 1.799661 | 1.537249 | 4.407671 |
fbcf2443ffb2 | -0.100404 | 0.697461 | 0.625836 | -0.298404 | 1.369898 | 3.254521 | -1.659380 | 0.643531 | 0.902710 | 1.291877 | ... | 0.777023 | 6.496499 | 0.279898 | -0.841950 | -0.869419 | 0.675092 | 5.259685 | -0.835379 | 9.631781 | 1.765445 |
5 rows × 140 columns
In [12]:
# Normalize Y by subtracting the mean and dividing by the standard deviation for each cell
# This is important for training stability and performance
Y = Y.values
Y -= Y.mean(axis=1).reshape(-1, 1)
Y /= Y.std(axis=1).reshape(-1, 1)
Y.shape
Out[12]:
(70988, 140)
Let's keep only some features¶
In [ ]:
# Concatenate the X with 75 SVD embeddings and X0 with selected cell surface proteins RNA expression levels
# These are information from RNA Expression
X = np.hstack((X[:,:75],X0))
X.shape
Out[ ]:
(70988, 159)
Tensorflow Keras librairies¶
In [14]:
import math
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, Concatenate, Dropout, BatchNormalization
# Check if GPU is available
if tf.config.list_physical_devices('GPU'):
print("GPU is available")
else:
print("GPU is not available, using CPU")
2025-07-22 18:53:53.876489: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used. 2025-07-22 18:53:54.439562: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used. 2025-07-22 18:53:54.602645: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1753235634.721339 3305803 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1753235634.741439 3305803 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered W0000 00:00:1753235635.241123 3305803 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1753235635.241136 3305803 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1753235635.241138 3305803 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1753235635.241141 3305803 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. 2025-07-22 18:53:55.260339: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
GPU is not available, using CPU
2025-07-22 18:54:02.545338: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
Metric and loss function¶
In [15]:
def correlation_score(y_true, y_pred):
"""
Compute the mean Pearson correlation coefficient between true and predicted values.
Parameters:
- y_true: Ground truth values (numpy array or pandas DataFrame) of shape (n_samples, n_targets).
- y_pred: Predicted values (numpy array or pandas DataFrame) of shape (n_samples, n_targets).
Returns:
- float: Mean Pearson correlation coefficient across all samples.
"""
# Convert pandas DataFrame to numpy array if needed
if type(y_true) == pd.DataFrame:
y_true = y_true.values
if type(y_pred) == pd.DataFrame:
y_pred = y_pred.values
corrsum = 0 # Initialize sum of correlation coefficients
# Iterate over each sample (row) to compute correlation
for i in range(len(y_true)):
# Calculate Pearson correlation coefficient for the i-th sample
# np.corrcoef returns a 2x2 matrix; [1,0] is the off-diagonal (true vs. pred)
corrsum += np.corrcoef(y_true[i], y_pred[i])[1, 0]
# Return the mean correlation across all samples
return corrsum / len(y_true)
def negative_correlation_loss(y_true, y_pred):
"""
Custom Keras loss function to compute the negative mean Pearson correlation coefficient.
Parameters:
- y_true: Ground truth tensor of shape (batch_size, n_targets).
- y_pred: Predicted tensor of shape (batch_size, n_targets).
Returns:
- Tensor: Negative mean Pearson correlation coefficient (to be minimized).
"""
# Compute mean of predictions across features (axis=1) for each sample
my = K.mean(tf.convert_to_tensor(y_pred), axis=1)
# Reshape mean to (batch_size, 1) and tile to match y_true shape (batch_size, n_targets)
my = tf.tile(tf.expand_dims(my, axis=1), (1, y_true.shape[1]))
# Center predictions by subtracting the mean (y_pred - mean(y_pred))
ym = y_pred - my
# Numerator: Sum of element-wise product of true and centered predicted values
r_num = K.sum(tf.multiply(y_true, ym), axis=1)
# Denominator: Product of standard deviation of predictions and sqrt(n_targets)
r_den = tf.sqrt(K.sum(K.square(ym), axis=1) * float(y_true.shape[-1]))
# Compute mean Pearson correlation coefficient across the batch
r = tf.reduce_mean(r_num / r_den)
# Return negative correlation (to minimize as a loss)
return -r
Model and Parameters¶
- A basic "wide-and-deep" MLP model.
- Hyperparameters optimized using Keras Tuner.
- Further architectural options could be investigated.
- We aim to predict protein expression from RNA expression data.
In [16]:
LR_START = 0.01
BATCH_SIZE = 512
def create_model():
# regularizers & dropout rate
reg1 = 9.613e-06
reg2 = 1e-07
REG1 = tf.keras.regularizers.l2(reg1)
REG2 = tf.keras.regularizers.l2(reg2)
DROP = 0.1
activation = 'selu'
inputs = Input(shape=(X.shape[1],))
# four successive Dense→Dropout blocks
x0 = Dense(256, kernel_regularizer=REG1, activation=activation)(inputs)
x0 = Dropout(DROP)(x0)
x1 = Dense(512, kernel_regularizer=REG1, activation=activation)(x0)
x1 = Dropout(DROP)(x1)
x2 = Dense(512, kernel_regularizer=REG1, activation=activation)(x1)
x2 = Dropout(DROP)(x2)
x3 = Dense(Y.shape[1], kernel_regularizer=REG1, activation=activation)(x2)
x3 = Dropout(DROP)(x3)
# concatenate all four intermediate outputs
x = Concatenate()([x0, x1, x2, x3])
# final linear layer to produce Y.shape[1] outputs
x = Dense(Y.shape[1], kernel_regularizer=REG2, activation='linear')(x)
return Model(inputs, x)
test_model = create_model()
test_model.summary()
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │ input_layer │ (None, 159) │ 0 │ - │ │ (InputLayer) │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense (Dense) │ (None, 256) │ 40,960 │ input_layer[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dropout (Dropout) │ (None, 256) │ 0 │ dense[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense_1 (Dense) │ (None, 512) │ 131,584 │ dropout[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dropout_1 (Dropout) │ (None, 512) │ 0 │ dense_1[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense_2 (Dense) │ (None, 512) │ 262,656 │ dropout_1[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dropout_2 (Dropout) │ (None, 512) │ 0 │ dense_2[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense_3 (Dense) │ (None, 140) │ 71,820 │ dropout_2[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dropout_3 (Dropout) │ (None, 140) │ 0 │ dense_3[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ concatenate │ (None, 1420) │ 0 │ dropout[0][0], │ │ (Concatenate) │ │ │ dropout_1[0][0], │ │ │ │ │ dropout_2[0][0], │ │ │ │ │ dropout_3[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense_4 (Dense) │ (None, 140) │ 198,940 │ concatenate[0][0] │ └─────────────────────┴───────────────────┴────────────┴───────────────────┘
Total params: 705,960 (2.69 MB)
Trainable params: 705,960 (2.69 MB)
Non-trainable params: 0 (0.00 B)
Training¶
- The correlation scores (~0.89 per fold and ~0.89 for out of fold) suggest the model is performing well for using the embedded RNA expression information to predict protein expression.
In [ ]:
%%time
# Create the model dir
model_dir = DATA_DIR.joinpath("models", "citeseq", "submissions")
model_dir.mkdir(parents=True, exist_ok=True)
# Also create the directory for weights
weights_dir = DATA_DIR.joinpath("models", "citeseq")
weights_dir.mkdir(parents=True, exist_ok=True)
EPOCHS = 300
N_SPLITS = 3
pred_train = np.zeros((Y.shape[0],Y.shape[1]))
np.random.seed(1)
tf.random.set_seed(1)
score_list = []
kf = GroupKFold(n_splits=N_SPLITS)
score_list = []
# GroupKFold to ensure that cells from the same donor are not in both train and validation sets, preventing data leakage
# This way, we can ensure that the model generalizes well to unseen data
for fold, (idx_tr, idx_va) in enumerate(kf.split(X, groups=meta.donor)):
start_time = datetime.datetime.now()
model = None
gc.collect()
X_tr = X[idx_tr]
y_tr = Y[idx_tr]
X_va = X[idx_va]
y_va = Y[idx_va]
lr = ReduceLROnPlateau(
monitor = "val_loss",
factor = 0.9,
patience = 4,
verbose = VERBOSE)
es = EarlyStopping(
monitor = "val_loss",
patience = 40,
verbose = VERBOSE,
mode = "min",
restore_best_weights = True)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath = weights_dir.joinpath("citeseq.weights.h5"),
save_weights_only = True,
monitor = 'val_loss',
mode = 'min',
save_best_only = True)
callbacks = [
lr,
es,
model_checkpoint_callback
]
model = create_model()
model.compile(
optimizer = tf.keras.optimizers.Adam(learning_rate=LR_START),
metrics = [negative_correlation_loss],
loss = negative_correlation_loss
)
# Training
model.fit(
X_tr,
y_tr,
validation_data=(
X_va,
y_va),
epochs = EPOCHS,
verbose = VERBOSE,
batch_size = BATCH_SIZE,
shuffle = True,
callbacks = callbacks)
del X_tr, y_tr
gc.collect()
model.load_weights(weights_dir.joinpath("citeseq.weights.h5"))
model.save(model_dir.joinpath(f"model_{fold}.keras"))
print('model saved')
# Model validation
y_va_pred = model.predict(X_va)
corrscore = correlation_score(y_va, y_va_pred)
pred_train[idx_va] = y_va_pred
print(f"Fold {fold}, correlation = {corrscore:.5f}")
del X_va, y_va, y_va_pred
gc.collect()
score_list.append(corrscore)
# Show overall score
print(f"{Fore.GREEN}{Style.BRIGHT}Mean correlation = {np.array(score_list).mean():.5f}{Style.RESET_ALL}")
score_total = correlation_score(Y, pred_train)
print(f"{Fore.BLUE}{Style.BRIGHT}Out of fold correlation = {score_total:.5f}{Style.RESET_ALL}")
model saved 776/776 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step Fold 0, correlation = 0.89081 model saved 750/750 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step Fold 1, correlation = 0.89672 model saved 694/694 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step Fold 2, correlation = 0.89311 Mean correlation = 0.89355 Out of fold correlation = 0.89353 CPU times: user 1h 56min 21s, sys: 8min 18s, total: 2h 4min 40s Wall time: 15min 26s
Modeling & Training¶
A "wide-and-deep" Multi-Layer Perceptron (MLP) model was constructed using TensorFlow and Keras. The model architecture consists of several dense and dropout layers, with the final output predicting the levels of 140 surface proteins.
Key aspects of the model and training process are:
- Loss Function: The model is optimized using a
negative_correlation_loss
function, which aims to maximize the Pearson correlation between true and predicted values. - Evaluation Metric: The primary metric for success is the
correlation_score
, which calculates the mean Pearson correlation coefficient. - Training: The model is trained using a 3-fold cross-validation strategy grouped by donor to prevent data leakage and ensure generalization. The training process includes callbacks like
ReduceLROnPlateau
for adjusting the learning rate andEarlyStopping
to prevent overfitting. - Public Score: 0.81
Results¶
- The model demonstrates strong performance, achieving a mean out-of-fold correlation score of approximately 0.89, indicating a high degree of accuracy in predicting protein expression from RNA expression data.
In [ ]:
In [ ]:
Appendix: Data Detailes¶
- Multiome
- train/test_multi_inputs.h5: ATAC-seq peak counts processed with TF-IDF transformation using the standard log(TF) * log(IDF) output (chromatin accessibility). Rows represent cells, and columns correspond to genomic locations, identified by coordinates on the GRCh38 reference genome from 10x References - 2020-A (July 7, 2020).
- train_multi_targets.h5: RNA gene expression levels, normalized by library size and log1p transformed, for the same cells.
- CITE-seq
- train/test_cite_inputs.h5: RNA gene expression levels, normalized by library size and log1p transformed. Rows represent cells, and columns correspond to genes, labeled as {gene_name}_{gene_ensemble-ids}.
- train_cite_targets.h5: Surface protein levels for the same cells, normalized using the DSB method.
In [ ]:
# Public Score: 0.81, predictions have a Pearson correlation coefficient of 0.81 against the public test set.