Multi-Omic Analysis Pan: An AI-Driven Framework for Advancing CART Programs¶

In [9]:
from PIL import Image
import matplotlib.pyplot as plt
from pathlib import Path
DATA_DIR = Path("/share/crsp/lab/pkaiser/ddlin/single-cell-multimodal-ml/data")
img = Image.open(DATA_DIR.joinpath("images", "Plan.png"))
plt.figure(figsize=(8, 6))
plt.imshow(img)
plt.axis('off')
plt.show()
No description has been provided for this image

1. Executive Summary

Following our insightful conversation, I've outlined a multi-phase strategic plan to leverage AstraZeneca's rich clinical and multi-omic datasets. The goal is to build a predictive and mechanistic framework that de-risks clinical development and accelerates the discovery of next-generation therapies. This plan directly addresses the key opportunities we discussed: integrating high-dimensional single-cell, cytokine, and clinical data to understand the tumor microenvironment (TME), predict patient outcomes, and identify biomarkers for efficacy, persistence, and safety (e.g., CRS).

The approach is grounded in a "Patient-Centric" philosophy, where all data modalities are integrated to create a holistic view of an individual's response to therapy. By combining established bioinformatics with advanced AI, we can build a learning system that grows more powerful with each new dataset, much like the integrative approach demonstrated in the recent Cell paper00737-8).

2. The Phased Analysis Plan

This is an iterative framework designed to deliver value at each stage.

Phase 1: Foundational Data Integration & Harmonization¶

The first step is to create a unified analytical substrate from our diverse data sources.

  • Objective: To integrate scRNA-seq, TCR-seq, CITE-seq, cytokine panels, and clinical metadata into a cohesive data object for each patient.

  • Methodology:

    1. QC & Pre-processing: Employ rigorous, standardized pipelines (e.g., using Scanpy/Seurat) for each data type to handle QC, normalization, and filtering (e.g., doublet detection, mitochondrial content).
    2. Harmonization: Use methods like Harmony or scVI to correct for batch effects across different patient samples and experimental runs.
    3. Unsupervised Integration: Utilize MOFA+ to decompose the integrated dataset into a set of "latent factors." These factors represent the main drivers of biological variation across all data types, providing an unbiased look at the underlying biology connecting gene programs, cell states, and protein expression.
  • Usage Example: We could use MOFA+ on a cohort's pre-infusion and post-infusion samples. The resulting latent factors might reveal a specific factor that strongly correlates with T-cell expansion and a decrease in monocytic suppressor cells, immediately highlighting a key biological axis for further investigation.

Phase 2: Predictive Modeling & Explainable AI (XAI)¶

With an integrated foundation, we can build models to predict clinically meaningful outcomes.

  • Objective: To identify robust biomarkers that predict therapeutic efficacy, long-term persistence, or adverse events like CRS.

  • Methodology:

    1. Supervised Integration & Feature Selection: For a given question (e.g., Responder vs. Non-Responder), use a supervised method like DIABLO. DIABLO is designed to find interconnected features across omics layers that are maximally correlated with the outcome. This gives us a concise, multi-modal biomarker signature.
    2. Explainable Predictive Modeling:
      • Train an XGBoost model on the biomarker signature from DIABLO or the latent factors from MOFA+. XGBoost is robust, fast, and highly accurate for tabular/omics data.
      • Apply SHAP analysis to the model's predictions. This is critical: for each patient, we can generate a plot showing exactly which features (e.g., high expression of IFNG in CD8+ T-cells, low levels of IL-6, presence of a specific TCR clone) pushed the model towards a "Responder" prediction.
  • Usage Example: We train a model to predict durable response at 6 months. For a new patient, the model predicts "Non-Responder." The SHAP plot reveals the top reason is the high abundance of a specific myeloid-derived suppressor cell (MDSC) population in their pre-infusion product, a feature we discovered and validated with DIABLO. This provides a testable hypothesis and a potential biomarker for patient stratification.

Phase 3: Building a TME Knowledge Graph (TME-KG)¶

This phase synthesizes our findings into a dynamic, queryable knowledge base.

  • Objective: To create an evolving "digital twin" of the cell therapy TME by connecting our internal findings with public data (TCGA, DEPMAP, literature).

  • Methodology:

    1. Graph Construction (using Neo4j):
      • Nodes: Genes, proteins, cell types (e.g., CAR-T, Treg, Macrophage), clinical phenotypes (Persistence, Exhaustion, CRS), and drugs/treatments.
      • Edges: Relationships between nodes, weighted by statistical significance. Examples: (Gene X) -[:EXPRESSED_IN]-> (Cell Type Y), (Cell Type Y) -[:CORRELATED_WITH]-> (Phenotype Z), (Drug A) -[:INDUCES]-> (Gene X).
    2. Graph Analytics:
      • Community Detection: Find modules of genes and cells that function as a unit.
      • PageRank/Centrality: Identify the most influential nodes in a given process (e.g., what is the most central gene in the network of T-cell exhaustion?).
  • Usage Example: A query to our TME-KG could be: "Show me all genes that are (1) upregulated in our non-responding patients' CAR-T cells, (2) known to interact with the TGF-β pathway based on public data, and (3) are targeted by an existing drug." This query, impossible with standard analysis, could instantly generate novel combination therapy hypotheses.

Phase 4: Leveraging LLMs for Unstructured Data¶

This is our strategy for unlocking insights from text-based data like clinical notes and publications.

  • Objective: To convert unstructured text into quantitative features that can enrich our multi-omic models and knowledge graph.

  • Methodology:

    1. Embedding: Use pre-trained or fine-tuned Language Models (e.g., SentenceTransformers, BioBERT) to convert clinical notes, pathology reports, and scientific abstracts into high-dimensional vector embeddings.
    2. Semantic Search & Augmentation: These embeddings allow for powerful semantic searching (e.g., "find all patients with notes describing symptoms of neurotoxicity"). The embeddings themselves can be used as features in our predictive models (Phase 2) or used to create new relationships in our knowledge graph (Phase 3).
  • Usage Example: We embed all physician notes from a clinical trial. We notice that the embeddings for patients who experienced CRS cluster together in vector space. By analyzing the words that drive this clustering, we might discover a previously unappreciated early clinical symptom (e.g., "mild headache") that precedes CRS, creating a potential early warning sign.

3. Conclusion

I am incredibly enthusiastic about the potential to apply these advanced AI techniques to the rich, clinically-relevant datasets at AstraZeneca. This framework is designed to be a collaborative and iterative engine for discovery. By building from a solid multi-omic foundation to predictive, explainable models, and finally to a synthesized knowledge graph, we can create a powerful system to better understand disease biology, predict patient response, and ultimately design more effective cell therapies.