Objective: Detect Copy Number Variations (CNVs) in cell-free DNA (cfDNA) samples using a Variational Autoencoder (VAE) trained on normal samples. Anomaly detection is based on reconstruction error - normal samples should have low error, while cfDNA samples with cancer-derived CNVs should have high error.
Key Finding: Over 4 experiments, best AUROC=0.907 (Δ+0.082 from first), best Separation=3.03x (Δ+0.16x from first). All targets achieved!
| Dataset | Samples | Features | Notes |
|---|---|---|---|
| Normal (Train) | 395 | 21,574 genes | ENCODE blacklist filtered, gene-level aggregation |
| Normal (Val) | 49 | ||
| Normal (Test) | 49 | ||
| cfDNA (Test) | 500 |
Note: For holdout experiments, Test = held-out normal test subset vs cfDNA (primary). Optional AUROC shown in parentheses is computed using all normal samples vs cfDNA and is for reference only (not a held-out split).
| Metric | holdout_latent16_beta0holdout 2025-12-18 |
holdout_latent16_beta001holdout 2025-12-18 |
holdout_latent32_beta0holdout 2025-12-18 |
holdout_latent32_beta001holdout 2025-12-18 |
Target |
|---|---|---|---|---|---|
| AUROC |
Test: 0.825
(All normals: 0.804) |
Test: 0.754
(All normals: 0.711) ↓ |
Test: 0.907
(All normals: 0.900) ↑ |
Test: 0.741
(All normals: 0.709) ↓ |
≥ 0.8 |
| Separation Ratio | Test: 2.87x | Test: 1.64x ↓ | Test: 3.03x ↑ | Test: 1.53x ↓ | ≥ 1.5x |
| Cohen's d | 0.50 | 0.51 ↑ | 0.68 ↑ | 0.47 ↓ | ≥ 0.5 |
| AUPRC |
0.976
(All: 0.775) |
0.968
(All: 0.715) |
0.988
(All: 0.871) |
0.964
(All: 0.706) |
≥ 0.5 |
| Sensitivity @ 95% Spec | 43.2% | 41.0% | 66.0% | 33.6% | ≥ 30% |
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| Latent Dim | 16 | Learning Rate | 0.0003 |
| Hidden Channels | [16, 32, 64] | Beta Max | 0.0 |
| Kernel Sizes | [15, 9, 5] | Beta Anneal Epochs | 20 |
Test = held-out normal test subset vs cfDNA (primary). All normals (when shown) uses all normal samples vs cfDNA for reference only.
| Metric | Test (Holdout) | All Normals (Ref) | Target | Status |
|---|---|---|---|---|
| AUROC | 0.825 | 0.804 | ≥ 0.8 | PASS |
| AUPRC | 0.976 | 0.775 | ≥ 0.5 | PASS |
| Separation Ratio | 2.87x | - | ≥ 1.5x | PASS |
| Cohen's d | 0.50 | - | ≥ 0.5 | PASS |
| Normal Error (mean ± std) | 0.028 ± 0.033 | - | - | |
| cfDNA Error (mean ± std) | 0.080 ± 0.109 | - | - | |
Trade-offs between sensitivity and specificity at different thresholds. For cancer screening, high sensitivity (95%+) is typically prioritized.
| Specificity | Sensitivity (Recall) | FPR | Threshold | Note |
|---|---|---|---|---|
| 99.0% | 7.8% | 2.0% | 0.2466 | |
| 95.0% | 40.0% | 6.1% | 0.0436 | |
| 90.0% | 48.6% | 10.2% | 0.0369 | |
| 80.0% | 63.0% | 20.4% | 0.0289 | |
| 46.9% | 95.0% | 53.1% | 0.0202 | 95% sensitivity operating point |
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| Latent Dim | 16 | Learning Rate | 0.0003 |
| Hidden Channels | [16, 32, 64] | Beta Max | 0.01 |
| Kernel Sizes | [15, 9, 5] | Beta Anneal Epochs | 20 |
Test = held-out normal test subset vs cfDNA (primary). All normals (when shown) uses all normal samples vs cfDNA for reference only.
| Metric | Test (Holdout) | All Normals (Ref) | Target | Status |
|---|---|---|---|---|
| AUROC | 0.754 | 0.711 | ≥ 0.8 | FAIL |
| AUPRC | 0.968 | 0.715 | ≥ 0.5 | PASS |
| Separation Ratio | 1.64x | - | ≥ 1.5x | PASS |
| Cohen's d | 0.51 | - | ≥ 0.5 | PASS |
| Normal Error (mean ± std) | 0.048 ± 0.031 | - | - | |
| cfDNA Error (mean ± std) | 0.079 ± 0.063 | - | - | |
Trade-offs between sensitivity and specificity at different thresholds. For cancer screening, high sensitivity (95%+) is typically prioritized.
| Specificity | Sensitivity (Recall) | FPR | Threshold | Note |
|---|---|---|---|---|
| 99.0% | 3.6% | 2.0% | 0.2610 | |
| 95.0% | 40.0% | 6.1% | 0.0648 | |
| 90.0% | 62.0% | 10.2% | 0.0494 | |
| 80.0% | 64.0% | 20.4% | 0.0485 | |
| 0.0% | 100.0% | 100.0% | 0.0283 | 95% sensitivity operating point |
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| Latent Dim | 32 | Learning Rate | 0.0003 |
| Hidden Channels | [16, 32, 64] | Beta Max | 0.0 |
| Kernel Sizes | [15, 9, 5] | Beta Anneal Epochs | 20 |
Test = held-out normal test subset vs cfDNA (primary). All normals (when shown) uses all normal samples vs cfDNA for reference only.
| Metric | Test (Holdout) | All Normals (Ref) | Target | Status |
|---|---|---|---|---|
| AUROC | 0.907 | 0.900 | ≥ 0.8 | PASS |
| AUPRC | 0.988 | 0.871 | ≥ 0.5 | PASS |
| Separation Ratio | 3.03x | - | ≥ 1.5x | PASS |
| Cohen's d | 0.68 | - | ≥ 0.5 | PASS |
| Normal Error (mean ± std) | 0.015 ± 0.012 | - | - | |
| cfDNA Error (mean ± std) | 0.045 ± 0.046 | - | - | |
Trade-offs between sensitivity and specificity at different thresholds. For cancer screening, high sensitivity (95%+) is typically prioritized.
| Specificity | Sensitivity (Recall) | FPR | Threshold | Note |
|---|---|---|---|---|
| 99.0% | 11.6% | 2.0% | 0.0952 | |
| 95.0% | 63.8% | 6.1% | 0.0224 | |
| 90.0% | 73.8% | 10.2% | 0.0189 | |
| 80.0% | 87.2% | 22.4% | 0.0157 | |
| 63.3% | 95.2% | 36.7% | 0.0129 | 95% sensitivity operating point |
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| Latent Dim | 32 | Learning Rate | 0.0003 |
| Hidden Channels | [16, 32, 64] | Beta Max | 0.01 |
| Kernel Sizes | [15, 9, 5] | Beta Anneal Epochs | 20 |
Test = held-out normal test subset vs cfDNA (primary). All normals (when shown) uses all normal samples vs cfDNA for reference only.
| Metric | Test (Holdout) | All Normals (Ref) | Target | Status |
|---|---|---|---|---|
| AUROC | 0.741 | 0.709 | ≥ 0.8 | FAIL |
| AUPRC | 0.964 | 0.706 | ≥ 0.5 | PASS |
| Separation Ratio | 1.53x | - | ≥ 1.5x | PASS |
| Cohen's d | 0.47 | - | ≥ 0.5 | FAIL |
| Normal Error (mean ± std) | 0.040 ± 0.031 | - | - | |
| cfDNA Error (mean ± std) | 0.060 ± 0.045 | - | - | |
Trade-offs between sensitivity and specificity at different thresholds. For cancer screening, high sensitivity (95%+) is typically prioritized.
| Specificity | Sensitivity (Recall) | FPR | Threshold | Note |
|---|---|---|---|---|
| 99.0% | 1.2% | 2.0% | 0.2468 | |
| 95.0% | 32.0% | 6.1% | 0.0557 | |
| 90.0% | 55.0% | 10.2% | 0.0413 | |
| 80.0% | 66.6% | 20.4% | 0.0378 | |
| 0.0% | 100.0% | 100.0% | 0.0212 | 95% sensitivity operating point |
| Metric | Current Best | Target | Gap | % to Target |
|---|---|---|---|---|
| AUROC | 0.907 | ≥ 0.8 | Met | 113% |
| Separation Ratio | 3.03x | ≥ 1.5x | Met | 202% |
| Cohen's d | 0.92 | ≥ 0.5 | Met | 184% |
Analysis of the latent space representation for the best performing model. Shows how normal and cfDNA samples are encoded.
Synthesized from analysis of all 4 experiments and parsed from the latest experiment report.
| Parameter | holdout_latent16_beta0 | holdout_latent16_beta001 | holdout_latent32_beta0 | holdout_latent32_beta001 |
|---|---|---|---|---|
| latent_dim | 16 | 16 | 32 | 32 |
| hidden_channels | [16, 32, 64] | [16, 32, 64] | [16, 32, 64] | [16, 32, 64] |
| kernel_sizes | [15, 9, 5] | [15, 9, 5] | [15, 9, 5] | [15, 9, 5] |
| learning_rate | 0.0003 | 0.0003 | 0.0003 | 0.0003 |
| beta_max | 0.0 | 0.01 | 0.0 | 0.01 |
| beta_anneal_epochs | 20 | 20 | 20 | 20 |
| clip_values | 3.0 | 3.0 | 3.0 | 3.0 |
| batch_size | 32 | 32 | 32 | 32 |
| epochs | 200 | 200 | 200 | 200 |
| patience | 30 | 30 | 30 | 30 |
uv run python scripts/train.py \
--normal-data /data/cnv_autoencoder/processed/gene_aggregated/normal_batch.npz \
--cfdna-data /data/cnv_autoencoder/processed/gene_aggregated/cfdna_batch.npz \
--checkpoint-dir /data/cnv_autoencoder/checkpoints/holdout_latent16_beta0 \
--mlflow-uri /data/cnv_autoencoder/mlflow \
--experiment-name cnv_vae_holdout_split \
--run-name holdout_latent16_beta0 \
--exclude-list configs/sample_exclude_list.csv \
--test-split 0.1 \
--latent-dim 16 \
--hidden-channels 16 32 64 \
--kernel-sizes 15 9 5 \
--learning-rate 0.0003 \
--beta-max 0.0 \
--clip-values 3.0 \
--epochs 200 \
--patience 30 \
--seed 42
uv run python scripts/train.py \
--normal-data /data/cnv_autoencoder/processed/gene_aggregated/normal_batch.npz \
--cfdna-data /data/cnv_autoencoder/processed/gene_aggregated/cfdna_batch.npz \
--checkpoint-dir /data/cnv_autoencoder/checkpoints/holdout_latent16_beta001 \
--mlflow-uri /data/cnv_autoencoder/mlflow \
--experiment-name cnv_vae_holdout_split \
--run-name holdout_latent16_beta001 \
--exclude-list configs/sample_exclude_list.csv \
--test-split 0.1 \
--latent-dim 16 \
--hidden-channels 16 32 64 \
--kernel-sizes 15 9 5 \
--learning-rate 0.0003 \
--beta-max 0.01 \
--clip-values 3.0 \
--epochs 200 \
--patience 30 \
--seed 42
uv run python scripts/train.py \
--normal-data /data/cnv_autoencoder/processed/gene_aggregated/normal_batch.npz \
--cfdna-data /data/cnv_autoencoder/processed/gene_aggregated/cfdna_batch.npz \
--checkpoint-dir /data/cnv_autoencoder/checkpoints/holdout_latent32_beta0 \
--mlflow-uri /data/cnv_autoencoder/mlflow \
--experiment-name cnv_vae_holdout_split \
--run-name holdout_latent32_beta0 \
--exclude-list configs/sample_exclude_list.csv \
--test-split 0.1 \
--latent-dim 32 \
--hidden-channels 16 32 64 \
--kernel-sizes 15 9 5 \
--learning-rate 0.0003 \
--beta-max 0.0 \
--clip-values 3.0 \
--epochs 200 \
--patience 30 \
--seed 42
uv run python scripts/train.py \
--normal-data /data/cnv_autoencoder/processed/gene_aggregated/normal_batch.npz \
--cfdna-data /data/cnv_autoencoder/processed/gene_aggregated/cfdna_batch.npz \
--checkpoint-dir /data/cnv_autoencoder/checkpoints/holdout_latent32_beta001 \
--mlflow-uri /data/cnv_autoencoder/mlflow \
--experiment-name cnv_vae_holdout_split \
--run-name holdout_latent32_beta001 \
--exclude-list configs/sample_exclude_list.csv \
--test-split 0.1 \
--latent-dim 32 \
--hidden-channels 16 32 64 \
--kernel-sizes 15 9 5 \
--learning-rate 0.0003 \
--beta-max 0.01 \
--clip-values 3.0 \
--epochs 200 \
--patience 30 \
--seed 42
| Experiment | Analysis Directory |
|---|---|
| holdout_latent16_beta0 | /data/cnv_autoencoder/analysis/88bc9ed08ea542a1a7840216703eca66/ |
| holdout_latent16_beta001 | /data/cnv_autoencoder/analysis/1ef6b476f675419e972f31f27625e741/ |
| holdout_latent32_beta0 | /data/cnv_autoencoder/analysis/dfef8db798bf42dbbfc77f5be3aa59ec/ |
| holdout_latent32_beta001 | /data/cnv_autoencoder/analysis/4277be6f338b4f0ea44f58c2cc668fca/ |