AI/ML for Variant Impacts on Pathways
Systems-scale machine learning to predict variant impacts
Background
Interpreting variants of unknown significance (VUS) requires moving beyond traditional sequence annotation. These variants are embedded in complex biological contexts where their impacts may arise from structural perturbations at the protein level or from network-level effects across pathways.
Machine learning provides a scalable framework for capturing both perspectives:
At the molecular scale, models can extract features from 3D protein structures.
At the systems scale, graph-based and matrix factorization methods can map variants into regulatory networks that drive cellular phenotypes.
Phase 1: Developing a baseline model using AI/ML with 3D Protein Structures to predict single pathway level changes (complete)
Problem: Functional impacts of mutations are often non-obvious from sequence.
Approach: Applied supervised machine learning to structural features of proteins.
Algorithms: Density-Based Clustering, Random Forest, Gradient Boosting, Graph Neural Networks
Data: Protein structural data from Alphafold and Protein Data Bank (PDB), cell line level information from DepMap/CCLE, patient sample information from The Cancer Genome Atlas (TCGA)
Features: 3D coordinates and variant cluster information
Output: Effect on NRF2 Pathway Transcription
Outcome: Identified spatially dense variant clusters with shared functional consequences, providing interpretable ML evidence for structural hotspots.
Phase 2: Scaling Up with Unsupervised ML and Geospatial Statistics (in progress)
Idea: Structural hotspots are not isolated; they cascade into pathway dysregulation.
Approach: Developed multi-scale ML pipelines that map protein-level variant clusters into pathway activity scores across >500 gene regulatory networks.
Algorithms: Non-negative Matrix Factorization (NMF) & clustering to reveal variant-driven subtypes.
Spatial Statistics: Using geospatial statistics methods to develop interactive 2D protein variant maps
Databases: Providing information on 500+ pathways and 16k+ proteins as an interactive databse
Data: Protein structural data from Alphafold and Protein Data Bank (PDB), cell line level information from DepMap/CCLE, patient sample information from The Cancer Genome Atlas (TCGA)
Outcome: Built a scalable framework that connects variants → pathway perturbations → therapeutic response, with plans to build an interactive database