Education

Massachusetts Institute of Technology
Cambridge, MA
Ph.D. Computational and Systems Biology
2026 - Present
Columbia University in the City of New York
New York, NY
B.S. Biomedical Engineering, Minor in Computer Science. GPA: 4.01/4.0
2023-2026

Relevant Coursework: Probabilistic Models and Machine Learning, High Performance Machine Learning, Biostatistics for Engineers, Advanced Programming, Data Structure, High Dimensional Stats for Biomedical Data, Deep Learning for Biomedical Imaging, Quantitative Physiology, Calculus III, Linear Algebra

Stony Brook University (SBU)
Stony Brook, NY
B.E. Electrical Engineering. GPA: 4.0/4.0
2022-2023

Publications

Sarkar A, Duran A, Yiyang Yu, Lin DW, Kang Y, Somia N, Mantilla P, Zhou J, Nagai M, Tang Z, Hanington K, Chang K, Koo PK, Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion, bioRxiv, Apr 2026.

Adams E, Bai L, Lee M, Yiyang Yu, AlQuraishi M., From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models, ICML, May 2025.

Ziqi Tang, Nirali Somia, Yiyang Yu, Peter K Koo, Evaluating the representational power of pre-trained DNA language models for regulatory genomics, Genome Biology, Volume 26, Issue 1, Jul 2025.

Yiyang Yu, Shivani Muthukumar, Peter K Koo, EvoAug-TF: extending evolution-inspired data augmentations for genomic deep learning to TensorFlow, Bioinformatics, Volume 40, Issue 3, Mar 2024.

Rafi AM, Nogina D, Penzar D, et al. A community effort to optimize sequence-based deep learning models of gene regulation. Nat Biotechnol, Oct 2024. (Name in Consortium)

Selected Honors

  • 2026 NSF Graduate Research Fellowship (GRFP)
  • 2025 Barry M Goldwater Scholar $7500 scholarship
  • 2025 Tau Beta Pi Engineering Honor Society
  • 2024 Kaggle Competitions Master (3xGold, 4x Silver) Global Ranking: 133 / 203,363 (Top 0.065%)
  • 2024 HackMIT Intersystems Challenge 1st Place $2000 prize
  • 2024 Gilbert Family Scholarship from Columbia Engineering

Work Experience

Columbia University Irving Medical Center (Mohammed AlQuraishi Lab)
New York, NY
SURF Fellow / Undergraduate Researcher
Dec 2023 - Present
  • Developed discrete diffusion model for protein sequence-structure co-generation using ESM3 structure tokenizer and BERT architectures, achieving faster generation speeds compared to continuous diffusion approaches.
  • Implemented sparse autoencoder training pipelines to uncover interpretable biological features in protein language models. Work was published at ICML 2025 with highlights.
  • Designing algorithms to select a maximally diversified set of proteins for molecular dynamics simulations, generating data to support subsequent model development for predicting protein conformation trajectories.
  • Developing methods to extract protein conformational ensembles from AlphaFold2 through latent space exploration and systematically created a benchmark library to compare with existing methods.
Cold Spring Harbor Laboratory (Peter Koo Lab)
Cold Spring Harbor, NY
Research Intern
Mar 2022 - Dec 2023, May 2025 - Aug 2025
  • Co-led development of DNA Discrete Diffusion (D3), a score-entropy discrete diffusion model for conditional generation of regulatory sequences. Implemented variant effect prediction extensions and built latent space visualizer to interpret learned regulatory motifs.
  • Established fine-tuning pipelines with Low-Rank Adaptation (LoRA) and Supervised Fine-Tuning (SFT) for four pre-trained DNA language models to evaluate its representational power for regulatory genomics.
  • Developed and implemented evolution-inspired data augmentations (EvoAug-TF) in TensorFlow for genomic deep neural networks and demonstrated its improvement in generalization and interpretability.
  • Designed and evaluated more than 100 deep-learning models for predicting DNA promoters' expression rates using Python, TensorFlow, and WandB in the 2022 DREAM Challenge. Placed 7th in the final leaderboard.
Leash Biosciences
Salt Lake City, UT (Remote)
Machine Learning Intern
Aug 2024 - May 2025
  • Developing a multimodal transformer model to predict binding affinities of protein-molecule bindings.
  • Optimized the training and inference pipeline of protein language models by up to 30% by reimplementing models using flash attention and provided a validation method to evaluate data from wet-lab experiments.
Bioengineering Education, Application and Research (BEAR)
Stony Brook, NY
Research Assistant
Dec 2022 - Aug 2023
  • Developed CICaidA, a centralized health monitoring system for nursing homes featuring custom ESP32 hardware with MAX3010 sensors to track heart rate and blood oxygen, a Flutter mobile app, and Firebase backend to provide real-time alerts to caretakers.

Projects

Zotero MCP - Open Source MCP Server for Zotero Integration
Mar 2025 - Present
  • Built an MCP server connecting Zotero research libraries with AI assistants (Claude, ChatGPT, Cursor) via the Model Context Protocol. Implemented semantic search with multiple embedding models, PDF annotation extraction, and BibTeX export. Achieved over 720,000 downloads and 3k+ GitHub stars.
Nano Protein Viewer - Protein Structure Visualization Tool
Aug 2025 - Nov 2025
  • Developed a protein structure visualization tool using the Molstar framework with support for multiple formats (PDB, mmCIF, MOL2, etc.), diffusion animation playback, sequence alignment, and ESMFold integration. Released as a VSCode extension (1100+ downloads), web application, and Jupyter notebook plugin (2000+ downloads).
Gold, NeurIPS 2024 - Predict New Medicines with BELKA, Kaggle (13/1946)
Apr 2024 - Jul 2024
  • Developed deep learning models to predict small molecule-protein interactions using the Big Encoded Library for Chemical Assessment. Implemented over 40 types of DL models, including CNNs, GNNs, Transformers, RNNs, and GDBT Models, and finally developed a robust solution after using up all 480 submissions.

Leadership Experience

Co-President, Columbia University Biotech Society (CUBS)
Dec 2024 - Dec 2025
  • Organized annual Biotech Summit with over 200 participants and hosted 8+ speaker events with guests from academia and industry with total attendance exceeding 400 people.
  • Started data science competition initiative to help students learn AI through hands-on work with biological data; led a team of four students to finish top 5% (silver medal) in a bioimaging competition.
  • Coordinated multiple initiatives including an iGEM team, a podcast series featuring professors and industry leaders, and outreach workshops introducing high school students to computational biology.
HardCORE Initiative Lead, Columbia Organizing of Rising Entrepreneurs (CORE)
Jan 2024 - May 2026
  • Organizing and facilitating a series of biotech & deeptech speaker events and investor dinners, connecting 15+ industry professionals with Columbia student entrepreneurs, resulting in mentorship opportunities and potential funding partnerships for early-stage biotech ventures.
Founder and President, Artificial Intelligence Community at SBU
Dec 2022 - Jul 2023
  • Hosted biweekly machine learning/python hands-on workshops to cultivate members in their interests in AI.
  • Collaborated with AI professors to create research opportunities and organized guest lectures from industry professionals.

Technical Skills

Languages: Python, Javascript, Typescript, Java, C, C++, HTML/CSS, LaTeX, Bash, MATLAB
Libraries: PyTorch, TensorFlow, Jax, Numpy, Pandas, WandB, OpenCV, HuggingFace, React, Node
Tools: Git, GCP, HPC, Slurm, Jupyter, Fusion 360, Photoshop, Unity 3D, Blender