Portfolio

View all projects on GitHub

I build open-source tools and libraries for AI-driven scientific discovery. My projects span drug discovery systems, protein language models, generative modeling for therapeutics, and materials informatics.


LOBSTER - Language Models for Biological Sequences

Repository: prescient-design/lobster Status: Active development • Apache 2.0 License Technologies: Python, PyTorch, Hugging Face, Flash Attention

Overview

LOBSTER (Language models for Biological Sequence Transformation and Evolutionary Representation) is an open-source library for training and deploying protein language models from scratch. Developed at Genentech’s Prescient Design, it enables researchers to build custom foundation models with full control over training data and embedding spaces.

Key Capabilities

Pre-trained Models:

Core Features:

Research Impact

Based on peer-reviewed research published in 2024:

LOBSTER addresses a critical gap in drug discovery: the ability to rapidly iterate on protein language models tailored to specific therapeutic applications. By enabling customizable training, it allows biotech companies and research labs to build foundation models optimized for their target space without relying solely on general-purpose models like ESM.

Applications in Drug Discovery

Who Should Use This

Links:


Walk-Jump Sampling - Protein Discovery System

Repository: prescient-design/walk-jump Status: Active • Apache 2.0 License • 58 stars Technologies: Python, Hydra, PyTorch

Overview

Official implementation of discrete Walk-Jump Sampling (dWJS), a novel generative modeling approach for protein discovery and therapeutic design. Recognized with the ICLR 2024 Outstanding Paper Award (1 of 5 from 7,300+ submissions), this work represents a fundamental advance in discrete sampling methods for biological sequences.

Key Features

Sampling Pipeline:

Evaluation Framework:

Research Foundation

Publication: “Protein Discovery with Discrete Walk-Jump Sampling” Authors: Nathan C. Frey, Daniel Berenberg, Karina Zadorozhny, Joseph Kleinhenz, et al. Citation: arXiv:2306.12360 (2023) Recognition: ICLR 2024 Outstanding Paper Award

Why This Matters

Generative modeling for proteins faces a fundamental challenge: sequences are discrete, but most generative methods (VAEs, GANs, standard diffusion) assume continuous spaces. Walk-Jump Sampling operates natively in discrete space while maintaining the sampling quality of continuous methods. This makes it particularly suited for therapeutic protein design where sequence validity is critical.

Who Should Use This

Links:


Keywords: protein language models, generative protein design, AI drug discovery tools, walk-jump sampling, discrete diffusion, antibody design software, materials informatics, graph neural networks, semi-supervised learning, topological materials, open source drug discovery, foundation models biology, therapeutic protein engineering