Ben Yu

yubenjamin2022 at ucla dot edu

Hi! I'm Ben, an undergraduate at UCLA, majoring in Data Theory. I am currently a researcher with the Digital Synthesis Lab, advised by Professor Daniel Schwalbe-Koda. There, I work on information-theoretic approaches to efficiently compress atomistic datasets. I am also a researcher with UCLA NLP, advised by Wenbo Hu. Here, I am exploring ways to improve generation and understanding of world models. I previously was a researcher with the Computational Machine Learning Group, advised by Justin Cui. There, I researched improving performance of image and video generation models in text preservation and efficient generation.

I'm interested in explainable AI, specifically interested in how we can make large language models and multimodal models more interpretable and aligned with human values. I'm particularly interested in improving LLMs through architectural improvements (a recent paper I am very interested in exploring is this one) and post-training methods, such as reinforcement learning and improving other SFT techniques. Current interests in this area are improving reasoning in LLMs and LLM interpretability. I'm always looking for new and interesting research opportunities in my field of interest, so feel free to reach out if you'd like to collaborate!

Email / CV / Scholar / GitHub

Publications

* Represents Equal Contribution

Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models
Benjamin Yu, Jackie Liu*, Justin Cui
AAAI 2026 AIR-FM Workshop
arXiv

Using optimized noise perturbations for reinforcement learning, we enable flow-matching models to efficiently improve image quality and human alignment.

Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory
Benjamin Yu, Daniel Schwalbe-Koda*
Preprint
arXiv, GitHub

By quantifying dataset redundancy using information entropy, we can selectively subsample data to achieve compact, information-preserving datasets that improve training efficiency.

Video Text Preservation with Synthetic Text-Rich Videos
Ziyang Liu, Benjamin Yu, Kevin Valencia, Justin Cui
ICCV 2025 Curated Data for Efficient Learning Workshop

Using a small, curated dataset of videos with high-quality text is sufficient to fine-tune a text-to-video model to generate high-quality text.

Miscellanea

Oral Presentations	Efficient compression of atomistic datasets with information theory, APS Global Physics Summit 2025
Academic Service	Not yet but hopefully eventually!
Teaching	ACM AI at UCLA, Workshop Officer, 2025 - 2026 Statistics Club Workshop Chair, 2024 - 2025

Thank you to Jon Barron for this website template, the source code is here.

Publications

Miscellanea

Oral Presentations

Academic Service

Teaching