Shengguang Wu

Shengguang Wu
Ph.D. Student, Computer Science, Stanford University
Email: shgwu [AT] cs [DOT] stanford [DOT] edu

About me

Hi! My name is Shengguang (I also go by Daniel). I'm a first-year PhD student in Computer Science at Stanford University, currently rotating with Monica Lam at Stanford Open Virtual Assistant (OVAL) Lab. In the last quarters, I worked with Diyi Yang at Stanford Social and Language Technologies (SALT) Lab and Nick Haber at Stanford Autonomous Agents Lab.
I received my Master's degree from Peking University, advised by Qi Su. Previously, I also worked as a research intern with Qwen Team.

Research Interests

I am interested in developing human-like learning and reasoning skills in machines across domains and modalities. Currently, my work involves the following areas:

Self-Improvement:
Enabling AI agents to learn from interactions and continually self-improve — actively adapting to new information and novel tasks.

Multimodal Grounding & Reasoning:
Harnessing textual feedback to guide fine-grainded visual perception, and drawing from visual insights to optimize language-based reasoning.

Publications	(see also Google Scholar)
	3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Ethem Can, Xunlei Wu, Clemens Eppner, Valts Blukis, Jonathan Tremblay, Jiajun Wu, Stan Birchfield, Nick Haber TBA, 2025 TL;DR: 3D-Generalist is a generative graphics framework to create 3D environments. Key modules include: 1. diffusion model to generate panoramic images that render the structures of 3D environments; 2. VLA trained via self-improving loop for code generation that refines the environments; 3. another VLA for placing diverse unlabled 3D assets. 3D-Generalist provides a controllabe pipeline to scale up synthetic 3D environment data for embodied AI.
	Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images Shengguang Wu, Fan-Yun Sun, Kaiyue Wen, Nick Haber ACL-Main, 2025 TL;DR: S-VCO is a novel finetuning method that enhances visual-centric capabilties of VLMs while preserving general performance. Key design is a symmetrical visual contrastive objective that optimizes over visual details while avoiding one-sided "preference" formulation. Across various VLM benchmark domains, S-VCO demonstrates most significant and consistent improvements, with especially strong gains on visually demanding tasks.
	Rethinking Pragmatics in Large Language Models: Towards Open-Ended Evaluation and Preference Tuning Shengguang Wu, Shusheng Yang, Zhenglun Chen, Qi Su EMNLP-Main, 2024 TL;DR: We proposed novel paradigms for assessing and enhancing social-pragmatic abilities in L(V)LMs. Key results include: 1. open-ended evaluation better reveals LLMs' pragmatic generation as opposed to multiple-choice setup; 2. preferential tuning effectively invokes pragmatic reasoning without compromising generic abilities; 3. improvement of the speaker model's multimodal theory of mind in image referential games.
	DiverseEvol: Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, Chang Zhou ArXiv, 2023 TL;DR: DiverseEvol is an efficient instruction-tuning method that allows the model itself to iteratively sample training subsets to improve its own performance, with a key selection principle of maintaining high diversity in the chosen subsets. Across three datasets and benchmarks, our models, trained on less than 4% of the original dataset, match or improve performance compared with finetuning on full data.
	Qwen Technical Report Qwen Team ArXiv, 2023 TL;DR: We release Qwen, a family of highly-capabale foundation LLMs and Chat-Models. QwenLMs achieve superior performance than baselines (e.g., LLaMA2) of similar sizes on a wide range of benchmarks that measure natural language understanding, reasoning, problem solving, etc. Qwen-72B also outperforms GPT-3.5 on 70% of all tasks.
	Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision Shengguang Wu, Zhenglun Chen, Qi Su ACM-MM, 2024 TL;DR: We present an artifact recovery model that accurately generates images of lost artifacts adhering to historical knowledge. Key designs include: 1. prompt enhancement with archaeological knowledge elicited from LLMs; 2. contrastive learning for textual guidance on correlated historical expertise; 3. visual-semantic constraints on edge and perceptual features for learning intricate visual details.
	DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models Shengguang Wu, Mei Yuan, Qi Su EMNLP-Findings, 2023 TL;DR: We introduced a novel non-autoregressive approach to visual storytelling, DiffuVST, which is a diffusion-based LM featuring bidirectional context guidance and multimodal adapters. It directly predicts ground-truth text embeddings from any noisy input, achieving superior performance across NLG metrics at a massively faster inference speed compared to strong autoregressive baselines.

Website template from YueYANG1996.github.io.