Shengguang Wu
Ph.D. Student, Computer Science, Stanford University
Email: shgwu [AT] cs [DOT] stanford [DOT] edu

Google Scholar Icon GitHub Icon LinkedIn Icon CV
About me

Hi! My name is Shengguang (I also go by Daniel). I'm a first-year PhD student in Computer Science at Stanford University, currently rotating with Diyi Yang at Stanford Social and Language Technologies (SALT) Lab. In the last quarter, I worked with Nick Haber at Stanford Autonomous Agents Lab.
I received my Master's degree from Peking University, advised by Qi Su. Previously, I also worked as a research intern with Qwen Team.

Research Interests

I am interested in developing human-like learning and reasoning skills in machines across domains and modalities. Currently, my work involves the following areas:

Self Improvement Icon
Self-Improvement:
     Enabling AI agents to learn from interactions and continually self-improve — actively adapting to new information and novel tasks.

Multimodal Grounding Icon
Multimodal Grounding & Reasoning:
     Harnessing textual feedback to guide fine-grainded visual perception, and drawing from visual insights to optimize language-based reasoning.

Publications
(see also Google Scholar)
Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Ethem Can, Xunlei Wu, Clemens Eppner, Valts Blukis, Jonathan Tremblay, Jiajun Wu, Stan Birchfield, Nick Haber
TBA, 2025
TL;DR: 3D-Generalist is a generative graphics framework to create 3D environments. Key modules include: 1. diffusion model to generate panoramic images that render the structures of 3D environments; 2. VLA trained via self-improving loop for code generation that refines the environments; 3. another VLA for placing diverse unlabled 3D assets. 3D-Generalist provides a controllabe pipeline to scale up synthetic 3D environment data for embodied AI.
Shengguang Wu, Fan-Yun Sun, Kaiyue Wen, Nick Haber
ArXiv, 2025
TL;DR: S-VCO is a novel finetuning method that enhances visual-centric capabilties of VLMs while preserving general performance. Key design is a symmetrical visual contrastive objective that optimizes over visual details while avoiding one-sided "preference" formulation. Across various VLM benchmark domains, S-VCO demonstrates most significant and consistent improvements, with especially strong gains on visually demanding tasks.
Shengguang Wu, Shusheng Yang, Zhenglun Chen, Qi Su
EMNLP-Main, 2024
TL;DR: We proposed novel paradigms for assessing and enhancing social-pragmatic abilities in L(V)LMs. Key results include: 1. open-ended evaluation better reveals LLMs' pragmatic generation as opposed to multiple-choice setup; 2. preferential tuning effectively invokes pragmatic reasoning without compromising generic abilities; 3. improvement of the speaker model's multimodal theory of mind in image referential games.
Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, Chang Zhou
ArXiv, 2023
TL;DR: DiverseEvol is an efficient instruction-tuning method that allows the model itself to iteratively sample training subsets to improve its own performance, with a key selection principle of maintaining high diversity in the chosen subsets. Across three datasets and benchmarks, our models, trained on less than 4% of the original dataset, match or improve performance compared with finetuning on full data.
Qwen Team
ArXiv, 2023
TL;DR: We release Qwen, a family of highly-capabale foundation LLMs and Chat-Models. QwenLMs achieve superior performance than baselines (e.g., LLaMA2) of similar sizes on a wide range of benchmarks that measure natural language understanding, reasoning, problem solving, etc. Qwen-72B also outperforms GPT-3.5 on 70% of all tasks.
Shengguang Wu, Zhenglun Chen, Qi Su
ACM-MM, 2024
TL;DR: We present an artifact recovery model that accurately generates images of lost artifacts adhering to historical knowledge. Key designs include: 1. prompt enhancement with archaeological knowledge elicited from LLMs; 2. contrastive learning for textual guidance on correlated historical expertise; 3. visual-semantic constraints on edge and perceptual features for learning intricate visual details.
Shengguang Wu, Mei Yuan, Qi Su
EMNLP-Findings, 2023
TL;DR: We introduced a novel non-autoregressive approach to visual storytelling, DiffuVST, which is a diffusion-based LM featuring bidirectional context guidance and multimodal adapters. It directly predicts ground-truth text embeddings from any noisy input, achieving superior performance across NLG metrics at a massively faster inference speed compared to strong autoregressive baselines.


Website template from YueYANG1996.github.io.