I am Xinyi Wang (王心怡), a Postdoctoral Researcher at the Princeton Language and Intelligence Lab, working closely with Danqi Chen. I received my Ph.D. degree from the University of California, Santa Barbara (UCSB), where I was advised by William Yang Wang. I’ve also interned at MIT-IBM Watson AI Lab and Microsoft Research before. I am honored to have received the J.P. Morgan AI Ph.D. Fellowship and the UCSB Computer Science Outstanding Publication Award. My research focuses on developing a principled understanding of large foundation models from their pretraining data distribution, with the goal of improving their capabilities, addressing their limitations, and optimizing their application across diverse domains. You can download my CV here.

Selected Publications

* indicates equal contribution

  • Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality [paper]

    Vedant Badoni, Danqi Chen, Xinyi Wang

    To Be Released

  • Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models

    Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

    Proceedings of ICML 2026, Seoul (poster) [paper][code]

  • Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

    Xunjian Yin, Xinyi Wang, Liangming Pan, Xiaojun Wan, William Yang Wang

    Proceedings of ACL 2025, Vienna (poster) [paper][code]

  • Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data

    Xinyi Wang*, Antonis Antoniades*, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang

    Proceedings of ICLR 2025, Singapore (poster) [paper][code]

  • Guiding Language Model Math Reasoning with Planning Tokens

    Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, Alessandro Sordoni

    Proceedings of COLM 2024, Philadelphia (poster) [paper][code]

  • Understanding the Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation

    Xinyi Wang, Alfonso Amayuelas, Kexun Zhang, Liangming Pan, Wenhu Chen, William Yang Wang

    Proceedings of ICML 2024, Vienna (poster) [paper][code]

  • Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning

    Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, William Yang Wang

    Proceedings of NeurIPS 2023, New Orleans (poster) [paper][code]

  • Causal Balancing for Domain Generalization

    Xinyi Wang, Michael Saxon, Jiachen Li, Hongyang Zhang, Kun Zhang, William Yang Wang

    Proceedings of ICLR 2023, Rwanda (poster) [paper][code]

Talks

  • Talk at PLI lunch on WebGraphMix (pretraining data selection), May 2026: [slides]
  • Talk at NYU CILVR seminar on reasoning scaling law, March 2026: [slides]
  • Talk at Princeton NLP lab meeting on LLM RL Training-Rollout Mismatch, Jan 2026: [slides]
  • Talk at PLI lunch on reasoning scaling law, September 2025: [slides]
  • Talk at NICE on Generalization v.s. Memorization, August 2025: [slides]
  • Talk at ICML 2025 MOSS workshop on reasoning scaling law, July 2025: [slides]
  • My academic job talk and PhD defense presentation given at multiple institutes on Understanding Large Language Models from Pretraining Data Distribution, Feb-May 2025: [slides]
  • My PhD proposal presentation, March 2024: [slides]
  • Talk at Tsinghua University and Peking University on Understanding and Improving Pre-trained Large Language Models through a Probabilistic Lens, October 2023: [slides]
  • Talk at Hong Kong University of Science and Technology on understanding in-context learning and demonstration selection, May 2023: [slides]
  • My PhD major area exam presentation on probabilistic theory of LLMs, March 2023: [slides]

Services