Rongjiehuang

@Rongjiehuang ·Menlo Park ·rongjiehuang.github.io

Focusing on multimodal synthesis (speech/audio/sing), speech translation, and self-supervised learning.

29 repos
533 followers
66 following

Python 86%
Jupyter Notebook 14%

7 contributions in the last year

2-day longest streak

Jun 2025

222324252627282930

Jul 2025

12345678910111213141516171819202122232425262728293031

Aug 2025

12345678910111213141516171819202122232425262728293031

Sep 2025

123456789101112131415161718192021222324252627282930

Oct 2025

12345678910111213141516171819202122232425262728293031

Nov 2025

123456789101112131415161718192021222324252627282930

Dec 2025

12345678910111213141516171819202122232425262728293031

Jan 2026

12345678910111213141516171819202122232425262728293031

Feb 2026

12345678910111213141516171819202122232425262728

Mar 2026

12345678910111213141516171819202122232425262728293031

Apr 2026

123456789101112131415161718192021222324252627282930

May 2026

12345678910111213141516171819202122232425262728293031

Jun 2026

123456789101112131415161718192021

Hi there 👋 Rongjie Huang (黄融杰) did my Graduate study at College of Computer Science and Software, Zhejiang University, supervised by Prof. Zhou Zhao. I also obtained Bachelor’s degree at…

Hi there 👋

Rongjie Huang (黄融杰) did my Graduate study at College of Computer Science and Software, Zhejiang University, supervised by Prof. Zhou Zhao. I also obtained Bachelor’s degree at Zhejiang University. During my graduate study, I was lucky to collaborate with the CMU Speech Team led by Prof. Shinji Watanabe, and Audio Research Team at Zhejiang University. I was grateful to intern or collaborate at TikTok, Shanghai AI Lab (OpenGV Lab), Tencent Seattle Lab, Alibaba Damo Academic, with Yi Ren, Jinglin Liu, Chunlei Zhang and Dong Yu.

My research interest includes Multi-Modal Generative AI, Multi-Modal Language Processing, and AI4Science. I have published first-author papers at the top international AI conferences such as NeurIPS/ICLR/ICML/ACL/IJCAI.

I am actively looking for academic collaboration, feel free to drop me an email.

📎 Homepages

Personal Pages: https://rongjiehuang.github.io (updated recently🔥)
Linkedin: https://www.linkedin.com/in/rongjie-huang-a362541b2
Google Scholar: https://scholar.google.com/citations?user=iRHBUsgAAAAJ

💻 Selected Research Papers

Generative AI for Speech, Sing, and Audio: Spoken Large Language Model, Text-to-Audio Synthesis, Text-to-Speech Synthesis, Singing Voice Synthesis

Audio-Visual Language Processing: Audio-Visual Speech-to-Speech Translation, Self-Supervised Learning

My full paper list is shown at my personal homepage.

Spoken Large Language Model

InstructSpeech: Following Speech Editing Instructions via Large Language Models. Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang, Xize Cheng, Ziyue Jiang, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao, Zhou Zhao. ICML, 2024
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe. AAAI, 2024
UniAudio: An Audio Foundation Model Toward Universal Audio Generation. Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Shinji Watanabe, Helen Meng. ICML 2024

Video-to-Audio Synthesis

[Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models.]() Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao. ICML, 2023
[Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT.]() Le Zhuo*, Ruoyi Du*, Han Xiao*, Yangguang Li*, Dongyang Liu*, Rongjie Huang*, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao. ICLR, 2024
[Make-An-Audio 2: Improving Text-to-Audio with Dual Text Information Representation.]() Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao. Arxiv, 2023

Audio-Visual Language Processing

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset. FAIR at Meta. Core contributor
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation. Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, and Zhou Zhao. ICLR, 2023
[AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation.]() Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin and Zhou Zhao. ACL, 2023

Text-to-Speech Synthesis

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech. Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. NeurIPS, 2022
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. Rongjie Huang, Max W.Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. IJCAI, 2022(oral)
Multi-Singer: Fast multi-singer singing voice vocoder with a large-scale corpus. Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. ACM MM, 2021(oral)

All public repos (29)

Show forks Show archived