CLIPSynth

CLIPSynth: Learning Text-to-audio Synthesis from Videos using CLIP and Diffusion Models

Hao-Wen Dong^1,2* Gunnar A. Sigurdsson¹ Chenyang Tao¹ Jiun-Yu Kao¹ Yu-Hsiang Lin¹ Anjali Narayan-Chen¹
Arpit Gupta¹ Tagyoung Chung¹ Jing Huang¹ Nanyun Peng^1,3 Wenbo Zhao¹
¹ Amazon Alexa AI ² University of California San Diego ³ University of California, Los Angeles
* Work done during an internship at Amazon

paper demo video slides

Content

Sample Results on MUSIC
Sample Results on VGG-Sound
Image-queried Synthesis Examples
Out-of-distribution Generalization Experiments
Citation

Summary of the compared models

CLIPSynth: Our proposed model (trained with image queries).
CLIPSynth-Text: The proposed CLIPSynth model trained with text queries.
CLIPSynth-Hybrid: The proposed CLIPSynth trained with both image and text queries.
MiniLMSynth: The proposed CLIPSynth model with the query model replaced by a MiniLM encoder.
CLIPRetriever: A retrieval-based model that finds the associated audio of the image that has the closest CLIP embedding to the input text query.

Model	Generative	Unlabeled data only	Query type (training)	Query type (test)
CLIPSynth	✓	✓	Image	Text
CLIPSynth-Text	✓		Text	Text
CLIPSynth-Hybrid	✓		Image + Text	Text
MiniLMSynth	✓		Text	Text
CLIPRetriever		✓	-	Text

Important notes

For the CLIPSynth models, we prefix the text query into the form of “a photo of playing {query}” on MUSIC and “a photo of {query}” on VGG-Sound.
All the spectrograms shown are mel spectrograms.

Example results on MUSIC

All the examples presented in this section use text queries, and they are randomly selected samples.

Examples

bassoon	cello	pipa	acoustic guitar	electric bass

erhu	piano	erhu	bagpipe	guzheng

bassoon	bagpipe	drum	flute	cello

clarinet	acoustic guitar	erhu	pipa	guzheng

Comparison 1

Query: “bassoon”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Comparison 2

Query: “cello”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Comparison 3

Query: “pipa”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Comparison 4

Query: “acoustic guitar”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Comparison 5

Query: “electric bass”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Example results on VGG-Sound

All the examples presented in this section use text queries, and they are randomly selected samples.

Examples

people crowd	people sniggering	goat bleating	baby laughter	sharpen knife

playing marimba, xylophone	car engine starting	playing sitar	sliding door	engine accelerating, revving, vroom

child speech, kid speaking	train horning	helicopter	male speech, man speaking	dog bow-wow

ambulance siren	playing acoustic guitar	dog barking	bowling impact	pigeon, dove cooing

Comparison 1

Query: “people crowd”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Comparison 2

Query: “people sniggering”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Comparison 3

Query: “goat bleating”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Comparison 4

Query: “baby laughter”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Comparison 5

Query: “sharpen knife”

CLIPSynth	CLIPSynth-Text	CLIPSynth-Hybrid	MiniLMSynth	CLIPRetriever

Image-queried Synthesis Demo

All the examples presented in this section use image queries, and they are randomly selected samples.

Examples on MUSIC

Examples on VGG-Sound

Out-of-distribution Generalization Experiments

In this experiment, we aim to examine the generalizability of the trained CLIPSynth model with unseen objects and combinatory prompts.

Experiment on CLIPSynth trained on MUSIC

Note: We can see that the model can generalize to unseen objects to some extent (viola, double bass, marimba, bongos are not presented in the MUSIC dataset). However, the model fails to handle combinatory inputs but generate the “average” sounds instead.

Experiment on CLIPSynth trained on VGG-Sound

Citation

Hao-Wen Dong, Gunnar A. Sigurdsson, Chenyang Tao, Jiun-Yu Kao, Yu-Hsiang Lin, Anjali Narayan-Chen, Arpit Gupta, Tagyoung Chung, Jing Huang, Nanyun Peng, and Wenbo Zhao, “CLIPSynth: Learning Text-to-audio Synthesis from Videos using CLIP and Diffusion Models,” Proceedings of the CVPR Workshop on Sight and Sound, 2023.

@inproceedings{dong2023clipsynth,
    author = {Hao-Wen Dong and Gunnar A. Sigurdsson and Chenyang Tao and Jiun-Yu Kao and Yu-Hsiang Lin and Anjali Narayan-Chen and Arpit Gupta and Tagyoung Chung and Jing Huang and Nanyun Peng and Wenbo Zhao},
    title = {CLIPSynth: Learning Text-to-audio Synthesis from Videos using CLIP and Diffusion Models},
    booktitle = {Proceedings of the CVPR Workshop on Sight and Sound},
    year = 2023,
}