P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

Published:

Author: Sungwon Kim, Kevin J. Shih, Rohan Badlani, João Felipe Santos, Evelina Bhakturina, Mikyas Desta, Rafael Valle, Sungroh Yoon, Bryan Catanzaro

Posted: Sungwon Kim

Overview

In our latest work, we present P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time. Unlike the neural codec language models, we specifically train P-Flow on LibriTTS dataset using a continuous mel-representation. Through our training method using continuous speech prompts, P-Flow matches the speaker similarity performance of the large-scale zero-shot TTS models with two orders of magnitude less training data and has more than 20x faster sampling speed. Our results show that P-Flow has better pronunciation and is preferred in human likeness and speaker similarity to its recent state-of-the-art counterparts, thus defining P-Flow as an attractive and desirable alternative.

The following are the key technical highlights of our work:

  • We propose a speech prompt approach for the non-autoregressive zero-shot TTS model which surpasses the speaker embedding approach and provides in-context learning capabilities for speaker adaptation.
  • We propose a flow matching generative model for a high-quality and fast zero-shot TTS that significantly improves the synthesis speed and sample quality compared to the large-scale autoregressive baseline.
  • We demonstrate comparable speaker adaptation performance to the large-scale autoregressive baseline using significantly fewer training data and a small transformer-based encoder, highlighting the effectiveness of the proposed speech prompting approach.
  • P-Flow achieves an average inference latency of 0.11 seconds on an NVIDIA A100 GPU.

This work is published at NeurIPS 2023.

The samples below are generated by a version of P-Flow that was trained on a larger amount of data and utilized a larger model architecture than what was presented in the paper. In the samples below we synthesize P-Flow’s abstract in English with different reference voices from our first and last authors.

Reference SpeakerReferenceP-Flow (v1.5)
Bryan Catanzaro
Sungwon Kim
Sungwon Kim
Sungwon Kim
Sungwon Kim
Sungwon Kim

Method

In this work, we aim to provide in-context learning capabilities for zero-shot speaker adaptation in a high-quality and fast non-autoregressive TTS model. To avoid the potential bottleneck of speaker embedding approaches to extract speaker information from the reference speech, we adopt a speech prompting approach that directly utilizes the reference speech as a prompt for speaker adaptation, similar to neural codec language models.

Given a <text, speech> paired data, P-Flow trains in a manner similar to a masked autoencoder, using a 3-second random segment of the speech data and the text input for generative modeling on the remaining segment. During zero-shot inference, P-Flow performs zero-shot TTS by providing a 3-second reference speech of the desired speaker as a prompt. P-Flow has the advantage of executing zero-shot TTS using only audio as a prompt, without the need for a transcript of the reference speech.

pflow_prompt

In P-Flow, we provide the text encoder with a mel-spectrogram of a 3-second reference speech as a speech prompt, along with the text input, to output a text representation for the reference speech. Additionally, we employ a new type of generative model, the flow-matching model, as a decoder for fast and high-quality speech synthesis.

  • Speech prompted text encoder for speaker adaptation
  • Flow matching generative model for fast and high-quality speech synthesis

pflow_arch

Audio Samples

Section 4.1 - Prompting v.s. Separate Speaker Encoder

This result demonstrates the effectiveness of speaker conditioning through speech prompting. For the baseline (= P-Flow w/o Prompt), instead of directly inputting the speech prompt into the text encoder, we encode a random segment of speech to a fixed-size speaker embedding using a speaker encoder with the same architecture as the text encoder. The table below shows that even without changing other aspects, the use of speech prompts leads to a substantial enhancement in speaker similarity.

  • P-Flow: Use Speech-prompted text encoder
  • P-Flow (w/o Prompt): Use a fixed-size speaker embedding from a transformer-based speech encoder

pflow_table1

ReferenceP-FlowP-Flow (w/o Prompt)
They moved thereafter cautiously about the hut, groping before and about them to find something to show that Warrenton had fulfilled his mission.
Yea, his honourable worship is within. But he hath a godly minister or two with him, and likewise a leech.
The army found the people in poverty and left them in comparative wealth.
He was in deep converse with the clerk, and entered the hall holding him by the arm.

Section 4.2 - Model Comparison (LibriSpeech)

We compare P-Flow with state-of-the-art zero-shot TTS model, VALL-E. Although P-Flow is trained on significantly less amount of training data than VALL-E, P-Flow achieves better WER (ASR metric), comparable SECS (Speaker similarity metric), and 20 times faster sampling speed.

  • Baseline: VALL-E

pflow_table2

Ground TruthReferenceP-FlowVALL-E
They moved thereafter cautiously about the hut, groping before and about them to find something to show that Warrenton had fulfilled his mission.
And lay me down in thy cold bed, and leave my shining lot.
Yea, his honourable worship is within. But he hath a godly minister or two with him, and likewise a leech.
Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.
The army found the people in poverty and left them in comparative wealth.
Thus did this humane and right-minded father comfort his unhappy daughter; and her mother, embracing her again, did all she could to soothe her feelings.
He was in deep converse with the clerk, and entered the hall holding him by the arm.

Appendix - Model Comparison (VCTK)

We further compare the P-Flow model with the baseline VALL-E model on a subset of the VCTK dataset samples. Despite being trained on significantly less data compared to VALL-E, P-Flow demonstrates good speaker adaptation performance on the out-of-distribution VCTK dataset.

Ground TruthReferenceP-FlowVALL-E
We have to reduce the number of plastic bags.
So what is the campaign about?
My life has changed a lot.
Nothing is yet confirmed.
I could hardly move for the next couple of days.
His son has been travelling with the Tartan Army for years.
Her husband was very concerned that it might be fatal.
We’ve made a couple of albums.

Section 4.3 - Effect of Guidance Scale and Euler Steps

Effect of Guidance Scale (Gamma=Guidance Scale, N=10)

ReferenceP-Flow (Gamma=0)P-Flow (Gamma=1)P-Flow (Gamma=2)
They moved thereafter cautiously about the hut, groping before and about them to find something to show that Warrenton had fulfilled his mission.
Yea, his honourable worship is within. But he hath a godly minister or two with him, and likewise a leech.
The army found the people in poverty and left them in comparative wealth.
He was in deep converse with the clerk, and entered the hall holding him by the arm.

Effect of Sampling Steps (N=Sampling Steps, Gamma=1)

ReferenceP-Flow (N=1)P-Flow (N=2)P-Flow (N=5)P-Flow (N=10)P-Flow (N=20)
They moved thereafter cautiously about the hut, groping before and about them to find something to show that Warrenton had fulfilled his mission.
Yea, his honourable worship is within. But he hath a godly minister or two with him, and likewise a leech.
The army found the people in poverty and left them in comparative wealth.
He was in deep converse with the clerk, and entered the hall holding him by the arm.

Appendix - Zero-shot TTS with Emotional Reference Speech

Text: We have to reduce the number of plastic bags.

EmotionReferenceP-FlowVALL-E
Anger
Sleepiness
Neutral
Amused
Disgust