实时语音交互。

Andres Marafioti 8e9b069f3a push		1 月之前
LLM	129cd11bf0 Fixes to bugs from original PR	1 月之前
STT	129cd11bf0 Fixes to bugs from original PR	1 月之前
TTS	c88eec7e26 fix	1 月之前
VAD	1d3b5bfc40 avoid debug logging for melo	1 月之前
arguments_classes	6daa9baf68 Update vad_arguments.py	1 月之前
connections	fc9f960285 Merge pull request #52 from huggingface/logging_fix	1 月之前
utils	d50687a0c3 refactor all the handlers - folder structure	1 月之前
.dockerignore	8096f91d02 Dockerize (#22)	2 月之前
.gitignore	8096f91d02 Dockerize (#22)	2 月之前
Dockerfile	8096f91d02 Dockerize (#22)	2 月之前
Dockerfile.arm64	d8afa799ac Add support for a docker container for Jetson boards	1 月之前
LICENSE	64ac6cf639 adding apache license	2 月之前
README.md	ac1dd41260 improve readme (not according to cursor :( )	1 月之前
baseHandler.py	1d3b5bfc40 avoid debug logging for melo	1 月之前
docker-compose.yml	d8afa799ac Add support for a docker container for Jetson boards	1 月之前
listen_and_play.py	8e9b069f3a push	1 月之前
logo.png	b976f94728 add logo	2 月之前
requirements.txt	8bfbb8df3b Merge branch 'main' into DeepFilterNet	1 月之前
requirements_mac.txt	8bfbb8df3b Merge branch 'main' into DeepFilterNet	1 月之前
s2s_pipeline.py	1afd2445d3 add language arg to lightning whisper handler	1 月之前

Speech To Speech: an effort for an open-sourced and modular GPT4-o

Approach

Structure

This repository implements a speech-to-speech cascaded pipeline with consecutive parts:

Voice Activity Detection (VAD): silero VAD v5
Speech to Text (STT): Whisper checkpoints (including distilled versions)
Language Model (LM): Any instruct model available on the Hugging Face Hub! 🤗
Text to Speech (TTS): Parler-TTS🤗

Modularity

The pipeline aims to provide a fully open and modular approach, leveraging models available on the Transformers library via the Hugging Face hub. The level of modularity intended for each part is as follows:

VAD: Uses the implementation from Silero's repo.
STT: Uses Whisper models exclusively; however, any Whisper checkpoint can be used, enabling options like Distil-Whisper and French Distil-Whisper.
LM: This part is fully modular and can be changed by simply modifying the Hugging Face hub model ID. Users need to select an instruct model since the usage here involves interacting with it.
TTS: The mini architecture of Parler-TTS is standard, but different checkpoints, including fine-tuned multilingual checkpoints, can be used.

The code is designed to facilitate easy modification. Each component is implemented as a class and can be re-implemented to match specific needs.

Setup

Clone the repository:

git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech

Install the required dependencies using uv:

uv pip install -r requirements.txt

For Mac users, use the requirements_mac.txt file instead:

uv pip install -r requirements_mac.txt

If you want to use Melo TTS, you also need to run:

python -m unidic download

Usage

The pipeline can be run in two ways:

Server/Client approach: Models run on a server, and audio input/output are streamed from a client.
Local approach: Runs locally.

Docker Server

Install the NVIDIA Container Toolkit

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Start the docker container


### Server/Client Approach

1. Run the pipeline on the server:
   ```bash
   python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0

Run the client locally to handle microphone input and receive generated audio:
```
python listen_and_play.py --host <IP address of your server>
```
Local Approach (Mac)
1. For optimal settings on Mac: bash python s2s_pipeline.py --local_mac_optimal_settings

This setting:

Adds --device mps to use MPS for all models.
- Sets LightningWhisperMLX for STT
- Sets MLX LM for language model
- Sets MeloTTS for TTS

Recommended usage with Cuda

Leverage Torch Compile for Whisper and Parler-TTS:

python s2s_pipeline.py \
	--recv_host 0.0.0.0 \
	--send_host 0.0.0.0 \
	--lm_model_name microsoft/Phi-3-mini-4k-instruct \
	--init_chat_role system \
	--stt_compile_mode reduce-overhead \
	--tts_compile_mode default

For the moment, modes capturing CUDA Graphs are not compatible with streaming Parler-TTS (reduce-overhead, max-autotune).

Multi-language Support

The pipeline supports multiple languages, allowing for automatic language detection or specific language settings. Here are examples for both local (Mac) and server setups:

With the server version:

For automatic language detection:

python s2s_pipeline.py \
    --stt_model_name large-v3 \
    --language zh \
    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \

Or for one language in particular, chinese in this example

python s2s_pipeline.py \
    --stt_model_name large-v3 \
    --language zh \
    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \

Local Mac Setup

For automatic language detection:

python s2s_pipeline.py \
    --local_mac_optimal_settings \
    --device mps \
    --stt_model_name large-v3 \
    --language zh \
    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \

Or for one language in particular, chinese in this example

python s2s_pipeline.py \
    --local_mac_optimal_settings \
    --device mps \
    --stt_model_name large-v3 \
    --language zh \
    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \

Command-line Usage

Model Parameters

model_name, torch_dtype, and device are exposed for each part leveraging the Transformers' implementations: Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix:

stt (Speech to Text)
lm (Language Model)
tts (Text to Speech)

For example:

--lm_model_name google/gemma-2b-it

Generation Parameters

Other generation parameters of the model's generate method can be set using the part's prefix + _gen_, e.g., --stt_gen_max_new_tokens 128. These parameters can be added to the pipeline part's arguments class if not already exposed (see LanguageModelHandlerArguments for example).

Notable Parameters

VAD Parameters

--thresh: Threshold value to trigger voice activity detection.
--min_speech_ms: Minimum duration of detected voice activity to be considered speech.
--min_silence_ms: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.

Language Model

--init_chat_role: Defaults to None. Sets the initial role in the chat template, if applicable. Refer to the model's card to set this value (e.g. for Phi-3-mini-4k-instruct you have to set --init_chat_role system)
--init_chat_prompt: Defaults to "You are a helpful AI assistant." Required when setting --init_chat_role.

Speech to Text

--description: Sets the description for Parler-TTS generated voice. Defaults to: "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
--play_steps_s: Specifies the duration of the first chunk sent during streaming output from Parler-TTS, impacting readiness and decoding steps.

Citations

Silero VAD

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Distil-Whisper

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Parler-TTS

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

README.md