实时语音交互。

Andrés Marafioti 93d74ba3bc Merge pull request #124 from huggingface/new-new-faster-whisper 5 days ago
LLM 2e6a12e29a add facebookmms tts handler and hindi support 1 week ago
STT a3bed3df14 add faster-whisper to s2s_pipeline 5 days ago
TTS 94d301e28a fixes 1 week ago
VAD 1d3b5bfc40 avoid debug logging for melo 1 month ago
arguments_classes a3bed3df14 add faster-whisper to s2s_pipeline 5 days ago
connections fc9f960285 Merge pull request #52 from huggingface/logging_fix 1 month ago
utils d50687a0c3 refactor all the handlers - folder structure 1 month ago
.dockerignore 8096f91d02 Dockerize (#22) 2 months ago
.gitignore 8096f91d02 Dockerize (#22) 2 months ago
Dockerfile 8096f91d02 Dockerize (#22) 2 months ago
Dockerfile.arm64 d8afa799ac Add support for a docker container for Jetson boards 1 month ago
LICENSE 64ac6cf639 adding apache license 2 months ago
README.md a3dcf894fc updated readme for a small typo 3 weeks ago
baseHandler.py 1d3b5bfc40 avoid debug logging for melo 1 month ago
docker-compose.yml d8afa799ac Add support for a docker container for Jetson boards 1 month ago
listen_and_play.py f49f8d2580 Fix hanging client on KeyboardInterrupt 1 week ago
logo.png b976f94728 add logo 2 months ago
requirements.txt a3bed3df14 add faster-whisper to s2s_pipeline 5 days ago
requirements_mac.txt a3bed3df14 add faster-whisper to s2s_pipeline 5 days ago
s2s_pipeline.py 241ea2f2cc fixes 5 days ago

README.md

 

Speech To Speech: an effort for an open-sourced and modular GPT4-o

📖 Quick Index

Approach

Structure

This repository implements a speech-to-speech cascaded pipeline consisting of the following parts:

  1. Voice Activity Detection (VAD)
  2. Speech to Text (STT)
  3. Language Model (LM)
  4. Text to Speech (TTS)

Modularity

The pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub. The code is designed for easy modification, and we already support device-specific and external library implementations:

VAD

STT

LLM

TTS

Setup

Clone the repository:

git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech

Install the required dependencies using uv:

uv pip install -r requirements.txt

For Mac users, use the requirements_mac.txt file instead:

uv pip install -r requirements_mac.txt

If you want to use Melo TTS, you also need to run:

python -m unidic download

Usage

The pipeline can be run in two ways:

  • Server/Client approach: Models run on a server, and audio input/output are streamed from a client.
  • Local approach: Runs locally.

Recommanded setup

Server/Client Approach

  1. Run the pipeline on the server:

    python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0
    
    1. Run the client locally to handle microphone input and receive generated audio: bash python listen_and_play.py --host <IP address of your server>

Local Approach (Mac)

  1. For optimal settings on Mac:

    python s2s_pipeline.py --local_mac_optimal_settings
    

    This setting:

    • Adds --device mps to use MPS for all models.
      • Sets LightningWhisperMLX for STT
      • Sets MLX LM for language model
      • Sets MeloTTS for TTS

    Docker Server

    Install the NVIDIA Container Toolkit

    https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

    Start the docker container

    
    
    
    ### Recommended usage with Cuda
    
    Leverage Torch Compile for Whisper and Parler-TTS. **The usage of Parler-TTS allows for audio output streaming, futher reducing the overeall latency** 🚀:
    
    ```bash
    python s2s_pipeline.py \
    	--lm_model_name microsoft/Phi-3-mini-4k-instruct \
    	--stt_compile_mode reduce-overhead \
    	--tts_compile_mode default \
    --recv_host 0.0.0.0 \
    	--send_host 0.0.0.0 
    

For the moment, modes capturing CUDA Graphs are not compatible with streaming Parler-TTS (reduce-overhead, max-autotune).

Multi-language Support

The pipeline currently supports English, French, Spanish, Chinese, Japanese, and Korean.
Two use cases are considered:

  • Single-language conversation: Enforce the language setting using the --language flag, specifying the target language code (default is 'en').
  • Language switching: Set --language to 'auto'. In this case, Whisper detects the language for each spoken prompt, and the LLM is prompted with "Please reply to my message in ..." to ensure the response is in the detected language.

Please note that you must use STT and LLM checkpoints compatible with the target language(s). For the STT part, Parler-TTS is not yet multilingual (though that feature is coming soon! 🤗). In the meantime, you should use Melo (which supports English, French, Spanish, Chinese, Japanese, and Korean) or Chat-TTS.

With the server version:

For automatic language detection:

python s2s_pipeline.py \
    --stt_model_name large-v3 \
    --language auto \
    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \

Or for one language in particular, chinese in this example

python s2s_pipeline.py \
    --stt_model_name large-v3 \
    --language zh \
    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \

Local Mac Setup

For automatic language detection:

python s2s_pipeline.py \
    --local_mac_optimal_settings \
    --device mps \
    --stt_model_name large-v3 \
    --language auto \
    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \

Or for one language in particular, chinese in this example

python s2s_pipeline.py \
    --local_mac_optimal_settings \
    --device mps \
    --stt_model_name large-v3 \
    --language zh \
    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \

Command-line Usage

NOTE: References for all the CLI arguments can be found directly in the arguments classes or by running python s2s_pipeline.py -h.

Module level Parameters

See ModuleArguments class. Allows to set:

  • a common --device (if one wants each part to run on the same device)
  • --mode local or server
  • chosen STT implementation
  • chosen LM implementation
  • chose TTS implementation
  • logging level

VAD parameters

See VADHandlerArguments class. Notably:

  • --thresh: Threshold value to trigger voice activity detection.
  • --min_speech_ms: Minimum duration of detected voice activity to be considered speech.
  • --min_silence_ms: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.

STT, LM and TTS parameters

model_name, torch_dtype, and device are exposed for each implementation of the Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix (e.g. stt, lm or tts, check the implementations' arguments classes for more details).

For example:

--lm_model_name google/gemma-2b-it

Generation parameters

Other generation parameters of the model's generate method can be set using the part's prefix + _gen_, e.g., --stt_gen_max_new_tokens 128. These parameters can be added to the pipeline part's arguments class if not already exposed.

Citations

Silero VAD

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Distil-Whisper

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Parler-TTS

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}