Andrés Marafioti 93d74ba3bc Merge pull request #124 from huggingface/new-new-faster-whisper | 6 days ago | |
---|---|---|
LLM | 1 week ago | |
STT | 6 days ago | |
TTS | 1 week ago | |
VAD | 1 month ago | |
arguments_classes | 6 days ago | |
connections | 1 month ago | |
utils | 1 month ago | |
.dockerignore | 2 months ago | |
.gitignore | 2 months ago | |
Dockerfile | 2 months ago | |
Dockerfile.arm64 | 1 month ago | |
LICENSE | 2 months ago | |
README.md | 3 weeks ago | |
baseHandler.py | 1 month ago | |
docker-compose.yml | 1 month ago | |
listen_and_play.py | 1 week ago | |
logo.png | 2 months ago | |
requirements.txt | 6 days ago | |
requirements_mac.txt | 6 days ago | |
s2s_pipeline.py | 6 days ago |
This repository implements a speech-to-speech cascaded pipeline consisting of the following parts:
The pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub. The code is designed for easy modification, and we already support device-specific and external library implementations:
VAD
STT
LLM
TTS
Clone the repository:
git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech
Install the required dependencies using uv:
uv pip install -r requirements.txt
For Mac users, use the requirements_mac.txt
file instead:
uv pip install -r requirements_mac.txt
If you want to use Melo TTS, you also need to run:
python -m unidic download
The pipeline can be run in two ways:
Run the pipeline on the server:
python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0
bash
python listen_and_play.py --host <IP address of your server>
For optimal settings on Mac:
python s2s_pipeline.py --local_mac_optimal_settings
This setting:
--device mps
to use MPS for all models.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
### Recommended usage with Cuda
Leverage Torch Compile for Whisper and Parler-TTS. **The usage of Parler-TTS allows for audio output streaming, futher reducing the overeall latency** 🚀:
```bash
python s2s_pipeline.py \
--lm_model_name microsoft/Phi-3-mini-4k-instruct \
--stt_compile_mode reduce-overhead \
--tts_compile_mode default \
--recv_host 0.0.0.0 \
--send_host 0.0.0.0
For the moment, modes capturing CUDA Graphs are not compatible with streaming Parler-TTS (reduce-overhead
, max-autotune
).
The pipeline currently supports English, French, Spanish, Chinese, Japanese, and Korean.
Two use cases are considered:
--language
flag, specifying the target language code (default is 'en').--language
to 'auto'. In this case, Whisper detects the language for each spoken prompt, and the LLM is prompted with "Please reply to my message in ...
" to ensure the response is in the detected language.Please note that you must use STT and LLM checkpoints compatible with the target language(s). For the STT part, Parler-TTS is not yet multilingual (though that feature is coming soon! 🤗). In the meantime, you should use Melo (which supports English, French, Spanish, Chinese, Japanese, and Korean) or Chat-TTS.
For automatic language detection:
python s2s_pipeline.py \
--stt_model_name large-v3 \
--language auto \
--mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \
Or for one language in particular, chinese in this example
python s2s_pipeline.py \
--stt_model_name large-v3 \
--language zh \
--mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \
For automatic language detection:
python s2s_pipeline.py \
--local_mac_optimal_settings \
--device mps \
--stt_model_name large-v3 \
--language auto \
--mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
Or for one language in particular, chinese in this example
python s2s_pipeline.py \
--local_mac_optimal_settings \
--device mps \
--stt_model_name large-v3 \
--language zh \
--mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
NOTE: References for all the CLI arguments can be found directly in the arguments classes or by running
python s2s_pipeline.py -h
.
See ModuleArguments class. Allows to set:
--device
(if one wants each part to run on the same device)--mode
local
or server
See VADHandlerArguments class. Notably:
--thresh
: Threshold value to trigger voice activity detection.--min_speech_ms
: Minimum duration of detected voice activity to be considered speech.--min_silence_ms
: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.model_name
, torch_dtype
, and device
are exposed for each implementation of the Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix (e.g. stt
, lm
or tts
, check the implementations' arguments classes for more details).
For example:
--lm_model_name google/gemma-2b-it
Other generation parameters of the model's generate method can be set using the part's prefix + _gen_
, e.g., --stt_gen_max_new_tokens 128
. These parameters can be added to the pipeline part's arguments class if not already exposed.
@misc{Silero VAD,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}},
commit = {insert_some_commit_here},
email = {hello@silero.ai}
}
@misc{gandhi2023distilwhisper,
title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
year={2023},
eprint={2311.00430},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}