Andres Marafioti 8e9b069f3a push | 1 月之前 | |
---|---|---|
LLM | 1 月之前 | |
STT | 1 月之前 | |
TTS | 1 月之前 | |
VAD | 1 月之前 | |
arguments_classes | 1 月之前 | |
connections | 1 月之前 | |
utils | 1 月之前 | |
.dockerignore | 2 月之前 | |
.gitignore | 2 月之前 | |
Dockerfile | 2 月之前 | |
Dockerfile.arm64 | 1 月之前 | |
LICENSE | 2 月之前 | |
README.md | 1 月之前 | |
baseHandler.py | 1 月之前 | |
docker-compose.yml | 1 月之前 | |
listen_and_play.py | 1 月之前 | |
logo.png | 2 月之前 | |
requirements.txt | 1 月之前 | |
requirements_mac.txt | 1 月之前 | |
s2s_pipeline.py | 1 月之前 |
This repository implements a speech-to-speech cascaded pipeline with consecutive parts:
The pipeline aims to provide a fully open and modular approach, leveraging models available on the Transformers library via the Hugging Face hub. The level of modularity intended for each part is as follows:
The code is designed to facilitate easy modification. Each component is implemented as a class and can be re-implemented to match specific needs.
Clone the repository:
git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech
Install the required dependencies using uv:
uv pip install -r requirements.txt
For Mac users, use the requirements_mac.txt
file instead:
uv pip install -r requirements_mac.txt
If you want to use Melo TTS, you also need to run:
python -m unidic download
The pipeline can be run in two ways:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
### Server/Client Approach
1. Run the pipeline on the server:
```bash
python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0
Run the client locally to handle microphone input and receive generated audio:
python listen_and_play.py --host <IP address of your server>
bash
python s2s_pipeline.py --local_mac_optimal_settings
This setting:
--device mps
to use MPS for all models.
Leverage Torch Compile for Whisper and Parler-TTS:
python s2s_pipeline.py \
--recv_host 0.0.0.0 \
--send_host 0.0.0.0 \
--lm_model_name microsoft/Phi-3-mini-4k-instruct \
--init_chat_role system \
--stt_compile_mode reduce-overhead \
--tts_compile_mode default
For the moment, modes capturing CUDA Graphs are not compatible with streaming Parler-TTS (reduce-overhead
, max-autotune
).
The pipeline supports multiple languages, allowing for automatic language detection or specific language settings. Here are examples for both local (Mac) and server setups:
For automatic language detection:
python s2s_pipeline.py \
--stt_model_name large-v3 \
--language zh \
--mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \
Or for one language in particular, chinese in this example
python s2s_pipeline.py \
--stt_model_name large-v3 \
--language zh \
--mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \
For automatic language detection:
python s2s_pipeline.py \
--local_mac_optimal_settings \
--device mps \
--stt_model_name large-v3 \
--language zh \
--mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
Or for one language in particular, chinese in this example
python s2s_pipeline.py \
--local_mac_optimal_settings \
--device mps \
--stt_model_name large-v3 \
--language zh \
--mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
model_name
, torch_dtype
, and device
are exposed for each part leveraging the Transformers' implementations: Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix:
stt
(Speech to Text)lm
(Language Model)tts
(Text to Speech)For example:
--lm_model_name google/gemma-2b-it
Other generation parameters of the model's generate method can be set using the part's prefix + _gen_
, e.g., --stt_gen_max_new_tokens 128
. These parameters can be added to the pipeline part's arguments class if not already exposed (see LanguageModelHandlerArguments
for example).
--thresh
: Threshold value to trigger voice activity detection.--min_speech_ms
: Minimum duration of detected voice activity to be considered speech.--min_silence_ms
: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.--init_chat_role
: Defaults to None
. Sets the initial role in the chat template, if applicable. Refer to the model's card to set this value (e.g. for Phi-3-mini-4k-instruct you have to set --init_chat_role system
)--init_chat_prompt
: Defaults to "You are a helpful AI assistant."
Required when setting --init_chat_role
.--description
: Sets the description for Parler-TTS generated voice. Defaults to: "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
--play_steps_s
: Specifies the duration of the first chunk sent during streaming output from Parler-TTS, impacting readiness and decoding steps.
@misc{Silero VAD,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}},
commit = {insert_some_commit_here},
email = {hello@silero.ai}
}
@misc{gandhi2023distilwhisper,
title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
year={2023},
eprint={2311.00430},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}