Ajout OpenWeb-UI fonctionnel

2025-08-07 20:11:08 +02:00
commit cd76763e66
50 changed files with 20505 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,438 @@
+# OpenedAI Speech
+
+Notice: This software is mostly obsolete and will no longer be updated.
+
+Some Alternatives:
+
+* https://speaches.ai/
+* https://github.com/remsky/Kokoro-FastAPI
+* https://github.com/astramind-ai/Auralis
+* https://lightning.ai/docs/litserve/home?code_sample=speech
+
+----
+
+An OpenAI API compatible text to speech server.
+
+* Compatible with the OpenAI audio/speech API
+* Serves the [/v1/audio/speech endpoint](https://platform.openai.com/docs/api-reference/audio/createSpeech)
+* Not affiliated with OpenAI in any way, does not require an OpenAI API Key
+* A free, private, text-to-speech server with custom voice cloning
+
+Full Compatibility:
+* `tts-1`: `alloy`, `echo`, `fable`, `onyx`, `nova`, and `shimmer` (configurable)
+* `tts-1-hd`:  `alloy`, `echo`, `fable`, `onyx`, `nova`, and `shimmer` (configurable, uses OpenAI samples by default)
+* response_format: `mp3`, `opus`, `aac`, `flac`, `wav` and `pcm`
+* speed 0.25-4.0 (and more)
+
+Details:
+* Model `tts-1` via [piper tts](https://github.com/rhasspy/piper) (very fast, runs on cpu)
+  * You can map your own [piper voices](https://rhasspy.github.io/piper-samples/) via the `voice_to_speaker.yaml` configuration file
+* Model `tts-1-hd` via [coqui-ai/TTS](https://github.com/coqui-ai/TTS) xtts_v2 voice cloning (fast, but requires around 4GB GPU VRAM)
+  * Custom cloned voices can be used for tts-1-hd, See: [Custom Voices Howto](#custom-voices-howto)
+  * 🌐 [Multilingual](#multilingual) support with XTTS voices, the language is automatically detected if not set
+  * [Custom fine-tuned XTTS model support](#custom-fine-tuned-model-support)
+  * Configurable [generation parameters](#generation-parameters)
+  * Streamed output while generating
+* Occasionally, certain words or symbols may sound incorrect, you can fix them with regex via `pre_process_map.yaml`
+* Tested with python 3.9-3.11, piper does not install on python 3.12 yet
+
+
+If you find a better voice match for `tts-1` or `tts-1-hd`, please let me know so I can update the defaults.
+
+## Recent Changes
+
+Version 0.18.2, 2024-08-16
+
+* Fix docker building for amd64, refactor github actions again, free up more disk space
+
+Version 0.18.1, 2024-08-15
+
+* refactor github actions
+
+Version 0.18.0, 2024-08-15
+
+* Allow folders of wav samples in xtts. Samples will be combined, allowing for mixed voices and collections of small samples. Still limited to 30 seconds total. Thanks @nathanhere.
+* Fix missing yaml requirement in -min image
+* fix fr_FR-tom-medium and other 44khz piper voices (detect non-default sample rates)
+* minor updates
+
+Version 0.17.2, 2024-07-01
+
+* fix -min image (re: langdetect)
+
+Version 0.17.1, 2024-07-01
+
+* fix ROCm (add langdetect to requirements-rocm.txt)
+* Fix zh-cn for xtts
+
+Version 0.17.0, 2024-07-01
+
+* Automatic language detection, thanks [@RodolfoCastanheira](https://github.com/RodolfoCastanheira)
+
+Version 0.16.0, 2024-06-29
+
+* Multi-client safe version. Audio generation is synchronized in a single process. The estimated 'realtime' factor of XTTS on a GPU is roughly 1/3, this means that multiple streams simultaneously, or `speed` over 2, may experience audio underrun (delays or pauses in playback). This makes multiple clients possible and safe, but in practice 2 or 3 simultaneous streams is the maximum without audio underrun.
+
+Version 0.15.1, 2024-06-27
+
+* Remove deepspeed from requirements.txt, it's too complex for typical users. A more detailed deepspeed install document will be required.
+
+Version 0.15.0, 2024-06-26
+
+* Switch to [coqui-tts](https://github.com/idiap/coqui-ai-TTS) (updated fork), updated simpler dependencies, torch 2.3, etc.
+* Resolve cuda threading issues
+
+Version 0.14.1, 2024-06-26
+
+* Make deepspeed possible (`--use-deepspeed`), but not enabled in pre-built docker images (too large). Requires the cuda-toolkit installed, see the Dockerfile comment for details
+
+Version 0.14.0, 2024-06-26
+
+* Added `response_format`: `wav` and `pcm` support
+* Output streaming (while generating) for `tts-1` and `tts-1-hd`
+* Enhanced [generation parameters](#generation-parameters) for xtts models (temperature, top_p, etc.)
+* Idle unload timer (optional) - doesn't work perfectly yet
+* Improved error handling
+
+Version 0.13.0, 2024-06-25
+
+* Added [Custom fine-tuned XTTS model support](#custom-fine-tuned-model-support)
+* Initial prebuilt arm64 image support (Apple M-series, Raspberry Pi - MPS is not supported in XTTS/torch), thanks [@JakeStevenson](https://github.com/JakeStevenson), [@hchasens](https://github.com/hchasens)
+* Initial attempt at AMD GPU (ROCm 5.7) support
+* Parler-tts support removed
+* Move the *.default.yaml to the root folder
+* Run the docker as a service by default (`restart: unless-stopped`)
+* Added `audio_reader.py` for streaming text input and reading long texts
+
+Version 0.12.3, 2024-06-17
+
+* Additional logging details for BadRequests (400)
+
+Version 0.12.2, 2024-06-16
+
+* Fix :min image requirements (numpy<2?)
+
+Version 0.12.0, 2024-06-16
+
+* Improved error handling and logging
+* Restore the original alloy tts-1-hd voice by default, use alloy-alt for the old voice.
+
+Version 0.11.0, 2024-05-29
+
+* 🌐 [Multilingual](#multilingual) support (16 languages) with XTTS
+* Remove high Unicode filtering from the default `config/pre_process_map.yaml`
+* Update Docker build & app startup. thanks @justinh-rahb
+* Fix: "Plan failed with a cudnnException"
+* Remove piper cuda support
+
+Version: 0.10.1, 2024-05-05
+
+* Remove `runtime: nvidia` from docker-compose.yml, this assumes nvidia/cuda compatible runtime is available by default. thanks [@jmtatsch](https://github.com/jmtatsch)
+
+Version: 0.10.0, 2024-04-27
+
+* Pre-built & tested docker images, smaller docker images (8GB or 860MB)
+* Better upgrades: reorganize config files under `config/`, voice models under `voices/`
+* **Compatibility!** If you customized your `voice_to_speaker.yaml` or `pre_process_map.yaml` you need to move them to the `config/` folder.
+* default listen host to 0.0.0.0
+
+Version: 0.9.0, 2024-04-23
+
+* Fix bug with yaml and loading UTF-8
+* New sample text-to-speech application `say.py`
+* Smaller docker base image
+* Add beta [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) support (you can describe very basic features of the speaker voice), See: (https://www.text-description-to-speech.com/) for some examples of how to describe voices. Voices can be defined in the `voice_to_speaker.default.yaml`. Two example [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) voices are included in the `voice_to_speaker.default.yaml` file. `parler-tts` is experimental software and is kind of slow. The exact voice will be slightly different each generation but should be similar to the basic description.
+
+...
+
+Version: 0.7.3, 2024-03-20
+
+* Allow different xtts versions per voice in `voice_to_speaker.yaml`, ex. xtts_v2.0.2
+* Quality: Fix xtts sample rate (24000 vs. 22050 for piper) and pops
+
+
+## Installation instructions
+
+### Create a `speech.env` environment file
+
+Copy the `sample.env` to `speech.env` (customize if needed)
+```bash
+cp sample.env speech.env
+```
+
+#### Defaults
+```bash
+TTS_HOME=voices
+HF_HOME=voices
+#PRELOAD_MODEL=xtts
+#PRELOAD_MODEL=xtts_v2.0.2
+#EXTRA_ARGS=--log-level DEBUG --unload-timer 300
+#USE_ROCM=1
+```
+
+### Option A: Manual installation
+```shell
+# install curl and ffmpeg
+sudo apt install curl ffmpeg
+# Create & activate a new virtual environment (optional but recommended)
+python -m venv .venv
+source .venv/bin/activate
+# Install the Python requirements
+# - use requirements-rocm.txt for AMD GPU (ROCm support)
+# - use requirements-min.txt for piper only (CPU only)
+pip install -U -r requirements.txt
+# run the server
+bash startup.sh
+```
+
+> On first run, the voice models will be downloaded automatically. This might take a while depending on your network connection.
+
+### Option B: Docker Image (*recommended*)
+
+#### Nvidia GPU (cuda)
+
+```shell
+docker compose up
+```
+
+#### AMD GPU (ROCm support)
+
+```shell
+docker compose -f docker-compose.rocm.yml up
+```
+
+#### ARM64 (Apple M-series, Raspberry Pi)
+
+> XTTS only has CPU support here and will be very slow, you can use the Nvidia image for XTTS with CPU (slow), or use the piper only image (recommended)
+
+#### CPU only, No GPU (piper only)
+
+> For a minimal docker image with only piper support (<1GB vs. 8GB).
+
+```shell
+docker compose -f docker-compose.min.yml up
+```
+
+## Server Options
+
+```shell
+usage: speech.py [-h] [--xtts_device XTTS_DEVICE] [--preload PRELOAD] [--unload-timer UNLOAD_TIMER] [--use-deepspeed] [--no-cache-speaker] [-P PORT] [-H HOST]
+                 [-L {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
+
+OpenedAI Speech API Server
+
+options:
+  -h, --help            show this help message and exit
+  --xtts_device XTTS_DEVICE
+                        Set the device for the xtts model. The special value of 'none' will use piper for all models. (default: cuda)
+  --preload PRELOAD     Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use. (default: None)
+  --unload-timer UNLOAD_TIMER
+                        Idle unload timer for the XTTS model in seconds, Ex. 900 for 15 minutes (default: None)
+  --use-deepspeed       Use deepspeed with xtts (this option is unsupported) (default: False)
+  --no-cache-speaker    Don't use the speaker wav embeddings cache (default: False)
+  -P PORT, --port PORT  Server tcp port (default: 8000)
+  -H HOST, --host HOST  Host to listen on, Ex. 0.0.0.0 (default: 0.0.0.0)
+  -L {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
+                        Set the log level (default: INFO)
+```
+
+
+## Sample Usage
+
+You can use it like this:
+
+```shell
+curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
+    "model": "tts-1",
+    "input": "The quick brown fox jumped over the lazy dog.",
+    "voice": "alloy",
+    "response_format": "mp3",
+    "speed": 1.0
+  }' > speech.mp3
+```
+
+Or just like this:
+
+```shell
+curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
+    "input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3
+```
+
+Or like this example from the [OpenAI Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech):
+
+```python
+import openai
+
+client = openai.OpenAI(
+  # This part is not needed if you set these environment variables before import openai
+  # export OPENAI_API_KEY=sk-11111111111
+  # export OPENAI_BASE_URL=http://localhost:8000/v1
+  api_key = "sk-111111111",
+  base_url = "http://localhost:8000/v1",
+)
+
+with client.audio.speech.with_streaming_response.create(
+  model="tts-1",
+  voice="alloy",
+  input="Today is a wonderful day to build something people love!"
+) as response:
+  response.stream_to_file("speech.mp3")
+```
+
+Also see the `say.py` sample application for an example of how to use the openai-python API.
+
+```shell
+# play the audio, requires 'pip install playsound'
+python say.py -t "The quick brown fox jumped over the lazy dog." -p
+# save to a file in flac format
+python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac
+```
+
+You can also try the included `audio_reader.py` for listening to longer text and streamed input.
+
+Example usage:
+```bash
+python audio_reader.py -s 2 < LICENSE # read the software license - fast
+```
+
+## OpenAI API Documentation and Guide
+
+* [OpenAI Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech)
+* [OpenAI API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech)
+
+
+## Custom Voices Howto
+
+### Piper
+
+  1. Select the piper voice and model from the [piper samples](https://rhasspy.github.io/piper-samples/)
+  2. Update the `config/voice_to_speaker.yaml` with a new section for the voice, for example:
+```yaml
+...
+tts-1:
+  ryan:
+    model: voices/en_US-ryan-high.onnx
+    speaker: # default speaker
+```
+  3. New models will be downloaded as needed, of you can download them in advance with `download_voices_tts-1.sh`. For example:
+```shell
+bash download_voices_tts-1.sh en_US-ryan-high
+```
+
+### Coqui XTTS v2
+
+Coqui XTTS v2 voice cloning can work with as little as 6 seconds of clear audio. To create a custom voice clone, you must prepare a WAV file sample of the voice.
+
+#### Guidelines for preparing good sample files for Coqui XTTS v2
+* Mono (single channel) 22050 Hz WAV file
+* 6-30 seconds long - longer isn't always better (I've had some good results with as little as 4 seconds)
+* low noise (no hiss or hum)
+* No partial words, breathing, laughing, music or backgrounds sounds
+* An even speaking pace with a variety of words is best, like in interviews or audiobooks.
+* Audio longer than 30 seconds will be silently truncated.
+
+You can use FFmpeg to prepare your audio files, here are some examples:
+
+```shell
+# convert a multi-channel audio file to mono, set sample rate to 22050 hz, trim to 6 seconds, and output as WAV file.
+ffmpeg -i input.mp3 -ac 1 -ar 22050 -t 6 -y me.wav
+# use a simple noise filter to clean up audio, and select a start time start for sampling.
+ffmpeg -i input.wav -af "highpass=f=200, lowpass=f=3000" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav
+# A more complex noise reduction setup, including volume adjustment
+ffmpeg -i input.mkv -af "highpass=f=200, lowpass=f=3000, volume=5, afftdn=nf=25" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav
+```
+
+Once your WAV file is prepared, save it in the `/voices/` directory and update the `config/voice_to_speaker.yaml` file with the new file name.
+
+For example:
+
+```yaml
+...
+tts-1-hd:
+  me:
+    model: xtts
+    speaker: voices/me.wav # this could be you
+```
+
+You can also use a sub folder for multiple audio samples to combine small samples or to mix different samples together.
+
+For example:
+
+```yaml
+...
+tts-1-hd:
+  mixed:
+    model: xtts
+    speaker: voices/mixed
+```
+
+Where the `voices/mixed/` folder contains multiple wav files. The total audio length is still limited to 30 seconds.
+
+## Multilingual
+
+Multilingual cloning support was added in version 0.11.0 and is available only with the XTTS v2 model. To use multilingual voices with piper simply download a language specific voice.
+
+Coqui XTTSv2 has support for multiple languages: English (`en`), Spanish (`es`), French (`fr`), German (`de`), Italian (`it`), Portuguese (`pt`), Polish (`pl`), Turkish (`tr`), Russian (`ru`), Dutch (`nl`), Czech (`cs`), Arabic (`ar`), Chinese (`zh-cn`), Hungarian (`hu`), Korean (`ko`), Japanese (`ja`), and Hindi (`hi`). When not set, an attempt will be made to automatically detect the language, falling back to English (`en`).
+
+Unfortunately the OpenAI API does not support language, but you can create your own custom speaker voice and set the language for that.
+
+1) Create the WAV file for your speaker, as in [Custom Voices Howto](#custom-voices-howto)
+2) Add the voice to `config/voice_to_speaker.yaml` and include the correct Coqui `language` code for the speaker. For example:
+
+```yaml
+  xunjiang:
+    model: xtts
+    speaker: voices/xunjiang.wav
+    language: zh-cn
+```
+
+3) Don't remove high unicode characters in your `config/pre_process_map.yaml`! If you have these lines, you will need to remove them. For example:
+
+Remove:
+```yaml
+- - '[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U000024C2-\U0001F251]+'
+  - ''
+```
+
+These lines were added to the `config/pre_process_map.yaml` config file by default before version 0.11.0:
+
+4) Your new multi-lingual speaker voice is ready to use!
+
+
+## Custom Fine-Tuned Model Support
+
+Adding a custom xtts model is simple. Here is an example of how to add a custom fine-tuned 'halo' XTTS model.
+
+1) Save the model folder under `voices/` (all 4 files are required, including the vocab.json from the model)
+```
+openedai-speech$ ls voices/halo/
+config.json  vocab.json  model.pth  sample.wav
+```
+2) Add the custom voice entry under the `tts-1-hd` section of `config/voice_to_speaker.yaml`:
+```yaml
+tts-1-hd:
+...
+  halo:
+    model: halo # This name is required to be unique
+    speaker: voices/halo/sample.wav # voice sample is required
+    model_path: voices/halo
+```
+3) The model will be loaded when you access the voice for the first time (`--preload` doesn't work with custom models yet)
+
+## Generation Parameters
+
+The generation of XTTSv2 voices can be fine tuned with the following options (defaults included below):
+
+```yaml
+tts-1-hd:
+  alloy:
+    model: xtts
+    speaker: voices/alloy.wav
+    enable_text_splitting: True
+    length_penalty: 1.0
+    repetition_penalty: 10
+    speed: 1.0
+    temperature: 0.75
+    top_k: 50
+    top_p: 0.85
+```