liumaolin commited on
Commit
13ffebc
·
1 Parent(s): 86f7d71

docs: streamline usage guides by consolidating and decluttering

Browse files

- Remove duplicated sections from `USAGE.md` and `USAGE_CN.md` related to Quick and Advanced Modes.
- Simplify descriptions of API usage, voice cloning modes, and TTS generation.
- Replace detailed technical steps with a concise feature overview for better readability.
- Update links to reference the interactive Swagger UI for API details.
- Enhance layout by reorganizing key sections like "First-Time Setup" and "Feature Overview."

Files changed (2) hide show
  1. USAGE.md +10 -1146
  2. USAGE_CN.md +10 -1156
USAGE.md CHANGED
@@ -15,15 +15,9 @@
15
  - [Running the Application](#running-the-application)
16
  - [Start Backend API Server](#51-start-backend-api-server)
17
  - [Start Frontend Electron App](#52-start-frontend-electron-app)
18
- - [Usage Guide](#usage-guide)
19
- - [First-Time Setup](#61-first-time-setup)
20
- - [Quick Mode - Voice Cloning for Beginners](#62-quick-mode---voice-cloning-for-beginners)
21
- - [Advanced Mode - Expert Voice Cloning](#63-advanced-mode---expert-voice-cloning)
22
- - [Text-to-Speech Generation](#64-text-to-speech-generation)
23
- - [Voice Library Management](#65-voice-library-management)
24
- - [API Reference](#api-reference)
25
  - [Troubleshooting](#troubleshooting)
26
- - [Development](#development)
27
 
28
  ---
29
 
@@ -379,9 +373,7 @@ The Electron application will launch automatically with hot-reload enabled for d
379
 
380
  ---
381
 
382
- ## Usage Guide
383
-
384
- ### 6.1 First-Time Setup
385
 
386
  When you first launch the Electron app, you'll need to download required models.
387
 
@@ -421,702 +413,21 @@ When you first launch the Electron app, you'll need to download required models.
421
  - Verify you have ~10 GB free disk space
422
  - For manual installation, see section 3.3
423
 
424
- ### 6.2 Quick Mode - Voice Cloning for Beginners
425
-
426
- Quick Mode provides a simplified workflow for users who want to create a voice clone quickly without technical knowledge.
427
-
428
- #### Using the API
429
-
430
- **Step 1: Upload Audio File**
431
-
432
- ```bash
433
- curl -X POST http://localhost:8000/api/v1/files \
434
- -F "file=@path/to/voice_sample.wav" \
435
- -F "purpose=training"
436
- ```
437
-
438
- **Response**:
439
- ```json
440
- {
441
- "file_id": "550e8400-e29b-41d4-a716-446655440000",
442
- "filename": "voice_sample.wav",
443
- "size": 1234567,
444
- "purpose": "training"
445
- }
446
- ```
447
-
448
- **Step 2: Create Training Task**
449
-
450
- ```bash
451
- curl -X POST http://localhost:8000/api/v1/tasks \
452
- -H "Content-Type: application/json" \
453
- -d '{
454
- "exp_name": "my_voice",
455
- "audio_file_id": "550e8400-e29b-41d4-a716-446655440000",
456
- "options": {
457
- "version": "v2",
458
- "language": "zh",
459
- "quality": "standard"
460
- }
461
- }'
462
- ```
463
-
464
- **Response**:
465
- ```json
466
- {
467
- "id": "task-uuid-here",
468
- "status": "queued",
469
- "exp_name": "my_voice",
470
- "created_at": "2026-01-23T10:30:00Z"
471
- }
472
- ```
473
-
474
- **Step 3: Monitor Progress**
475
-
476
- Using Server-Sent Events (SSE):
477
- ```bash
478
- curl -N http://localhost:8000/api/v1/tasks/task-uuid-here/progress
479
- ```
480
-
481
- **Progress Events**:
482
- ```
483
- event: progress
484
- data: {"stage": "audio_slice", "progress": 25, "message": "Slicing audio..."}
485
-
486
- event: progress
487
- data: {"stage": "sovits_train", "progress": 50, "message": "Training SoVITS model..."}
488
-
489
- event: complete
490
- data: {"status": "completed", "voice_id": "voice-uuid-here"}
491
- ```
492
-
493
- #### Quality Presets
494
-
495
- | Preset | SoVITS Epochs | GPT Epochs | Est. Time | Quality |
496
- |--------|---------------|------------|-----------|---------|
497
- | **fast** | 4 | 8 | ~10 min | Good for testing |
498
- | **standard** | 8 | 15 | ~20 min | Balanced quality/speed |
499
- | **high** | 16 | 30 | ~40 min | Best quality |
500
-
501
- **Recommendations**:
502
- - Use `fast` for quick tests and previews
503
- - Use `standard` for most production use cases
504
- - Use `high` for professional applications requiring maximum quality
505
-
506
- #### Using the UI
507
-
508
- **Step 1: Navigate to Voice Clone Page**
509
- - Click "Voice Clone" in the sidebar
510
- - Or use keyboard shortcut: `Ctrl/Cmd + N`
511
-
512
- **Step 2: Upload Audio Sample**
513
- - Click "Upload Audio" button
514
- - Select a WAV or MP3 file
515
- - **Requirements**:
516
- - Duration: 5-30 seconds recommended
517
- - Quality: Clear voice, minimal background noise
518
- - Content: Natural speech, not singing or shouting
519
-
520
- **Step 3: Configure Training**
521
- - **Voice Name**: Enter a unique name (e.g., "John's Voice")
522
- - **Language**: Select primary language (Chinese, English, Japanese)
523
- - **Quality Preset**: Choose from fast/standard/high
524
-
525
- **Step 4: Start Training**
526
- - Click "Start Training" button
527
- - The task will be queued and processing will begin
528
-
529
- **Step 5: Monitor Progress**
530
- - Progress bar shows overall completion
531
- - Current stage displayed (e.g., "Training SoVITS model...")
532
- - Estimated time remaining shown
533
- - You can navigate away and check back later
534
-
535
- **Step 6: Training Complete**
536
- - You'll receive a notification when complete
537
- - The voice automatically appears in Voice Library
538
- - You can immediately use it for TTS generation
539
-
540
- **Tips for Best Results**:
541
- - Use high-quality audio (preferably 48kHz WAV)
542
- - Ensure consistent tone and speaking style
543
- - Avoid audio with music or sound effects
544
- - 10-15 seconds is the sweet spot for sample length
545
- - Multiple short samples can be combined
546
-
547
- ### 6.3 Advanced Mode - Expert Voice Cloning
548
-
549
- Advanced Mode provides granular control over each stage of the voice training pipeline. This is recommended for users who want to fine-tune training parameters.
550
-
551
- #### Training Pipeline Stages
552
-
553
- The complete training pipeline consists of 7 stages:
554
-
555
- 1. **Audio Slice**: Split audio into segments
556
- 2. **ASR** (Automatic Speech Recognition): Transcribe audio to text
557
- 3. **Text Feature**: Extract text embeddings
558
- 4. **Hubert Feature**: Extract audio features
559
- 5. **Semantic Token**: Generate semantic tokens
560
- 6. **SoVITS Train**: Train voice synthesis model
561
- 7. **GPT Train**: Train text-to-semantic model
562
-
563
- #### Stage Dependencies
564
-
565
- ```
566
- audio_slice → asr → text_feature → sovits_train
567
- ↘ ↗
568
- hubert_feature → semantic_token → gpt_train
569
- ```
570
-
571
- **Important**: Each stage must wait for its dependencies to complete.
572
-
573
- #### Using the API
574
-
575
- **Step 1: Create Experiment**
576
-
577
- ```bash
578
- curl -X POST http://localhost:8000/api/v1/experiments \
579
- -H "Content-Type: application/json" \
580
- -d '{
581
- "exp_name": "my_custom_voice",
582
- "version": "v2",
583
- "audio_file_id": "file-uuid-here"
584
- }'
585
- ```
586
-
587
- **Response**:
588
- ```json
589
- {
590
- "id": "exp-uuid-here",
591
- "exp_name": "my_custom_voice",
592
- "version": "v2",
593
- "stages": {
594
- "audio_slice": {"status": "pending"},
595
- "asr": {"status": "pending"},
596
- "text_feature": {"status": "pending"},
597
- "hubert_feature": {"status": "pending"},
598
- "semantic_token": {"status": "pending"},
599
- "sovits_train": {"status": "pending"},
600
- "gpt_train": {"status": "pending"}
601
- }
602
- }
603
- ```
604
-
605
- **Step 2: Execute Stages Individually**
606
-
607
- **Stage 1 - Audio Slice**:
608
- ```bash
609
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/audio_slice \
610
- -H "Content-Type: application/json" \
611
- -d '{
612
- "threshold": -34,
613
- "min_length": 4000,
614
- "min_interval": 300,
615
- "hop_size": 10,
616
- "max_silence_kept": 500
617
- }'
618
- ```
619
-
620
- **Parameters**:
621
- - `threshold`: dB threshold for silence detection (-60 to 0, default: -34)
622
- - `min_length`: Minimum segment length in ms (1000-10000, default: 4000)
623
- - `min_interval`: Minimum silence interval in ms (0-3000, default: 300)
624
- - `hop_size`: Analysis window hop size in ms (default: 10)
625
- - `max_silence_kept`: Maximum silence to keep in ms (default: 500)
626
-
627
- **Stage 2 - ASR**:
628
- ```bash
629
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/asr \
630
- -H "Content-Type: application/json" \
631
- -d '{
632
- "model": "达摩 ASR (中文)",
633
- "language": "zh"
634
- }'
635
- ```
636
-
637
- **ASR Models**:
638
- - `达摩 ASR (中文)`: DamoASR for Chinese (best for Chinese)
639
- - `Faster Whisper (多语言)`: Faster Whisper for multilingual
640
-
641
- **Stage 3 - Text Feature**:
642
- ```bash
643
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/text_feature \
644
- -H "Content-Type: application/json" \
645
- -d '{
646
- "language": "zh"
647
- }'
648
- ```
649
-
650
- **Stage 4 - Hubert Feature**:
651
- ```bash
652
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/hubert_feature \
653
- -H "Content-Type: application/json" \
654
- -d '{}'
655
- ```
656
-
657
- **Stage 5 - Semantic Token**:
658
- ```bash
659
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/semantic_token \
660
- -H "Content-Type: application/json" \
661
- -d '{}'
662
- ```
663
-
664
- **Stage 6 - SoVITS Train**:
665
- ```bash
666
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/sovits_train \
667
- -H "Content-Type: application/json" \
668
- -d '{
669
- "total_epoch": 8,
670
- "batch_size": 4,
671
- "save_every_epoch": 4,
672
- "text_low_lr_rate": 0.4,
673
- "if_save_latest": true,
674
- "if_save_every_weights": true,
675
- "version": "v2"
676
- }'
677
- ```
678
-
679
- **Parameters**:
680
- - `total_epoch`: Total training epochs (4-32, default: 8)
681
- - `batch_size`: Batch size (1-40, default: 4)
682
- - `save_every_epoch`: Save checkpoint every N epochs (1-50, default: 4)
683
- - `text_low_lr_rate`: Text encoder learning rate multiplier (0.2-1.0, default: 0.4)
684
-
685
- **Stage 7 - GPT Train**:
686
- ```bash
687
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/gpt_train \
688
- -H "Content-Type: application/json" \
689
- -d '{
690
- "total_epoch": 15,
691
- "batch_size": 4,
692
- "save_every_epoch": 5,
693
- "if_save_latest": true,
694
- "if_save_every_weights": true,
695
- "version": "v2"
696
- }'
697
- ```
698
-
699
- **Step 3: Monitor Stage Progress**
700
-
701
- Each stage provides real-time progress via SSE:
702
-
703
- ```bash
704
- curl -N http://localhost:8000/api/v1/experiments/exp-uuid/stages/sovits_train/progress
705
- ```
706
-
707
- **Progress Events**:
708
- ```
709
- event: progress
710
- data: {"epoch": 2, "total_epochs": 8, "progress": 25, "loss": 0.234}
711
-
712
- event: progress
713
- data: {"epoch": 4, "total_epochs": 8, "progress": 50, "loss": 0.189}
714
-
715
- event: complete
716
- data: {"status": "completed", "final_loss": 0.142}
717
- ```
718
-
719
- #### Using the UI
720
-
721
- **Step 1: Create New Experiment**
722
- - Navigate to "Advanced Mode" page
723
- - Click "New Experiment"
724
- - Enter experiment name and upload audio
725
-
726
- **Step 2: Configure Each Stage**
727
- - Click on a stage card to expand settings
728
- - Adjust parameters (or use preset defaults)
729
- - Click "Run Stage" to execute
730
-
731
- **Step 3: Monitor Pipeline**
732
- - Visual pipeline diagram shows stage status
733
- - Green: Completed, Blue: Running, Gray: Pending
734
- - Click any stage to view detailed logs
735
-
736
- **Step 4: Iterate and Refine**
737
- - Review results after each stage
738
- - Adjust parameters and re-run if needed
739
- - Export final model when satisfied
740
-
741
- **Advanced Tips**:
742
- - Use lower `batch_size` (2-4) on GPUs with limited memory
743
- - Increase `total_epoch` for better quality with sufficient data
744
- - Save checkpoints frequently (`save_every_epoch`) to recover from interruptions
745
- - Monitor loss values - should decrease over epochs
746
-
747
- ### 6.4 Text-to-Speech Generation
748
-
749
- Once you have trained a voice, you can use it to generate speech from text.
750
-
751
- #### Using the API
752
-
753
- **Basic TTS Request**:
754
- ```bash
755
- curl -X POST http://localhost:8000/api/v1/inference/tts \
756
- -H "Content-Type: application/json" \
757
- -d '{
758
- "text": "Hello, this is a test of text-to-speech synthesis.",
759
- "voice_id": "voice-uuid-here",
760
- "speed": 1.0,
761
- "emotion": "auto"
762
- }'
763
- ```
764
-
765
- **Response**:
766
- ```json
767
- {
768
- "audio_url": "http://localhost:8000/api/v1/files/audio-uuid-here",
769
- "duration": 3.2,
770
- "format": "wav"
771
- }
772
- ```
773
-
774
- **Parameters**:
775
- - `text` (required): Text to synthesize (max 5000 characters)
776
- - `voice_id` (required): UUID of trained voice
777
- - `speed` (optional): Speaking speed multiplier (0.5 - 2.0, default: 1.0)
778
- - `emotion` (optional): Emotion style (auto, neutral, happy, sad)
779
- - `seed` (optional): Random seed for reproducibility
780
-
781
- **Download Generated Audio**:
782
- ```bash
783
- curl -o output.wav http://localhost:8000/api/v1/files/audio-uuid-here
784
- ```
785
-
786
- #### Using the UI
787
-
788
- **Step 1: Navigate to TTS Page**
789
- - Click "Text to Speech" in sidebar
790
- - Or use keyboard shortcut: `Ctrl/Cmd + T`
791
-
792
- **Step 2: Select Voice**
793
- - Open voice dropdown
794
- - Select a trained voice from the list
795
- - Preview button lets you hear a sample
796
-
797
- **Step 3: Enter Text**
798
- - Type or paste text into the text area
799
- - Character count shown (max 5000)
800
- - Supports multi-line text
801
-
802
- **Step 4: Adjust Settings**
803
- - **Speed**: Drag slider or enter value (0.5x - 2.0x)
804
- - 0.5x: Very slow, clear enunciation
805
- - 1.0x: Natural speaking pace
806
- - 1.5x: Fast, still intelligible
807
- - 2.0x: Very fast
808
- - **Emotion**: Select from dropdown (if supported by model)
809
- - Auto: Infer from text
810
- - Neutral: Flat, factual delivery
811
- - Happy: Upbeat, positive tone
812
- - Sad: Somber, melancholic tone
813
-
814
- **Step 5: Generate**
815
- - Click "Generate" button
816
- - Processing takes 2-5 seconds
817
- - Progress indicator shown
818
-
819
- **Step 6: Listen and Download**
820
- - Audio player appears automatically
821
- - Click play button to listen
822
- - Click download button to save WAV file
823
- - Share button to copy shareable link
824
-
825
- **Text Guidelines**:
826
- - Use proper punctuation for natural pauses
827
- - Break long text into sentences
828
- - Use quotation marks for dialogue
829
- - All-caps for emphasis (use sparingly)
830
-
831
- **Tips for Natural Speech**:
832
- - Add commas for breath pauses
833
- - Use ellipsis (...) for trailing off
834
- - Question marks affect intonation
835
- - Exclamation points add emphasis
836
-
837
- ### 6.5 Voice Library Management
838
-
839
- The Voice Library is where all your trained voices are stored and managed.
840
-
841
- #### Using the API
842
-
843
- **List All Voices**:
844
- ```bash
845
- curl http://localhost:8000/api/v1/files?purpose=training
846
- ```
847
-
848
- **Response**:
849
- ```json
850
- {
851
- "files": [
852
- {
853
- "id": "voice-uuid-1",
854
- "filename": "john_voice",
855
- "created_at": "2026-01-20T10:30:00Z",
856
- "size": 1234567,
857
- "metadata": {
858
- "language": "zh",
859
- "quality": "standard",
860
- "duration": 12.5
861
- }
862
- },
863
- {
864
- "id": "voice-uuid-2",
865
- "filename": "mary_voice",
866
- "created_at": "2026-01-21T14:20:00Z",
867
- "size": 2345678,
868
- "metadata": {
869
- "language": "en",
870
- "quality": "high",
871
- "duration": 18.3
872
- }
873
- }
874
- ]
875
- }
876
- ```
877
-
878
- **Get Voice Details**:
879
- ```bash
880
- curl http://localhost:8000/api/v1/files/voice-uuid-1
881
- ```
882
-
883
- **Delete Voice**:
884
- ```bash
885
- curl -X DELETE http://localhost:8000/api/v1/files/voice-uuid-1
886
- ```
887
-
888
- **Export Voice Model**:
889
- ```bash
890
- curl -o voice_model.zip http://localhost:8000/api/v1/voices/voice-uuid-1/export
891
- ```
892
-
893
- #### Using the UI
894
-
895
- **Browse Voice Library**:
896
- - Navigate to "Voice Library" page
897
- - Voices displayed as cards with:
898
- - Voice name
899
- - Language and quality badges
900
- - Creation date
901
- - Sample duration
902
- - Preview waveform
903
-
904
- **Voice Card Actions**:
905
- - **Play**: Listen to voice sample
906
- - **Edit**: Rename or update metadata
907
- - **Export**: Download voice model files
908
- - **Delete**: Remove voice (with confirmation)
909
-
910
- **Search and Filter**:
911
- - Search bar: Filter by voice name
912
- - Language filter: Show only specific languages
913
- - Quality filter: Show only specific quality presets
914
- - Sort options:
915
- - Name (A-Z)
916
- - Date created (newest first)
917
- - Date created (oldest first)
918
- - File size
919
-
920
- **Bulk Operations**:
921
- - Select multiple voices (Shift+Click)
922
- - Export selected voices as ZIP
923
- - Delete selected voices
924
- - Tag selected voices
925
-
926
- **Voice Details Panel**:
927
- Click on any voice card to view:
928
- - Full training parameters
929
- - Training history and logs
930
- - Model file sizes
931
- - Sample audio clips
932
- - Export and sharing options
933
-
934
- **Organization Tips**:
935
- - Use descriptive names (e.g., "John_Professional", "Mary_Casual")
936
- - Tag voices by project or use case
937
- - Export important voices as backups
938
- - Delete test voices to save space
939
-
940
  ---
941
 
942
- ## API Reference
943
-
944
- ### Quick Mode Endpoints
945
-
946
- #### Tasks
947
-
948
- **Create Task** - Start a one-click voice training task
949
- ```http
950
- POST /api/v1/tasks
951
- Content-Type: application/json
952
-
953
- {
954
- "exp_name": "string",
955
- "audio_file_id": "uuid",
956
- "options": {
957
- "version": "v2",
958
- "language": "zh|en|ja",
959
- "quality": "fast|standard|high"
960
- }
961
- }
962
- ```
963
-
964
- **List Tasks** - Get all tasks
965
- ```http
966
- GET /api/v1/tasks?status=queued|running|completed|failed
967
- ```
968
-
969
- **Get Task** - Get specific task details
970
- ```http
971
- GET /api/v1/tasks/{task_id}
972
- ```
973
-
974
- **Cancel Task** - Cancel a running task
975
- ```http
976
- DELETE /api/v1/tasks/{task_id}
977
- ```
978
-
979
- **Task Progress** - Real-time progress via SSE
980
- ```http
981
- GET /api/v1/tasks/{task_id}/progress
982
- Accept: text/event-stream
983
- ```
984
-
985
- ### Advanced Mode Endpoints
986
-
987
- #### Experiments
988
-
989
- **Create Experiment** - Initialize a new training experiment
990
- ```http
991
- POST /api/v1/experiments
992
- Content-Type: application/json
993
-
994
- {
995
- "exp_name": "string",
996
- "version": "v2",
997
- "audio_file_id": "uuid"
998
- }
999
- ```
1000
-
1001
- **Get Experiment** - Get experiment details
1002
- ```http
1003
- GET /api/v1/experiments/{exp_id}
1004
- ```
1005
-
1006
- **List Experiments** - Get all experiments
1007
- ```http
1008
- GET /api/v1/experiments?status=pending|running|completed
1009
- ```
1010
 
1011
- **Delete Experiment** - Delete experiment and all data
1012
- ```http
1013
- DELETE /api/v1/experiments/{exp_id}
1014
- ```
1015
 
1016
- #### Stages
1017
 
1018
- **Execute Stage** - Run a specific pipeline stage
1019
- ```http
1020
- POST /api/v1/experiments/{exp_id}/stages/{stage_type}
1021
- Content-Type: application/json
1022
 
1023
- {
1024
- // Stage-specific parameters
1025
- }
1026
- ```
1027
 
1028
- **Stage Types**:
1029
- - `audio_slice`
1030
- - `asr`
1031
- - `text_feature`
1032
- - `hubert_feature`
1033
- - `semantic_token`
1034
- - `sovits_train`
1035
- - `gpt_train`
1036
-
1037
- **Get Stage Status** - Get status of a specific stage
1038
- ```http
1039
- GET /api/v1/experiments/{exp_id}/stages/{stage_type}
1040
- ```
1041
 
1042
- **Get All Stage Statuses** - Get status of all stages
1043
- ```http
1044
- GET /api/v1/experiments/{exp_id}/stages
1045
- ```
1046
-
1047
- **Stage Progress** - Real-time stage progress via SSE
1048
- ```http
1049
- GET /api/v1/experiments/{exp_id}/stages/{stage_type}/progress
1050
- Accept: text/event-stream
1051
- ```
1052
-
1053
- **Get Stage Schema** - Get parameters schema for a stage
1054
- ```http
1055
- GET /api/v1/stages/{stage_type}/schema
1056
- ```
1057
-
1058
- ### Common Endpoints
1059
-
1060
- #### Files
1061
-
1062
- **Upload File** - Upload audio or data file
1063
- ```http
1064
- POST /api/v1/files
1065
- Content-Type: multipart/form-data
1066
-
1067
- file: binary
1068
- purpose: training|inference
1069
- ```
1070
-
1071
- **List Files** - Get all uploaded files
1072
- ```http
1073
- GET /api/v1/files?purpose=training|inference
1074
- ```
1075
-
1076
- **Get File** - Download a specific file
1077
- ```http
1078
- GET /api/v1/files/{file_id}
1079
- ```
1080
-
1081
- **Delete File** - Delete a file
1082
- ```http
1083
- DELETE /api/v1/files/{file_id}
1084
- ```
1085
-
1086
- #### Inference
1087
-
1088
- **Text-to-Speech** - Generate speech from text
1089
- ```http
1090
- POST /api/v1/inference/tts
1091
- Content-Type: application/json
1092
-
1093
- {
1094
- "text": "string",
1095
- "voice_id": "uuid",
1096
- "speed": 1.0,
1097
- "emotion": "auto|neutral|happy|sad",
1098
- "seed": 42
1099
- }
1100
- ```
1101
-
1102
- **Get Voice Info** - Get voice model information
1103
- ```http
1104
- GET /api/v1/voices/{voice_id}
1105
- ```
1106
-
1107
- #### Configuration
1108
-
1109
- **Get Stage Presets** - Get preset configurations for stages
1110
- ```http
1111
- GET /api/v1/stages/presets
1112
- ```
1113
-
1114
- **Health Check** - Check API server health
1115
- ```http
1116
- GET /health
1117
- ```
1118
-
1119
- **Full OpenAPI specification available at**: http://localhost:8000/openapi.json
1120
 
1121
  ---
1122
 
@@ -1160,28 +471,6 @@ rm ~/.moyoyo-tts/data/tasks.db
1160
  python app/main.py
1161
  ```
1162
 
1163
- #### Training Fails Immediately
1164
-
1165
- **Symptom**: Training starts but fails within seconds.
1166
-
1167
- **Diagnosis**:
1168
- ```bash
1169
- # Check GPU availability
1170
- python -c "import torch; print(torch.cuda.is_available())"
1171
-
1172
- # Check CUDA version
1173
- python -c "import torch; print(torch.version.cuda)"
1174
-
1175
- # Check disk space
1176
- df -h
1177
- ```
1178
-
1179
- **Solutions**:
1180
- 1. **No GPU**: System will use CPU (slower but works)
1181
- 2. **CUDA mismatch**: Reinstall PyTorch with correct CUDA version
1182
- 3. **Out of disk space**: Free up at least 10GB
1183
- 4. **Out of memory**: Reduce `batch_size` in training parameters
1184
-
1185
  #### Python Environment Issues
1186
 
1187
  **Symptom**: `ModuleNotFoundError` or import errors.
@@ -1286,431 +575,6 @@ nvm use 18
1286
  %APPDATA%\tts-voice-app\logs\
1287
  ```
1288
 
1289
- ### Common Errors
1290
-
1291
- #### "PYTHONPATH not set" Error
1292
-
1293
- **Symptom**: Import errors related to `GPT_SoVITS` module.
1294
-
1295
- **Cause**: The API server needs to find the main project directory.
1296
-
1297
- **Solution**: The API automatically sets `PYTHONPATH`, but verify:
1298
- ```bash
1299
- # Check project structure
1300
- ls GPT-SoVITS/ # Should contain *.py files
1301
-
1302
- # Set manually if needed
1303
- export PYTHONPATH=/Users/coldish/workspace/GPT-SoVITS:$PYTHONPATH
1304
- ```
1305
-
1306
- #### "Model not found" Error
1307
-
1308
- **Symptom**: Training fails with "Cannot find pretrained model" message.
1309
-
1310
- **Diagnosis**:
1311
- ```bash
1312
- # Check if models exist
1313
- ls GPT_SoVITS/pretrained_models/
1314
- # Should show: s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt, s2G488k.pth, s2D488k.pth
1315
- ```
1316
-
1317
- **Solution**: Download pretrained models (see section 3.3):
1318
- ```bash
1319
- wget https://www.modelscope.cn/models/XXXXRT/GPT-SoVITS-Pretrained/resolve/master/pretrained_models.zip
1320
- unzip -q -o pretrained_models.zip -d GPT_SoVITS
1321
- ```
1322
-
1323
- #### "Out of memory" Error
1324
-
1325
- **Symptom**: Training crashes with `CUDA out of memory` or `MemoryError`.
1326
-
1327
- **Solutions**:
1328
- 1. **Reduce batch size**:
1329
- ```json
1330
- {
1331
- "batch_size": 2 // Reduce from 4 to 2
1332
- }
1333
- ```
1334
-
1335
- 2. **Close other applications**: Free up GPU/RAM
1336
-
1337
- 3. **Use CPU mode**: Slower but uses system RAM instead of GPU:
1338
- ```bash
1339
- # Set environment variable
1340
- export CUDA_VISIBLE_DEVICES=""
1341
- python app/main.py
1342
- ```
1343
-
1344
- 4. **Increase system swap** (Linux):
1345
- ```bash
1346
- sudo dd if=/dev/zero of=/swapfile bs=1G count=8
1347
- sudo mkswap /swapfile
1348
- sudo swapon /swapfile
1349
- ```
1350
-
1351
- #### "NLTK Data Not Found" Error
1352
-
1353
- **Symptom**: Text processing fails with NLTK data errors.
1354
-
1355
- **Solution**: Download NLTK data (see section 3.3):
1356
- ```bash
1357
- wget https://www.modelscope.cn/models/XXXXRT/GPT-SoVITS-Pretrained/resolve/master/nltk_data.zip
1358
- unzip -q -o nltk_data.zip -d .venv/
1359
- ```
1360
-
1361
- #### Audio Quality Issues
1362
-
1363
- **Symptom**: Generated audio sounds robotic, distorted, or unclear.
1364
-
1365
- **Solutions**:
1366
- 1. **Use better training data**:
1367
- - High-quality audio (48kHz WAV preferred)
1368
- - Clear voice, minimal background noise
1369
- - 10-15 seconds of audio
1370
- - Natural, conversational speech
1371
-
1372
- 2. **Increase training quality**:
1373
- ```json
1374
- {
1375
- "quality": "high" // Use high instead of standard
1376
- }
1377
- ```
1378
-
1379
- 3. **Train longer**:
1380
- ```json
1381
- {
1382
- "total_epoch": 16 // Increase from 8 to 16
1383
- }
1384
- ```
1385
-
1386
- 4. **Check reference audio**: Ensure uploaded audio is not corrupted
1387
-
1388
- ---
1389
-
1390
- ## Development
1391
-
1392
- ### Backend Development
1393
-
1394
- #### Running with Hot-Reload
1395
-
1396
- Hot-reload automatically restarts the server when code changes are detected:
1397
-
1398
- ```bash
1399
- # Using uvicorn
1400
- uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
1401
-
1402
- # With custom reload directories
1403
- uvicorn app.main:app --reload --reload-dir api_server/app
1404
- ```
1405
-
1406
- #### Running Tests
1407
-
1408
- ```bash
1409
- # Navigate to project root
1410
- cd GPT-SoVITS
1411
-
1412
- # Run all tests
1413
- pytest api_server/tests/
1414
-
1415
- # Run specific test file
1416
- pytest api_server/tests/test_tasks.py
1417
-
1418
- # Run with coverage report
1419
- pytest --cov=api_server/app --cov-report=html
1420
-
1421
- # View coverage report
1422
- open htmlcov/index.html
1423
- ```
1424
-
1425
- #### Code Formatting
1426
-
1427
- ```bash
1428
- # Format Python code with Black
1429
- black api_server/
1430
-
1431
- # Sort imports with isort
1432
- isort api_server/
1433
-
1434
- # Lint with flake8
1435
- flake8 api_server/
1436
-
1437
- # Type checking with mypy
1438
- mypy api_server/
1439
- ```
1440
-
1441
- #### Database Migrations
1442
-
1443
- ```bash
1444
- # Generate migration
1445
- alembic revision --autogenerate -m "Add new column"
1446
-
1447
- # Apply migrations
1448
- alembic upgrade head
1449
-
1450
- # Rollback migration
1451
- alembic downgrade -1
1452
- ```
1453
-
1454
- #### Adding New Endpoints
1455
-
1456
- 1. Create route in `api_server/app/routes/`
1457
- 2. Add business logic in `api_server/app/services/`
1458
- 3. Update models in `api_server/app/models/`
1459
- 4. Add tests in `api_server/tests/`
1460
- 5. Update OpenAPI documentation
1461
-
1462
- ### Frontend Development
1463
-
1464
- #### Development Mode
1465
-
1466
- Development mode enables hot module replacement (HMR) for instant feedback:
1467
-
1468
- ```bash
1469
- # Start development server
1470
- npm run dev
1471
-
1472
- # Start with custom port
1473
- npm run dev -- --port 5174
1474
-
1475
- # Start with debug logging
1476
- DEBUG=electron* npm run dev
1477
- ```
1478
-
1479
- #### Type Checking
1480
-
1481
- ```bash
1482
- # Run Vue type checking
1483
- npm run type-check
1484
-
1485
- # Run TypeScript compiler check
1486
- npx tsc --noEmit
1487
-
1488
- # Watch mode for continuous checking
1489
- npm run type-check -- --watch
1490
- ```
1491
-
1492
- #### Building for Production
1493
-
1494
- **Development Build** (with source maps):
1495
- ```bash
1496
- npm run build
1497
- ```
1498
-
1499
- **Production Build** (optimized):
1500
- ```bash
1501
- npm run build:prod
1502
- ```
1503
-
1504
- **Preview Production Build**:
1505
- ```bash
1506
- npm run preview
1507
- ```
1508
-
1509
- #### Building Distribution Packages
1510
-
1511
- Build platform-specific installers:
1512
-
1513
- **macOS**:
1514
- ```bash
1515
- npm run build:mac
1516
- # Output: tts-voice-app/release/MoYoYo-TTS-1.0.0.dmg
1517
- ```
1518
-
1519
- **Windows**:
1520
- ```bash
1521
- npm run build:win
1522
- # Output: tts-voice-app/release/MoYoYo-TTS-Setup-1.0.0.exe
1523
- ```
1524
-
1525
- **Linux**:
1526
- ```bash
1527
- npm run build:linux
1528
- # Output: tts-voice-app/release/moyoyo-tts-1.0.0.AppImage
1529
- ```
1530
-
1531
- **Build All Platforms** (requires platform-specific dependencies):
1532
- ```bash
1533
- npm run build:all
1534
- ```
1535
-
1536
- **Build Configuration**:
1537
- Edit `tts-voice-app/electron-builder.yml` to customize:
1538
- - App name and ID
1539
- - Icon files
1540
- - File associations
1541
- - Auto-update settings
1542
- - Code signing
1543
-
1544
- #### Component Development
1545
-
1546
- **Create New Component**:
1547
- ```bash
1548
- # Navigate to components directory
1549
- cd tts-voice-app/src/components
1550
-
1551
- # Create component file
1552
- touch MyComponent.vue
1553
- ```
1554
-
1555
- **Component Template**:
1556
- ```vue
1557
- <template>
1558
- <div class="my-component">
1559
- <!-- Template here -->
1560
- </div>
1561
- </template>
1562
-
1563
- <script setup lang="ts">
1564
- import { ref } from 'vue'
1565
-
1566
- // Component logic here
1567
- const myValue = ref('')
1568
- </script>
1569
-
1570
- <style scoped>
1571
- .my-component {
1572
- /* Styles here */
1573
- }
1574
- </style>
1575
- ```
1576
-
1577
- #### State Management
1578
-
1579
- The app uses Vue Composition API with Pinia stores:
1580
-
1581
- ```typescript
1582
- // Create new store in src/stores/myStore.ts
1583
- import { defineStore } from 'pinia'
1584
-
1585
- export const useMyStore = defineStore('myStore', {
1586
- state: () => ({
1587
- items: []
1588
- }),
1589
- getters: {
1590
- itemCount: (state) => state.items.length
1591
- },
1592
- actions: {
1593
- addItem(item) {
1594
- this.items.push(item)
1595
- }
1596
- }
1597
- })
1598
- ```
1599
-
1600
- #### Debugging
1601
-
1602
- **Vue DevTools**:
1603
- - Automatically enabled in development mode
1604
- - Access via browser DevTools panel
1605
-
1606
- **Electron DevTools**:
1607
- ```bash
1608
- # Open DevTools on startup
1609
- DEBUG_ELECTRON=true npm run dev
1610
- ```
1611
-
1612
- **Console Logging**:
1613
- ```typescript
1614
- // Main process logs
1615
- console.log('Main:', data)
1616
-
1617
- // Renderer process logs
1618
- console.log('Renderer:', data)
1619
-
1620
- // Check logs in terminal and DevTools console
1621
- ```
1622
-
1623
- #### Testing
1624
-
1625
- ```bash
1626
- # Run unit tests
1627
- npm run test
1628
-
1629
- # Run with coverage
1630
- npm run test:coverage
1631
-
1632
- # Run E2E tests
1633
- npm run test:e2e
1634
-
1635
- # Watch mode
1636
- npm run test:watch
1637
- ```
1638
-
1639
- ### Project Structure
1640
-
1641
- ```
1642
- GPT-SoVITS/
1643
- ├── api_server/ # Backend API
1644
- │ ├── app/
1645
- │ │ ├── main.py # FastAPI application
1646
- │ │ ├── routes/ # API endpoints
1647
- │ │ ├── services/ # Business logic
1648
- │ │ ├── models/ # Data models
1649
- │ │ └── utils/ # Utilities
1650
- │ └── tests/ # Backend tests
1651
- ├── tts-voice-app/ # Frontend Electron app
1652
- │ ├── src/
1653
- │ │ ├── main/ # Electron main process
1654
- │ │ ├── renderer/ # Vue UI
1655
- │ │ ├── components/ # Vue components
1656
- │ │ └── stores/ # State management
1657
- │ └── dist/ # Build output
1658
- ├── GPT_SoVITS/ # Core ML models
1659
- │ ├── pretrained_models/ # Base models
1660
- │ └── text/ # Text processing
1661
- └── .env # Configuration
1662
- ```
1663
-
1664
- ### Contribution Guidelines
1665
-
1666
- 1. **Fork and clone the repository**
1667
- 2. **Create feature branch**: `git checkout -b feature/my-feature`
1668
- 3. **Make changes** and add tests
1669
- 4. **Run tests and linting**: `pytest && black . && isort .`
1670
- 5. **Commit changes**: `git commit -m "feat: add my feature"`
1671
- 6. **Push to branch**: `git push origin feature/my-feature`
1672
- 7. **Create Pull Request** with description
1673
-
1674
- **Commit Message Format**:
1675
- - `feat:` New feature
1676
- - `fix:` Bug fix
1677
- - `docs:` Documentation changes
1678
- - `style:` Code style changes
1679
- - `refactor:` Code refactoring
1680
- - `test:` Test changes
1681
- - `chore:` Build/tooling changes
1682
-
1683
- ---
1684
-
1685
- ## Additional Resources
1686
-
1687
- ### Documentation
1688
-
1689
- - **API Documentation**: http://localhost:8000/docs
1690
- - **Design Document**: `frontend_design.md`
1691
- - **Development Guide**: `development.md`
1692
- - **OpenAPI Specification**: `openapi.json`
1693
-
1694
- ### External Links
1695
-
1696
- - **GPT-SoVITS Repository**: https://github.com/RVC-Boss/GPT-SoVITS
1697
- - **ModelScope Models**: https://www.modelscope.cn/models/XXXXRT/GPT-SoVITS-Pretrained
1698
- - **FastAPI Documentation**: https://fastapi.tiangolo.com
1699
- - **Vue 3 Documentation**: https://vuejs.org
1700
- - **Electron Documentation**: https://www.electronjs.org
1701
-
1702
- ### Support
1703
-
1704
- For issues, questions, or feature requests:
1705
- 1. Check this documentation first
1706
- 2. Search existing GitHub issues
1707
- 3. Create a new issue with detailed description
1708
- 4. Include error messages, logs, and system info
1709
-
1710
- ### License
1711
-
1712
- This project is licensed under the MIT License. See `LICENSE` file for details.
1713
-
1714
  ---
1715
 
1716
  **Last Updated**: 2026-01-23
 
15
  - [Running the Application](#running-the-application)
16
  - [Start Backend API Server](#51-start-backend-api-server)
17
  - [Start Frontend Electron App](#52-start-frontend-electron-app)
18
+ - [First-Time Setup](#first-time-setup)
19
+ - [Feature Overview](#feature-overview)
 
 
 
 
 
20
  - [Troubleshooting](#troubleshooting)
 
21
 
22
  ---
23
 
 
373
 
374
  ---
375
 
376
+ ## First-Time Setup
 
 
377
 
378
  When you first launch the Electron app, you'll need to download required models.
379
 
 
413
  - Verify you have ~10 GB free disk space
414
  - For manual installation, see section 3.3
415
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
416
  ---
417
 
418
+ ## Feature Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419
 
420
+ MoYoYo.tts provides powerful voice cloning and text-to-speech capabilities through an intuitive interface:
 
 
 
421
 
422
+ **Quick Mode** offers a streamlined one-click workflow perfect for beginners. Simply upload a 5-30 second audio sample, select your quality preset (fast/standard/high), and start training. The system automatically handles all pipeline stages including audio processing, speech recognition, feature extraction, and model training. Within 10-40 minutes, you'll have a custom voice ready for text-to-speech generation.
423
 
424
+ **Advanced Mode** gives experienced users granular control over each stage of the training pipeline. You can fine-tune parameters for audio slicing, choose between ASR models (DamoASR for Chinese, Faster Whisper for multilingual), adjust training epochs and batch sizes, and monitor detailed progress for each stage. This mode is ideal for optimizing quality or working with specific audio characteristics.
 
 
 
425
 
426
+ **Text-to-Speech Generation** allows you to instantly use any trained voice to convert text into natural-sounding speech. Adjust speaking speed (0.5x-2.0x), select emotional tones if supported, and generate high-quality audio output in seconds. The system supports multiple languages and provides real-time audio playback and download capabilities.
 
 
 
427
 
428
+ **Voice Library Management** keeps all your trained voices organized in one place. Browse, search, and filter voices by language or quality. Preview any voice with sample audio, export models for backup or sharing, and manage your voice collection efficiently.
 
 
 
 
 
 
 
 
 
 
 
 
429
 
430
+ For detailed API documentation and advanced usage, visit the interactive Swagger UI at **http://localhost:8000/docs** when the backend server is running.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
431
 
432
  ---
433
 
 
471
  python app/main.py
472
  ```
473
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
474
  #### Python Environment Issues
475
 
476
  **Symptom**: `ModuleNotFoundError` or import errors.
 
575
  %APPDATA%\tts-voice-app\logs\
576
  ```
577
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
578
  ---
579
 
580
  **Last Updated**: 2026-01-23
USAGE_CN.md CHANGED
@@ -16,15 +16,9 @@
16
  - [运行应用](#运行应用)
17
  - [启动后端 API 服务器](#51-启动后端-api-服务器)
18
  - [启动前端 Electron 应用](#52-启动前端-electron-应用)
19
- - [使用指南](#使用指南)
20
- - [首次设置](#61-首次设置)
21
- - [快速模式 - 初学者声音克隆](#62-快速模式---初学者声音克隆)
22
- - [高级模式 - 专家声音克隆](#63-高级模式---专家声音克隆)
23
- - [文本转语音生成](#64-文本转语音生成)
24
- - [声音库管理](#65-声音库管理)
25
- - [API 参考](#api-参考)
26
  - [故障排除](#故障排除)
27
- - [开发](#开发)
28
 
29
  ---
30
 
@@ -427,9 +421,7 @@ Electron 应用将自动启动,开发模式下启用热重载。
427
 
428
  ---
429
 
430
- ## 使用指南
431
-
432
- ### 6.1 首次设置
433
 
434
  首次启动 Electron 应用时,您需要下载必需的模型。
435
 
@@ -469,702 +461,21 @@ Electron 应用将自动启动,开发模式下启用热重载。
469
  - 确认您有约 10 GB 的可用磁盘空间
470
  - 如需手动安装,请参见 3.3 节
471
 
472
- ### 6.2 快速模式 - 初学者声音克隆
473
-
474
- 快速模式为想要快速创建声音克隆的用户提供了简化的工作流程,无需技术知识。
475
-
476
- #### 使用 API
477
-
478
- **步骤 1:上传音频文件**
479
-
480
- ```bash
481
- curl -X POST http://localhost:8000/api/v1/files \
482
- -F "file=@path/to/voice_sample.wav" \
483
- -F "purpose=training"
484
- ```
485
-
486
- **响应**:
487
- ```json
488
- {
489
- "file_id": "550e8400-e29b-41d4-a716-446655440000",
490
- "filename": "voice_sample.wav",
491
- "size": 1234567,
492
- "purpose": "training"
493
- }
494
- ```
495
-
496
- **步骤 2:创建训练任务**
497
-
498
- ```bash
499
- curl -X POST http://localhost:8000/api/v1/tasks \
500
- -H "Content-Type: application/json" \
501
- -d '{
502
- "exp_name": "my_voice",
503
- "audio_file_id": "550e8400-e29b-41d4-a716-446655440000",
504
- "options": {
505
- "version": "v2",
506
- "language": "zh",
507
- "quality": "standard"
508
- }
509
- }'
510
- ```
511
-
512
- **响应**:
513
- ```json
514
- {
515
- "id": "task-uuid-here",
516
- "status": "queued",
517
- "exp_name": "my_voice",
518
- "created_at": "2026-01-23T10:30:00Z"
519
- }
520
- ```
521
-
522
- **步骤 3:监控进度**
523
-
524
- 使用服务器发送事件(SSE):
525
- ```bash
526
- curl -N http://localhost:8000/api/v1/tasks/task-uuid-here/progress
527
- ```
528
-
529
- **进度事件**:
530
- ```
531
- event: progress
532
- data: {"stage": "audio_slice", "progress": 25, "message": "切片音频中..."}
533
-
534
- event: progress
535
- data: {"stage": "sovits_train", "progress": 50, "message": "训练 SoVITS 模型中..."}
536
-
537
- event: complete
538
- data: {"status": "completed", "voice_id": "voice-uuid-here"}
539
- ```
540
-
541
- #### 质量预设
542
-
543
- | 预设 | SoVITS 轮数 | GPT 轮数 | 预计时间 | 质量 |
544
- |--------|---------------|------------|-----------|---------|
545
- | **fast** | 4 | 8 | 约 10 分钟 | 适合测试 |
546
- | **standard** | 8 | 15 | 约 20 分钟 | 平衡质量/速度 |
547
- | **high** | 16 | 30 | 约 40 分钟 | 最佳质量 |
548
-
549
- **建议**:
550
- - 使用 `fast` 进行快速测试和预览
551
- - 使用 `standard` 用于大多数生产用例
552
- - 使用 `high` 用于需要最高质量的专业应用
553
-
554
- #### 使用 UI
555
-
556
- **步骤 1:进入声音克隆页面**
557
- - 点击侧边栏中的"声音克隆"
558
- - 或使用键盘快捷键:`Ctrl/Cmd + N`
559
-
560
- **步骤 2:上传音频样本**
561
- - 点击"上传音频"按钮
562
- - 选择 WAV 或 MP3 文件
563
- - **要求**:
564
- - 时长:推荐 5-30 秒
565
- - 质量:清晰的声音,最少的背景噪音
566
- - 内容:自然的讲话,不是唱歌或喊叫
567
-
568
- **步骤 3:配置训练**
569
- - **声音名称**:输入唯一名称(例如,"张三的声音")
570
- - **语言**:选择主要语言(中文、英文、日文)
571
- - **质量预设**:从 fast/standard/high 中选择
572
-
573
- **步骤 4:开始训练**
574
- - 点击"开始训练"按钮
575
- - 任务将被排队,处理将开始
576
-
577
- **步骤 5:监控进度**
578
- - 进度条显示整体完成情况
579
- - 显示当前阶段(例如,"训练 SoVITS 模型中...")
580
- - 显示预计剩余时间
581
- - 您可以导航离开并稍后查看
582
-
583
- **步骤 6:训练完成**
584
- - 完成后您将收到通知
585
- - 声音自动出现在声音库中
586
- - 您可以立即使用它进行 TTS 生成
587
-
588
- **获得最佳效果的提示**:
589
- - 使用高质量音频(最好是 48kHz WAV)
590
- - 确保音调和说话风格一致
591
- - 避免带有音乐或声音效果的音频
592
- - 10-15 秒是样本长度的最佳选择
593
- - 可以组合多个短样本
594
-
595
- ### 6.3 高级模式 - 专家声音克隆
596
-
597
- 高级模式提供对声音训练管道每个阶段的精细控制。建议想要微调训练参数的用户使用。
598
-
599
- #### 训练管道阶段
600
-
601
- 完整的训练管道包含 7 个阶段:
602
-
603
- 1. **Audio Slice**(音频切片):将音频分割成片段
604
- 2. **ASR**(自动语音识别):将音频转录为文本
605
- 3. **Text Feature**(文本特征):提取文本嵌入
606
- 4. **Hubert Feature**(Hubert 特征):提取音频特征
607
- 5. **Semantic Token**(语义标记):生成语义标记
608
- 6. **SoVITS Train**(SoVITS 训练):训练声音合成模型
609
- 7. **GPT Train**(GPT 训练):训练文本到语义模型
610
-
611
- #### 阶段依赖关系
612
-
613
- ```
614
- audio_slice → asr → text_feature → sovits_train
615
- ↘ ↗
616
- hubert_feature → semantic_token → gpt_train
617
- ```
618
-
619
- **重要**:每个阶段必须等待其依赖项完成。
620
-
621
- #### 使用 API
622
-
623
- **步骤 1:创建实验**
624
-
625
- ```bash
626
- curl -X POST http://localhost:8000/api/v1/experiments \
627
- -H "Content-Type: application/json" \
628
- -d '{
629
- "exp_name": "my_custom_voice",
630
- "version": "v2",
631
- "audio_file_id": "file-uuid-here"
632
- }'
633
- ```
634
-
635
- **响应**:
636
- ```json
637
- {
638
- "id": "exp-uuid-here",
639
- "exp_name": "my_custom_voice",
640
- "version": "v2",
641
- "stages": {
642
- "audio_slice": {"status": "pending"},
643
- "asr": {"status": "pending"},
644
- "text_feature": {"status": "pending"},
645
- "hubert_feature": {"status": "pending"},
646
- "semantic_token": {"status": "pending"},
647
- "sovits_train": {"status": "pending"},
648
- "gpt_train": {"status": "pending"}
649
- }
650
- }
651
- ```
652
-
653
- **步骤 2:单独执行阶段**
654
-
655
- **阶段 1 - 音频切片**:
656
- ```bash
657
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/audio_slice \
658
- -H "Content-Type: application/json" \
659
- -d '{
660
- "threshold": -34,
661
- "min_length": 4000,
662
- "min_interval": 300,
663
- "hop_size": 10,
664
- "max_silence_kept": 500
665
- }'
666
- ```
667
-
668
- **参数**:
669
- - `threshold`:静音检测的 dB 阈值(-60 到 0,默认:-34)
670
- - `min_length`:最小片段长度(毫秒)(1000-10000,默认:4000)
671
- - `min_interval`:最小静音间隔(毫秒)(0-3000,默认:300)
672
- - `hop_size`:分析窗口跳跃大小(毫秒)(默认:10)
673
- - `max_silence_kept`:要保留的最大静音(毫秒)(默认:500)
674
-
675
- **阶段 2 - ASR**:
676
- ```bash
677
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/asr \
678
- -H "Content-Type: application/json" \
679
- -d '{
680
- "model": "达摩 ASR (中文)",
681
- "language": "zh"
682
- }'
683
- ```
684
-
685
- **ASR 模型**:
686
- - `达摩 ASR (中文)`:用于中文的 DamoASR(最适合中文)
687
- - `Faster Whisper (多语言)`:用于多语言的 Faster Whisper
688
-
689
- **阶段 3 - 文本特征**:
690
- ```bash
691
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/text_feature \
692
- -H "Content-Type: application/json" \
693
- -d '{
694
- "language": "zh"
695
- }'
696
- ```
697
-
698
- **阶段 4 - Hubert 特征**:
699
- ```bash
700
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/hubert_feature \
701
- -H "Content-Type: application/json" \
702
- -d '{}'
703
- ```
704
-
705
- **阶段 5 - 语义标记**:
706
- ```bash
707
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/semantic_token \
708
- -H "Content-Type: application/json" \
709
- -d '{}'
710
- ```
711
-
712
- **阶段 6 - SoVITS 训练**:
713
- ```bash
714
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/sovits_train \
715
- -H "Content-Type: application/json" \
716
- -d '{
717
- "total_epoch": 8,
718
- "batch_size": 4,
719
- "save_every_epoch": 4,
720
- "text_low_lr_rate": 0.4,
721
- "if_save_latest": true,
722
- "if_save_every_weights": true,
723
- "version": "v2"
724
- }'
725
- ```
726
-
727
- **参数**:
728
- - `total_epoch`:总训练轮数(4-32,默认:8)
729
- - `batch_size`:批次大小(1-40,默认:4)
730
- - `save_every_epoch`:每 N 轮保存检查点(1-50,默认:4)
731
- - `text_low_lr_rate`:文本编码器学习率乘数(0.2-1.0,默认:0.4)
732
-
733
- **阶段 7 - GPT 训练**:
734
- ```bash
735
- curl -X POST http://localhost:8000/api/v1/experiments/exp-uuid/stages/gpt_train \
736
- -H "Content-Type: application/json" \
737
- -d '{
738
- "total_epoch": 15,
739
- "batch_size": 4,
740
- "save_every_epoch": 5,
741
- "if_save_latest": true,
742
- "if_save_every_weights": true,
743
- "version": "v2"
744
- }'
745
- ```
746
-
747
- **步骤 3:监控阶段进度**
748
-
749
- 每个阶段通过 SSE 提供实时进度:
750
-
751
- ```bash
752
- curl -N http://localhost:8000/api/v1/experiments/exp-uuid/stages/sovits_train/progress
753
- ```
754
-
755
- **进度事件**:
756
- ```
757
- event: progress
758
- data: {"epoch": 2, "total_epochs": 8, "progress": 25, "loss": 0.234}
759
-
760
- event: progress
761
- data: {"epoch": 4, "total_epochs": 8, "progress": 50, "loss": 0.189}
762
-
763
- event: complete
764
- data: {"status": "completed", "final_loss": 0.142}
765
- ```
766
-
767
- #### 使用 UI
768
-
769
- **步骤 1:创建新实验**
770
- - 进入"高级模式"页面
771
- - 点击"新建实验"
772
- - 输入实验名称并上传音频
773
-
774
- **步骤 2:配置每个阶段**
775
- - 点击阶段卡以展开设置
776
- - 调整参数(或使用预设默认值)
777
- - 点击"运行阶段"执行
778
-
779
- **步骤 3:监控管道**
780
- - 可视化管道图显示阶段状态
781
- - 绿色:已完成,蓝色:运行中,灰色:待处理
782
- - 点击任何阶段查看详细日志
783
-
784
- **步骤 4:迭代和优化**
785
- - 每个阶段后检查结果
786
- - 如需要可调整参数并重新运行
787
- - 满意时导出最终模型
788
-
789
- **高级提示**:
790
- - 在内存有限的 GPU 上使用较低的 `batch_size`(2-4)
791
- - 对于有足够数据的更好质量,增加 `total_epoch`
792
- - 频繁保存检查点(`save_every_epoch`)以从中断中恢复
793
- - 监控损失值 - 应该随着轮数递减
794
-
795
- ### 6.4 文本转语音生成
796
-
797
- 训练好声音后,您可以使用它从文本生成语音。
798
-
799
- #### 使用 API
800
-
801
- **基本 TTS 请求**:
802
- ```bash
803
- curl -X POST http://localhost:8000/api/v1/inference/tts \
804
- -H "Content-Type: application/json" \
805
- -d '{
806
- "text": "你好,这是文本转语音合成的测试。",
807
- "voice_id": "voice-uuid-here",
808
- "speed": 1.0,
809
- "emotion": "auto"
810
- }'
811
- ```
812
-
813
- **响应**:
814
- ```json
815
- {
816
- "audio_url": "http://localhost:8000/api/v1/files/audio-uuid-here",
817
- "duration": 3.2,
818
- "format": "wav"
819
- }
820
- ```
821
-
822
- **参数**:
823
- - `text`(必需):要合成的文本(最多 5000 个字符)
824
- - `voice_id`(必需):训练好的声音的 UUID
825
- - `speed`(可选):说话速度乘数(0.5 - 2.0,默认:1.0)
826
- - `emotion`(可选):情感风格(auto、neutral、happy、sad)
827
- - `seed`(可选):用于可重复性的随机种子
828
-
829
- **下载生成的音频**:
830
- ```bash
831
- curl -o output.wav http://localhost:8000/api/v1/files/audio-uuid-here
832
- ```
833
-
834
- #### 使用 UI
835
-
836
- **步骤 1:进入 TTS 页面**
837
- - 点击侧边栏中的"文本转语音"
838
- - 或使用键盘快捷键:`Ctrl/Cmd + T`
839
-
840
- **步骤 2:选择声音**
841
- - 打开声音下拉菜单
842
- - 从列表中选择训练好的声音
843
- - 预览按钮可让您听到样本
844
-
845
- **步骤 3:输入文本**
846
- - 在文本区域中输入或粘贴文本
847
- - 显示字符计数(最多 5000)
848
- - 支持多行文本
849
-
850
- **步骤 4:调整设置**
851
- - **速度**:拖动滑块或输入值(0.5x - 2.0x)
852
- - 0.5x:非常慢,清晰的发音
853
- - 1.0x:自然的说话节奏
854
- - 1.5x:快速,仍然清晰
855
- - 2.0x:非常快
856
- - **情感**:从下拉菜单中选择(如果模型支持)
857
- - Auto:从文本推断
858
- - Neutral:平坦、事实性的表达
859
- - Happy:积极向上的语气
860
- - Sad:忧郁、哀伤的语气
861
-
862
- **步骤 5:生成**
863
- - 点击"生成"按钮
864
- - 处理需要 2-5 秒
865
- - 显示进度指示器
866
-
867
- **步骤 6:收听和下载**
868
- - 音频播放器自动出现
869
- - 点击播放按钮收听
870
- - 点击下载按钮保存 WAV 文件
871
- - 分享按钮复制可分享链接
872
-
873
- **文本指南**:
874
- - 使用适当的标点符号进行自然停顿
875
- - 将长文本分成句子
876
- - 对话使用引号
877
- - 全大写用于强调(谨慎使用)
878
-
879
- **自然语音提示**:
880
- - 添加逗号进行呼吸停顿
881
- - 使用省略号(...)进行尾音
882
- - 问号影响语调
883
- - 感叹号增加强调
884
-
885
- ### 6.5 声音库管理
886
-
887
- 声音库是存储和管理所有训练声音的地方。
888
-
889
- #### 使用 API
890
-
891
- **列出所有声音**:
892
- ```bash
893
- curl http://localhost:8000/api/v1/files?purpose=training
894
- ```
895
-
896
- **响应**:
897
- ```json
898
- {
899
- "files": [
900
- {
901
- "id": "voice-uuid-1",
902
- "filename": "john_voice",
903
- "created_at": "2026-01-20T10:30:00Z",
904
- "size": 1234567,
905
- "metadata": {
906
- "language": "zh",
907
- "quality": "standard",
908
- "duration": 12.5
909
- }
910
- },
911
- {
912
- "id": "voice-uuid-2",
913
- "filename": "mary_voice",
914
- "created_at": "2026-01-21T14:20:00Z",
915
- "size": 2345678,
916
- "metadata": {
917
- "language": "en",
918
- "quality": "high",
919
- "duration": 18.3
920
- }
921
- }
922
- ]
923
- }
924
- ```
925
-
926
- **获取声音详情**:
927
- ```bash
928
- curl http://localhost:8000/api/v1/files/voice-uuid-1
929
- ```
930
-
931
- **删除声音**:
932
- ```bash
933
- curl -X DELETE http://localhost:8000/api/v1/files/voice-uuid-1
934
- ```
935
-
936
- **导出声音模型**:
937
- ```bash
938
- curl -o voice_model.zip http://localhost:8000/api/v1/voices/voice-uuid-1/export
939
- ```
940
-
941
- #### 使用 UI
942
-
943
- **浏览声音库**:
944
- - 进入"声音库"页面
945
- - 声音显示为带有以下内容的卡片:
946
- - 声音名称
947
- - 语言和质量徽章
948
- - 创建日期
949
- - 样本持续时间
950
- - 预览波形
951
-
952
- **声音卡操作**:
953
- - **播放**:收听声音样本
954
- - **编辑**:重命名或更新元数据
955
- - **导出**:下载声音模型文件
956
- - **删除**:删除声音(带确认)
957
-
958
- **搜索和筛选**:
959
- - 搜索栏:按声音名称筛选
960
- - 语言筛选:仅显示特定语言
961
- - 质量筛选:仅显示特定质量预设
962
- - 排序选项:
963
- - 名称(A-Z)
964
- - 创建日期(最新在前)
965
- - 创建日期(最旧在前)
966
- - 文件大小
967
-
968
- **批量操作**:
969
- - 选择多个声音(Shift+点击)
970
- - 将选定的声音导出为 ZIP
971
- - 删除选定的声音
972
- - 标记选定的声音
973
-
974
- **声音详情面板**:
975
- 点击任何声音卡查看:
976
- - 完整的训练参数
977
- - 训练历史和日志
978
- - 模型文件大小
979
- - 样本音频片段
980
- - 导出和分享选项
981
-
982
- **组织提示**:
983
- - 使用描述性名称(例如,"张三_专业"、"李四_休闲")
984
- - 按项目或用例标记声音
985
- - 导出重要的声音作为备份
986
- - 删除测试声音以节省空间
987
-
988
  ---
989
 
990
- ## API 参考
991
-
992
- ### 快速模式端点
993
-
994
- #### 任务
995
 
996
- **创建任务** - 启动一键式声音训练任务
997
- ```http
998
- POST /api/v1/tasks
999
- Content-Type: application/json
1000
 
1001
- {
1002
- "exp_name": "string",
1003
- "audio_file_id": "uuid",
1004
- "options": {
1005
- "version": "v2",
1006
- "language": "zh|en|ja",
1007
- "quality": "fast|standard|high"
1008
- }
1009
- }
1010
- ```
1011
-
1012
- **列出任务** - 获取所有任务
1013
- ```http
1014
- GET /api/v1/tasks?status=queued|running|completed|failed
1015
- ```
1016
 
1017
- **获取任务** - 获取特定任务详情
1018
- ```http
1019
- GET /api/v1/tasks/{task_id}
1020
- ```
1021
-
1022
- **取消任务** - 取消正在运行的任务
1023
- ```http
1024
- DELETE /api/v1/tasks/{task_id}
1025
- ```
1026
-
1027
- **任务进度** - 通过 SSE 实时进度
1028
- ```http
1029
- GET /api/v1/tasks/{task_id}/progress
1030
- Accept: text/event-stream
1031
- ```
1032
 
1033
- ### 级模式端点
1034
 
1035
- #### 实验
1036
-
1037
- **创建实验** - 初始化新的训练实验
1038
- ```http
1039
- POST /api/v1/experiments
1040
- Content-Type: application/json
1041
-
1042
- {
1043
- "exp_name": "string",
1044
- "version": "v2",
1045
- "audio_file_id": "uuid"
1046
- }
1047
- ```
1048
-
1049
- **获取实验** - 获取实验详情
1050
- ```http
1051
- GET /api/v1/experiments/{exp_id}
1052
- ```
1053
-
1054
- **列出实验** - 获取所有实验
1055
- ```http
1056
- GET /api/v1/experiments?status=pending|running|completed
1057
- ```
1058
-
1059
- **删除实验** - 删除实验和所有数据
1060
- ```http
1061
- DELETE /api/v1/experiments/{exp_id}
1062
- ```
1063
-
1064
- #### 阶段
1065
-
1066
- **执行阶段** - 运行特定的管道阶段
1067
- ```http
1068
- POST /api/v1/experiments/{exp_id}/stages/{stage_type}
1069
- Content-Type: application/json
1070
-
1071
- {
1072
- // 阶段特定参数
1073
- }
1074
- ```
1075
-
1076
- **阶段类型**:
1077
- - `audio_slice`
1078
- - `asr`
1079
- - `text_feature`
1080
- - `hubert_feature`
1081
- - `semantic_token`
1082
- - `sovits_train`
1083
- - `gpt_train`
1084
-
1085
- **获取阶段状态** - 获取特定阶段的状态
1086
- ```http
1087
- GET /api/v1/experiments/{exp_id}/stages/{stage_type}
1088
- ```
1089
-
1090
- **获取所有阶段状态** - 获取所有阶段的状态
1091
- ```http
1092
- GET /api/v1/experiments/{exp_id}/stages
1093
- ```
1094
-
1095
- **阶段进度** - 通过 SSE 实时阶段进度
1096
- ```http
1097
- GET /api/v1/experiments/{exp_id}/stages/{stage_type}/progress
1098
- Accept: text/event-stream
1099
- ```
1100
-
1101
- **获取阶段架构** - 获取阶段的参数架构
1102
- ```http
1103
- GET /api/v1/stages/{stage_type}/schema
1104
- ```
1105
-
1106
- ### 通用端点
1107
-
1108
- #### 文件
1109
-
1110
- **上传文件** - 上传音频或数据文件
1111
- ```http
1112
- POST /api/v1/files
1113
- Content-Type: multipart/form-data
1114
-
1115
- file: binary
1116
- purpose: training|inference
1117
- ```
1118
-
1119
- **列出文件** - 获取所有上传的文件
1120
- ```http
1121
- GET /api/v1/files?purpose=training|inference
1122
- ```
1123
-
1124
- **获取文件** - 下载特定文件
1125
- ```http
1126
- GET /api/v1/files/{file_id}
1127
- ```
1128
-
1129
- **删除文件** - 删除文件
1130
- ```http
1131
- DELETE /api/v1/files/{file_id}
1132
- ```
1133
-
1134
- #### 推理
1135
-
1136
- **文本转语音** - 从文本生成语音
1137
- ```http
1138
- POST /api/v1/inference/tts
1139
- Content-Type: application/json
1140
-
1141
- {
1142
- "text": "string",
1143
- "voice_id": "uuid",
1144
- "speed": 1.0,
1145
- "emotion": "auto|neutral|happy|sad",
1146
- "seed": 42
1147
- }
1148
- ```
1149
-
1150
- **获取声音信息** - 获取声音模型信息
1151
- ```http
1152
- GET /api/v1/voices/{voice_id}
1153
- ```
1154
-
1155
- #### 配置
1156
-
1157
- **获取阶段预设** - 获取阶段的预设配置
1158
- ```http
1159
- GET /api/v1/stages/presets
1160
- ```
1161
-
1162
- **健康检查** - 检查 API 服务器健康状况
1163
- ```http
1164
- GET /health
1165
- ```
1166
 
1167
- **完整OpenAPI 规范可以下位置获得**http://localhost:8000/openapi.json
1168
 
1169
  ---
1170
 
@@ -1208,38 +519,6 @@ rm ~/.moyoyo-tts/data/tasks.db
1208
  python app/main.py
1209
  ```
1210
 
1211
- #### 训练立即失败
1212
-
1213
- **症状**:训练开始但在几秒钟内失败。
1214
-
1215
- **诊断**:
1216
- ```bash
1217
- # 检查 GPU 可用性
1218
- python -c "import torch; print(torch.cuda.is_available())"
1219
-
1220
- # 检查 CUDA 版本
1221
- python -c "import torch; print(torch.version.cuda)"
1222
-
1223
- # 检查磁盘空间
1224
- df -h
1225
- ```
1226
-
1227
- **解决方案**:
1228
- 1. **无 GPU**:系统将使用 CPU(较慢但有效)
1229
- 2. **CUDA 不匹配**:使用正确的 CUDA 版本重新安装 PyTorch:
1230
- ```bash
1231
- # 对于 CUDA 12.6
1232
- uv sync --reinstall-package torch --reinstall-package torchaudio
1233
-
1234
- # 对于 CUDA 12.8(Windows)
1235
- uv sync --reinstall-package torch --reinstall-package torchaudio --index pytorch-cu128
1236
-
1237
- # 仅 CPU
1238
- uv sync --reinstall-package torch --reinstall-package torchaudio --index pytorch-cpu
1239
- ```
1240
- 3. **磁盘空间不足**:至少释放 10GB
1241
- 4. **内存不足**:在训练参数中减少 `batch_size`
1242
-
1243
  #### Python 环境问题
1244
 
1245
  **症状**:`ModuleNotFoundError` 或导入错误。
@@ -1344,431 +623,6 @@ nvm use 18
1344
  %APPDATA%\tts-voice-app\logs\
1345
  ```
1346
 
1347
- ### 常见错误
1348
-
1349
- #### "PYTHONPATH not set" 错误
1350
-
1351
- **症状**:与 `GPT_SoVITS` 模块相关的导入错误。
1352
-
1353
- **原因**:API 服务器需要找到主项目目录。
1354
-
1355
- **解决方案**:API 自动设置 `PYTHONPATH`,但请验证:
1356
- ```bash
1357
- # 检查项目结构
1358
- ls GPT-SoVITS/ # 应包含 *.py 文件
1359
-
1360
- # 如需手动设置
1361
- export PYTHONPATH=/Users/coldish/workspace/GPT-SoVITS:$PYTHONPATH
1362
- ```
1363
-
1364
- #### "Model not found" 错误
1365
-
1366
- **症状**:训练失败并显示"找不到预训练模型"消息。
1367
-
1368
- **诊断**:
1369
- ```bash
1370
- # 检查模型是否存在
1371
- ls GPT_SoVITS/pretrained_models/
1372
- # 应显示:s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt, s2G488k.pth, s2D488k.pth
1373
- ```
1374
-
1375
- **解决方案**:下载预训练模型(参见 3.3 节):
1376
- ```bash
1377
- wget https://www.modelscope.cn/models/XXXXRT/GPT-SoVITS-Pretrained/resolve/master/pretrained_models.zip
1378
- unzip -q -o pretrained_models.zip -d GPT_SoVITS
1379
- ```
1380
-
1381
- #### "Out of memory" 错误
1382
-
1383
- **症状**:训练崩溃并显示 `CUDA out of memory` 或 `MemoryError`。
1384
-
1385
- **解决方案**:
1386
- 1. **减小批次大小**:
1387
- ```json
1388
- {
1389
- "batch_size": 2 // 从 4 减少到 2
1390
- }
1391
- ```
1392
-
1393
- 2. **关闭其他应用程序**:释放 GPU/RAM
1394
-
1395
- 3. **使用 CPU 模式**:较慢但使用系统 RAM 而不是 GPU:
1396
- ```bash
1397
- # 设置环境变量
1398
- export CUDA_VISIBLE_DEVICES=""
1399
- python app/main.py
1400
- ```
1401
-
1402
- 4. **增加系统交换空间**(Linux):
1403
- ```bash
1404
- sudo dd if=/dev/zero of=/swapfile bs=1G count=8
1405
- sudo mkswap /swapfile
1406
- sudo swapon /swapfile
1407
- ```
1408
-
1409
- #### "NLTK Data Not Found" 错误
1410
-
1411
- **症状**:文本处理失败并显示 NLTK 数据错误。
1412
-
1413
- **解决方案**:下载 NLTK 数据(参见 3.3 节):
1414
- ```bash
1415
- wget https://www.modelscope.cn/models/XXXXRT/GPT-SoVITS-Pretrained/resolve/master/nltk_data.zip
1416
- unzip -q -o nltk_data.zip -d .venv/
1417
- ```
1418
-
1419
- #### 音频质量问题
1420
-
1421
- **症状**:生成的音频听起来像机器人、失真或不清楚。
1422
-
1423
- **解决方案**:
1424
- 1. **使用更好的训练数据**:
1425
- - 高质量音频(首选 48kHz WAV)
1426
- - 清晰的声音,最少的背景噪音
1427
- - 10-15 秒的音频
1428
- - 自然、对话式的讲话
1429
-
1430
- 2. **提高训练质量**:
1431
- ```json
1432
- {
1433
- "quality": "high" // 使用 high 而不是 standard
1434
- }
1435
- ```
1436
-
1437
- 3. **训练更长时间**:
1438
- ```json
1439
- {
1440
- "total_epoch": 16 // 从 8 增加到 16
1441
- }
1442
- ```
1443
-
1444
- 4. **检查参考音频**:确保上传的音频未损坏
1445
-
1446
- ---
1447
-
1448
- ## 开发
1449
-
1450
- ### 后端开发
1451
-
1452
- #### 使用热重载运行
1453
-
1454
- 热重载在检测到代码更改时自动重启服务器:
1455
-
1456
- ```bash
1457
- # 使用 uvicorn
1458
- uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
1459
-
1460
- # 使用自定义重载目录
1461
- uvicorn app.main:app --reload --reload-dir api_server/app
1462
- ```
1463
-
1464
- #### 运行测试
1465
-
1466
- ```bash
1467
- # 进入项目根目录
1468
- cd GPT-SoVITS
1469
-
1470
- # 运行所有测试
1471
- pytest api_server/tests/
1472
-
1473
- # 运行特定测试文件
1474
- pytest api_server/tests/test_tasks.py
1475
-
1476
- # 使用覆盖率报告运行
1477
- pytest --cov=api_server/app --cov-report=html
1478
-
1479
- # 查看覆盖率报告
1480
- open htmlcov/index.html
1481
- ```
1482
-
1483
- #### 代码格式化
1484
-
1485
- ```bash
1486
- # 使用 Black 格式化 Python 代码
1487
- black api_server/
1488
-
1489
- # 使用 isort 排序导入
1490
- isort api_server/
1491
-
1492
- # 使用 flake8 进行代码检查
1493
- flake8 api_server/
1494
-
1495
- # 使用 mypy 进行类型检查
1496
- mypy api_server/
1497
- ```
1498
-
1499
- #### 数据库迁移
1500
-
1501
- ```bash
1502
- # 生成迁移
1503
- alembic revision --autogenerate -m "Add new column"
1504
-
1505
- # 应用迁移
1506
- alembic upgrade head
1507
-
1508
- # 回滚迁移
1509
- alembic downgrade -1
1510
- ```
1511
-
1512
- #### 添加新端点
1513
-
1514
- 1. 在 `api_server/app/routes/` 中创建路由
1515
- 2. 在 `api_server/app/services/` 中添加业务逻辑
1516
- 3. 在 `api_server/app/models/` 中更新模型
1517
- 4. 在 `api_server/tests/` 中添加测试
1518
- 5. 更新 OpenAPI 文档
1519
-
1520
- ### 前端开发
1521
-
1522
- #### 开发模式
1523
-
1524
- 开发模式启用热模块替换(HMR)以获得即时反馈:
1525
-
1526
- ```bash
1527
- # 启动开发服务器
1528
- npm run dev
1529
-
1530
- # 使用自定义端口启动
1531
- npm run dev -- --port 5174
1532
-
1533
- # 使用调试日志启动
1534
- DEBUG=electron* npm run dev
1535
- ```
1536
-
1537
- #### 类型检查
1538
-
1539
- ```bash
1540
- # 运行 Vue 类型检查
1541
- npm run type-check
1542
-
1543
- # 运行 TypeScript 编译器检查
1544
- npx tsc --noEmit
1545
-
1546
- # 监视模式以进行连续检查
1547
- npm run type-check -- --watch
1548
- ```
1549
-
1550
- #### 构建生产版本
1551
-
1552
- **开发构建**(带源映射):
1553
- ```bash
1554
- npm run build
1555
- ```
1556
-
1557
- **生产构建**(优化):
1558
- ```bash
1559
- npm run build:prod
1560
- ```
1561
-
1562
- **预览生产构建**:
1563
- ```bash
1564
- npm run preview
1565
- ```
1566
-
1567
- #### 构建分发包
1568
-
1569
- 构建特定于平台的安装程序:
1570
-
1571
- **macOS**:
1572
- ```bash
1573
- npm run build:mac
1574
- # 输出:tts-voice-app/release/MoYoYo-TTS-1.0.0.dmg
1575
- ```
1576
-
1577
- **Windows**:
1578
- ```bash
1579
- npm run build:win
1580
- # 输出:tts-voice-app/release/MoYoYo-TTS-Setup-1.0.0.exe
1581
- ```
1582
-
1583
- **Linux**:
1584
- ```bash
1585
- npm run build:linux
1586
- # 输出:tts-voice-app/release/moyoyo-tts-1.0.0.AppImage
1587
- ```
1588
-
1589
- **构建所有平台**(需要特定于平台的依赖项):
1590
- ```bash
1591
- npm run build:all
1592
- ```
1593
-
1594
- **构建配置**:
1595
- 编辑 `tts-voice-app/electron-builder.yml` 以自定义:
1596
- - 应用名称和 ID
1597
- - 图标文件
1598
- - 文件关联
1599
- - 自动更新设置
1600
- - 代码签名
1601
-
1602
- #### 组件开发
1603
-
1604
- **创建新组件**:
1605
- ```bash
1606
- # 进入组件目录
1607
- cd tts-voice-app/src/components
1608
-
1609
- # 创建组件文件
1610
- touch MyComponent.vue
1611
- ```
1612
-
1613
- **组件模板**:
1614
- ```vue
1615
- <template>
1616
- <div class="my-component">
1617
- <!-- 模板在这里 -->
1618
- </div>
1619
- </template>
1620
-
1621
- <script setup lang="ts">
1622
- import { ref } from 'vue'
1623
-
1624
- // 组件逻辑在这里
1625
- const myValue = ref('')
1626
- </script>
1627
-
1628
- <style scoped>
1629
- .my-component {
1630
- /* 样式在这里 */
1631
- }
1632
- </style>
1633
- ```
1634
-
1635
- #### 状态管理
1636
-
1637
- 应用使用 Vue Composition API 和 Pinia stores:
1638
-
1639
- ```typescript
1640
- // 在 src/stores/myStore.ts 中创建新的 store
1641
- import { defineStore } from 'pinia'
1642
-
1643
- export const useMyStore = defineStore('myStore', {
1644
- state: () => ({
1645
- items: []
1646
- }),
1647
- getters: {
1648
- itemCount: (state) => state.items.length
1649
- },
1650
- actions: {
1651
- addItem(item) {
1652
- this.items.push(item)
1653
- }
1654
- }
1655
- })
1656
- ```
1657
-
1658
- #### 调试
1659
-
1660
- **Vue DevTools**:
1661
- - 在开发模式下自动启用
1662
- - 通过浏览器 DevTools 面板访问
1663
-
1664
- **Electron DevTools**:
1665
- ```bash
1666
- # 启动时打开 DevTools
1667
- DEBUG_ELECTRON=true npm run dev
1668
- ```
1669
-
1670
- **控制台日志记录**:
1671
- ```typescript
1672
- // 主进程日志
1673
- console.log('Main:', data)
1674
-
1675
- // 渲染进程日志
1676
- console.log('Renderer:', data)
1677
-
1678
- // 在终端和 DevTools 控制台中检查日志
1679
- ```
1680
-
1681
- #### 测试
1682
-
1683
- ```bash
1684
- # 运行单元测试
1685
- npm run test
1686
-
1687
- # 使用覆盖率运行
1688
- npm run test:coverage
1689
-
1690
- # 运行 E2E 测试
1691
- npm run test:e2e
1692
-
1693
- # 监视模式
1694
- npm run test:watch
1695
- ```
1696
-
1697
- ### 项目结构
1698
-
1699
- ```
1700
- GPT-SoVITS/
1701
- ├── api_server/ # 后端 API
1702
- │ ├── app/
1703
- │ │ ├── main.py # FastAPI 应用
1704
- │ │ ├── routes/ # API 端点
1705
- │ │ ├── services/ # 业务逻辑
1706
- │ │ ├── models/ # 数据模型
1707
- │ │ └── utils/ # 实用工具
1708
- │ └── tests/ # 后端测试
1709
- ├── tts-voice-app/ # 前端 Electron 应用
1710
- │ ├── src/
1711
- │ │ ├── main/ # Electron 主进程
1712
- │ │ ├── renderer/ # Vue UI
1713
- │ │ ├── components/ # Vue 组件
1714
- │ │ └── stores/ # 状态管理
1715
- │ └── dist/ # 构建输出
1716
- ├── GPT_SoVITS/ # 核心 ML 模型
1717
- │ ├── pretrained_models/ # 基础模型
1718
- │ └── text/ # 文本处理
1719
- └── .env # 配置
1720
- ```
1721
-
1722
- ### 贡献指南
1723
-
1724
- 1. **Fork 并克隆仓库**
1725
- 2. **创建功能分支**:`git checkout -b feature/my-feature`
1726
- 3. **进行更改**并添加测试
1727
- 4. **运行测试和代码检查**:`pytest && black . && isort .`
1728
- 5. **提交更改**:`git commit -m "feat: add my feature"`
1729
- 6. **推送到分支**:`git push origin feature/my-feature`
1730
- 7. **创建 Pull Request**并附上描述
1731
-
1732
- **提交消息格式**:
1733
- - `feat:`:新功能
1734
- - `fix:`:错误修复
1735
- - `docs:`:文档更改
1736
- - `style:`:代码样式更改
1737
- - `refactor:`:代码重构
1738
- - `test:`:测试更改
1739
- - `chore:`:构建/工具更改
1740
-
1741
- ---
1742
-
1743
- ## 其他资源
1744
-
1745
- ### 文档
1746
-
1747
- - **API 文档**:http://localhost:8000/docs
1748
- - **设计文档**:`frontend_design.md`
1749
- - **开发指南**:`development.md`
1750
- - **OpenAPI 规范**:`openapi.json`
1751
-
1752
- ### 外部链接
1753
-
1754
- - **GPT-SoVITS 仓库**:https://github.com/RVC-Boss/GPT-SoVITS
1755
- - **ModelScope 模型**:https://www.modelscope.cn/models/XXXXRT/GPT-SoVITS-Pretrained
1756
- - **FastAPI 文档**:https://fastapi.tiangolo.com
1757
- - **Vue 3 文档**:https://cn.vuejs.org
1758
- - **Electron 文档**:https://www.electronjs.org
1759
-
1760
- ### 支持
1761
-
1762
- 对于问题、疑问或功能请求:
1763
- 1. 首先查看本文档
1764
- 2. 搜索现有的 GitHub issues
1765
- 3. 创建包含详细描述的新 issue
1766
- 4. 包括错误消息、日志和系统信息
1767
-
1768
- ### 许可证
1769
-
1770
- 本项目根据 MIT 许可证授权。详见 `LICENSE` 文件。
1771
-
1772
  ---
1773
 
1774
  **最后更新**:2026-01-23
 
16
  - [运行应用](#运行应用)
17
  - [启动后端 API 服务器](#51-启动后端-api-服务器)
18
  - [启动前端 Electron 应用](#52-启动前端-electron-应用)
19
+ - [首次设置](#首次设置)
20
+ - [功能概览](#功能概览)
 
 
 
 
 
21
  - [故障排除](#故障排除)
 
22
 
23
  ---
24
 
 
421
 
422
  ---
423
 
424
+ ## 首次设置
 
 
425
 
426
  首次启动 Electron 应用时,您需要下载必需的模型。
427
 
 
461
  - 确认您有约 10 GB 的可用磁盘空间
462
  - 如需手动安装,请参见 3.3 节
463
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
464
  ---
465
 
466
+ ## 功能概览
 
 
 
 
467
 
468
+ MoYoYo.tts 通过直观的界面提供强大的声音克隆和文本转语音功能:
 
 
 
469
 
470
+ **快速模式**为初学者提供简化的一键式工作流程。只需上传 5-30 秒的音频样本,选择您的质量预设(fast/standard/high),然后开始训练。系统自动处理所有管道阶段,包括音频处理、语音识别、特征提取和模型训练。在 10-40 分钟内,您将获得一个可用于文本转语音生成的自定义声音。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
471
 
472
+ **高级模式**为经验丰富的用户提供对训练管道各个阶段的精细控制。您可以微调音频切片参数,在 ASR 模型之间选择(用于中文的 DamoASR,用于多语言的 Faster Whisper),调整训练轮数和批次大小,并监控每个阶段的详细进度。此模式非常适合优化质量或处理特定的音频特性。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
473
 
474
+ **文本转语音生成**允许您立即使用任何训练好的声音将文本转换为自然发音的语音。调整说话速度(0.5x-2.0x),如果支持则选择情感语气,并在几秒钟内生成质量的音频输出。系统支持多种语言,并提供实时音频播放和下载功能。
475
 
476
+ **声音库管理**将所有训练好的声音集中在一个地方。按语言或质量浏览、搜索和筛选声音。使用样本音频预览任何声音,导出模型进行备份或共享,并有效管理您的声音收藏。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
477
 
478
+ 有关详细API 文档和高级使用,请后端服务器运行时访问交互式 Swagger UI:**http://localhost:8000/docs**。
479
 
480
  ---
481
 
 
519
  python app/main.py
520
  ```
521
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
522
  #### Python 环境问题
523
 
524
  **症状**:`ModuleNotFoundError` 或导入错误。
 
623
  %APPDATA%\tts-voice-app\logs\
624
  ```
625
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
626
  ---
627
 
628
  **最后更新**:2026-01-23