YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Synthetic Translation Dataset Generator
A professional, clean-code implementation for generating synthetic translation datasets using LLMs.
Project Structure
synthetic_projects/
βββ src/
β βββ core/ # Shared core utilities
β β βββ config.py # Configuration management
β β βββ models.py # Data models
β β βββ llm_client.py # LLM API client
β β βββ worker_pool.py # Multiprocessing worker pool
β βββ asr_translation/ # ASR -> English translation
β β βββ prompts.py # Translation prompts
β β βββ models.py # Data models
β β βββ processor.py # Data processing logic
β β βββ runner.py # Main runner script
β βββ chat_translation/ # Chat -> English translation + moderation
β βββ prompts.py # Translation & moderation prompts
β βββ models.py # Data models
β βββ processor.py # Data processing logic
β βββ runner.py # Main runner script
βββ tests/ # Unit tests
βββ scripts/ # Background execution scripts
βββ configs/ # Configuration files
Features
- Clean Architecture: Separation of concerns with modular design
- Type-Safe: Full type hints with Pydantic models
- Configurable: YAML/ENV-based configuration
- Efficient: Multi-CPU processing with dynamic batching
- Resilient: Retry logic and error handling
- Testable: Comprehensive test coverage
- Maintainable: Well-documented, easy to extend
Sub-Projects
1. ASR Translation
Translates Vietnamese ASR transcriptions to well-written English text.
Input: Raw ASR transcriptions (Vietnamese, unnormalized)
Output: Clean, well-organized English translations
2. Chat Translation
Translates Vietnamese chat messages to formal English with content moderation.
Input: Vietnamese chat messages
Output:
- Formal English translation
- Political compliance metadata (Vietnam laws)
Quick Start
Installation
pip install -r requirements.txt
Configuration
Create .env file:
VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL=Qwen/Qwen3-Next-80B-A3B-Instruct
MAX_WORKERS=8
BATCH_SIZE=32
Run ASR Translation
python -m src.asr_translation.runner \
--input translation_for_asr/telephone2000h.txt \
--output outputs/asr_translated.jsonl \
--num-workers 8
Run Chat Translation
python -m src.chat_translation.runner \
--dataset tarudesu/VOZ-HSD \
--output outputs/chat_translated.jsonl \
--num-workers 8
Background Execution
# ASR translation in background
nohup bash scripts/run_asr_translation.sh > logs/asr.log 2>&1 &
# Chat translation in background
nohup bash scripts/run_chat_translation.sh > logs/chat.log 2>&1 &
Testing
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support