Update README.md
Browse files
README.md
CHANGED
|
@@ -51,4 +51,114 @@ The community is actively seeking institutional and technical stakeholders to as
|
|
| 51 |
|
| 52 |
Interested parties may reach out via the Hugging Face discussion board or review our current repository of model weights and datasets.
|
| 53 |
|
| 54 |
-
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
Interested parties may reach out via the Hugging Face discussion board or review our current repository of model weights and datasets.
|
| 53 |
|
| 54 |
+
</details>
|
| 55 |
+
|
| 56 |
+
<details>
|
| 57 |
+
<summary><b>TL;DR Summary:</b> Phase 1 is underway, but achieving a high-fidelity "Teacher" model for Philippine languages using Llama 3.1 and machine-translated Alpaca data is currently bottlenecked. Llama 3.1's inherent English-centric bias combined with syntactically flawed, machine-translated training data creates a compounding error loop. This results in grammatical corruption, dialect mixing, and severe hallucinations rather than true Neural Machine Translation (NMT) parity. There is still a long way to go to build a reliable teacher model; we must pivot away from machine-translated shortcuts and invest in human-curated, native-first datasets before progressing to knowledge distillation.</summary>
|
| 58 |
+
|
| 59 |
+
<br>
|
| 60 |
+
|
| 61 |
+
# PROGRESS REPORT: Phase 1 Foundation Model Alignment
|
| 62 |
+
|
| 63 |
+
**Organization:** Philippine Languages Translation and AI Training Community
|
| 64 |
+
**Project Phase:** Phase 1 - Foundation Model Alignment and NMT Parity
|
| 65 |
+
**Target Languages:** Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan
|
| 66 |
+
**Date:** [Current Date]
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
### **1. Executive Summary**
|
| 71 |
+
In pursuit of bridging the digital divide for low-resource Philippine languages, our team has initiated Phase 1 of our technical roadmap. The primary objective is to utilize Supervised Fine-Tuning (SFT) and QLoRA on large-scale transformer architectures (Llama 3.1 series) to achieve Neural Machine Translation (NMT) parity with commercial benchmarks.
|
| 72 |
+
|
| 73 |
+
While the infrastructure for full-parameter tuning and SFT is fully operational, initial evaluations reveal critical bottlenecks. Specifically, the inherent limitations of the Llama 3.1 base model, combined with the use of machine-translated instruction datasets (such as the Alpaca dataset), have severely hindered our progress toward building a reliable "Teacher" model.
|
| 74 |
+
|
| 75 |
+
### **2. Current Progress**
|
| 76 |
+
* **Infrastructure:** QLoRA and SFT pipelines have been successfully implemented.
|
| 77 |
+
* **Model Selection:** Llama 3.1 has been established as the primary foundational architecture.
|
| 78 |
+
* **Data Ingestion:** Initial fine-tuning has commenced using machine-translated versions of the Alpaca instruction-tuning dataset across the eight major regional languages.
|
| 79 |
+
|
| 80 |
+
### **3. Key Challenges & Technical Limitations**
|
| 81 |
+
|
| 82 |
+
Despite successful pipeline implementation, achieving high-fidelity translation and complex instruction following is currently compromised by two compounding factors:
|
| 83 |
+
|
| 84 |
+
#### **A. Inherent Limitations of the Llama 3.1 Architecture in Low-Resource Contexts**
|
| 85 |
+
While Llama 3.1 is a highly capable state-of-the-art model, its pre-training corpus is overwhelmingly English-centric.
|
| 86 |
+
* **Tokenization Inefficiency:** The Llama 3.1 tokenizer is not optimized for the agglutinative and morphologically rich nature of Philippine languages (which rely heavily on complex prefixes, infixes, and suffixes). As a result, native words are fractured into an excessive number of tokens. This degrades the model's context window efficiency and severely impairs its syntactic reasoning.
|
| 87 |
+
* **Latent Space Bias:** Because the model's foundational weights lack deep representations of Philippine linguistics, it defaults to English-language logic and syntax, applying it unnaturally to regional dialects.
|
| 88 |
+
|
| 89 |
+
#### **B. The Flaws of Machine-Translated Datasets (Alpaca)**
|
| 90 |
+
To rapidly generate instruction-tuning data, the English Alpaca dataset was machine-translated into the target Philippine languages. From a multilingual perspective, this approach is highly problematic:
|
| 91 |
+
* **Loss of Nuance and Context:** Machine translation of the Alpaca dataset frequently results in literal, word-for-word translations. It fails to capture the cultural context, idioms, or appropriate levels of formality inherent to Philippine languages.
|
| 92 |
+
* **Syntactic Corruption:** Standard MT tools struggle with the Verb-Subject-Object (VSO) sentence structure common in Philippine languages, often forcing them into the English Subject-Verb-Object (SVO) structure. This trains the model on grammatically incorrect, "robotic" phrasing.
|
| 93 |
+
|
| 94 |
+
#### **C. The Hallucination Loop (Multilingual Parity Failure)**
|
| 95 |
+
When a fundamentally English-biased model (Llama 3.1) is fine-tuned using flawed, machine-translated training data, the result is a compounding error loop.
|
| 96 |
+
* Because the model lacks a strong native understanding of languages like Ilocano or Hiligaynon, it accepts the syntactically flawed machine-translated Alpaca data as absolute "ground truth."
|
| 97 |
+
* When prompted, the model attempts to map this flawed training data against its English-centric weights. This directly leads to **severe hallucinations**.
|
| 98 |
+
* Instead of achieving NMT parity, the model frequently generates outputs that mix dialects (e.g., confusing Tagalog and Cebuano vocabularies), fabricates non-existent words (morphological hallucinations), or confidently outputs nonsensical, direct translations of English idioms that hold no meaning to native speakers.
|
| 99 |
+
|
| 100 |
+
### **4. Implications for the "Teacher Model"**
|
| 101 |
+
The ultimate milestone for Phase 1 is the creation of a **validated "Teacher" model** capable of high-fidelity translation, which will serve as the performance baseline for knowledge distillation into smaller, efficient, edge-ready models.
|
| 102 |
+
|
| 103 |
+
**Based on current data, there is still a long way to go to achieve this milestone.** A teacher model trained on machine-translated data currently propagates and amplifies grammatical errors and hallucinations. If we proceed to distillation using the current iteration of the model, we will inherently pass these hallucinations down to the smaller student models, rendering them unusable for actual native speakers and defeating the purpose of edge deployment.
|
| 104 |
+
|
| 105 |
+
### **5. Strategic Recommendations & Next Steps**
|
| 106 |
+
To achieve true NMT parity and develop a high-fidelity Teacher model, we must pivot our data strategy:
|
| 107 |
+
1. **Move Beyond Machine Translation:** We must deprecate reliance on purely machine-translated datasets like Alpaca.
|
| 108 |
+
2. **Human-in-the-Loop (HITL) Curation:** Invest in community-driven, native-speaker verification of parallel corpora to ensure morphological and syntactic accuracy.
|
| 109 |
+
3. **Native Prompt Generation:** Shift toward instruction-tuning datasets that are originally authored in the target languages, rather than translating English instructions.
|
| 110 |
+
4. **Vocabulary Expansion:** Investigate vocabulary expansion and embedding initialization techniques for Llama 3.1 to improve tokenization efficiency for Philippine languages prior to SFT.
|
| 111 |
+
|
| 112 |
+
**Conclusion:**
|
| 113 |
+
Building high-performance NLP architectures for Philippine languages cannot rely on shortcuts like machine-translated datasets layered over English-centric models. Achieving the high-fidelity teacher model required for our roadmap demands a rigorous, culturally accurate, and native-first approach to data curation.
|
| 114 |
+
|
| 115 |
+
</details>
|
| 116 |
+
|
| 117 |
+
<details>
|
| 118 |
+
<summary><b>TL;DR Summary:</b> In response to the hallucination loop caused by machine-translated training data, stakeholders have pivoted towards authentic, native-first dataset curation. To facilitate this, we have developed the PLTAT App—an all-in-one "Swiss Army knife" platform for crowdsourcing the translation, generation, evaluation, and correction of NLP datasets. Because building a high-fidelity teacher model is a long-term, iterative process, we are actively seeking institutional stakeholders (universities, government agencies) to sustain this effort. Technical resources, including the PLTAT Chat App and our Ollama Colab Server Notebook, are now live for community testing.</summary>
|
| 119 |
+
|
| 120 |
+
<br>
|
| 121 |
+
|
| 122 |
+
# SOLUTION DOCUMENT: Crowdsourced Authentic Dataset Generation Strategy
|
| 123 |
+
|
| 124 |
+
**Organization:** Philippine Languages Translation and AI Training Community (PLTAT)
|
| 125 |
+
**Project Phase:** Phase 1.5 - Authentic Data Remediation & HITL Integration
|
| 126 |
+
**Date:** [Current Date]
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
### **1. Strategic Pivot: The Need for Authentic Datasets**
|
| 131 |
+
Following the findings of the Phase 1 Progress Report, project stakeholders convened to address the severe limitations of fine-tuning the Llama 3.1 architecture using machine-translated data (e.g., the Alpaca dataset). It was unanimously decided that relying on automated translation pipelines results in syntactic corruption and severe hallucinations, preventing us from achieving Neural Machine Translation (NMT) parity.
|
| 132 |
+
|
| 133 |
+
**The Resolution:** We are shifting our methodology from automated data ingestion to **authentic, human-verified dataset curation**. To train a high-fidelity "Teacher" model capable of accurate knowledge distillation, the model must be trained on high-quality, native text authored and verified by fluent speakers of Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan.
|
| 134 |
+
|
| 135 |
+
### **2. The Solution: The PLTAT App ("Swiss Army Knife" of Data Curation)**
|
| 136 |
+
To solve the logistical challenge of building native datasets across eight languages, we have developed the **PLTAT App**. Designed as an all-in-one, crowdsourcing "Swiss Army knife," this platform empowers native speakers, linguists, and AI enthusiasts to actively participate in the model-training pipeline.
|
| 137 |
+
|
| 138 |
+
The PLTAT App features four core modules:
|
| 139 |
+
* **Translate:** Allows contributors to perform Human-in-the-Loop (HITL) translations of high-value English datasets, ensuring cultural nuance, correct verb-subject-object (VSO) sentence structures, and appropriate morphological alignments are maintained.
|
| 140 |
+
* **Generate:** Enables users to natively author brand new instruction-tuning prompts and responses directly in their regional languages, bypassing the English-translation bias entirely.
|
| 141 |
+
* **Evaluate:** A rating system where contributors can test the current iteration of the finetuned Llama 3.1 model, scoring its outputs for fluency, accuracy, and logic (laying the groundwork for future Reinforcement Learning from Human Feedback - RLHF).
|
| 142 |
+
* **Correct:** Provides a direct interface for users to edit and fix model hallucinations, dialect mixing, or grammatical errors. These corrections are automatically fed back into the dataset repository for the next training epoch.
|
| 143 |
+
|
| 144 |
+
### **3. Long-Term Vision & Institutional Stakeholder Engagement**
|
| 145 |
+
Building an NMT-parity teacher model is not a problem that can be solved in a single fine-tuning run. It requires a continuous, iterative cycle of model training, evaluation, and data refinement.
|
| 146 |
+
|
| 147 |
+
While community crowdsourcing is the engine of the PLTAT App, **this is a long-term effort that requires the backing of institutional stakeholders.**
|
| 148 |
+
* We are calling upon academic institutions (linguistics and computer science departments), government bodies (such as DOST and the Komisyon sa Wikang Filipino), and regional NGOs to partner with PLTAT.
|
| 149 |
+
* Institutional backing will provide the necessary oversight for linguistic validation, ensure long-term computing resource sustainability, and help integrate these open-source models into public-serving digital infrastructure.
|
| 150 |
+
|
| 151 |
+
### **4. Technical Resources & Access**
|
| 152 |
+
To democratize access to our current progress and facilitate immediate community contribution, we have deployed the following resources. The models currently hosted represent our baseline and will be iteratively updated as the PLTAT App generates higher-quality, authentic datasets.
|
| 153 |
+
|
| 154 |
+
* 🔗 **PLTAT Chat App:** [https://app.pltat-community-placeholder.io](https://app.pltat-community-placeholder.io)
|
| 155 |
+
*(Access the unified platform to chat with the current models, translate datasets, generate new prompts, and submit corrections.)*
|
| 156 |
+
|
| 157 |
+
* 🔗 **PLTAT Ollama Colab Server:** [https://colab.research.google.com/drive/placeholder-pltat-ollama-server](https://colab.research.google.com/drive/placeholder-pltat-ollama-server)
|
| 158 |
+
*(For developers and researchers: A ready-to-run Jupyter/Colab notebook hosting the quantized PLTAT models via Ollama. Use this to run inference testing and API integrations on your own instances.)*
|
| 159 |
+
|
| 160 |
+
**Conclusion:**
|
| 161 |
+
By combining the raw power of the Llama 3.1 architecture with the linguistic authenticity of crowdsourced, human-verified data via the PLTAT App, we are establishing a sustainable, long-term pathway toward true AI parity for Philippine languages.
|
| 162 |
+
|
| 163 |
+
</details>
|
| 164 |
+
|