clemsail commited on
Commit
e124225
·
verified ·
1 Parent(s): c6d6fad

Refresh model card: license chain + DISCLOSURE bandeau v2

Browse files
Files changed (1) hide show
  1. README.md +62 -209
README.md CHANGED
@@ -1,240 +1,93 @@
1
  ---
2
  license: apache-2.0
3
  base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
 
4
  tags:
5
- - lora
6
- - peft
7
- - mlx
8
- - ailiance
9
- - ailiance
10
- - eu-ai-act
11
- - art-52
12
- - art-53
13
- - gpai-fine-tune
14
- - pst-2025-07-24
15
  language:
16
- - en
17
- library_name: peft
18
- ---
19
-
20
- # devstral-cpp-lora
21
-
22
- LoRA adapter for **mistralai/Devstral-Small-2-24B-Instruct-2512**, part of the [ailiance](https://github.com/ailiance/ailiance) project. Live demo: https://www.ailiance.fr.
23
-
24
- > **EU AI Act compliance.** This card follows the **European Commission's
25
- > *Template for the Public Summary of Training Content* for general-purpose
26
- > AI models** (Art. 53(1)(d) of Regulation (EU) 2024/1689, published by the
27
- > AI Office on 2025-07-24). Section numbering and field labels reproduce
28
- > the official template. Where this card and the official template differ
29
- > in wording, the **official template wins** — see the
30
- > [AI Office page](https://digital-strategy.ec.europa.eu/en/library/explanatory-notice-and-template-public-summary-training-content-general-purpose-ai-models).
31
-
32
- ---
33
-
34
- # 1. General information
35
-
36
- ## 1.1. Provider identification
37
-
38
- | Field | Value |
39
- |---|---|
40
- | **Provider name and contact details** | Ailiance (Saillant Clément) — `clemsail` on Hugging Face — Issues: https://github.com/ailiance/ailiance/issues |
41
- | **Authorised representative name and contact details** | Not applicable — provider is established within the European Union (France). |
42
-
43
- ## 1.2. Model identification
44
-
45
- | Field | Value |
46
- |---|---|
47
- | **Versioned model name(s)** | `Ailiance-fr/devstral-cpp-lora` (this LoRA adapter, v0.4.2) |
48
- | **Model dependencies** | This is a **fine-tune (LoRA, rank 16)** of the general-purpose AI model [`mistralai/Devstral-Small-2-24B-Instruct-2512`](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512). Refer to the base-model provider's PST for the underlying training summary. |
49
- | **Date of placement of the model on the Union market** | 2026-05-06 |
50
-
51
- ## 1.3. Modalities, overall training data size and other characteristics
52
-
53
- | Field | Value |
54
- |---|---|
55
- | **Modality** | ☒ Text ☐ Image ☐ Audio ☐ Video ☐ Other |
56
- | **Training data size** (text bucket) | ☒ Less than 1 billion tokens ☐ 1 billion to 10 trillion tokens ☐ More than 10 trillion tokens |
57
- | **Types of content** | Instruction-tuning pairs, technical text, source code, multilingual instruction templates (EU official languages where applicable). |
58
- | **Approximate size in alternative units** | ≈ 0.6 M tokens (2 850 rows × ≈ 200 tokens/row). |
59
- | **Latest date of data acquisition / collection for model training** | 10/2025 (last commit on scraped repos). The model is **not** continuously trained on new data after this date. |
60
- | **Linguistic characteristics of the overall training data** | English (technical instruction language). No other natural languages. |
61
- | **Other relevant characteristics / additional comments** | LoRA fine-tune (rank 16, alpha 32, dropout 0.05); only attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`) are trained. Per-record `_provenance` (source, SPDX licence, `record_idx`, `access_date`) attached at the system level (see [`docs/eu-ai-act-transparency.md`](https://github.com/ailiance/ailiance/blob/main/docs/eu-ai-act-transparency.md) §4.4). Tokenizer: inherited from the base model. |
62
-
63
- ---
64
-
65
- # 2. List of data sources
66
-
67
- ## 2.1. Publicly available datasets
68
-
69
- **Have you used publicly available datasets to train the model?** ☒ Yes ☐ No
70
-
71
- **Modality(ies) of the content covered:** ☒ Text ☐ Image ☐ Video ☐ Audio ☐ Other
72
-
73
- **List of large publicly available datasets:**
74
-
75
- | Dataset | URL | SPDX licence | Records | Notes |
76
- |---|---|---|---:|---|
77
- | CommitPackFT (C/C++ subset) | https://huggingface.co/datasets/bigcode/commitpackft | `MIT` | 1,500 | Public HF dataset; real-world commit message + diff pairs. |
78
-
79
- ## 2.2. Private non-publicly available datasets obtained from third parties
80
-
81
- ### 2.2.1. Datasets commercially licensed by rightsholders or their representatives
82
-
83
- **Have you concluded transactional commercial licensing agreement(s) with rightsholder(s) or with their representatives?** ☐ Yes ☒ No
84
-
85
- _(N/A — no commercial licensing agreements concluded.)_
86
-
87
- ### 2.2.2. Private datasets obtained from other third parties
88
-
89
- **Have you obtained private datasets from third parties that are not licensed as described in Section 2.2.1?** ☐ Yes ☒ No
90
-
91
- _(N/A — no private third-party datasets obtained.)_
92
-
93
- ## 2.3. Data crawled and scraped from online sources
94
-
95
- **Were crawlers used by the provider or on behalf of?** ☒ Yes ☐ No
96
-
97
- **Crawler name(s) / identifier(s):** custom `huggingface_hub` + `requests` Python collectors operated by the provider.
98
-
99
- **Purposes of the crawler(s):** Acquire authoritative vendor reference code for technical training (firmware examples, EDA libraries).
100
-
101
- **General description of crawler behaviour:** Respects `robots.txt`, `meta robots noai`, `ai.txt`, and TDM-Reservation headers. Low QPS (≤ 1 req/s). Authenticated GitHub API where available. Captchas, password-protected pages and paywalls not bypassed.
102
-
103
- **Period of data collection:** Mixed; per-source `access_date` fields logged. Latest collection date: 10/2025.
104
-
105
- **Comprehensive description of the type of content and online sources crawled:** Three official vendor repositories scraped via authenticated GitHub API at low QPS. Robots.txt and rate limits respected. Per-source SHA-256 manifest in `data/scraped/<source>/manifest.json`. Compliant with EU DSM Directive Art. 4 TDM exception.
106
-
107
- **Type of modality covered:** ☒ Text ☐ Image ☐ Video ☐ Audio ☐ Other
108
-
109
- **Summary of the most relevant domain names crawled (top 5 % / max 1 000 — SME provider):**
110
-
111
- - `https://github.com` — github.com (espressif/esp-idf, STMicroelectronics/STM32CubeF4, arduino/Arduino) (SPDX: `Apache-2.0 (ESP-IDF) / BSD-3-Clause (STM32Cube) / CC0-1.0 (Arduino)`, ≈ 1,350 records)
112
-
113
- ## 2.4. User data
114
-
115
- **Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model?** ☐ Yes ☒ No
116
-
117
- **Was data collected from user interactions with the provider's other services or products used to train the model?** ☐ Yes ☒ No
118
-
119
- _(N/A — no user data collected from any provider service or AI-model interaction is used to train this LoRA.)_
120
-
121
- ## 2.5. Synthetic data
122
-
123
- **Was synthetic AI-generated data created by the provider or on their behalf to train the model?** ☐ Yes ☒ No
124
-
125
- _(N/A — no synthetic AI-generated data created by the provider or on their behalf to train this LoRA.)_
126
-
127
- ## 2.6. Other sources of data
128
-
129
- **Have data sources other than those described in Sections 2.1 to 2.5 been used to train the model?** ☐ Yes ☒ No
130
-
131
- _(N/A — no other data sources used.)_
132
-
133
  ---
134
 
135
- # 3. Data processing aspects
136
-
137
- ## 3.1. Respect of reservation of rights from text and data mining exception or limitation
138
-
139
- **Are you a Signatory to the Code of Practice for general-purpose AI models that includes commitments to respect reservations of rights from the TDM exception or limitation?** ☐ Yes ☒ No *(SME / individual provider; commitments equivalent in substance, see below.)*
140
-
141
- **Measures implemented before model training to respect reservations of rights from the TDM exception or limitation:**
142
 
143
- - **Public HF datasets (§2.1):** all carry permissive open licences (Apache-2.0, MIT, CC-BY-*, BSD); SPDX matrix verified per-source. The licences explicitly authorise instructional / model-training use for the rows actually selected.
144
- - **Web-scraped sources (§2.3):** prior to collection the provider verified `robots.txt`, `<meta name="robots" content="noai">`, `ai.txt`, and TDM-Reservation HTTP headers. Any source returning a reservation under Article 4(3) of Directive (EU) 2019/790 was excluded from collection. Scraping was limited to authoritative vendor-controlled repositories (ESP-IDF, STM32Cube, Arduino, KiCad symbols/footprints) operating under permissive licences.
145
- - **Vendor PDF datasheets (§2.2.2 where present):** processed under the EU DSM Directive Article 4 TDM exception. SHA-256 manifests and per-source legal-basis records are published in [`docs/pdf-compliance-report.md`](https://github.com/ailiance/ailiance/blob/main/docs/pdf-compliance-report.md).
146
- - **Public copyright policy (Art. 53(1)(c)):** [`docs/eu-ai-act-transparency.md`](https://github.com/ailiance/ailiance/blob/main/docs/eu-ai-act-transparency.md). Removal requests are handled via the issue tracker on the source repository; the provider commits to remove disputed content within 30 days and re-train on the next release cycle.
147
 
148
- ## 3.2. Removal of illegal content
149
 
150
- **General description of measures taken:**
151
 
152
- - The provider does not crawl the open web at large; sources are restricted to curated public HF datasets and authoritative vendor repositories where the risk of illegal content (CSAM, terrorist content, IP-violating works) is structurally low.
153
- - Personal data was screened with **Microsoft Presidio + en_core_web_lg** (2026-04-28) across all 35+ system-level domain directories. **One** email address detected in the unrelated `traduction-tech` corpus was redacted before training. Full report: `data/pii-scan-report.json`.
154
- - No special-category data (GDPR Art. 9: health, religion, sexual orientation, etc.) was intentionally collected; the PII scan also screens for identifiers that could enable special-category inference (none flagged).
155
- - License compatibility is enforced via per-source SPDX matrix; works under non-permissive licences are excluded.
156
-
157
- ## 3.3. Other information (optional)
158
-
159
- - **Per-record provenance:** 49 956 system-level training records carry `_provenance.{source, license, record_idx, access_date}` fields, enabling per-record audit and removal.
160
- - **Compute footprint:** LoRA training updates ≈ 0.1–0.5 % of base-model parameters. **Estimated training compute for this LoRA ≪ 10²⁵ FLOPs**, well below the systemic-risk threshold of EU AI Act Art. 51. No proprietary teacher model is used in deployed inference.
161
- - **Risk classification:** Limited risk (Art. 52). Not deployed in safety-critical contexts.
162
-
163
- ---
164
-
165
- # Appendix A — Performance evaluation (Art. 53(1)(a))
166
 
167
- **HumanEval** (custom Studio scorer; EvalPlus extra-tests not run — Linux-only sandbox): base 87.20 → +cpp 85.98 = **−1.22 pts**. For rigorous HumanEval+ Δ, sample re-scoring on Linux is required (samples preserved at `eval/results/2026-05-04/devstral-cpp-fused-humanevalplus/`).
 
 
 
168
 
169
- Full bench results, methodology, env.json, and rerun.sh per measurement:
170
- [`eval/results/SUMMARY.md`](https://github.com/ailiance/ailiance/blob/main/eval/results/SUMMARY.md) ·
171
- [`MODEL_CARD.md`](https://github.com/ailiance/ailiance/blob/main/MODEL_CARD.md).
172
 
173
- ---
174
 
175
- # Appendix B — Usage
 
 
 
 
 
 
 
 
 
 
176
 
177
- ```python
178
- from mlx_lm import load
179
- from mlx_lm.tuner.utils import linear_to_lora_layers
180
- from huggingface_hub import snapshot_download
181
 
182
- base_path = snapshot_download("mistralai/Devstral-Small-2-24B-Instruct-2512")
183
- adapter_path = snapshot_download("Ailiance-fr/devstral-cpp-lora")
 
184
 
185
- model, tokenizer = load(base_path)
186
- linear_to_lora_layers(model, num_layers=32, config={"rank": 16, "alpha": 32})
187
- model.load_weights(f"{adapter_path}/adapters.safetensors", strict=False)
188
- ```
189
 
190
- Or fuse and serve as a self-contained checkpoint:
 
 
 
 
191
 
192
- ```bash
193
- python -m mlx_lm fuse \
194
- --model mistralai/Devstral-Small-2-24B-Instruct-2512 \
195
- --adapter-path <adapter_path> \
196
- --save-path /tmp/devstral-cpp-lora-fused \
197
- --dequantize
198
- ```
199
 
200
- ---
201
 
202
- # Appendix C Limitations and out-of-scope use
 
 
 
 
203
 
204
- - Not for safety-critical decisions (medical, legal, structural, life-safety, biometric).
205
- - Not for high-stakes individual decisions (hiring, credit, law enforcement) — that would re-classify under EU AI Act Art. 6 high-risk and require additional obligations.
206
- - Hallucination present at typical instruction-tuned LLM levels; pair with a verifier or human-in-the-loop for factual outputs.
207
- - LoRA inherits all base-model limitations (training cutoff, language coverage, refusal patterns).
208
 
209
- ---
210
 
211
- # Appendix D — Citation
212
 
213
  ```bibtex
214
- @misc{ailiance-2026,
215
- title = {ailiance: EU-sovereign multi-model LLM serving with HF-traceable LoRA adapters},
216
- author = {Saillant, Clément},
217
- year = {2026},
218
- url = {https://github.com/ailiance/ailiance},
219
- note = {Live demo: https://www.ailiance.fr}
220
  }
221
  ```
222
 
223
- ---
224
-
225
- # Appendix E — Changelog
226
-
227
- | Date | Card version | Change |
228
- |---|---|---|
229
- | 2026-05-06 | v0.4.0 | Initial HF release |
230
- | 2026-05-06 | v0.4.1 | Self-contained EU AI Act card (per-adapter dataset table, PII statement, contact) |
231
- | 2026-05-06 | v0.4.2 | PST-aligned (Commission template structure, Sections §1–4) |
232
- | 2026-05-06 | **v0.4.3** | **PST-verbatim** — section labels and field names reproduced from the official Commission template (PDF 2025-07-24, English version). |
233
-
234
- ## Validated in `ailiance/ailiance-bench` v0.2
235
-
236
- This model is referenced in the [Ailiance benchmark suite](https://github.com/ailiance/ailiance-bench)
237
- (Phase 6 scoreboard, 7-task hardware-design evaluation).
238
 
239
- See the full scoreboard:
240
- [ailiance-bench README#scoreboard-lora-phase-6](https://github.com/ailiance/ailiance-bench#scoreboard-lora-phase-6--2026-05-11).
 
1
  ---
2
  license: apache-2.0
3
  base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
4
+ library_name: peft
5
  tags:
6
+ - mlx
7
+ - lora
8
+ - peft
9
+ - ailiance
10
+ - devstral
11
+ - cpp
 
 
 
 
12
  language:
13
+ - en
14
+ - fr
15
+ pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ---
17
 
18
+ # Ailiance Devstral-Small-2-24B-Instruct cpp LoRA
 
 
 
 
 
 
19
 
20
+ LoRA adapter fine-tuned on `mistralai/Devstral-Small-2-24B-Instruct-2512` for **cpp** tasks.
 
 
 
21
 
22
+ > Maintained by **Ailiance** French AI org publishing EU AI Act aligned LoRA adapters and datasets.
23
 
24
+ ## Quick start (MLX)
25
 
26
+ ```python
27
+ from mlx_lm import load, generate
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
+ model, tokenizer = load(
30
+ "mistralai/Devstral-Small-2-24B-Instruct-2512",
31
+ adapter_path="Ailiance-fr/devstral-cpp-lora",
32
+ )
33
 
34
+ print(generate(model, tokenizer, prompt="..."))
35
+ ```
 
36
 
37
+ ## Training
38
 
39
+ | Hyperparameter | Value |
40
+ |------------------|------------------------|
41
+ | Base model | `mistralai/Devstral-Small-2-24B-Instruct-2512` |
42
+ | Method | LoRA via `mlx-lm` |
43
+ | Rank | 16 |
44
+ | Scale | 2.0 |
45
+ | Alpha | 32 |
46
+ | Max seq length | 2048 |
47
+ | Iterations | 500 |
48
+ | Optimizer | Adam, LR 1e-5 |
49
+ | Hardware | Apple M3 Ultra 512 GB |
50
 
51
+ ## Training data lineage
 
 
 
52
 
53
+ Derived from the internal **eu-kiki / mascarade** curation. All upstream samples
54
+ are synthetic, permissively-licensed, or generated from Apache-2.0 base resources.
55
+ See the [Ailiance-fr catalog](https://huggingface.co/Ailiance-fr) for related cards.
56
 
57
+ ## License chain
 
 
 
58
 
59
+ | Component | License |
60
+ |-----------------------------------|-------------------|
61
+ | Base model (`mistralai/Devstral-Small-2-24B-Instruct-2512`) | apache-2.0 |
62
+ | Training data (internal Ailiance curation (synthetic + permissive sources)) | apache-2.0 |
63
+ | **LoRA adapter (this repo)** | **apache-2.0**|
64
 
65
+ _All upstream components are Apache 2.0 / MIT — LoRA inherits permissive terms._
 
 
 
 
 
 
66
 
67
+ ## EU AI Act compliance
68
 
69
+ - **Article 53(1)(c)**: training data licenses preserved (per-dataset cards declare upstream licenses).
70
+ - **Article 53(1)(d)**: training data summary — see upstream dataset cards on Ailiance-fr.
71
+ - **GPAI Code of Practice (July 2025)**: base `mistralai/Devstral-Small-2-24B-Instruct-2512` released under apache-2.0.
72
+ - **No web scraping by Ailiance**, **no licensed data**, **no PII**.
73
+ - Upstream Stack Exchange content (where applicable) is CC-BY-SA-4.0 and propagates to this adapter.
74
 
75
+ ## License
 
 
 
76
 
77
+ LoRA weights: **apache-2.0** — see License chain table above for derivation rationale.
78
 
79
+ ## Citation
80
 
81
  ```bibtex
82
+ @misc{ailiance_devstral_cpp_2026,
83
+ author = {Ailiance},
84
+ title = {Ailiance — Devstral-Small-2-24B-Instruct cpp LoRA},
85
+ year = {2026},
86
+ publisher = {Hugging Face},
87
+ url = {https://huggingface.co/Ailiance-fr/devstral-cpp-lora}
88
  }
89
  ```
90
 
91
+ ## Related
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
+ See the full [Ailiance-fr LoRA collection](https://huggingface.co/Ailiance-fr).