| --- |
| title: Resep ID Gemma 4 |
| emoji: 🍲 |
| colorFrom: red |
| colorTo: yellow |
| sdk: static |
| pinned: false |
| license: gemma |
| short_description: Gemma 4 Indonesian recipe fine-tune case study |
| models: |
| - google/gemma-4-e2b-it |
| - junwatu/resep-ID-gemma-4-E2B-it |
| - junwatu/resep-ID-gemma-4-E2B-it-gguf |
| datasets: |
| - junwatu/indonesian-recipes |
| tags: |
| - gemma |
| - gemma-4 |
| - fine-tuning |
| - mi300x |
| - rocm |
| - indonesian |
| - recipes |
| - gguf |
| - text-generation |
| --- |
| |
| # Resep ID Gemma 4 |
|
|
| This Space explains an end-to-end fine-tuning project: taking `google/gemma-4-e2b-it`, adapting it to Indonesian recipe generation, evaluating the result, quantizing it to GGUF, and deploying it as a lightweight recipe assistant. |
|
|
| The goal was simple: |
|
|
| > Given an Indonesian dish title, generate a structured recipe with `Bahan:` and `Langkah:` in natural Bahasa Indonesia. |
|
|
| Example input: |
|
|
| ```text |
| Tulis resep masakan Indonesia berjudul: "Tumis Kangkung Tempe". |
| ``` |
|
|
| Expected output shape: |
|
|
| ```text |
| Bahan: |
| - ... |
| - ... |
| |
| Langkah: |
| 1. ... |
| 2. ... |
| ``` |
|
|
| ## Project Summary |
|
|
| | Item | Details | |
| |---|---| |
| | Base model | `google/gemma-4-e2b-it` | |
| | Fine-tuned model | `junwatu/resep-ID-gemma-4-E2B-it` | |
| | GGUF model | `junwatu/resep-ID-gemma-4-E2B-it-gguf` | |
| | Dataset | `junwatu/indonesian-recipes` | |
| | Task | Indonesian recipe generation | |
| | Training hardware | AMD Instinct MI300X | |
| | GPU memory | 192 GB HBM3 class | |
| | Software stack | ROCm 7.2, PyTorch ROCm wheel, Transformers 5.x, TRL 1.x | |
| | Training method | Full supervised fine-tune | |
| | Training data | 66,419 recipes | |
| | Validation data | 1,748 recipes | |
| | Held-out test data | 1,748 recipes | |
| | Final deployment format | Safetensors + GGUF Q4_K_M / Q8_0 | |
| |
| ## Why Fine-Tune? |
| |
| The base Gemma 4 model was already fluent in Indonesian, but it often missed the identity of specific Indonesian dishes. |
| |
| For example, the base model could produce a plausible recipe, but not always the right recipe. It struggled with regional or highly specific dishes such as: |
| |
| - Sosis Solo |
| - Tahu Thek |
| - Tempe Mendoan |
| - Tahu Walik Aci |
| - Kering Tempe Pete |
| - DEBM / MPASI recipe variants |
| |
| A baseline evaluation on 50 held-out recipes showed the main gap: |
| |
| | Dimension | Base Gemma 4 E2B | |
| |---|---:| |
| | Language fidelity | 5.00 | |
| | Format compliance | 3.90 | |
| | Ingredient plausibility | 3.10 | |
| | Step coherence | 3.20 | |
| | Dish authenticity | 2.70 | |
| | Overall | 3.58 | |
| |
| The key weakness was `dish_authenticity`: the model was fluent, but too often produced a generic Indonesian recipe instead of the requested dish. |
|
|
| ## Dataset |
|
|
| The dataset contains structured Indonesian home-cooking recipes. |
|
|
| Each row has: |
|
|
| | Field | Description | |
| |---|---| |
| | `title` | Recipe name | |
| | `ingredients` | List of ingredient lines | |
| | `steps` | Ordered cooking steps | |
| | `num_ingredients` | Ingredient count | |
| | `num_steps` | Step count | |
| | `char_count` | Approximate recipe length | |
|
|
| The project converts the original parquet files into JSONL splits: |
|
|
| ```text |
| data/processed/train.jsonl |
| data/processed/val.jsonl |
| data/processed/test.jsonl |
| ``` |
|
|
| The held-out test split is not used for training. It is used only for pre/post fine-tune comparison. |
|
|
| ## Training Setup |
|
|
| The fine-tune used a single AMD MI300X GPU on ROCm 7.2. |
|
|
| Important training choices: |
|
|
| - Full fine-tune instead of LoRA |
| - bf16 training |
| - 1 epoch |
| - Effective batch size 16 |
| - Max sequence length 2048 |
| - Cosine learning-rate schedule |
| - 3% warmup |
| - Gradient checkpointing enabled |
| - Vision/audio paths frozen because this task is text-only |
|
|
| Gemma 4 is multimodal, but this project trains only the text path: |
|
|
| ```text |
| Train: |
| - model.language_model.* |
| - lm_head |
| |
| Freeze: |
| - vision tower |
| - audio tower |
| - vision/audio adapters |
| ``` |
|
|
| ## Training Format |
|
|
| The project uses TRL prompt/completion conversational format: |
|
|
| ```json |
| { |
| "prompt": [ |
| { |
| "role": "user", |
| "content": "Tulis resep masakan Indonesia berjudul: \"Tumis Kangkung Tempe\"..." |
| } |
| ], |
| "completion": [ |
| { |
| "role": "assistant", |
| "content": "Bahan:\n- ...\n\nLangkah:\n1. ..." |
| } |
| ] |
| } |
| ``` |
|
|
| This format was important. In this stack, the alternative `messages` format with `assistant_only_loss=True` caused unstable loss behavior. |
|
|
| ## Results |
|
|
| The fine-tuned model improved the practical recipe-generation behavior. |
|
|
| | Dimension | Base | Fine-tuned | |
| |---|---:|---:| |
| | Language fidelity | 5.00 | ~4.6 | |
| | Format compliance | 3.90 | ~4.95 | |
| | Ingredient plausibility | 3.10 | ~3.5 | |
| | Step coherence | 3.20 | ~3.9 | |
| | Dish authenticity | 2.70 | ~3.25 | |
| | Overall | 3.58 | ~4.0 | |
|
|
| The strongest gains were: |
|
|
| - More consistent `Bahan:` / `Langkah:` formatting |
| - Better recipe length discipline |
| - More natural Indonesian cooking vocabulary |
| - Better common-dish ingredient profiles |
| - Better structure for common dishes like tumis, pepes, rendang, sambal, and gulai |
|
|
| ## Critical Inference Setting |
|
|
| One important lesson from the project: the fine-tuned model needs repetition control. |
|
|
| For Hugging Face Transformers inference, use: |
|
|
| ```python |
| model.generate( |
| **inputs, |
| max_new_tokens=1280, |
| do_sample=False, |
| repetition_penalty=1.05, |
| no_repeat_ngram_size=6, |
| pad_token_id=tok.eos_token_id, |
| ) |
| ``` |
|
|
| Without `no_repeat_ngram_size=6`, long recipes can fall into repeated ingredient-list loops. |
|
|
| For GGUF runtimes such as llama.cpp or LM Studio, use the DRY sampler equivalent with allowed length around 6. |
|
|
| ## GGUF Deployment |
|
|
| The model was also converted to GGUF for local and CPU-friendly use. |
|
|
| Available quantizations: |
|
|
| | Quant | Approx. size | Use case | |
| |---|---:|---| |
| | Q4_K_M | ~3.2 GB | Default portable version | |
| | Q8_0 | ~4.7 GB | Higher quality, more RAM | |
| |
| The GGUF model can run with llama.cpp, LM Studio, or other GGUF-compatible runtimes. |
| |
| ## What Worked |
| |
| The project worked well for: |
| |
| - Common Indonesian home-cooking recipes |
| - Structured recipe generation |
| - Concise recipe output |
| - Natural Indonesian recipe phrasing |
| - Common ingredients and cooking methods |
| |
| Examples of stronger categories: |
| |
| - Ayam |
| - Ikan |
| - Sapi |
| - Kambing |
| - Tahu |
| - Tempe |
| - Telur |
| - Udang |
| - Sambal |
| - Tumis |
| - Pepes |
| - Rendang-style dishes |
| |
| ## Limitations |
| |
| This is not a perfect cookbook model. |
| |
| Known limitations: |
| |
| - Rare regional dishes can become generic. |
| - Some defining ingredients may be omitted. |
| - Diet or modifier terms such as MPASI, DEBM, basah, or kering may be ignored. |
| - The model may produce plausible but not authentic recipes. |
| - Some outputs may contain minor formatting or fraction glitches. |
| - Recipes should be checked before cooking. |
| |
| The main remaining bottleneck is dataset coverage, especially for regional and specialty dishes. |
| |
| ## Lessons Learned |
| |
| The biggest technical lessons: |
| |
| 1. Use the native ROCm 7.2 PyTorch wheel on MI300X. |
| 2. Avoid older ROCm wheels for this Gemma 4 bf16 training path. |
| 3. Use prompt/completion format with TRL for this stack. |
| 4. Always run a cheap quick-validation training pass before a full run. |
| 5. Judge the base model before fine-tuning. |
| 6. Automatic metrics are not enough for recipe quality. |
| 7. `no_repeat_ngram_size=6` is critical for stable inference. |
| 8. Dataset coverage matters more than another epoch for rare dishes. |
|
|
| ## Cost and Runtime |
|
|
| The full successful cycle was inexpensive because MI300X training was fast for this model size. |
|
|
| Approximate reference run: |
|
|
| | Phase | Approx. cost | |
| |---|---:| |
| | Setup and debugging | ~$2.50 | |
| | Quick validation | ~$1.50 | |
| | Full training | ~$3.00 | |
| | Evaluation iterations | ~$2.00 | |
| | GGUF conversion and upload | ~$1.30 | |
| | Idle/debugging slack | ~$4.00 | |
| | Total | ~$14 | |
|
|
| Future cycles should be cheaper because the stack and gotchas are now documented. |
|
|
| ## Links |
|
|
| - Base model: [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it) |
| - Fine-tuned model: [`junwatu/resep-ID-gemma-4-E2B-it`](https://huggingface.co/junwatu/resep-ID-gemma-4-E2B-it) |
| - GGUF model: [`junwatu/resep-ID-gemma-4-E2B-it-gguf`](https://huggingface.co/junwatu/resep-ID-gemma-4-E2B-it-gguf) |
| - Dataset: [`junwatu/indonesian-recipes`](https://huggingface.co/datasets/junwatu/indonesian-recipes) |
| - Live recipe demo: [`junwatu/koki-ai`](https://huggingface.co/spaces/junwatu/koki-ai) |
|
|
| ## License |
|
|
| This project inherits the Gemma Terms of Use from the base model. |
|
|