--- title: Resep ID Gemma 4 emoji: 🍲 colorFrom: red colorTo: yellow sdk: static pinned: false license: gemma short_description: Gemma 4 Indonesian recipe fine-tune case study models: - google/gemma-4-e2b-it - junwatu/resep-ID-gemma-4-E2B-it - junwatu/resep-ID-gemma-4-E2B-it-gguf datasets: - junwatu/indonesian-recipes tags: - gemma - gemma-4 - fine-tuning - mi300x - rocm - indonesian - recipes - gguf - text-generation --- # Resep ID Gemma 4 This Space explains an end-to-end fine-tuning project: taking `google/gemma-4-e2b-it`, adapting it to Indonesian recipe generation, evaluating the result, quantizing it to GGUF, and deploying it as a lightweight recipe assistant. The goal was simple: > Given an Indonesian dish title, generate a structured recipe with `Bahan:` and `Langkah:` in natural Bahasa Indonesia. Example input: ```text Tulis resep masakan Indonesia berjudul: "Tumis Kangkung Tempe". ``` Expected output shape: ```text Bahan: - ... - ... Langkah: 1. ... 2. ... ``` ## Project Summary | Item | Details | |---|---| | Base model | `google/gemma-4-e2b-it` | | Fine-tuned model | `junwatu/resep-ID-gemma-4-E2B-it` | | GGUF model | `junwatu/resep-ID-gemma-4-E2B-it-gguf` | | Dataset | `junwatu/indonesian-recipes` | | Task | Indonesian recipe generation | | Training hardware | AMD Instinct MI300X | | GPU memory | 192 GB HBM3 class | | Software stack | ROCm 7.2, PyTorch ROCm wheel, Transformers 5.x, TRL 1.x | | Training method | Full supervised fine-tune | | Training data | 66,419 recipes | | Validation data | 1,748 recipes | | Held-out test data | 1,748 recipes | | Final deployment format | Safetensors + GGUF Q4_K_M / Q8_0 | ## Why Fine-Tune? The base Gemma 4 model was already fluent in Indonesian, but it often missed the identity of specific Indonesian dishes. For example, the base model could produce a plausible recipe, but not always the right recipe. It struggled with regional or highly specific dishes such as: - Sosis Solo - Tahu Thek - Tempe Mendoan - Tahu Walik Aci - Kering Tempe Pete - DEBM / MPASI recipe variants A baseline evaluation on 50 held-out recipes showed the main gap: | Dimension | Base Gemma 4 E2B | |---|---:| | Language fidelity | 5.00 | | Format compliance | 3.90 | | Ingredient plausibility | 3.10 | | Step coherence | 3.20 | | Dish authenticity | 2.70 | | Overall | 3.58 | The key weakness was `dish_authenticity`: the model was fluent, but too often produced a generic Indonesian recipe instead of the requested dish. ## Dataset The dataset contains structured Indonesian home-cooking recipes. Each row has: | Field | Description | |---|---| | `title` | Recipe name | | `ingredients` | List of ingredient lines | | `steps` | Ordered cooking steps | | `num_ingredients` | Ingredient count | | `num_steps` | Step count | | `char_count` | Approximate recipe length | The project converts the original parquet files into JSONL splits: ```text data/processed/train.jsonl data/processed/val.jsonl data/processed/test.jsonl ``` The held-out test split is not used for training. It is used only for pre/post fine-tune comparison. ## Training Setup The fine-tune used a single AMD MI300X GPU on ROCm 7.2. Important training choices: - Full fine-tune instead of LoRA - bf16 training - 1 epoch - Effective batch size 16 - Max sequence length 2048 - Cosine learning-rate schedule - 3% warmup - Gradient checkpointing enabled - Vision/audio paths frozen because this task is text-only Gemma 4 is multimodal, but this project trains only the text path: ```text Train: - model.language_model.* - lm_head Freeze: - vision tower - audio tower - vision/audio adapters ``` ## Training Format The project uses TRL prompt/completion conversational format: ```json { "prompt": [ { "role": "user", "content": "Tulis resep masakan Indonesia berjudul: \"Tumis Kangkung Tempe\"..." } ], "completion": [ { "role": "assistant", "content": "Bahan:\n- ...\n\nLangkah:\n1. ..." } ] } ``` This format was important. In this stack, the alternative `messages` format with `assistant_only_loss=True` caused unstable loss behavior. ## Results The fine-tuned model improved the practical recipe-generation behavior. | Dimension | Base | Fine-tuned | |---|---:|---:| | Language fidelity | 5.00 | ~4.6 | | Format compliance | 3.90 | ~4.95 | | Ingredient plausibility | 3.10 | ~3.5 | | Step coherence | 3.20 | ~3.9 | | Dish authenticity | 2.70 | ~3.25 | | Overall | 3.58 | ~4.0 | The strongest gains were: - More consistent `Bahan:` / `Langkah:` formatting - Better recipe length discipline - More natural Indonesian cooking vocabulary - Better common-dish ingredient profiles - Better structure for common dishes like tumis, pepes, rendang, sambal, and gulai ## Critical Inference Setting One important lesson from the project: the fine-tuned model needs repetition control. For Hugging Face Transformers inference, use: ```python model.generate( **inputs, max_new_tokens=1280, do_sample=False, repetition_penalty=1.05, no_repeat_ngram_size=6, pad_token_id=tok.eos_token_id, ) ``` Without `no_repeat_ngram_size=6`, long recipes can fall into repeated ingredient-list loops. For GGUF runtimes such as llama.cpp or LM Studio, use the DRY sampler equivalent with allowed length around 6. ## GGUF Deployment The model was also converted to GGUF for local and CPU-friendly use. Available quantizations: | Quant | Approx. size | Use case | |---|---:|---| | Q4_K_M | ~3.2 GB | Default portable version | | Q8_0 | ~4.7 GB | Higher quality, more RAM | The GGUF model can run with llama.cpp, LM Studio, or other GGUF-compatible runtimes. ## What Worked The project worked well for: - Common Indonesian home-cooking recipes - Structured recipe generation - Concise recipe output - Natural Indonesian recipe phrasing - Common ingredients and cooking methods Examples of stronger categories: - Ayam - Ikan - Sapi - Kambing - Tahu - Tempe - Telur - Udang - Sambal - Tumis - Pepes - Rendang-style dishes ## Limitations This is not a perfect cookbook model. Known limitations: - Rare regional dishes can become generic. - Some defining ingredients may be omitted. - Diet or modifier terms such as MPASI, DEBM, basah, or kering may be ignored. - The model may produce plausible but not authentic recipes. - Some outputs may contain minor formatting or fraction glitches. - Recipes should be checked before cooking. The main remaining bottleneck is dataset coverage, especially for regional and specialty dishes. ## Lessons Learned The biggest technical lessons: 1. Use the native ROCm 7.2 PyTorch wheel on MI300X. 2. Avoid older ROCm wheels for this Gemma 4 bf16 training path. 3. Use prompt/completion format with TRL for this stack. 4. Always run a cheap quick-validation training pass before a full run. 5. Judge the base model before fine-tuning. 6. Automatic metrics are not enough for recipe quality. 7. `no_repeat_ngram_size=6` is critical for stable inference. 8. Dataset coverage matters more than another epoch for rare dishes. ## Cost and Runtime The full successful cycle was inexpensive because MI300X training was fast for this model size. Approximate reference run: | Phase | Approx. cost | |---|---:| | Setup and debugging | ~$2.50 | | Quick validation | ~$1.50 | | Full training | ~$3.00 | | Evaluation iterations | ~$2.00 | | GGUF conversion and upload | ~$1.30 | | Idle/debugging slack | ~$4.00 | | Total | ~$14 | Future cycles should be cheaper because the stack and gotchas are now documented. ## Links - Base model: [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it) - Fine-tuned model: [`junwatu/resep-ID-gemma-4-E2B-it`](https://huggingface.co/junwatu/resep-ID-gemma-4-E2B-it) - GGUF model: [`junwatu/resep-ID-gemma-4-E2B-it-gguf`](https://huggingface.co/junwatu/resep-ID-gemma-4-E2B-it-gguf) - Dataset: [`junwatu/indonesian-recipes`](https://huggingface.co/datasets/junwatu/indonesian-recipes) - Live recipe demo: [`junwatu/koki-ai`](https://huggingface.co/spaces/junwatu/koki-ai) ## License This project inherits the Gemma Terms of Use from the base model.