---
title: Resep ID Gemma 4
emoji: 🍲
colorFrom: red
colorTo: yellow
sdk: static
pinned: false
license: gemma
short_description: Gemma 4 Indonesian recipe fine-tune case study
models:
- google/gemma-4-e2b-it
- junwatu/resep-ID-gemma-4-E2B-it
- junwatu/resep-ID-gemma-4-E2B-it-gguf
datasets:
- junwatu/indonesian-recipes
tags:
- gemma
- gemma-4
- fine-tuning
- mi300x
- rocm
- indonesian
- recipes
- gguf
- text-generation
---

# Resep ID Gemma 4

This Space explains an end-to-end fine-tuning project: taking `google/gemma-4-e2b-it`, adapting it to Indonesian recipe generation, evaluating the result, quantizing it to GGUF, and deploying it as a lightweight recipe assistant.

The goal was simple:

> Given an Indonesian dish title, generate a structured recipe with `Bahan:` and `Langkah:` in natural Bahasa Indonesia.

Example input:

```text
Tulis resep masakan Indonesia berjudul: "Tumis Kangkung Tempe".
```

Expected output shape:

```text
Bahan:
- ...
- ...

Langkah:
1. ...
2. ...
```

## Project Summary

| Item | Details |
|---|---|
| Base model | `google/gemma-4-e2b-it` |
| Fine-tuned model | `junwatu/resep-ID-gemma-4-E2B-it` |
| GGUF model | `junwatu/resep-ID-gemma-4-E2B-it-gguf` |
| Dataset | `junwatu/indonesian-recipes` |
| Task | Indonesian recipe generation |
| Training hardware | AMD Instinct MI300X |
| GPU memory | 192 GB HBM3 class |
| Software stack | ROCm 7.2, PyTorch ROCm wheel, Transformers 5.x, TRL 1.x |
| Training method | Full supervised fine-tune |
| Training data | 66,419 recipes |
| Validation data | 1,748 recipes |
| Held-out test data | 1,748 recipes |
| Final deployment format | Safetensors + GGUF Q4_K_M / Q8_0 |

## Why Fine-Tune?

The base Gemma 4 model was already fluent in Indonesian, but it often missed the identity of specific Indonesian dishes.

For example, the base model could produce a plausible recipe, but not always the right recipe. It struggled with regional or highly specific dishes such as:

- Sosis Solo
- Tahu Thek
- Tempe Mendoan
- Tahu Walik Aci
- Kering Tempe Pete
- DEBM / MPASI recipe variants

A baseline evaluation on 50 held-out recipes showed the main gap:

| Dimension | Base Gemma 4 E2B |
|---|---:|
| Language fidelity | 5.00 |
| Format compliance | 3.90 |
| Ingredient plausibility | 3.10 |
| Step coherence | 3.20 |
| Dish authenticity | 2.70 |
| Overall | 3.58 |

The key weakness was `dish_authenticity`: the model was fluent, but too often produced a generic Indonesian recipe instead of the requested dish.

## Dataset

The dataset contains structured Indonesian home-cooking recipes.

Each row has:

| Field | Description |
|---|---|
| `title` | Recipe name |
| `ingredients` | List of ingredient lines |
| `steps` | Ordered cooking steps |
| `num_ingredients` | Ingredient count |
| `num_steps` | Step count |
| `char_count` | Approximate recipe length |

The project converts the original parquet files into JSONL splits:

```text
data/processed/train.jsonl
data/processed/val.jsonl
data/processed/test.jsonl
```

The held-out test split is not used for training. It is used only for pre/post fine-tune comparison.

## Training Setup

The fine-tune used a single AMD MI300X GPU on ROCm 7.2.

Important training choices:

- Full fine-tune instead of LoRA
- bf16 training
- 1 epoch
- Effective batch size 16
- Max sequence length 2048
- Cosine learning-rate schedule
- 3% warmup
- Gradient checkpointing enabled
- Vision/audio paths frozen because this task is text-only

Gemma 4 is multimodal, but this project trains only the text path:

```text
Train:
- model.language_model.*
- lm_head

Freeze:
- vision tower
- audio tower
- vision/audio adapters
```

## Training Format

The project uses TRL prompt/completion conversational format:

```json
{
  "prompt": [
    {
      "role": "user",
      "content": "Tulis resep masakan Indonesia berjudul: \"Tumis Kangkung Tempe\"..."
    }
  ],
  "completion": [
    {
      "role": "assistant",
      "content": "Bahan:\n- ...\n\nLangkah:\n1. ..."
    }
  ]
}
```

This format was important. In this stack, the alternative `messages` format with `assistant_only_loss=True` caused unstable loss behavior.

## Results

The fine-tuned model improved the practical recipe-generation behavior.

| Dimension | Base | Fine-tuned |
|---|---:|---:|
| Language fidelity | 5.00 | ~4.6 |
| Format compliance | 3.90 | ~4.95 |
| Ingredient plausibility | 3.10 | ~3.5 |
| Step coherence | 3.20 | ~3.9 |
| Dish authenticity | 2.70 | ~3.25 |
| Overall | 3.58 | ~4.0 |

The strongest gains were:

- More consistent `Bahan:` / `Langkah:` formatting
- Better recipe length discipline
- More natural Indonesian cooking vocabulary
- Better common-dish ingredient profiles
- Better structure for common dishes like tumis, pepes, rendang, sambal, and gulai

## Critical Inference Setting

One important lesson from the project: the fine-tuned model needs repetition control.

For Hugging Face Transformers inference, use:

```python
model.generate(
    **inputs,
    max_new_tokens=1280,
    do_sample=False,
    repetition_penalty=1.05,
    no_repeat_ngram_size=6,
    pad_token_id=tok.eos_token_id,
)
```

Without `no_repeat_ngram_size=6`, long recipes can fall into repeated ingredient-list loops.

For GGUF runtimes such as llama.cpp or LM Studio, use the DRY sampler equivalent with allowed length around 6.

## GGUF Deployment

The model was also converted to GGUF for local and CPU-friendly use.

Available quantizations:

| Quant | Approx. size | Use case |
|---|---:|---|
| Q4_K_M | ~3.2 GB | Default portable version |
| Q8_0 | ~4.7 GB | Higher quality, more RAM |

The GGUF model can run with llama.cpp, LM Studio, or other GGUF-compatible runtimes.

## What Worked

The project worked well for:

- Common Indonesian home-cooking recipes
- Structured recipe generation
- Concise recipe output
- Natural Indonesian recipe phrasing
- Common ingredients and cooking methods

Examples of stronger categories:

- Ayam
- Ikan
- Sapi
- Kambing
- Tahu
- Tempe
- Telur
- Udang
- Sambal
- Tumis
- Pepes
- Rendang-style dishes

## Limitations

This is not a perfect cookbook model.

Known limitations:

- Rare regional dishes can become generic.
- Some defining ingredients may be omitted.
- Diet or modifier terms such as MPASI, DEBM, basah, or kering may be ignored.
- The model may produce plausible but not authentic recipes.
- Some outputs may contain minor formatting or fraction glitches.
- Recipes should be checked before cooking.

The main remaining bottleneck is dataset coverage, especially for regional and specialty dishes.

## Lessons Learned

The biggest technical lessons:

1. Use the native ROCm 7.2 PyTorch wheel on MI300X.
2. Avoid older ROCm wheels for this Gemma 4 bf16 training path.
3. Use prompt/completion format with TRL for this stack.
4. Always run a cheap quick-validation training pass before a full run.
5. Judge the base model before fine-tuning.
6. Automatic metrics are not enough for recipe quality.
7. `no_repeat_ngram_size=6` is critical for stable inference.
8. Dataset coverage matters more than another epoch for rare dishes.

## Cost and Runtime

The full successful cycle was inexpensive because MI300X training was fast for this model size.

Approximate reference run:

| Phase | Approx. cost |
|---|---:|
| Setup and debugging | ~$2.50 |
| Quick validation | ~$1.50 |
| Full training | ~$3.00 |
| Evaluation iterations | ~$2.00 |
| GGUF conversion and upload | ~$1.30 |
| Idle/debugging slack | ~$4.00 |
| Total | ~$14 |

Future cycles should be cheaper because the stack and gotchas are now documented.

## Links

- Base model: [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it)
- Fine-tuned model: [`junwatu/resep-ID-gemma-4-E2B-it`](https://huggingface.co/junwatu/resep-ID-gemma-4-E2B-it)
- GGUF model: [`junwatu/resep-ID-gemma-4-E2B-it-gguf`](https://huggingface.co/junwatu/resep-ID-gemma-4-E2B-it-gguf)
- Dataset: [`junwatu/indonesian-recipes`](https://huggingface.co/datasets/junwatu/indonesian-recipes)
- Live recipe demo: [`junwatu/koki-ai`](https://huggingface.co/spaces/junwatu/koki-ai)

## License

This project inherits the Gemma Terms of Use from the base model.