Credit Assessment Curriculum β€” Qwen2.5-7B + GRPO

Can an LLM learn to be a loan officer β€” without ever seeing a real loan? This is a LoRA adapter on Qwen2.5-7B-Instruct, trained by SFT warmup + 3-phase per-task curriculum GRPO inside a self-built OpenEnv environment that simulates Indian-bank loan underwriting (CIBIL, FOIR, LTV, RERA).

Result: 81.7% β†’ 96.7% overall accuracy (+15.0pp) on 60 held-out applicants. Built for the Scaler + Meta + Hugging Face OpenEnv Hackathon 2026 (Theme #4 Self-Improvement + #3.1 World Modeling).

πŸ”— Try the env: iamnijin/credit-assessment-env Space πŸ”— Train it yourself: Colab notebook πŸ”— Code: github.com/Nijin-P-S/Credit_Assessment_Env πŸ”— Slide deck: Google Slides


The problem

LLMs are great at pattern-matching ("high income, looks approvable"). They are bad at precise rule adherence β€” CIBIL 699 β‰  CIBIL 700, FOIR 50.1% β‰  FOIR 49.9%, RBI tiered LTV is 90/80/75 depending on the loan slab. A real loan officer has to nail those edges every time. That's the gap I wanted to close.

The environment

I built an OpenEnv environment with 3 escalating loan types:

Task Loan Type Key challenge
1 Β· Easy Personal CIBIL, FOIR, employment
2 Β· Medium Vehicle + LTV ratio, collateral
3 Β· Hard Home + RBI tiered LTV, RERA compliance

4 actions (approve Β· reject Β· request_docs Β· counter_offer), multi-step episodes (the applicant responds to request_docs), 10 hand-crafted trap profiles, and a deterministic ground-truth oracle that doubles as the reward.

The reward is asymmetric to encode real NPA economics: rejecting a good applicant costs βˆ’5 (lost revenue), approving a bad loan costs βˆ’15 (NPA risk), approving a non-RERA home loan costs βˆ’20 (regulatory liability).

The training pipeline

Training pipeline

The pipeline that actually worked (after one attempt that degraded the model):

  1. SFT warmup β€” 600 supervised examples teach the chain-of-thought format and the rule-walk style. This puts the policy in a region where GRPO's advantage signal is informative. Without this step, GRPO got stuck on certain trap profiles.
  2. Phase 1 β€” Personal Loans only (foundation rules: CIBIL, FOIR, employment).
  3. Phase 2 β€” adds Vehicle Loans + 20% replay buffer from Phase 1 (introduces LTV cap).
  4. Phase 3 β€” adds Home Loans + 20% replay buffer from earlier phases (RBI tiered LTV + RERA).

Each phase gates on a 50-sample held-out evaluation (β‰₯60% mastery required to advance). The replay buffer prevents catastrophic forgetting on previously-mastered loan types.

Hyperparameters: LoRA rank 32, alpha 64; learning rate 1e-6; GRPO beta=0.3; 8 generations per prompt; max completion length 512 (so chain-of-thought isn't truncated).

Results

Baseline vs Trained

Loan Type Baseline Trained Ξ”
Personal (easy) 80% 100% +20pp βœ…
Vehicle (medium) 70% 98% +28pp βœ…
Home (hard) 95% 92% βˆ’3pp (within sampling noise on 20 samples)
Overall (60 samples) 81.7% 96.7% +15.0pp

The Home Loan delta is statistically indistinguishable from noise (95% Wilson CI on 20 samples is Β±15pp). Personal and Vehicle deltas are well outside noise.

For context against general-purpose APIs on a matched 30-sample sanity check: Trained Qwen 96.7% > GPT-4o-mini 83.3%. The rule-based oracle still hits 100% by construction (it is the ground truth) β€” the LLM's value lies in narrative understanding, generalization to new loan products, and the self-improvement loop, not in beating the oracle.

Per-phase mastery

Per-phase mastery

100% on Personal, 98% on Vehicle, 92% on Home β€” every phase cleared the mastery gate on first attempt thanks to the SFT warmup. GRPO loss stayed near zero throughout, with visible spikes at phase transitions (a new loan type entering the distribution).

What this is not

It's a LoRA adapter, not a foundation model. It's narrowly trained on synthetic Indian-bank loan applications conforming to RBI norms. It will not give you good underwriting decisions on US mortgages or business loans without further fine-tuning. The environment is designed to be extended with new loan types in 4 files (generator + ground_truth + reward + router), not to be a general credit risk model.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base, "iamnijin/credit-assessment-curriculum")

prompt = """You are a senior loan officer at an Indian bank. ...
Applicant: 32y, CIBIL 740, monthly income β‚Ή1.25L, FOIR 38%, requests β‚Ή8L personal loan ...
Respond with JSON: {"decision": ..., "reasoning": ...}"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
Downloads last month
46
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for iamnijin/credit-assessment-curriculum

Base model

Qwen/Qwen2.5-7B
Adapter
(2077)
this model

Space using iamnijin/credit-assessment-curriculum 1