Home Credit - Credit Risk Model Stability
This repository contains the complete source code, pretrained models, and documentation for a State-of-the-Art (SOTA) credit risk stability model. The solution was developed for the Home Credit - Credit Risk Model Stability competition, focusing on predicting default probability while maintaining performance stability over time.
π Key Performance
| Metric | Score | Description |
|---|---|---|
| Stability Score | 0.67701 | Official competition metric (Gini with stability penalty). |
| AUC | 0.8308 | Raw predictive power (Single Fold). |
| Slope | ~0.00 | Performance degradation over time (near zero is ideal). |
π Repository Structure
This project is organized as a production-ready Python package with clear separation of concerns.
.
βββ models/ # Pretrained CatBoost models (10GB+)
β βββ catboost_fold_1.cbm
β βββ ...
βββ src/ # Core source code
β βββ data/ # Polars-based data pipeline & aggregation logic
β βββ features/ # Feature engineering & adversarial selection
β βββ models/ # Trainer wrapper for CatBoost/LGBM
β βββ validation/ # Stability-aware cross-validation splitters
βββ notebooks/ # Experimentation labs (Jupyter)
β βββ 01_baseline... # Initial feasibility study
β βββ 02_feature... # Deep feature engineering (Depth 0/1/2)
β βββ 05_champion... # FINAL Training script (GPU required)
β βββ ...
βββ docs/ # Detailed technical reports
β βββ reports/ # Technical evolution, summary, and appendices
βββ training_artifacts/ # Logs and OOF predictions
βββ verify_model.py # Quick inference verification script
π Getting Started
1. Prerequisites
- Python 3.10+
- NVIDIA GPU (Recommended for training, optional for inference)
- RAM: 32GB+ (for full data processing)
2. Installation
Clone the repository and install dependencies. Note that this repo uses Git LFS for model weights.
# Install Git LFS first
git lfs install
# Clone repository
git clone https://huggingface.co/Lyes930/home-credit-risk-model-v1
cd home-credit-risk-model-v1
# Install Python dependencies
pip install -r requirements.txt
3. Data Preparation
Due to licensing, the raw dataset cannot be hosted here. Please download it from Kaggle:
- Go to the Competition Data Page.
- Download and unzip the data.
- Place the
csv_filesorparquet_filesfolders inside adata/directory in the root of this repo.
Structure should look like:
data/
βββ parquet_files/
β βββ train/
β βββ test/
βββ feature_definitions.csv
π Usage
Inference (Verification)
To verify the pretrained models and run predictions on the training data (as a smoke test):
python verify_model.py
This script will load the models from the root directory, generate features, and output the AUC score.
Training (Retrain from Scratch)
If you have a GPU environment, you can reproduce the training process:
- Open
notebooks/05_champion_optimization.ipynb. - Ensure your
data/directory is populated. - Run all cells. This will train 5 folds of CatBoost models and save them.
π§ Technical Highlights
This solution differentiates itself through robust engineering rather than complex ensembles:
- Polars Data Engine: Replaced Pandas with Polars to handle 1.5M rows x 1600 columns with highly efficient memory usage (Lazy API).
- Depth-2 Aggregation: Implemented a double-aggregation strategy (
Payment -> Contract -> User) to capture deep historical credit behavior. - Adversarial Validation: Used a time-based discriminator to remove features that drift significantly over time, ensuring model stability.
- No "Metric Hacking": We proved that artificial score reduction (hacking) hurts performance on robust models. We stuck to honest probabilities.
For a deep dive into the architectural decisions, please read the Technical Evolution Path.
π€ Citation & Acknowledgements
If you use this code or ideas in your research, please cite:
@misc{home-credit-risk-v1,
author = {Lyes930},
title = {Home Credit - Credit Risk Model Stability Solution},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/Lyes930/home-credit-risk-model-v1}}
}
Special thanks to the Kaggle community and Home Credit Group for the challenging dataset.