Builder-Neekhil's picture
Upload README.md
edc7217 verified
|
raw
history blame
7.77 kB

πŸ’• Relationship Longevity Predictor

An ensemble ML model that predicts relationship compatibility and longevity based on personal and professional profiles of two individuals.

This is a predictability engine β€” not a matchmaking system. Given two individuals' personal attributes, values, interests, and personality traits, it predicts how likely their relationship is to succeed.

Model Architecture

Ensemble of 3 gradient-boosted tree models with 113 engineered dyadic features:

Model Weight AUC-ROC F1
XGBoost 0.40 0.8852 0.6013
LightGBM 0.35 0.8912 0.6351
CatBoost 0.25 0.8661 0.5974
Ensemble β€” 0.8842 0.6198

Best single model: LightGBM (AUC-ROC = 0.891, F1 = 0.635)

Performance Metrics (5-Fold Cross-Validation)

Metric Value
AUC-ROC 0.891
AUC-PR 0.699
Accuracy 85.3%
F1 Score 0.635
Precision 56.8%
Recall 72.0%
Brier Score 0.101

What the Model Learns

Top Predictive Features (from SHAP analysis)

  1. Attractiveness Perception Product β€” Mutual physical attraction between partners
  2. Probability Partner Wants to Date β€” Perceived reciprocal interest
  3. Humor Perception Product β€” Shared sense of humor (mutual ratings)
  4. Total Self-Awareness Gap β€” How accurately people perceive themselves vs how partners see them
  5. Interest Correlation β€” Overlap in hobbies and interests
  6. Shared Interests Score β€” Partner's rating of shared interests
  7. Interest Diversity β€” Breadth of the dater's interests
  8. Confidence Calibration β€” How well people predict their own attractiveness to others
  9. Intelligence Value Fulfillment β€” Whether the partner meets intelligence expectations
  10. Expectation Meets Reality β€” Gap between expected and actual satisfaction

Key Insights

  • Mutual attraction matters most β€” but it's the product (both people finding each other attractive) that predicts success, not just one-sided attraction
  • Humor compatibility ranks #3 β€” couples who both rate each other as funny are much more likely to match
  • Self-awareness is a strong predictor β€” people who accurately assess how others see them tend to form better partnerships
  • Shared interests matter significantly β€” the correlation between interests is more predictive than any single interest
  • Value alignment (what you care about vs what your partner delivers) drives long-term compatibility

Feature Engineering

113 features in 10 categories, engineered from raw dyadic profiles:

Category Features Description
Perception Gap 5 How you rate your partner vs how they rate you (per trait)
Mutual Scores 5 Average of both partners' ratings (per trait)
Perception Products 5 Multiplicative interaction of mutual ratings
Value Fulfillment 5 Does your partner deliver what you value most?
Self-Awareness 5 Self-perception vs partner perception gap
Age Features 4 Gap, gapΒ², is_older, combined age
Interest Features 5 Diversity, intensity, range, correlation
Importance Alignment 8 Do both people value the same traits?
Expectation Features 2 Expectation calibration, meets-reality score
Demographics 4 Race match, gender, same race importance
Raw Profiles ~65 Original personality ratings, interests, preferences

Training Data

Fisman Speed Dating Experiment (mstz/speeddating)

  • 1,048 speed-dating encounters between participants
  • Columbia Business School, 2002-2004
  • 17.7% positive match rate (class imbalance handled via scale_pos_weight + balanced weighting)

Usage

import joblib
import json
import numpy as np
from catboost import CatBoostClassifier

# Load models
xgb = joblib.load("xgboost_model.joblib")
lgb = joblib.load("lightgbm_model.joblib")
cat = CatBoostClassifier()
cat.load_model("catboost_model.cbm")
feature_cols = joblib.load("feature_columns.joblib")

with open("ensemble_config.json") as f:
    config = json.load(f)

# Prepare feature vector (113 features β€” see feature_columns.joblib)
# features = pd.DataFrame([your_feature_vector], columns=feature_cols)

# Predict
xgb_prob = xgb.predict_proba(features)[:, 1]
lgb_prob = lgb.predict_proba(features)[:, 1]
cat_prob = cat.predict_proba(features)[:, 1]

# Ensemble
score = 0.4 * xgb_prob + 0.35 * lgb_prob + 0.25 * cat_prob

# Interpret
if score >= 0.7:
    print("High Compatibility ❀️")
elif score >= 0.4:
    print("Moderate Compatibility πŸ’›")
else:
    print("Low Compatibility πŸ’”")

Visualizations

ROC Curves

ROC Curves

Feature Importance

Feature Importance

SHAP Summary

SHAP Summary

SHAP Dependence (Top 6 Features)

SHAP Dependence

Confusion Matrix

Confusion Matrix

Prediction Distribution

Probability Distribution

Literature Basis

Paper Contribution
Grinsztajn et al. (NeurIPS 2022) β€” "Why do tree-based models still outperform deep learning on tabular data?" Validated XGBoost/LightGBM as SOTA for tabular data with <100K rows
Fisman et al. (QJE 2006) β€” "Gender Differences in Mate Selection" Original speed dating experiment; ~70% accuracy with logistic regression
Gorishniy et al. (NeurIPS 2021) β€” "Revisiting Deep Learning Models for Tabular Data" FT-Transformer architecture for tabular; confirmed tree superiority on small datasets
Savcisens et al. (Nature Human Behaviour 2024) β€” "Using Sequences of Life-events to Predict Human Lives" life2vec β€” longitudinal life-event prediction; architecture reusable for dyadic temporal modeling

Limitations

  • Short-term proxy: The training data captures initial match decisions (4-minute speed dates), not long-term relationship outcomes. The model predicts initial compatibility, which is a proxy for β€” but not equivalent to β€” relationship longevity.
  • Sample demographics: Columbia University students (2002-2004) β€” may not generalize to all demographics/cultures.
  • Static features only: No temporal/interaction data. Adding communication patterns, life events, or behavioral signals would significantly improve longevity prediction (see life2vec approach).
  • Class imbalance: 17.7% match rate means the model is well-calibrated for rejection but less certain for positive predictions.

Files

File Description
xgboost_model.joblib XGBoost classifier (2000 trees)
lightgbm_model.joblib LightGBM classifier (2000 trees)
catboost_model.cbm CatBoost classifier (2000 iterations)
ensemble_config.json Ensemble weights, threshold, feature list, metrics
feature_columns.joblib Ordered list of 113 feature column names
race_encoder.joblib LabelEncoder for race categories
evaluation_results.csv Full evaluation metrics table
feature_importance.csv Feature importance rankings from XGB + LGB
predictor.py Prediction interface class
train_relationship_predictor.py Full training script (reproducible)
figures/ All visualizations (ROC, SHAP, confusion matrix, etc.)

License

Research use. Based on publicly available academic dataset.


Built with XGBoost, LightGBM, CatBoost, SHAP, and scikit-learn.