--- license: cc-by-nc-4.0 library_name: joblib pipeline_tag: tabular-classification tags: - relationships - gottman - survival-analysis - cox-proportional-hazards - xgboost - lightgbm - catboost - shap - ensemble - tabular-classification - couples - social-science datasets: - mstz/speeddating - vedastro-org/15000-Famous-People-Marriage-Divorce-Info metrics: - roc_auc - accuracy - f1 model-index: - name: relationship-longevity-predictor-v2 results: - task: type: tabular-classification name: Relationship Longevity Prediction dataset: name: Speed Dating + Gottman Divorce + Vedastro Marriages (composite) type: custom metrics: - type: roc_auc value: 0.8896 name: AUC-ROC - type: accuracy value: 0.859 name: Accuracy - type: f1 value: 0.630 name: F1 --- # πŸ’• Relationship Longevity Predictor β€” v2.0 **An ensemble ML model that predicts long-term relationship compatibility from two people's profiles, grounded in Gottman's Four Horsemen and Cox proportional hazards survival analysis.** πŸ‘‰ **[Try the live demo β†’](https://huggingface.co/spaces/Builder-Neekhil/relationship-longevity-predictor-demo)** --- ## What this is (and isn't) **Is:** A well-calibrated research artifact. An ensemble (XGBoost + LightGBM + CatBoost) trained on three open datasets, with Gottman behavioral proxies and survival priors layered in. Think of it as a **mirror** that reflects patterns the literature has documented β€” not a crystal ball. **Isn't:** A decision tool. Don't break up, propose, or pick a partner based on its output. The interesting question isn't "what score did I get" β€” it's "which of the Four Horsemen showed up in my top factors, and why." **Training data is narrow:** Columbia speed-daters (2002–2004), 170 Turkish couples from the YΓΆntem Gottman study, and 14,688 public-figure marriages pulled from a dataset originally compiled by Vedastro for unrelated research (we used only the marriage/divorce metadata β€” no astrological features). Generalization beyond these cohorts is unverified. See Limitations. --- ## πŸ“Š Headline Results | Metric | v1.0 Baseline | v2.0 Enhanced | Change | | --- | --- | --- | --- | | **AUC-ROC** | 0.8842 | **0.8896** | +0.0055 βœ… | | **AUC-PR** | 0.6933 | **0.7108** | +0.0175 βœ… | | **Brier Score** | 0.0960 | **0.0934** | -0.0026 βœ… | | **Accuracy** | 83.5% | **85.9%** | +2.4% βœ… | | **F1 Score** | 0.620 | **0.630** | +0.010 βœ… | | **Precision** | 52.4% | **58.9%** | +6.5% βœ… | **Key improvement: +12.3% precision boost** β€” far fewer false positives than v1. ### What Changed v2.0 adds **20 new features** from two additional data sources: | Source | Features Added | Signal | |--------|:-:|---| | **Gottman Behavioral Model** (Phase 1) | 13 | Contempt, criticism, defensiveness, stonewalling proxy scores derived from 170-couple divorce study | | **Marriage Duration Survival Model** (Phase 2) | 7 | Longevity priors from 14,688 real marriages (age-risk, relationship-history risk, timing hazard) | **8 of the 20 new features ranked in the top 30** most important features by SHAP: | Rank | New Feature | SHAP | Source | |:---:|---|:---:|---| | 3 | `gottman_proxy_love_maps` | 0.447 | πŸ”΄ Gottman | | 4 | `gottman_proxy_contempt_x_stonewalling` | 0.403 | πŸ”΄ Gottman | | 8 | `gottman_proxy_ratio` | 0.306 | πŸ”΄ Gottman | | 10 | `gottman_proxy_stonewalling` | 0.279 | πŸ”΄ Gottman | | 12 | `gottman_proxy_horsemen` | 0.264 | πŸ”΄ Gottman | | 21 | `gottman_proxy_net_risk` | 0.189 | πŸ”΄ Gottman | | 27 | `survival_age_gap_risk` | 0.163 | πŸ”΅ Survival | | 29 | `gottman_proxy_contempt` | 0.160 | πŸ”΄ Gottman | --- ## Phase 1: Gottman Behavioral Model **Dataset:** YΓΆntem et al. Divorce Predictors β€” 170 married/divorced Turkish couples, 54 Gottman-mapped behavioral questions. **Standalone performance:** AUC = **0.998**, Accuracy = **98.2%** on predicting divorce from behavioral patterns. The 54 questions map to Gottman's relationship theory: | Gottman Dimension | Questions | What It Measures | |---|:-:|---| | **Shared Goals** | Q1-Q10 | Aligned life direction, quality time, common objectives | | **Love Maps** | Q11-Q20 | Values alignment, role expectations, compatibility beliefs | | **Love Maps Deep** | Q21-Q30 | Knowing partner's inner world, stress, hopes, anxieties | | **Criticism** | Q31-Q32, Q37-Q38 | Attacking character, negative statements, sudden arguments | | **Contempt** | Q33-Q36, Q39-Q40 | Insults, humiliation, anger escalation, hatred | | **Defensiveness** | Q41, Q45-Q46, Q48-Q50 | Blame-shifting, victimhood, refusing responsibility | | **Stonewalling** | Q42-Q44, Q47 | Silence, withdrawal, leaving, shutting down | | **Deep Contempt** | Q51-Q54 | Attributing meanness, vindictiveness, pathology to partner | **Top divorce predictor by SHAP:** `love_maps Γ— shared_goals` interaction β€” couples who *both* lack shared goals *and* don't know each other's inner world face the highest divorce risk. ### Gottman Proxy Features (mapped to speed dating data) Since speed dating participants didn't answer the 54 Gottman questions, we created **proxy scores** by mapping their existing personality/perception data to Gottman dimensions: | Proxy | Derived From | |---|---| | `gottman_proxy_contempt` | Low mutual scores + high perception gaps | | `gottman_proxy_criticism` | Misaligned values + asymmetric ratings | | `gottman_proxy_defensiveness` | Self-rating inflation vs partner perception | | `gottman_proxy_stonewalling` | Low engagement, low liking, no shared interests | | `gottman_proxy_love_maps` | Interest correlation + shared interests + mutual perception accuracy | | `gottman_proxy_shared_goals` | Value alignment + interest overlap | | `gottman_proxy_ratio` | The famous Gottman 5:1 positive-to-negative ratio | --- ## Phase 2: Marriage Duration Survival Model **Dataset:** [vedastro-org/15000-Famous-People-Marriage-Divorce-Info](https://hf.co/datasets/vedastro-org/15000-Famous-People-Marriage-Divorce-Info) β€” 14,688 marriage records from 12,353 famous people. ### Key Findings | Finding | Statistic | |---|---| | **Overall divorce rate** | 34.5% | | **Median divorce timing** | 7 years | | **Most dangerous period** | 3-7 years (41.1% of all divorces) | | **Love marriage divorce rate** | 34.1% | | **Arranged marriage divorce rate** | 23.4% (p=0.006, significantly lower) | | **First marriage divorce rate** | 27.8% | | **Subsequent marriage divorce rate** | **69.3%** | ### Cox Proportional Hazards Model (Concordance = 0.64) | Factor | Hazard Ratio | p-value | Meaning | |---|:---:|:---:|---| | **Is first marriage** | **0.26** | <0.001 | 74% lower divorce hazard than subsequent marriages | | **Is love marriage** | **0.77** | 0.002 | 23% lower hazard than non-love marriages | | **Age at marriage** | **0.96** | <0.001 | Each year older β†’ 4% lower divorce hazard | | **Marriage number** | **1.34** | <0.001 | Each additional marriage β†’ 34% higher hazard | ### Divorce Timing Distribution ![Divorce Timing](phase2_survival_model/figures/divorce_timing.png) ### Kaplan-Meier Survival Curves ![KM by Type](phase2_survival_model/figures/km_by_type.png) ![KM by Marriage Number](phase2_survival_model/figures/km_by_marriage_number.png) --- ## Model Architecture (v2.0) **Ensemble of 3 gradient-boosted tree models** with **133 engineered features** (113 original + 13 Gottman + 7 survival): | Model | Weight | v1 AUC | v2 AUC | Change | |-------|:---:|:---:|:---:|:---:| | XGBoost | 0.40 | 0.8852 | 0.8920 | +0.0068 | | LightGBM | 0.35 | 0.8912 | **0.9011** | +0.0099 | | CatBoost | 0.25 | 0.8661 | 0.8688 | +0.0027 | | **Ensemble** | β€” | 0.8842 | **0.8896** | +0.0055 | ## Visualizations ### v1 vs v2 ROC Comparison ![ROC Comparison](v2_enhanced/figures/roc_comparison.png) ### Metrics Comparison ![Metrics Comparison](v2_enhanced/figures/metrics_comparison.png) ### Feature Source Contribution ![Source Contribution](v2_enhanced/figures/source_contribution.png) ### Enhanced SHAP Summary (v2) ![Enhanced SHAP](v2_enhanced/figures/enhanced_shap_summary.png) ### v1 Visualizations | | | |---|---| | ![ROC Curves](figures/roc_curves.png) | ![SHAP Summary](figures/shap_summary.png) | | ![Feature Importance](figures/feature_importance.png) | ![Confusion Matrix](figures/confusion_matrix.png) | --- ## Training Data | Dataset | Records | Role | |---|:---:|---| | [mstz/speeddating](https://hf.co/datasets/mstz/speeddating) | 1,048 encounters | Primary training data β€” individual profiles + match outcome | | YΓΆntem et al. Divorce Predictors (Kaggle) | 170 couples | Phase 1 β€” Gottman behavioral feature engineering | | [vedastro-org/15000-Famous-People-Marriage-Divorce-Info](https://hf.co/datasets/vedastro-org/15000-Famous-People-Marriage-Divorce-Info) | 14,688 marriages | Phase 2 β€” Longevity priors + survival analysis | ## Literature Basis | Paper | Contribution | |-------|-------------| | Grinsztajn et al. (NeurIPS 2022) β€” *"Why do tree-based models still outperform deep learning on tabular data?"* | Validated XGBoost/LightGBM as SOTA for medium-sized tabular data | | Fisman et al. (QJE 2006) β€” *"Gender Differences in Mate Selection"* | Original speed dating experiment; ~70% accuracy with logistic regression | | **Gottman & Silver (1999) β€” *"The Seven Principles for Making Marriage Work"*** | **Four Horsemen framework: contempt, criticism, defensiveness, stonewalling** | | **YΓΆntem et al. (2019) β€” *"Divorce Prediction Using Correlation Based Feature Selection"*** | **54-question Gottman-mapped divorce predictor; published 97.7% accuracy** | | Savcisens et al. (Nature Human Behaviour 2024) β€” *"Using Sequences of Life-events to Predict Human Lives"* | life2vec β€” longitudinal prediction architecture | ## Repo Structure ``` β”œβ”€β”€ # v1.0 Baseline Model β”œβ”€β”€ xgboost_model.joblib, lightgbm_model.joblib, catboost_model.cbm β”œβ”€β”€ ensemble_config.json, feature_columns.joblib β”œβ”€β”€ figures/ # v1 plots β”‚ β”œβ”€β”€ # Phase 1 β€” Gottman Behavioral Model β”œβ”€β”€ phase1_divorce_model/ β”‚ β”œβ”€β”€ divorce_xgb.joblib, divorce_lgb.joblib, divorce_cat.cbm β”‚ β”œβ”€β”€ gottman_recipe.json # Dimension mappings + importance β”‚ β”œβ”€β”€ gottman_mapping.joblib β”‚ └── figures/ # SHAP, confusion matrix, dimension importance β”‚ β”œβ”€β”€ # Phase 2 β€” Survival Model β”œβ”€β”€ phase2_survival_model/ β”‚ β”œβ”€β”€ longevity_priors.json # Base rates by type/era/age/marriage# β”‚ β”œβ”€β”€ survival_recipe.json # Cox PH + KM + timing distributions β”‚ └── figures/ # KM curves, Cox hazard ratios, timing β”‚ β”œβ”€β”€ # v2.0 Enhanced Model (RECOMMENDED) β”œβ”€β”€ v2_enhanced/ β”‚ β”œβ”€β”€ enhanced_xgb.joblib, enhanced_lgb.joblib, enhanced_cat.cbm β”‚ β”œβ”€β”€ enhanced_config.json # Weights, features, metrics, improvements β”‚ β”œβ”€β”€ enhanced_feature_columns.joblib β”‚ └── figures/ # Comparison plots, SHAP β”‚ └── # Training Scripts (fully reproducible) β”œβ”€β”€ train_relationship_predictor.py # v1 baseline β”œβ”€β”€ phase1_divorce_model.py # Gottman behavioral model β”œβ”€β”€ phase2_marriage_duration.py # Survival analysis └── phase3_integration.py # Integration + comparison ``` ## Limitations & Ethics **Cohort bias.** The primary training signal is from Columbia University speed-daters in 2002–2004. This is a narrow demographic slice β€” predominantly educated, urban, US-based, early-internet-era. Generalization to other populations is unverified and should be assumed weak until tested. **Celebrity bias in the survival priors.** The 14,688-marriage Vedastro dataset is public-figure-heavy, with known elevated divorce rates and atypical relationship dynamics (media exposure, wealth asymmetry, career mobility). The arranged-vs-love finding (23.4% vs 34.1%) is descriptive of this dataset, not a general claim about relationship types. **Dataset provenance.** The Vedastro dataset was originally compiled for astrology research. This model uses only the structured marriage/divorce metadata (age at marriage, marriage number, duration, type, outcome) β€” no astrological variables are used as features. **Short-horizon proxy.** Speed-dating captures initial match decisions, not long-term outcomes. The Gottman and survival layers partially bridge this gap, but they're proxies, not ground truth. **Small Gottman sample.** The underlying divorce predictor was trained on 170 couples. The Four Horsemen framework itself is robust across decades of research; the proxy mapping from speed-dating features to Gottman dimensions is approximate and worth questioning. **Not a decision tool.** Outputs are probabilistic, directional, and should be treated as a conversation starter β€” not advice. This model should not be used to make real decisions about real relationships. ## License cc-by-nc-4.0 Research use. Based on publicly available academic datasets. --- *Built with XGBoost, LightGBM, CatBoost, SHAP, lifelines, and scikit-learn.*