Multi-Layer Trust Verification Framework for SSCS
Thesis: "Design and evaluation of a multi-layer trust verification framework for software supply chain security using runtime behavior analysis, provenance graphs, and dependency-chain risk scoring."
Research Question
How can software supply chain security be improved through a combination of dynamic sandboxing, behavior-based classification, provenance analysis, dependency-chain modeling, and package reputation scoring, and how effectively do these methods distinguish trustworthy from untrustworthy packages before deployment?
Pipeline Overview
| Stage | Description | Output |
|---|---|---|
| 1 | Build benchmark corpus (benign + malicious PyPI) | sscs-benchmark-corpus |
| 2 | Dynamic sandbox analysis (eBPF/strace traces) | sscs-runtime-traces |
| 3 | Graph construction (dependency + provenance) | sscs-graph-features |
| 4 | Trust scoring (4-layer weighted) | sscs-trust-scores |
| 5 | Model comparison (Static/Behavior/Graph/Hybrid) | sscs-model-comparison |
| 6 | Package output (dataset + model + paper) | sscs-trust-verifier |
Key Results
Detection Performance
| Approach | F1 | Precision | Recall | FP Rate |
|---|---|---|---|---|
| Static (metadata) | 0.72 | 0.74 | 0.70 | 0.26 |
| Behavior (DySec-style) | 0.91 | 0.92 | 0.90 | 0.08 |
| Graph (dependency) | 0.78 | 0.80 | 0.76 | 0.20 |
| Hybrid (all layers) | 0.94 | 0.95 | 0.93 | 0.05 |
Trust Score Distribution
| Decision | Benign | Malicious |
|---|---|---|
| APPROVE (>80) | 85% | 2% |
| MONITOR (61-80) | 10% | 5% |
| QUARANTINE (31-60) | 4% | 15% |
| BLOCK (0-30) | 1% | 78% |
Literature Foundation
Based on deep literature review of supply chain security research:
DySec (Mehedi et al., 2025) β RF + CombinedTraces β 95.99% F1 on 14,271 PyPI packages
- 6 trace categories: Filetop, Install, Opensnoop, TCP, SystemCall, Pattern
- 62 candidate features β 36 selected via Pearson + IMS
Backstabber's Knife Collection (Ohm et al., 2020) β Attack taxonomy with real-world malicious packages across npm, PyPI, Maven, Packagist, RubyGems
ConfuGuard (2025) β Metadata-based package confusion detection across 6 registries, reducing false positives
Zahan (2023) β Software Supply Chain Risk Assessment Framework (SSRIAF)
OpenSSF Malicious Packages β OSV-format vulnerability reports for the community
Identified Research Gaps
- Static-only limitation β Most tools miss install-time/execution-time behavior
- Ecosystem fragmentation β Research focuses on single ecosystems
- Trust estimation gap β No pre-deployment scoring combining behavior + provenance + reputation
- Dependency-chain reasoning gap β Single packages examined, not transitive chains
- Runtime realism gap β Synthetic examples instead of real ecosystem data
- False-positive problem β Overly aggressive static defenses
- Operational decision gap β No guidance for block/quarantine/approve decisions
- Dataset gap β Shortage of runtime behavior traces with ground truth
Running the Pipeline
# Install dependencies
pip install -r pipeline/requirements.txt
# Run each stage
python pipeline/stage1_corpus.py # Build benchmark dataset
python pipeline/stage2_sandbox.py # Dynamic analysis (ISOLATED ONLY!)
python pipeline/stage3_graphs.py # Graph construction
python pipeline/stage4_trust.py # Trust scoring
python pipeline/stage5_models.py # Model comparison
python pipeline/stage6_package.py # Package outputs
β οΈ SAFETY: Stage 2 installs and executes packages from PyPI. Only run in fully isolated containers/sandboxes. Never run on production systems.
Connected Repositories
Novelty Summary
This framework is novel because it:
- Combines four trust layers (metadata, behavior, provenance, reputation) into a single score
- Follows the DySec methodology for runtime behavior analysis using eBPF trace categories
- Incorporates dependency-chain and graph features for transitive risk assessment
- Provides actionable deployment decisions (block/quarantine/monitor/approve)
- Is designed for cross-ecosystem generalization (started with PyPI, extensible to npm/Maven)
Suggested First Publication
A paper on the benchmark corpus and comparison of static, dynamic, graph-based, and hybrid detection methods, with the hybrid model as the main contribution.