# Design Choices Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards. --- ## Table of Contents 1. [Inception (Requirements Engineering)](#1-inception-requirements-engineering) 2. [Reproducibility (Versioning & Pipelines)](#2-reproducibility-versioning--pipelines) 3. [Quality Assurance](#3-quality-assurance) 4. [API (Inference Service)](#4-api-inference-service) 5. [Deployment (Containerization & CI/CD)](#5-deployment-containerization--cicd) 6. [Monitoring](#6-monitoring) --- ## 1. Inception (Requirements Engineering) ### Machine Learning Canvas The project adopted the **Machine Learning Canvas** framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions. | Canvas Section | Application | |----------------|-------------| | **Prediction Task** | Multi-label classification of 217 technical skills from GitHub issue text | | **Decisions** | Automated developer assignment based on predicted skill requirements | | **Value Proposition** | Reduced issue resolution time, optimized resource allocation | | **Data Sources** | SkillScope DB (7,245 PRs from 11 Java repositories) | | **Making Predictions** | Real-time classification upon issue creation | | **Building Models** | Iterative improvement over RF+TF-IDF baseline | | **Monitoring** | Continuous evaluation with drift detection | The complete ML Canvas is documented in [ML Canvas.md](./ML%20Canvas.md). ### Functional vs Non-Functional Requirements #### Functional Requirements | Requirement | Target | Metric | |-------------|--------|--------| | **Precision** | ≥ Baseline | True positives / Predicted positives | | **Recall** | ≥ Baseline | True positives / Actual positives | | **Micro-F1** | > Baseline | Harmonic mean across all labels | | **Multi-label Support** | 217 skills | Simultaneous prediction of multiple labels | #### Non-Functional Requirements | Category | Requirement | Implementation | |----------|-------------|----------------| | **Reproducibility** | Auditable experiments | MLflow tracking, DVC versioning | | **Explainability** | Interpretable predictions | Confidence scores per skill | | **Performance** | Low latency inference | FastAPI async, model caching | | **Scalability** | Batch processing | `/predict/batch` endpoint (max 100) | | **Maintainability** | Clean code | Ruff linting, type hints, docstrings | ### System-First vs Model-First Development The project adopted a **System-First** approach, prioritizing infrastructure and pipeline development before model optimization: ``` Timeline: ┌─────────────────────────────────────────────────────────────┐ │ Phase 1: Infrastructure │ Phase 2: Model Development │ │ - DVC/MLflow setup │ - Feature engineering │ │ - CI/CD pipeline │ - Hyperparameter tuning │ │ - Docker containers │ - SMOTE/ADASYN experiments │ │ - API skeleton │ - Performance optimization │ └─────────────────────────────────────────────────────────────┘ ``` **Rationale:** - Enables rapid iteration once infrastructure is stable - Ensures reproducibility from day one - Reduces technical debt during model development - Facilitates team collaboration with shared tooling --- ## 2. Reproducibility (Versioning & Pipelines) ### Code Versioning (Git) Standard Git workflow with branch protection: | Branch | Purpose | |--------|---------| | `main` | Production-ready code | | `feature/*` | New development | | `milestone/*` | Grouping all features before merging into main | ### Data & Model Versioning (DVC) **Design Decision:** Use DVC (Data Version Control) with DagsHub remote storage for large file management. ``` .dvc/config ├── remote: origin ├── url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc └── auth: basic (credentials via environment) ``` **Tracked Artifacts:** | File | Purpose | |------|---------| | `data/raw/skillscope_data.db` | Original SQLite database | | `data/processed/*.npy` | TF-IDF and embedding features | | `models/*.pkl` | Trained models and vectorizers | **Versioning Workflow:** ```bash # Track new data dvc add data/raw/new_dataset.db git add data/raw/.gitignore data/raw/new_dataset.db.dvc # Push to remote dvc push git commit -m "Add new dataset version" git push ``` ### Experiment Tracking (MLflow) **Design Decision:** Remote MLflow instance on DagsHub for collaborative experiment tracking. | Configuration | Value | |---------------|-------| | Tracking URI | `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow` | | Experiments | `skill_classification`, `skill_prediction_api` | **Logged Metrics:** - Training: precision, recall, F1-score, training time - Inference: prediction latency, confidence scores, timestamps **Artifact Storage:** - Model binaries (`.pkl`) - Vectorizers and scalers - Hyperparameter configurations ### Auditable ML Pipeline The pipeline is designed for complete reproducibility: ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ dataset.py │───▶│ features.py │───▶│ train.py │ │ (DVC pull) │ │ (TF-IDF) │ │ (MLflow) │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ ▼ ▼ ▼ .dvc files .dvc files MLflow Run ``` --- ## 3. Quality Assurance ### Testing Strategy #### Static Analysis (Ruff) **Design Decision:** Use Ruff as the primary linter for speed and comprehensive rule coverage. | Configuration | Value | |---------------|-------| | Line Length | 88 (Black compatible) | | Target Python | 3.10+ | | Rule Sets | PEP 8, isort, pyflakes | **CI Integration:** ```yaml - name: Lint with Ruff run: make lint ``` #### Dynamic Testing (Pytest) **Test Organization:** ``` tests/ ├── unit/ # Isolated function tests ├── integration/ # Component interaction tests ├── system/ # End-to-end tests ├── behavioral/ # ML-specific tests ├── deepchecks/ # Data validation └── great expectations/ # Schema validation ``` **Markers for Selective Execution:** ```python @pytest.mark.unit @pytest.mark.integration @pytest.mark.system @pytest.mark.slow ``` ### Model Validation vs Model Verification | Concept | Definition | Implementation | |---------|------------|----------------| | **Validation** | Does the model fit user needs? | Micro-F1 vs baseline comparison | | **Verification** | Is the model correctly built? | Unit tests, behavioral tests | ### Behavioral Testing **Design Decision:** Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics. | Test Type | Count | Purpose | |-----------|-------|---------| | **Invariance** | 9 | Stability under perturbations (typos, case changes) | | **Directional** | 10 | Expected behavior with keyword additions | | **Minimum Functionality** | 17 | Basic sanity checks on clear examples | **Example Invariance Test:** ```python def test_case_insensitivity(): """Model should predict same skills regardless of case.""" assert predict("Fix BUG") == predict("fix bug") ``` ### Data Quality Checks #### Great Expectations (10 Tests) **Design Decision:** Validate data at pipeline boundaries to catch quality issues early. | Validation Point | Tests | |------------------|-------| | Raw Database | Schema, row count, required columns | | Feature Matrix | No NaN/Inf, sparsity, SMOTE compatibility | | Label Matrix | Binary format, distribution, consistency | | Train/Test Split | No leakage, stratification | #### Deepchecks (24 Checks) **Suites:** - **Data Integrity Suite** (12 checks): Duplicates, nulls, correlations - **Train-Test Validation Suite** (12 checks): Leakage, drift, distribution **Status:** Production-ready (96% overall score) --- ## 4. API (Inference Service) ### FastAPI Implementation **Design Decision:** Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation. **Key Features:** - Async lifespan management for model loading - Middleware for Prometheus metrics collection - Structured exception handling ### RESTful Principles **Design Decision:** Follow REST best practices for intuitive API design. | Principle | Implementation | |-----------|----------------| | **Nouns, not verbs** | `/predictions` instead of `/getPrediction` | | **Plural resources** | `/predictions`, `/issues` | | **HTTP methods** | GET (retrieve), POST (create) | | **Status codes** | 200 (OK), 201 (Created), 404 (Not Found), 500 (Error) | **Endpoint Design:** | Method | Endpoint | Action | |--------|----------|--------| | `POST` | `/predict` | Create new prediction | | `POST` | `/predict/batch` | Create batch predictions | | `GET` | `/predictions` | List predictions | | `GET` | `/predictions/{run_id}` | Get specific prediction | ### OpenAPI/Swagger Documentation **Auto-generated documentation at runtime:** - Swagger UI: `/docs` - ReDoc: `/redoc` - OpenAPI JSON: `/openapi.json` **Pydantic Models for Schema Enforcement:** ```python class IssueInput(BaseModel): issue_text: str repo_name: Optional[str] = None pr_number: Optional[int] = None class PredictionResponse(BaseModel): run_id: str predictions: List[SkillPrediction] model_version: str ``` --- ## 5. Deployment (Containerization & CI/CD) ### Docker Containerization **Design Decision:** Multi-stage Docker builds with security best practices. **Dockerfile Features:** - Python 3.10 slim base image (minimal footprint) - Non-root user for security - DVC integration for model pulling - Health check endpoint configuration **Multi-Service Architecture:** ``` docker-compose.yml ├── hopcroft-api (FastAPI) │ ├── Port: 8080 │ ├── Volumes: source code, logs │ └── Health check: /health │ ├── hopcroft-gui (Streamlit) │ ├── Port: 8501 │ ├── Depends on: hopcroft-api │ └── Environment: API_BASE_URL │ └── hopcroft-net (Bridge network) ``` **Design Rationale:** - Separation of concerns (API vs GUI) - Independent scaling - Health-based dependency management - Shared network for internal communication ### CI/CD Pipeline (GitHub Actions) **Design Decision:** Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds. **Pipeline Stages:** ```yaml Jobs: unit-tests: - Checkout code - Setup Python 3.10 - Install dependencies - Ruff linting - Pytest unit tests - Upload test report (on failure) build-image: - Needs: unit-tests - Configure DVC credentials - Pull models - Build Docker image ``` **Triggers:** - Push to `main`, `feature/*` - Pull requests to `main` **Secrets Management:** - `DAGSHUB_USERNAME`: DagsHub authentication - `DAGSHUB_TOKEN`: DagsHub access token ### Hugging Face Spaces Hosting **Design Decision:** Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support. **Configuration:** ```yaml --- title: Hopcroft Skill Classification sdk: docker app_port: 7860 --- ``` **Startup Flow:** 1. `start_space.sh` configures DVC credentials 2. Pull models from DagsHub 3. Start FastAPI (port 8000) 4. Start Streamlit (port 8501) 5. Start Nginx (port 7860) for routing **Nginx Reverse Proxy:** - `/` → Streamlit GUI - `/docs`, `/predict`, `/predictions` → FastAPI - `/prometheus` → Prometheus metrics --- ## 6. Monitoring ### Resource-Level Monitoring **Design Decision:** Implement Prometheus metrics for real-time observability. | Metric | Type | Purpose | |--------|------|---------| | `hopcroft_requests_total` | Counter | Request volume by endpoint | | `hopcroft_request_duration_seconds` | Histogram | Latency distribution (P50, P90, P99) | | `hopcroft_in_progress_requests` | Gauge | Concurrent request load | | `hopcroft_prediction_processing_seconds` | Summary | Model inference time | **Middleware Implementation:** ```python @app.middleware("http") async def monitor_requests(request, call_next): IN_PROGRESS.inc() with REQUEST_LATENCY.labels(method, endpoint).time(): response = await call_next(request) REQUESTS_TOTAL.labels(method, endpoint, status).inc() IN_PROGRESS.dec() return response ``` ### Performance-Level Monitoring **Model Staleness Indicators:** - Prediction confidence trends over time - Drift detection alerts - Error rate monitoring ### Drift Detection Strategy **Design Decision:** Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction. | Component | Details | |-----------|---------| | **Algorithm** | KS Two-Sample Test | | **Baseline** | 1000 samples from training data | | **Threshold** | p-value < 0.05 (Bonferroni corrected) | | **Execution** | Scheduled via cron or manual trigger | **Drift Types Monitored:** | Type | Definition | Detection Method | |------|------------|------------------| | **Data Drift** | Feature distribution shift | KS test on input features | | **Target Drift** | Label distribution shift | Chi-square test on predictions | | **Concept Drift** | Relationship change | Performance degradation monitoring | **Metrics Published to Pushgateway:** - `drift_detected`: Binary indicator (0/1) - `drift_p_value`: Statistical significance - `drift_distance`: KS distance metric - `drift_check_timestamp`: Last check time ### Alerting Configuration **Prometheus Alert Rules:** | Alert | Condition | Severity | |-------|-----------|----------| | `ServiceDown` | Target down for 5m | Critical | | `HighErrorRate` | 5xx rate > 10% | Warning | | `SlowRequests` | P95 latency > 2s | Warning | | `DriftDetected` | drift_detected = 1 | Warning | **Alertmanager Integration:** - Severity-based routing - Email notifications - Inhibition rules to prevent alert storms ### Grafana Visualization **Dashboard Panels:** 1. Request Rate (gauge) 2. Request Latency p50/p95 (time series) 3. In-Progress Requests (stat panel) 4. Error Rate 5xx (stat panel) 5. Model Prediction Time (time series) 6. Requests by Endpoint (bar chart) **Data Sources:** - Prometheus: Real-time metrics - Pushgateway: Batch job metrics (drift detection) ### HF Spaces Deployment Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy: | Service | Production URL | |---------|----------------| | Prometheus | `https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/` | | Grafana | `https://dacrow13-hopcroft-skill-classification.hf.space/grafana/` | This enables real-time monitoring of the production deployment without additional infrastructure.