Spaces:

devorbit
/

murshid

Paused

App Files Files Community

devorbit commited on 11 days ago

Commit

26e1c2e

0 Parent(s):

Initial deployment - secrets removed

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.dockerignore +18 -0
.gitattributes +3 -0
.gitignore +11 -0
DEPLOY_GUIDE.md +103 -0
Dockerfile +49 -0
MurshidBackend_Colab.ipynb +967 -0
MurshidBackend_Colab_Report.md +545 -0
Needed/murshid_label_columns.json +22 -0
Needed/murshid_logreg_pipeline_manual_oof_pcatuned.joblib +3 -0
Needed/murshid_logreg_thresholds_manual_oof_pcatuned.npy +3 -0
Needed/murshid_query_template_structure_clean_shared.xlsx +3 -0
README.md +39 -0
murshid_backend/README.md +156 -0
murshid_backend/TECHNICAL_REPORT.md +322 -0
murshid_backend/alembic.ini +38 -0
murshid_backend/alembic/env.py +52 -0
murshid_backend/alembic/script.py.mako +25 -0
murshid_backend/alembic/versions/0001_initial_schema.py +87 -0
murshid_backend/app/__init__.py +1 -0
murshid_backend/app/api/__init__.py +1 -0
murshid_backend/app/api/routes/__init__.py +1 -0
murshid_backend/app/api/routes/db_viewer.py +122 -0
murshid_backend/app/api/routes/health.py +73 -0
murshid_backend/app/api/routes/queries.py +78 -0
murshid_backend/app/api/routes/rules.py +100 -0
murshid_backend/app/api/routes/stats.py +43 -0
murshid_backend/app/config.py +29 -0
murshid_backend/app/db/__init__.py +1 -0
murshid_backend/app/db/base.py +5 -0
murshid_backend/app/db/session.py +25 -0
murshid_backend/app/main.py +60 -0
murshid_backend/app/ml/__init__.py +1 -0
murshid_backend/app/ml/embedder.py +116 -0
murshid_backend/app/ml/logistic_model.py +111 -0
murshid_backend/app/ml/pipeline.py +225 -0
murshid_backend/app/ml/sanitizer.py +32 -0
murshid_backend/app/ml/summarizer.py +262 -0
murshid_backend/app/ml/svm_model.py +101 -0
murshid_backend/app/models/__init__.py +16 -0
murshid_backend/app/models/mapping_job.py +40 -0
murshid_backend/app/models/query_template.py +27 -0
murshid_backend/app/models/rule.py +27 -0
murshid_backend/app/models/rule_technique_mapping.py +31 -0
murshid_backend/app/models/technique.py +24 -0
murshid_backend/app/models/user.py +30 -0
murshid_backend/app/repositories/__init__.py +1 -0
murshid_backend/app/repositories/job_repo.py +44 -0
murshid_backend/app/repositories/rule_repo.py +71 -0
murshid_backend/app/repositories/template_repo.py +94 -0
murshid_backend/app/schemas/__init__.py +1 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,18 @@

+__pycache__
+*.pyc
+*.pyo
+.venv
+venv
+**/.env
+.env.local
+*.db
+*.log
+.git
+.gitignore
+*.zip
+MurshidBackend_Colab.ipynb
+MurshidBackend_Colab_Report.md
+interface_pictures/
+murshid_backend/.venv
+murshid_backend/__pycache__
+murshid_backend/TECHNICAL_REPORT.md

.gitattributes ADDED Viewed

	@@ -0,0 +1,3 @@

+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.xlsx filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+__pycache__/
+*.pyc
+*.pyo
+.venv/
+venv/
+*.db
+*.log
+**/.env
+.env.local
+murshid_backend_for_drive.zip
+interface_pictures/

DEPLOY_GUIDE.md ADDED Viewed

	@@ -0,0 +1,103 @@

+# 🚀 دليل النشر على Hugging Face Spaces
+## المتطلبات
+- حساب على [Hugging Face](https://huggingface.co/) (مجاني)
+- [Git](https://git-scm.com/) مثبّت على جهازك
+---
+## الخطوات
+### 1. إنشاء Space جديد
+1. اذهب إلى: https://huggingface.co/new-space
+2. **Space name**: `murshid`
+3. **SDK**: اختر **Docker**
+4. **Visibility**: Public (مجاني) أو Private
+5. اضغط **Create Space**
+### 2. رفع المشروع
+```powershell
+cd d:\murishd
+# تهيئة Git (إذا لم يكن موجوداً)
+git init
+# إضافة الـ remote (غيّر YOUR_USERNAME باسم حسابك)
+git remote add space https://huggingface.co/spaces/YOUR_USERNAME/murshid
+# إضافة الملفات والرفع
+git add .
+git commit -m "Initial deployment"
+git push space main
+```
+> ⚠️ إذا طلب كلمة مرور، استخدم **Access Token** من:
+> https://huggingface.co/settings/tokens
+### 3. إعداد المتغيرات البيئية (Secrets)
+اذهب إلى إعدادات الـ Space: `Settings → Variables and secrets`
+أضف هذه المتغيرات:
+| الاسم | القيمة | النوع |
+|-------|--------|-------|
+| `MURSHID_DB_URL` | `sqlite:////app/data/murshid.db` | Variable |
+| `MURSHID_MODELS_DIR` | `/app/Needed` | Variable |
+| `MURSHID_SKIP_LLM` | `true` | Variable |
+| `SECRET_KEY` | (اختر كلمة سر عشوائية) | **Secret** |
+| `HF_TOKEN` | (اختياري — لو تبغى Llama) | **Secret** |
+### 4. انتظر البناء
+- HF Spaces يبني الـ Docker image تلقائياً
+- يأخذ **3-5 دقائق** للبناء الأول
+- بعد النجاح، الرابط يكون:
+  ```
+  https://YOUR_USERNAME-murshid.hf.space
+  ```
+---
+## الروابط بعد النشر
+| الرابط | الوصف |
+|--------|-------|
+| `https://YOUR_USERNAME-murshid.hf.space` | الواجهة الرئيسية |
+| `https://YOUR_USERNAME-murshid.hf.space/docs` | توثيق Swagger |
+| `https://YOUR_USERNAME-murshid.hf.space/health` | فحص الحالة |
+---
+## ملاحظات
+### الوضع الحالي (LITE mode)
+- المشروع ينشر بوضع **LITE** (بدون torch/SecureBERT+)
+- تحليل القواعد يعمل لكن بدقة أقل (embeddings عشوائية)
+- مناسب لاختبار الواجهة والـ API
+### للترقية إلى LOCAL mode (SecureBERT+ بدون Llama)
+عدّل `Dockerfile` وأزل التعليق من سطر torch:
+```dockerfile
+RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu transformers sentencepiece
+```
+> ⚠️ هذا يزيد حجم الصورة ~800MB ويحتاج ذاكرة أكثر
+### للترقية إلى FULL mode (مع Llama 3)
+- غيّر الـ Space إلى **GPU (T4)** من الإعدادات ($0.60/ساعة)
+- عدّل `MURSHID_SKIP_LLM=false`
+- أضف `HF_TOKEN` في الـ Secrets
+- استخدم `requirements.txt` الكامل بدل `requirements_light.txt`
+---
+## استكشاف الأخطاء
+| المشكلة | الحل |
+|---------|------|
+| Build فشل | تحقق من الـ Logs في تبويب الـ Space |
+| 502 Bad Gateway | انتظر دقيقة — الخادم يبدأ |
+| DB خطأ | تحقق من `MURSHID_DB_URL` في المتغيرات |
+| Frontend لا يتصل | الـ BASE URL أصبح تلقائي (`window.location.origin`) |

Dockerfile ADDED Viewed

	@@ -0,0 +1,49 @@

+FROM python:3.11-slim
+# System deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        build-essential libxml2-dev libxslt1-dev \
+    && rm -rf /var/lib/apt/lists/*
+# Create non-root user (HF Spaces requirement)
+RUN useradd -m -u 1000 appuser
+WORKDIR /app
+# Copy requirements first for layer caching
+COPY murshid_backend/requirements_light.txt ./requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt \
+    && pip install --no-cache-dir openpyxl aiofiles scikit-learn
+# Optional: install torch CPU-only for LOCAL mode (SecureBERT+ embeddings)
+# Uncomment the next line if you want LOCAL mode (adds ~800MB to image)
+# RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu transformers sentencepiece
+# Copy backend code
+COPY murshid_backend/ ./murshid_backend/
+# Copy model files
+COPY Needed/ ./Needed/
+# Copy frontend
+COPY murshid_frontend/ ./murshid_frontend/
+# Create writable directory for SQLite DB
+RUN mkdir -p /app/data && chown -R appuser:appuser /app
+# Setup environment
+ENV MURSHID_DB_URL=sqlite:////app/data/murshid.db
+ENV MURSHID_MODELS_DIR=/app/Needed
+ENV MURSHID_SKIP_LLM=true
+ENV SECRET_KEY=murshid_hf_space_2026
+ENV PORT=7860
+# Run DB migrations + import templates + start server
+COPY start.sh ./start.sh
+RUN chmod +x start.sh
+USER appuser
+EXPOSE 7860
+CMD ["./start.sh"]

MurshidBackend_Colab.ipynb ADDED Viewed

	@@ -0,0 +1,967 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# 🛡️ Murshid Backend — Full Mode on Colab\n",
+        "\n",
+        "**مُرشِد | From Alerts to Guidance: MITRE ATT&CK-Aligned Techniques Mapping for SOC Analysts**\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## 📁 الملفات المطلوبة على Google Drive\n",
+        "\n",
+        "```\n",
+        "MyDrive/\n",
+        "├── murshid_backend_for_drive.zip   ← ارفعيه ثم شغّلي الخلية 2b لاستخراجه\n",
+        "│                                     أو\n",
+        "├── murshid_backend/                ← إذا استخرجته مسبقاً\n",
+        "│   ├── app/\n",
+        "│   ├── alembic/\n",
+        "│   ├── scripts/\n",
+        "│   ├── alembic.ini\n",
+        "│   └── requirements.txt\n",
+        "│\n",
+        "└── Needed/\n",
+        "    ├── murshid_logreg_pipeline_manual_oof_pcatuned.joblib\n",
+        "    ├── murshid_logreg_thresholds_manual_oof_pcatuned.npy\n",
+        "    ├── murshid_label_columns.json\n",
+        "    └── murshid_query_template_structure_clean_shared.xlsx\n",
+        "```\n",
+        "\n",
+        "## تعليمات التشغيل\n",
+        "\n",
+        "### المتطلبات قبل التشغيل\n",
+        "1. ✅ **GPU مُفعَّل:** `Runtime → Change runtime type → T4 GPU`\n",
+        "2. ✅ **Google Drive مُتَّصل** (يحتوي مجلد `Needed` بملفات النماذج)\n",
+        "3. ✅ **مجلد `murshid_backend`** على Drive أو رفعه يدوياً\n",
+        "\n",
+        "### الملفات المطلوبة في Google Drive\n",
+        "```\n",
+        "MyDrive/\n",
+        "├── Needed/\n",
+        "│   ├── murshid_logreg_pipeline_manual_oof_pcatuned.joblib\n",
+        "│   ├── murshid_logreg_thresholds_manual_oof_pcatuned.npy\n",
+        "│   ├── murshid_label_columns.json\n",
+        "│   └── murshid_query_template_structure_clean_shared.xlsx\n",
+        "└── murshid_backend/   ← مجلد الباكند كاملاً\n",
+        "```\n",
+        "\n",
+        "### ترتيب التشغيل\n",
+        "**شغّلي الخلايا بالترتيب من الأعلى للأسفل — لا تتخطّي أي خلية**\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 1: التحقق من GPU\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import torch\n",
+        "\n",
+        "print('CUDA available:', torch.cuda.is_available())\n",
+        "if torch.cuda.is_available():\n",
+        "    print('GPU:', torch.cuda.get_device_name(0))\n",
+        "    print('Memory:', round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1), 'GB')\n",
+        "else:\n",
+        "    print('⚠️  لا يوجد GPU — غيّري Runtime إلى T4 من القائمة أعلاه')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 2: تحميل Google Drive\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 3: تجهيز الباكند في /content\n",
+        "\n",
+        "> تقوم هذه الخلية تلقائياً بـ:\n",
+        "> 1. استخراج ZIP من Drive (إذا كان ZIP موجوداً ولم يُستخرج بعد)\n",
+        "> 2. نسخ مجلد `murshid_backend` إلى `/content` (أسرع للقراءة)\n",
+        "> 3. ضبط Python path\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "print('(هذه الخلية فارغة — الكود انتقل إلى الخلية 3 أدناه)')\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from google.colab import drive\n",
+        "import os\n",
+        "\n",
+        "drive.mount('/content/drive')\n",
+        "\n",
+        "# ✏️ عدّلي هذا المسار إذا كان مجلدك مختلفاً\n",
+        "NEEDED_PATH  = '/content/drive/MyDrive/Needed'\n",
+        "BACKEND_PATH = '/content/drive/MyDrive/murshid_backend'\n",
+        "ZIP_PATH     = '/content/drive/MyDrive/murshid_backend_for_drive.zip'\n",
+        "\n",
+        "print('=' * 55)\n",
+        "print('📂 Checking Google Drive files...')\n",
+        "print('=' * 55)\n",
+        "\n",
+        "# ── التحقق من ملفات Needed ────────────────────────���───────────\n",
+        "print('\\n📁 Needed/ (model files):')\n",
+        "required_files = {\n",
+        "    'murshid_logreg_pipeline_manual_oof_pcatuned.joblib': 'LogReg model',\n",
+        "    'murshid_logreg_thresholds_manual_oof_pcatuned.npy':  'LogReg thresholds',\n",
+        "    'murshid_label_columns.json':                          'Technique names',\n",
+        "}\n",
+        "\n",
+        "models_ok = True\n",
+        "for fname, desc in required_files.items():\n",
+        "    path   = f'{NEEDED_PATH}/{fname}'\n",
+        "    exists = os.path.isfile(path)\n",
+        "    size   = f'{os.path.getsize(path)/1024:.0f} KB' if exists else ''\n",
+        "    status = '✅' if exists else '❌'\n",
+        "    print(f'  {status} {fname}  {size}')\n",
+        "    if not exists:\n",
+        "        models_ok = False\n",
+        "\n",
+        "excel_path = f'{NEEDED_PATH}/murshid_query_template_structure_clean_shared.xlsx'\n",
+        "excel_ok   = os.path.isfile(excel_path)\n",
+        "print(f'  {\"✅\" if excel_ok else \"⚠️ \"} murshid_query_template_structure_clean_shared.xlsx  (optional)')\n",
+        "\n",
+        "# ── التحقق من الباكند ─────────────────────────────────────────\n",
+        "print('\\n📁 murshid_backend/ (backend code):')\n",
+        "backend_ok  = os.path.isdir(BACKEND_PATH)\n",
+        "zip_ok      = os.path.isfile(ZIP_PATH)\n",
+        "\n",
+        "if backend_ok:\n",
+        "    fcount = sum(len(f) for _, _, f in os.walk(BACKEND_PATH))\n",
+        "    print(f'  ✅ murshid_backend/  ({fcount} files)')\n",
+        "elif zip_ok:\n",
+        "    zsize = f'{os.path.getsize(ZIP_PATH)/1024:.0f} KB'\n",
+        "    print(f'  📦 murshid_backend_for_drive.zip  ({zsize}) — سيُستخرج تلقائياً في الخلية 3')\n",
+        "else:\n",
+        "    print(f'  ❌ murshid_backend/ غير موجود')\n",
+        "    print(f'  ❌ murshid_backend_for_drive.zip  غير موجود')\n",
+        "    print(f'\\n  ⚠️  ارفعي murshid_backend_for_drive.zip إلى:')\n",
+        "    print(f'     Google Drive → My Drive')\n",
+        "\n",
+        "# ── ملخص ──────────────────────────────────────────────────────\n",
+        "print('\\n' + '=' * 55)\n",
+        "if models_ok and (backend_ok or zip_ok):\n",
+        "    print('✅ كل شيء جاهز — تابعي تشغيل الخلايا')\n",
+        "elif not models_ok:\n",
+        "    print('❌ ملفات النماذج مفقودة من Needed/ — يجب رفعها أولاً')\n",
+        "else:\n",
+        "    print('❌ ملفات الباكند مفقودة — ارفعي ZIP أولاً')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 3: نسخ الباكند إلى /content\n",
+        "\n",
+        "> نسخ الملفات من Drive إلى `/content` لتسريع القراءة\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import shutil, os, zipfile, sys\n",
+        "\n",
+        "DRIVE_BASE   = '/content/drive/MyDrive'\n",
+        "ZIP_PATH     = f'{DRIVE_BASE}/murshid_backend_for_drive.zip'\n",
+        "BACKEND_DRIVE= f'{DRIVE_BASE}/murshid_backend'\n",
+        "BACKEND_LOCAL= '/content/murshid_backend'\n",
+        "\n",
+        "# ── الخطوة 1: استخراج ZIP من Drive إذا لزم ────────────────────\n",
+        "if not os.path.isdir(BACKEND_DRIVE):\n",
+        "    if os.path.isfile(ZIP_PATH):\n",
+        "        print(f'📦 ZIP found — extracting to Drive...')\n",
+        "        with zipfile.ZipFile(ZIP_PATH, 'r') as z:\n",
+        "            z.extractall(DRIVE_BASE)\n",
+        "        print(f'✅ Extracted to {BACKEND_DRIVE}')\n",
+        "    else:\n",
+        "        print('❌ ERROR: مجلد murshid_backend غير موجود على Drive')\n",
+        "        print(f'   المطلوب: {BACKEND_DRIVE}')\n",
+        "        print(f'   أو رفع: {ZIP_PATH}')\n",
+        "        raise FileNotFoundError(f'Backend not found. Upload murshid_backend_for_drive.zip to Google Drive MyDrive.')\n",
+        "else:\n",
+        "    print(f'✅ murshid_backend found on Drive: {BACKEND_DRIVE}')\n",
+        "\n",
+        "# ── الخطوة 2: نسخ إلى /content (أسرع بكثير من Drive أثناء التشغيل) ─\n",
+        "if os.path.exists(BACKEND_LOCAL):\n",
+        "    shutil.rmtree(BACKEND_LOCAL)\n",
+        "\n",
+        "shutil.copytree(\n",
+        "    BACKEND_DRIVE,\n",
+        "    BACKEND_LOCAL,\n",
+        "    ignore=shutil.ignore_patterns('__pycache__', '*.pyc', '.venv', '*.db', '*.log')\n",
+        ")\n",
+        "\n",
+        "# ── الخطوة 3: إضافة للـ Python path ──────────────────────────\n",
+        "if BACKEND_LOCAL not in sys.path:\n",
+        "    sys.path.insert(0, BACKEND_LOCAL)\n",
+        "\n",
+        "os.chdir(BACKEND_LOCAL)\n",
+        "\n",
+        "# ── تحقق ─────────────────────────────────────────────────────\n",
+        "file_count = sum(len(files) for _, _, files in os.walk(BACKEND_LOCAL))\n",
+        "print(f'✅ Backend ready at {BACKEND_LOCAL} ({file_count} files)')\n",
+        "print(f'✅ Working dir: {os.getcwd()}')\n",
+        "\n",
+        "# عرض الهيكل\n",
+        "print('\\nStructure:')\n",
+        "for item in sorted(os.listdir(BACKEND_LOCAL)):\n",
+        "    full = os.path.join(BACKEND_LOCAL, item)\n",
+        "    if os.path.isdir(full):\n",
+        "        sub_count = len(os.listdir(full))\n",
+        "        print(f'  📁 {item}/  ({sub_count} items)')\n",
+        "    else:\n",
+        "        size = os.path.getsize(full)\n",
+        "        print(f'  📄 {item}  ({size:,} bytes)')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 4: تثبيت المتطلبات\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "print('📦 Installing requirements...')\n",
+        "\n",
+        "# ── الحزم الأساسية للباكند ──────────────────────────────────────\n",
+        "!pip install -q \\\n",
+        "    fastapi==0.115.0 \\\n",
+        "    \"uvicorn[standard]==0.32.0\" \\\n",
+        "    pydantic==2.9.0 \\\n",
+        "    pydantic-settings==2.6.0 \\\n",
+        "    python-dotenv==1.0.0 \\\n",
+        "    sqlalchemy==2.0.0 \\\n",
+        "    alembic==1.13.0 \\\n",
+        "    aiofiles \\\n",
+        "    scikit-learn==1.6.1 \\\n",
+        "    joblib \\\n",
+        "    lxml \\\n",
+        "    openpyxl \\\n",
+        "    nest-asyncio \\\n",
+        "    pyngrok\n",
+        "\n",
+        "# ── bitsandbytes: مطلوب لتحميل LLaMA بـ 4-bit على GPU ─────────\n",
+        "print('📦 Installing bitsandbytes (required for LLaMA 4-bit)...')\n",
+        "!pip install -q -U \"bitsandbytes>=0.46.1\"\n",
+        "\n",
+        "# ── accelerate: مطلوب لـ device_map=\"auto\" ────────────────────\n",
+        "!pip install -q -U accelerate\n",
+        "\n",
+        "# ── تحقق من التثبيت ──────────────────────────────────────────\n",
+        "import importlib\n",
+        "for pkg in ['bitsandbytes', 'accelerate', 'fastapi', 'sklearn']:\n",
+        "    try:\n",
+        "        mod = importlib.import_module(pkg if pkg != 'sklearn' else 'sklearn')\n",
+        "        ver = getattr(mod, '__version__', '?')\n",
+        "        print(f'  ✅ {pkg}=={ver}')\n",
+        "    except ImportError:\n",
+        "        print(f'  ❌ {pkg} — فشل التثبيت')\n",
+        "\n",
+        "print('\\n✅ All requirements installed')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 5: إعداد ملف .env\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "# ✏️ ضعي HF Token هنا إذا لم تُضيفيه عبر Colab Secrets\n",
+        "HF_TOKEN = os.environ.get('HF_TOKEN', 'ادخل التوكن')\n",
+        "\n",
+        "env_content = f\"\"\"# Auto-generated .env for Colab FULL mode\n",
+        "MURSHID_DB_URL=sqlite:////content/murshid.db\n",
+        "MURSHID_MODELS_DIR={NEEDED_PATH}\n",
+        "HF_TOKEN={HF_TOKEN}\n",
+        "MURSHID_SKIP_LLM=false\n",
+        "SECRET_KEY=murshid_colab_2026\n",
+        "LLAMA_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct\n",
+        "EMBED_MODEL_ID=ehsanaghaei/SecureBERT_Plus\n",
+        "LOGREG_JOBLIB=murshid_logreg_pipeline_manual_oof_pcatuned.joblib\n",
+        "LOGREG_THRESHOLDS_NPY=murshid_logreg_thresholds_manual_oof_pcatuned.npy\n",
+        "LABEL_COLUMNS_JSON=murshid_label_columns.json\n",
+        "\"\"\"\n",
+        "\n",
+        "env_path = '/content/murshid_backend/.env'\n",
+        "with open(env_path, 'w') as f:\n",
+        "    f.write(env_content)\n",
+        "\n",
+        "print('✅ .env created at', env_path)\n",
+        "print('\\nContents:')\n",
+        "with open(env_path) as f:\n",
+        "    for line in f:\n",
+        "        if 'TOKEN' in line or 'SECRET' in line:\n",
+        "            key = line.split('=')[0]\n",
+        "            print(f'  {key}=****')\n",
+        "        else:\n",
+        "            print(' ', line.rstrip())"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 6: تهجير قاعدة البيانات (Alembic)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import subprocess, os\n",
+        "\n",
+        "os.chdir('/content/murshid_backend')\n",
+        "\n",
+        "result = subprocess.run(\n",
+        "    ['python', '-m', 'alembic', 'upgrade', 'head'],\n",
+        "    capture_output=True, text=True\n",
+        ")\n",
+        "\n",
+        "print(result.stdout)\n",
+        "if result.stderr:\n",
+        "    print(result.stderr)\n",
+        "\n",
+        "import os\n",
+        "db_exists = os.path.isfile('/content/murshid.db')\n",
+        "print('✅ Database ready:', '/content/murshid.db' if db_exists else '❌ لم يُنشأ')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 7: استيراد قوالب WQL من Excel\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import sys\n",
+        "sys.path.insert(0, '/content/murshid_backend')\n",
+        "os.chdir('/content/murshid_backend')\n",
+        "\n",
+        "excel_path = f'{NEEDED_PATH}/murshid_query_template_structure_clean_shared.xlsx'\n",
+        "\n",
+        "if os.path.isfile(excel_path):\n",
+        "    from app.db.session import SessionLocal\n",
+        "    from scripts.import_excel_templates import run as import_excel\n",
+        "\n",
+        "    db = SessionLocal()\n",
+        "    try:\n",
+        "        result = import_excel(db, replace=False)\n",
+        "        print('✅ Excel import result:')\n",
+        "        for k, v in result.items():\n",
+        "            print(f'   {k}: {v}')\n",
+        "    finally:\n",
+        "        db.close()\n",
+        "else:\n",
+        "    print(f'⚠️  Excel file not found at: {excel_path}')\n",
+        "    print('   يمكنك المتابعة — القوالب ستُضاف لاحقاً يدوياً')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 8: تشغيل FastAPI + ngrok\n",
+        "\n",
+        "> ⏳ هذه الخلية تأخذ **5-10 دقائق** لتحميل LLaMA (4.5GB) و SecureBERT+\n",
+        "\n",
+        "> 🔑 **الرابط العام سيظهر في النهاية** — انسخيه للفرونت\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import subprocess, time, os, sys, urllib.request\n",
+        "import nest_asyncio\n",
+        "nest_asyncio.apply()\n",
+        "\n",
+        "os.chdir('/content/murshid_backend')\n",
+        "\n",
+        "# ─── التحقق من bitsandbytes قبل تشغيل الخادم ─────────────────\n",
+        "try:\n",
+        "    import bitsandbytes as bnb\n",
+        "    print(f'✅ bitsandbytes {bnb.__version__}')\n",
+        "except ImportError:\n",
+        "    print('❌ bitsandbytes غير مثبّت — شغّلي الخلية 4 أولاً')\n",
+        "    raise\n",
+        "\n",
+        "# ─── تشغيل uvicorn ───────────────────────────────────────────\n",
+        "log_path = '/content/murshid_server.log'\n",
+        "log_file = open(log_path, 'w')\n",
+        "\n",
+        "server_proc = subprocess.Popen(\n",
+        "    [\n",
+        "        'python', '-m', 'uvicorn', 'app.main:app',\n",
+        "        '--host', '0.0.0.0',\n",
+        "        '--port', '8000',\n",
+        "        '--log-level', 'info'\n",
+        "    ],\n",
+        "    cwd='/content/murshid_backend',\n",
+        "    stdout=log_file,\n",
+        "    stderr=subprocess.STDOUT\n",
+        ")\n",
+        "\n",
+        "print('⏳ Loading LLaMA 3 8B + SecureBERT+...')\n",
+        "print('   جاري التحميل — انتظري حتى تظهر الرسالة النهائية')\n",
+        "\n",
+        "# ─── انتظار ذكي مع عرض اللوج ────────────────────────────────\n",
+        "started = False\n",
+        "last_log_size = 0\n",
+        "\n",
+        "for i in range(180):  # 15 دقيقة كحد أقصى\n",
+        "    time.sleep(5)\n",
+        "\n",
+        "    # تحقق إذا بدأ الخادم\n",
+        "    try:\n",
+        "        resp = urllib.request.urlopen('http://localhost:8000/health', timeout=3)\n",
+        "        if resp.status == 200:\n",
+        "            started = True\n",
+        "            break\n",
+        "    except Exception:\n",
+        "        pass\n",
+        "\n",
+        "    # عرض اللوج الجديد كل 30 ثانية\n",
+        "    if i % 6 == 0:\n",
+        "        elapsed = (i + 1) * 5\n",
+        "        log_file.flush()\n",
+        "        try:\n",
+        "            with open(log_path) as f:\n",
+        "                log_content = f.read()\n",
+        "            new_content = log_content[last_log_size:]\n",
+        "            last_log_size = len(log_content)\n",
+        "\n",
+        "            # تحقق من خطأ مبكر\n",
+        "            if 'ERROR' in new_content or 'ImportError' in new_content:\n",
+        "                print(f'\\n❌ خطأ في الخادم عند {elapsed}s:')\n",
+        "                # عرض آخر 1000 حرف من اللوج\n",
+        "                print(log_content[-1500:])\n",
+        "                server_proc.terminate()\n",
+        "                log_file.close()\n",
+        "                raise RuntimeError('Server failed to start. See log above.')\n",
+        "\n",
+        "            # عرض ما تم تحميله\n",
+        "            if 'Loaded' in new_content or 'loaded' in new_content or 'Application' in new_content:\n",
+        "                for line in new_content.strip().split('\\n'):\n",
+        "                    if any(k in line for k in ['INFO', 'Loaded', 'loaded', 'Application', 'WARNING']):\n",
+        "                        print(f'   {line.strip()}')\n",
+        "            else:\n",
+        "                mins = elapsed // 60\n",
+        "                secs = elapsed % 60\n",
+        "                print(f'   ⏳ {mins}m {secs}s — يجري تحميل النماذج...')\n",
+        "        except RuntimeError:\n",
+        "            raise\n",
+        "        except Exception:\n",
+        "            print(f'   ⏳ {elapsed}s elapsed...')\n",
+        "\n",
+        "log_file.flush()\n",
+        "log_file.close()\n",
+        "\n",
+        "if not started:\n",
+        "    print('\\n❌ Server did not start after 15 minutes.')\n",
+        "    print('─── آخر سطور اللوج ───')\n",
+        "    with open(log_path) as f:\n",
+        "        print(f.read()[-3000:])\n",
+        "else:\n",
+        "    print('\\n✅ Server started successfully!')\n",
+        "\n",
+        "    # ─── Cloudflare Tunnel (مجاني — بدون حساب) ──────────────────\n",
+        "    import subprocess, re, threading, time\n",
+        "\n",
+        "    # تثبيت cloudflared\n",
+        "    subprocess.run(\n",
+        "        ['wget', '-q', 'https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64',\n",
+        "         '-O', '/usr/local/bin/cloudflared'],\n",
+        "        check=True\n",
+        "    )\n",
+        "    subprocess.run(['chmod', '+x', '/usr/local/bin/cloudflared'], check=True)\n",
+        "    print('✅ cloudflared installed')\n",
+        "\n",
+        "    # تشغيل النفق\n",
+        "    cf_log = open('/content/cloudflared.log', 'w')\n",
+        "    cf_proc = subprocess.Popen(\n",
+        "        ['cloudflared', 'tunnel', '--url', 'http://localhost:8000'],\n",
+        "        stdout=cf_log, stderr=subprocess.STDOUT\n",
+        "    )\n",
+        "\n",
+        "    # انتظار ظهور الرابط في اللوج\n",
+        "    public_url = None\n",
+        "    for _ in range(30):\n",
+        "        time.sleep(2)\n",
+        "        cf_log.flush()\n",
+        "        try:\n",
+        "            with open('/content/cloudflared.log') as f:\n",
+        "                content = f.read()\n",
+        "            match = re.search(r'https://[a-z0-9\\-]+\\.trycloudflare\\.com', content)\n",
+        "            if match:\n",
+        "                public_url = match.group(0)\n",
+        "                break\n",
+        "        except Exception:\n",
+        "            pass\n",
+        "\n",
+        "    if public_url:\n",
+        "        print('\\n' + '='*60)\n",
+        "        print('🌐 PUBLIC URL (الرابط العام — Cloudflare):')\n",
+        "        print(f'   {public_url}')\n",
+        "        print('='*60)\n",
+        "        print(f'📖 Swagger:      {public_url}/docs')\n",
+        "        print(f'💚 Health:       {public_url}/health')\n",
+        "        print(f'🗄️  DB Summary:  {public_url}/api/db/summary')\n",
+        "        print('='*60)\n",
+        "        print('\\n📋 انسخي هذا السطر والصقيه في الفرونت (index.html):')\n",
+        "        print(f\"   const BASE = '{public_url}';\")\n",
+        "    else:\n",
+        "        print('⚠️  Cloudflare tunnel URL not found, check /content/cloudflared.log')\n",
+        "        with open('/content/cloudflared.log') as f:\n",
+        "            print(f.read()[-1000:])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# ─── تشغيل Cloudflare Tunnel بشكل منفصل (إذا فشل مع الخلية 8) ─\n",
+        "# شغّلي هذه الخلية فقط إذا كان الخادم يعمل لكن الـ tunnel فشل\n",
+        "\n",
+        "import subprocess, re, time, os\n",
+        "\n",
+        "# تثبيت cloudflared إذا لم يُثبَّت\n",
+        "if not os.path.isfile('/usr/local/bin/cloudflared'):\n",
+        "    subprocess.run(\n",
+        "        ['wget', '-q',\n",
+        "         'https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64',\n",
+        "         '-O', '/usr/local/bin/cloudflared'],\n",
+        "        check=True\n",
+        "    )\n",
+        "    subprocess.run(['chmod', '+x', '/usr/local/bin/cloudflared'], check=True)\n",
+        "    print('✅ cloudflared installed')\n",
+        "else:\n",
+        "    print('✅ cloudflared already installed')\n",
+        "\n",
+        "# تشغيل النفق\n",
+        "cf_log_path = '/content/cloudflared.log'\n",
+        "cf_log = open(cf_log_path, 'w')\n",
+        "cf_proc = subprocess.Popen(\n",
+        "    ['cloudflared', 'tunnel', '--url', 'http://localhost:8000'],\n",
+        "    stdout=cf_log, stderr=subprocess.STDOUT\n",
+        ")\n",
+        "\n",
+        "print('⏳ Opening Cloudflare tunnel...')\n",
+        "\n",
+        "public_url = None\n",
+        "for _ in range(30):\n",
+        "    time.sleep(2)\n",
+        "    cf_log.flush()\n",
+        "    try:\n",
+        "        with open(cf_log_path) as f:\n",
+        "            content = f.read()\n",
+        "        match = re.search(r'https://[a-z0-9\\-]+\\.trycloudflare\\.com', content)\n",
+        "        if match:\n",
+        "            public_url = match.group(0)\n",
+        "            break\n",
+        "    except Exception:\n",
+        "        pass\n",
+        "\n",
+        "if public_url:\n",
+        "    print('\\n' + '='*60)\n",
+        "    print(f'🌐 PUBLIC URL:  {public_url}')\n",
+        "    print(f'📖 Swagger:     {public_url}/docs')\n",
+        "    print(f'💚 Health:      {public_url}/health')\n",
+        "    print('='*60)\n",
+        "    print('\\n📋 الصقي هذا السطر في index.html:')\n",
+        "    print(f\"   const BASE = '{public_url}';\")\n",
+        "else:\n",
+        "    print('❌ لم يُعثر على URL. اللوج:')\n",
+        "    with open(cf_log_path) as f:\n",
+        "        print(f.read())\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 9: ربط الفرونت بـ Cloudflare URL\n",
+        "\n",
+        "بعد تشغيل الخلية السابقة، ستظهر رسالة مثل:\n",
+        "```\n",
+        "🌐 PUBLIC URL: https://xxxx-xxxx.trycloudflare.com\n",
+        "```\n",
+        "\n",
+        "**الخلية أدناه تُحدّث الفرونت تلقائياً** — أو يمكنك التعديل يدوياً في `index.html`:\n",
+        "```javascript\n",
+        "const BASE = 'https://xxxx-xxxx.trycloudflare.com';\n",
+        "```\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import subprocess, re, time, os\n",
+        "\n",
+        "# ── الخطوة 1: تثبيت cloudflared ──────────────────────────────\n",
+        "if not os.path.isfile('/usr/local/bin/cloudflared'):\n",
+        "    subprocess.run([\n",
+        "        'wget', '-q',\n",
+        "        'https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64',\n",
+        "        '-O', '/usr/local/bin/cloudflared'\n",
+        "    ], check=True)\n",
+        "    subprocess.run(['chmod', '+x', '/usr/local/bin/cloudflared'], check=True)\n",
+        "    print('✅ cloudflared installed')\n",
+        "else:\n",
+        "    print('✅ cloudflared ready')\n",
+        "\n",
+        "# ── الخطوة 2: تشغيل النفق ────────────────────────────────────\n",
+        "cf_log_path = '/content/cf.log'\n",
+        "cf_log = open(cf_log_path, 'w')\n",
+        "subprocess.Popen(\n",
+        "    ['cloudflared', 'tunnel', '--url', 'http://localhost:8000'],\n",
+        "    stdout=cf_log, stderr=subprocess.STDOUT\n",
+        ")\n",
+        "\n",
+        "print('⏳ Opening Cloudflare tunnel...')\n",
+        "\n",
+        "# ── الخطوة 3: انتظار الرابط ───────────────────────────────────\n",
+        "public_url = None\n",
+        "for _ in range(30):\n",
+        "    time.sleep(2)\n",
+        "    cf_log.flush()\n",
+        "    with open(cf_log_path) as f:\n",
+        "        content = f.read()\n",
+        "    match = re.search(r'https://[a-z0-9\\-]+\\.trycloudflare\\.com', content)\n",
+        "    if match:\n",
+        "        public_url = match.group(0)\n",
+        "        break\n",
+        "\n",
+        "if not public_url:\n",
+        "    print('❌ Tunnel failed. Log:')\n",
+        "    with open(cf_log_path) as f: print(f.read())\n",
+        "else:\n",
+        "    # ── الخطوة 4: تحديث index.html تلقائياً ─────────────────\n",
+        "    frontend_path = '/content/drive/MyDrive/murshid_frontend/index.html'\n",
+        "\n",
+        "    if os.path.isfile(frontend_path):\n",
+        "        with open(frontend_path, 'r', encoding='utf-8') as f:\n",
+        "            html = f.read()\n",
+        "        html_updated = re.sub(r\"const BASE = '[^']*';\",\n",
+        "                              f\"const BASE = '{public_url}';\", html)\n",
+        "        with open(frontend_path, 'w', encoding='utf-8') as f:\n",
+        "            f.write(html_updated)\n",
+        "        print(f'✅ index.html updated automatically')\n",
+        "    else:\n",
+        "        print(f'⚠️  index.html not found — عدّليه يدوياً')\n",
+        "\n",
+        "    print('\\n' + '='*60)\n",
+        "    print(f'🌐 PUBLIC URL:   {public_url}')\n",
+        "    print(f'📖 Swagger:      {public_url}/docs')\n",
+        "    print(f'💚 Health:       {public_url}/health')\n",
+        "    print(f'🖥️  Frontend:    {public_url}/index.html')\n",
+        "    print('='*60)\n",
+        "    print(f\"\\n📋 const BASE = '{public_url}';\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 10: اختبار الـ API\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import urllib.request, json\n",
+        "\n",
+        "# ─── Health Check ────────────────────────────────────────────\n",
+        "with urllib.request.urlopen('http://localhost:8000/health') as r:\n",
+        "    health = json.load(r)\n",
+        "\n",
+        "print('=== Health Check ===')\n",
+        "print(f\"  status:          {health['status']}\")\n",
+        "print(f\"  pipeline_mode:   {health['pipeline_mode']}\")\n",
+        "print(f\"  llama_loaded:    {health['components']['llama_loaded']}\")\n",
+        "print(f\"  embedder_loaded: {health['components']['embedder_loaded']}\")\n",
+        "print(f\"  logreg_loaded:   {health['components']['logreg_loaded']}\")\n",
+        "print(f\"  cuda_available:  {health['components']['cuda_available']}\")\n",
+        "\n",
+        "mode = health.get('pipeline_mode', 'unknown')\n",
+        "if mode == 'full':\n",
+        "    print('\\n✅ FULL mode — نتائج مطابقة 100% للدفتر')\n",
+        "elif mode == 'local':\n",
+        "    print('\\n⚠️  LOCAL mode — LLaMA لم يُحمَّل، تحققي من MURSHID_SKIP_LLM=false')\n",
+        "else:\n",
+        "    print('\\n❌ LITE mode — تحققي من تثبيت torch والنماذج')"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# ─── تحليل قاعدة اختبار ──────────────────────────────────────\n",
+        "import urllib.request, json\n",
+        "\n",
+        "test_rule = '''<rule id=\"18205\" level=\"5\">\n",
+        "  <if_sid>18201</if_sid>\n",
+        "  <id>^634$|^4730$</id>\n",
+        "  <description>Windows: Security Enabled Global Group Deleted</description>\n",
+        "  <mitre><id>T1484</id></mitre>\n",
+        "  <group>group_deleted,win_group_deleted</group>\n",
+        "</rule>'''\n",
+        "\n",
+        "payload = json.dumps({'rule_xml': test_rule}).encode()\n",
+        "req = urllib.request.Request(\n",
+        "    'http://localhost:8000/rules/analyze',\n",
+        "    data=payload,\n",
+        "    headers={'Content-Type': 'application/json'},\n",
+        "    method='POST'\n",
+        ")\n",
+        "\n",
+        "with urllib.request.urlopen(req) as r:\n",
+        "    result = json.load(r)\n",
+        "\n",
+        "print('=== Analyze Result ===')\n",
+        "print(f\"  rule_id:       {result['rule_id']}\")\n",
+        "print(f\"  pipeline_mode: {result['pipeline_mode']}\")\n",
+        "print(f\"  summary:       {result['summary']}\")\n",
+        "print(f\"\\n  TOP 5 Techniques:\")\n",
+        "print(f\"  {'Technique':<15} {'Conf%':>8} {'Proba':>8} {'Thr':>6} {'Gap':>8} {'Pred':>6}\")\n",
+        "print(f\"  {'-'*55}\")\n",
+        "for r in result['all_results'][:5]:\n",
+        "    pred = '✅' if r['predicted'] else '  '\n",
+        "    print(f\"  {pred} {r['technique_id']:<13} {r['confidence_percent']:>7.2f}%\"\n",
+        "          f\" {r['proba']:>8.4f} {r['threshold']:>6.2f} {r['gap']:>+8.4f}\")\n",
+        "\n",
+        "print(f\"\\n  Detected: {len(result['detected'])} technique(s)\")\n",
+        "for d in result['detected']:\n",
+        "    print(f\"    ✅ {d['technique_id']} — {d['confidence_percent']}%\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# ─── قوالب WQL للتقنية المكتشفة ──────────────────────────────\n",
+        "if result['detected']:\n",
+        "    top_technique = result['detected'][0]['technique_id']\n",
+        "\n",
+        "    with urllib.request.urlopen(f'http://localhost:8000/queries/{top_technique}') as r:\n",
+        "        queries = json.load(r)\n",
+        "\n",
+        "    print(f'=== WQL Templates for {top_technique} ===')\n",
+        "    for i, q in enumerate(queries, 1):\n",
+        "        print(f\"\\n  [{i}] {q.get('purpose', 'N/A')}\")\n",
+        "        print(f\"  Query: {q['wql_query'][:120]}...\")\n",
+        "        print(f\"  Note:  {q.get('note', 'N/A')}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 11: تصدير النتائج (اختياري)\n",
+        "\n",
+        "لحفظ النتائج بصيغة JSON لاستخدامها لاحقاً على الجهاز المحلي\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# ─── تحليل قائمة من القواعد وتصديرها ───────────────────────\n",
+        "import urllib.request, json, os\n",
+        "\n",
+        "# ✏️ أضيفي Rule IDs التي تريدين تحليلها\n",
+        "# يمكنك قراءتها من ملف\n",
+        "test_ids_path = f'{NEEDED_PATH}/test_rule_ids.json'\n",
+        "\n",
+        "if os.path.isfile(test_ids_path):\n",
+        "    with open(test_ids_path) as f:\n",
+        "        rule_ids = json.load(f)\n",
+        "    print(f'Loaded {len(rule_ids)} rule IDs from test_rule_ids.json')\n",
+        "else:\n",
+        "    # قواعد تجريبية\n",
+        "    rule_ids = ['18205']\n",
+        "    print('Using default test rule')\n",
+        "\n",
+        "print(f'Processing {len(rule_ids)} rules...')\n",
+        "\n",
+        "export_results = []\n",
+        "\n",
+        "for rule_id in rule_ids:\n",
+        "    try:\n",
+        "        with urllib.request.urlopen(f'http://localhost:8000/results/{rule_id}') as r:\n",
+        "            data = json.load(r)\n",
+        "            data['source'] = 'colab_full_mode'\n",
+        "            export_results.append(data)\n",
+        "            detected = len(data.get('detected', []))\n",
+        "            top = data['mappings'][0] if data['mappings'] else {}\n",
+        "            print(f\"  ✅ {rule_id}: {top.get('technique_id','?')} ({top.get('confidence_percent','?')}%) — {detected} detected\")\n",
+        "    except Exception as e:\n",
+        "        print(f\"  ⚠️  {rule_id}: {e}\")\n",
+        "\n",
+        "# حفظ النتائج\n",
+        "export_path = f'{NEEDED_PATH}/murshid_full_results.json'\n",
+        "with open(export_path, 'w', encoding='utf-8') as f:\n",
+        "    json.dump(export_results, f, ensure_ascii=False, indent=2)\n",
+        "\n",
+        "print(f'\\n✅ Exported {len(export_results)} results to:')\n",
+        "print(f'   {export_path}')\n",
+        "print('\\nيمكنك الآن استيراد هذا الملف في الباكند المحلي')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## الخلية 12: إيقاف الخادم (عند الانتهاء)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# إيقاف الخادم وإغلاق ngrok\n",
+        "try:\n",
+        "    from pyngrok import ngrok\n",
+        "    ngrok.kill()\n",
+        "    print('✅ ngrok tunnel closed')\n",
+        "except Exception:\n",
+        "    pass\n",
+        "\n",
+        "try:\n",
+        "    server_proc.terminate()\n",
+        "    print('✅ Server stopped')\n",
+        "except Exception:\n",
+        "    pass"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## ملاحظات مهمة\n",
+        "\n",
+        "### إذا انقطع الاتصال بـ Colab\n",
+        "- الخادم يتوقف تلقائياً\n",
+        "- أعيدي تشغيل الخلايا من الخلية 8\n",
+        "- رابط ngrok سيتغيّر — عدّلي الفرونت بالرابط الجديد\n",
+        "\n",
+        "### إذا ظهر خطأ في LLaMA\n",
+        "- تأكدي أن لديك صلاحية الوصول للنموذج: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct\n",
+        "- تأكدي من صحة HF_TOKEN\n",
+        "\n",
+        "### المقارنة مع الجهاز المحلي\n",
+        "| | Colab (FULL) | الجهاز المحلي (LOCAL) |\n",
+        "|--|-------------|----------------------|\n",
+        "| LLaMA | ✅ | ❌ |\n",
+        "| T1484 confidence | **94.76%** | 89.29% |\n",
+        "| القرار النهائي | T1484 ✅ | T1484 ✅ |\n",
+        "\n",
+        "### للعرض التقديمي\n",
+        "1. شغّلي الخلايا 1-8 مسبقاً (قبل العرض بـ 15 دقيقة)\n",
+        "2. انسخي رابط ngrok\n",
+        "3. عدّلي الفرونت\n",
+        "4. افتحي `https://xxxx.ngrok-free.app/index.html`\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "T4",
+      "machine_shape": "hm",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}

MurshidBackend_Colab_Report.md ADDED Viewed

	@@ -0,0 +1,545 @@

+# تقرير تقني: آلية عمل MurshidBackend_Colab.ipynb
+## مشروع مُرشِد | From Alerts to Guidance
+### MITRE ATT&CK-Aligned Techniques Mapping for SOC Analysts
+---
+## 1. نظرة عامة
+`MurshidBackend_Colab.ipynb` هو دفتر Jupyter مُصمَّم لتشغيل الباكند الكامل لمشروع مُرشِد على بيئة **Google Colab** باستخدام **GPU (Tesla T4)**، مما يُتيح تشغيل نموذج **LLaMA 3 8B** بتكميم 4-bit لتوليد ملخصات دلالية غنية لقواعد Wazuh XML، وذلك على عكس البيئة المحلية التي تعمل بدون LLaMA (LOCAL mode).
+### الهدف الرئيسي
+تشغيل **FULL mode** للـ pipeline:
+```
+قاعدة Wazuh XML
+      ↓
+ LLaMA 3 8B  ←── ملخص دلالي غني (GPU)
+      ↓
+ SecureBERT+  ←── 768-dim embedding
+      ↓
+ Logistic Regression  ←── confidence scores لكل تقنية
+      ↓
+ FastAPI + SQLite  ←── تخزين وخدمة النتائج
+      ↓
+ Cloudflare Tunnel  ←── رابط عام للفرونت
+```
+---
+## 2. المتطلبات قبل التشغيل
+### 2.1 إعداد Google Colab
+| المتطلب | التفاصيل |
+|---------|----------|
+| **GPU** | Tesla T4 — يُفعَّل من: `Runtime → Change runtime type → T4 GPU` |
+| **الذاكرة** | High RAM (machine_shape: "hm") |
+| **الإنترنت** | مفعَّل لتنزيل النماذج من Hugging Face |
+### 2.2 الملفات المطلوبة على Google Drive
+```
+MyDrive/
+├── murshid_backend_for_drive.zip   ← ملفات الباكند مضغوطة (44 KB)
+│        أو
+├── murshid_backend/                ← المجلد مستخرج مسبقاً
+│   ├── app/
+│   │   ├── main.py
+│   │   ├── config.py
+│   │   ├── api/routes/
+│   │   ├── ml/
+│   │   ├── models/
+│   │   ├── services/
+│   │   └── repositories/
+│   ├── alembic/
+│   ├── scripts/
+│   ├── alembic.ini
+│   └── requirements.txt
+│
+└── Needed/
+    ├── murshid_logreg_pipeline_manual_oof_pcatuned.joblib  ← نموذج LogReg
+    ├── murshid_logreg_thresholds_manual_oof_pcatuned.npy   ← عتبات التنبؤ
+    ├── murshid_label_columns.json                           ← أسماء التقنيات الـ 20
+    └── murshid_query_template_structure_clean_shared.xlsx  ← 60 قالب WQL
+```
+### 2.3 Hugging Face Token
+مطلوب للوصول إلى نموذج `meta-llama/Meta-Llama-3-8B-Instruct`:
+- يُضاف في `Colab Secrets` باسم `HF_TOKEN`
+- أو مباشرةً في خلية 5 من الدفتر
+---
+## 3. شرح الخلايا بالتفصيل
+### الخلية 1: التحقق من GPU
+**الهدف:** التأكد من وجود GPU قبل البدء.
+```python
+import torch
+print('CUDA available:', torch.cuda.is_available())
+print('GPU:', torch.cuda.get_device_name(0))
+print('Memory:', round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1), 'GB')
+```
+**المخرج المتوقع:**
+```
+CUDA available: True
+GPU: Tesla T4
+Memory: 15.8 GB
+```
+**ماذا يحدث إذا لم يكن هناك GPU؟**
+- LLaMA لن يُحمَّل (يحتاج CUDA)
+- الخادم سيعمل بـ LOCAL mode فقط (بدون تلخيص)
+---
+### الخلية 2: تحميل Google Drive والتحقق من الملفات
+**الهدف:** ربط Colab بـ Google Drive والتحقق من وجود جميع الملفات المطلوبة.
+```python
+from google.colab import drive
+drive.mount('/content/drive')
+NEEDED_PATH  = '/content/drive/MyDrive/Needed'
+BACKEND_PATH = '/content/drive/MyDrive/murshid_backend'
+ZIP_PATH     = '/content/drive/MyDrive/murshid_backend_for_drive.zip'
+```
+**ما يتحقق منه:**
+| الملف | النوع | الحالة |
+|-------|-------|--------|
+| `murshid_logreg_pipeline_manual_oof_pcatuned.joblib` | إلزامي | ✅ / ❌ |
+| `murshid_logreg_thresholds_manual_oof_pcatuned.npy` | إلزامي | ✅ / ❌ |
+| `murshid_label_columns.json` | إلزامي | ✅ / ❌ |
+| `murshid_query_template_structure_clean_shared.xlsx` | اختياري | ✅ / ⚠️ |
+| `murshid_backend/` أو `.zip` | إلزامي | ✅ / ❌ |
+---
+### الخلية 3: تجهيز الباكند في /content
+**الهدف:** نقل ملفات الباكند من Drive إلى `/content` لتسريع القراءة (Drive أبطأ في I/O).
+**المنطق الذكي:**
+```
+هل murshid_backend/ موجود على Drive؟
+      ↓ نعم → انسخ مباشرةً إلى /content
+      ↓ لا
+هل murshid_backend_for_drive.zip موجود؟
+      ↓ نعم → استخرجه إلى Drive أولاً ثم انسخ
+      ↓ لا
+→ ❌ خطأ: "ارفعي ZIP إلى Google Drive"
+```
+**الخطوات المنفَّذة:**
+1. **استخراج ZIP** (إذا لزم) إلى `MyDrive/`
+2. **ن��خ** `murshid_backend/` إلى `/content/murshid_backend/` (بدون pycache وملفات مؤقتة)
+3. **إضافة** `/content/murshid_backend` إلى `sys.path`
+4. **تغيير** working directory إلى `/content/murshid_backend`
+**لماذا النسخ إلى /content؟**
+- Drive يعتمد على FUSE mount = بطيء للقراءة المتكررة
+- `/content` على SSD محلي للـ VM = أسرع بـ 5-10x
+---
+### الخلية 4: تثبيت المتطلبات
+**الهدف:** تثبيت جميع المكتبات اللازمة لتشغيل الباكند.
+**المكتبات المثبَّتة:**
+| المكتبة | الإصدار | الغرض |
+|---------|---------|--------|
+| `fastapi` | 0.115.0 | إطار API |
+| `uvicorn` | 0.32.0 | خادم ASGI |
+| `pydantic` | 2.9.0 | تحقق من البيانات |
+| `sqlalchemy` | 2.0.0 | ORM |
+| `alembic` | 1.13.0 | هجرة DB |
+| `scikit-learn` | **1.6.1** | نموذج LogReg (يطابق بيئة التدريب) |
+| `bitsandbytes` | ≥0.46.1 | تكميم LLaMA 4-bit |
+| `accelerate` | آخر نسخة | `device_map="auto"` للـ GPU |
+| `openpyxl` | آخر نسخة | قراءة ملف Excel |
+| `lxml` | آخر نسخة | معالجة XML |
+| `pyngrok` | آخر نسخة | (احتياطي — غير مستخدم) |
+> **ملاحظة مهمة:** `scikit-learn==1.6.1` محدَّد بدقة لأن ملفات joblib دُرِّبت بهذه النسخة — استخدام نسخة مختلفة يُنتج تحذيرات `InconsistentVersionWarning`.
+---
+### الخلية 5: إعداد ملف .env
+**الهدف:** إنشاء ملف الإعدادات لتشغيل FULL mode.
+**محتوى الملف المُولَّد:**
+```env
+MURSHID_DB_URL=sqlite:////content/murshid.db
+MURSHID_MODELS_DIR=/content/drive/MyDrive/Needed
+HF_TOKEN=****
+MURSHID_SKIP_LLM=false          ← مفتاح FULL mode
+SECRET_KEY=murshid_colab_2026
+LLAMA_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
+EMBED_MODEL_ID=ehsanaghaei/SecureBERT_Plus
+LOGREG_JOBLIB=murshid_logreg_pipeline_manual_oof_pcatuned.joblib
+LOGREG_THRESHOLDS_NPY=murshid_logreg_thresholds_manual_oof_pcatuned.npy
+LABEL_COLUMNS_JSON=murshid_label_columns.json
+```
+**الفرق بين FULL و LOCAL mode:**
+| المتغير | FULL mode | LOCAL mode |
+|---------|-----------|------------|
+| `MURSHID_SKIP_LLM` | `false` | `true` |
+| LLaMA يُحمَّل؟ | ✅ نعم | ❌ لا |
+| جودة التلخيص | عالية | الوصف الخام فقط |
+| T1484 confidence (مثال) | **94.76%** | 89.29% |
+---
+### الخلية 6: تهجير قاعدة البيانات (Alembic)
+**الهدف:** إنشاء جداول قاعدة البيانات SQLite.
+```bash
+python -m alembic upgrade head
+```
+**الجداول المُنشأة (من migration 0001):**
+| الجدول | الغرض | مصدره في التقرير |
+|--------|--------|-----------------|
+| `users` | مستخدمو النظام (admin/analyst) | ER Diagram §3.2.6 |
+| `mapping_jobs` | وظائف معالجة ملفات القواعد | ER Diagram §3.2.6 |
+| `rules` | قواعد Wazuh المُحلَّلة | ER Diagram §3.2.6 |
+| `techniques` | تقنيات MITRE ATT&CK | ER Diagram §3.2.6 |
+| `rule_technique_mappings` | ربط القواعد بالتقنيات + confidence | ER Diagram §3.2.6 |
+| `query_templates` | قوالب WQL للتحقيق | ER Diagram §3.2.6 |
+> **ملاحظة:** قاعدة البيانات في `/content/murshid.db` — تُنشأ من جديد في كل جلسة Colab.
+---
+### الخلية 7: استيراد قوالب WQL من Excel
+**الهدف:** تحميل 60 قالب WQL من ملف Excel إلى قاعدة البيانات.
+**البيانات المستوردة:**
+| الإحصائية | القيمة |
+|-----------|--------|
+| إجمالي التقنيات | 20 تقنية |
+| إجمالي القوالب | 60 قالب (3 لكل تقنية) |
+| التقنيات المشمولة | T1047, T1055, T1059.001, T1070.004, T1078, T1083, T1095, T1098, T1105, T1110, T1112, T1114, T1176, T1190, T1484, T1498, T1499, T1529, T1531, T1562.001 |
+**مثال على قالب WQL (T1484):**
+```
+Template 1: Host pivot
+  agent.name:${HOST} AND win.system.eventID:(4728 OR 4729 ...) AND @timestamp:[now-24h TO now]
+Template 2: Actor pivot
+  win.eventdata.SubjectUserName:${USER} AND win.system.eventID:(...) AND @timestamp:[now-24h TO now]
+Template 3: High-impact target change
+  win.system.eventID:(...) AND win.eventdata.TargetUserName:("Domain Admins" OR ...) AND @timestamp:[now-24h TO now]
+```
+**منع التكرار:**
+- يتحقق من وجود (`technique_id` + `purpose`) قبل الإضافة
+- `replace=False` بشكل افتراضي (لا يُعيد الكتابة)
+---
+### الخلية 8: تشغيل FastAPI + Cloudflare Tunnel
+**الهدف:** الخلية الرئيسية — تُشغّل الباكند وتُنشئ رابطاً عاماً.
+#### 8.1 التحقق من bitsandbytes
+```python
+import bitsandbytes as bnb
+print(f'✅ bitsandbytes {bnb.__version__}')
+```
+> إذا فشل: يُوقف التشغيل فوراً مع رسالة واضحة.
+#### 8.2 تشغيل uvicorn
+```bash
+python -m uvicorn app.main:app --host 0.0.0.0 --port 8000 --log-level info
+```
+- `--host 0.0.0.0`: يستمع على كل الواجهات (مطلوب للـ tunnel)
+- اللوج يُحفظ في `/content/murshid_server.log`
+#### 8.3 تحميل النماذج (lifespan)
+عند بدء الخادم تُنفَّذ `load_models()` بهذا الترتيب:
+```
+1. hf_login(token)                     ← 1-2 ثانية
+2. LLaMA 3 8B-Instruct (4-bit NF4)    ← 5-8 دقائق (4.5 GB)
+   - BitsAndBytesConfig: load_in_4bit=True
+   - bnb_4bit_quant_type="nf4"
+   - bnb_4bit_compute_dtype=float16
+3. SecureBERT+ (ehsanaghaei)           ← 1-2 دقيقة
+   - AutoModel + AutoTokenizer
+   - mean pooling 768-dim
+4. LogisticRegressionModel             ← < 1 ثانية
+   - joblib.load (Pipeline: PCA + OneVsRestClassifier)
+   - np.load thresholds
+```
+#### 8.4 الانتظار الذكي
+```python
+for i in range(180):  # 15 دقيقة كحد أقصى
+    time.sleep(5)
+    # فحص /health كل 5 ثوانٍ
+    # عرض اللوج كل 30 ثانية
+    # كشف مبكر للأخطاء (ERROR, ImportError)
+```
+#### 8.5 Cloudflare Tunnel
+```bash
+wget cloudflared-linux-amd64 → /usr/local/bin/cloudflared
+cloudflared tunnel --url http://localhost:8000
+```
+- لا يحتاج حساباً أو توكناً
+- يُنتج رابطاً مثل: `https://xxxx.trycloudflare.com`
+- صالح طوال جلسة Colab
+---
+### الخلية 9: ربط الفرونت تلقائياً
+**الهدف:** تحديث `index.html` بالرابط الجديد من Cloudflare تلقائياً.
+```python
+# استخراج الرابط
+match = re.search(r'https://[a-z0-9\-]+\.trycloudflare\.com', content)
+public_url = match.group(0)
+# تحديث index.html على Drive
+html = re.sub(
+    r"const BASE = '[^']*';",
+    f"const BASE = '{public_url}';",
+    html
+)
+```
+**النتيجة:**
+```javascript
+// قبل
+const BASE = 'http://127.0.0.1:8000';
+// بعد
+const BASE = 'https://xxxx.trycloudflare.com';
+```
+---
+### الخلية 10: اختبار الـ API
+**الهدف:** التحقق من عمل كل مكون.
+#### 10.1 Health Check
+```python
+urllib.request.urlopen('http://localhost:8000/health')
+```
+**المخرج المتوقع (FULL mode):**
+```json
+{
+  "pipeline_mode": "full",
+  "pipeline_description": "LLaMA + SecureBERT+ + LogReg",
+  "components": {
+    "llama_loaded":    true,
+    "embedder_loaded": true,
+    "logreg_loaded":   true,
+    "cuda_available":  true
+  },
+  "all_model_files_present": true
+}
+```
+#### 10.2 تحليل قاعدة اختبار
+```python
+rule_xml = '<rule id="18205" level="5">...'
+POST http://localhost:8000/rules/analyze
+```
+**الـ pipeline خطوة بخطوة:**
+```
+XML Input (rule 18205)
+       ↓
+sanitize_rule_from_string()
+  - حذف: mitre, if_sid, group, if_group
+       ↓
+summarize_one_rule()  [LLaMA]
+  - Input: sanitized XML
+  - Output: "Detects the deletion of a security-enabled global group on a Windows system."
+       ↓
+build_text_for_embedding()
+  - text = summary + ". " + description
+  - "Detects the deletion of a security-enabled global group on a Windows system. Windows: Security Enabled Global Group Deleted."
+       ↓
+SecureBERTEmbedder.embed_text()
+  - Chunks (256 tokens max)
+  - mean pooling per chunk
+  - average chunks → 768-dim vector
+  - L2 normalize
+       ↓
+LogisticRegressionModel.predict()
+  - predict_proba(X_user)
+  - pred = (proba >= logreg_thr)
+  - conf = proba * 100
+  - gap  = proba - logreg_thr
+       ↓
+save_technique_mappings()  [DB]
+  - حفظ 20 تقنية مع confidence
+       ↓
+JSON Response
+```
+**المخرج للقاعدة 18205:**
+```
+Technique  Pred  Conf%   Proba   Thr    Gap
+T1484        ✅  94.76  0.9476  0.74  +0.2076   ← Primary
+T1531        ❌  27.92  0.2792  ...   ...
+T1070.004    ❌  21.03  0.2103  ...   ...
+T1098        ❌  10.65  0.1065  ...   ...
+T1112        ❌   9.27  0.0927  ...   ...
+```
+---
+الخطوات القادمة للمود المحلي (lOCAL Mode) غير ضروريه
+### الخلية 11: تصدير النتائج (اختياري)
+**الهدف:** تصدير نتائج القواعد المُحلَّلة إلى JSON لاستخدامها لاحقاً على الجهاز المحلي.
+```python
+export_path = f'{NEEDED_PATH}/murshid_full_results.json'
+json.dump(export_results, f, ensure_ascii=False, indent=2)
+```
+**الاستخدام:** يُمكِّن استيراد نتائج FULL mode في الباكند المحلي بدون GPU.
+---
+### الخلية 12: إيقاف الخادم
+```python
+cf_proc.terminate()     # إغلاق Cloudflare tunnel
+server_proc.terminate() # إيقاف uvicorn
+```
+---
+## 4. مقارنة أوضاع التشغيل
+| | FULL mode (Colab) | LOCAL mode (الجهاز) | LITE mode |
+|--|-------------------|---------------------|-----------|
+| **LLaMA** | ✅ | ❌ | ❌ |
+| **SecureBERT+** | ✅ | ✅ | ❌ |
+| **LogReg** | ✅ | ✅ | ✅ |
+| **GPU** | Tesla T4 | لا يلزم | لا يلزم |
+| **Embedding** | نص مُثرى بـ LLaMA | وصف القاعدة فقط | عشوائي |
+| **T1484 confidence** | **94.76%** | 89.29% | غير موثوق |
+| **القرار النهائي** | T1484 ✅ | T1484 ✅ | غير موثوق |
+| **وقت التحليل/قاعدة** | ~30-60 ثانية | ~2-5 ثوانٍ | < 1 ثانية |
+| **الاستخدام** | إنتاج / عرض | تطوير محلي | اختبار فقط |
+---
+## 5. معمارية النظام الكاملة على Colab
+```
+┌─────────────────────────────────────────────────────┐
+│                  Google Colab VM                     │
+│                                                      │
+│  ┌─────────────────────────────────┐                │
+│  │   /content/murshid_backend/     │                │
+│  │                                 │                │
+│  │   FastAPI (uvicorn :8000)        │                │
+│  │   ├── /health                   │                │
+│  │   ├── POST /rules/analyze       │                │
+│  │   ├── GET /results/{rule_id}    │                │
+│  │   ├── GET /queries/{tech_id}    │                │
+│  │   └── GET /api/db/...           │                │
+│  └───────────────┬─────────────────┘                │
+│                  │                                   │
+│  ┌───────────────┴───────────┐                      │
+│  │   ML Models (GPU VRAM)    │                      │
+│  │   ├── LLaMA 3 8B (4-bit) │                      │
+│  │   ├── SecureBERT+         │                      │
+│  │   └── LogReg Pipeline     │                      │
+│  └───────────────────────────┘                      │
+│                  │                                   │
+│  ┌───────────────┴───────────┐                      │
+│  │   /content/murshid.db     │                      │
+│  │   (SQLite — 6 جداول)      │                      │
+│  └───────────────────────────┘                      │
+│                                                      │
+│  ┌───────────────────────────┐                      │
+│  │   cloudflared tunnel      │                      │
+│  │   localhost:8000 → HTTPS  │                      │
+│  └───────────────┬───────────┘                      │
+└──────────────────┼──────────────────────────────────┘
+                   │
+                   ▼
+     https://xxxx.trycloudflare.com
+                   │
+                   ▼
+     ┌─────────────────────────┐
+     │   المتصفح / الفرونت     │
+     │   index.html (React)    │
+     └─────────────────────────┘
+```
+---
+## 6. الأخطاء الشائعة وحلولها
+| الخطأ | السبب | الحل |
+|-------|-------|------|
+| `ImportError: bitsandbytes>=0.46.1` | نسخة قديمة | شغّلي `!pip install -U bitsandbytes>=0.46.1` |
+| `FileNotFoundError: murshid_backend` | ZIP غير مرفوع | ارفعي `murshid_backend_for_drive.zip` إلى Drive |
+| `ERR_NGROK_4018` | ngrok يحتاج حساباً | استخدمي Cloudflare Tunnel (خلية 9) |
+| `Cannot connect to backend` | CORS مغلق | `allow_origins=["*"]` في `main.py` |
+| Server يستغرق > 15 دقيقة | تنزيل LLaMA بطيء | في الجلسة الثانية التنزيل من Cache |
+| `InconsistentVersionWarning` | sklearn إصدار مختلف | تأكدي من `scikit-learn==1.6.1` |
+---
+## 7. الـ Endpoints المتاحة بعد التشغيل
+| Method | Endpoint | الوصف |
+|--------|----------|-------|
+| `GET` | `/health` | حالة الخادم والنماذج |
+| `GET` | `/api/stats` | إحصائيات Dashboard |
+| `GET` | `/api/db/summary` | عدد الصفوف في الجداول |
+| `GET` | `/api/db/rules` | جميع القواعد في DB |
+| `GET` | `/api/db/mappings` | جميع المطابقات |
+| `GET` | `/api/db/techniques` | تقنيات MITRE المخزّنة |
+| `GET` | `/api/db/templates` | قوالب WQL |
+| `POST` | `/api/db/import-excel` | استيراد Excel |
+| `POST` | `/rules/analyze` | تحليل قاعدة XML (FULL pipeline) |
+| `GET` | `/results/{rule_id}` | نتائج تقنية قاعدة محددة |
+| `GET` | `/queries/{technique_id}` | استعلامات WQL لتقنية |
+| `POST` | `/admin/templates` | إضافة قالب WQL |
+| `PATCH` | `/admin/templates/{id}` | تعديل قالب |
+| `GET` | `/docs` | Swagger UI التفاعلي |
+---
+## 8. ملاحظات للعرض التقديمي
+1. **شغّلي الخلايا قبل العرض بـ 15 دقيقة** (وقت تحميل LLaMA)
+2. **انسخي رابط Cloudflare** وتحققي منه في المتصفح
+3. **الفرونت يُحدَّث تلقائياً** بالرابط الجديد في خلية 9
+4. **كل جلسة Colab جديدة = رابط Cloudflare جديد** — كرّري الخطوات
+5. **DB فارغة في كل جلسة** — حلّلي القواعد عبر Admin Panel أو خلية اختبار
+---
+*تاريخ الإنشاء: 8 أبريل 2026 | مشروع مُرشِد — CCIS, PNU*

Needed/murshid_label_columns.json ADDED Viewed

	@@ -0,0 +1,22 @@

+[
+  "T1047",
+  "T1055",
+  "T1059.001",
+  "T1070.004",
+  "T1078",
+  "T1083",
+  "T1095",
+  "T1098",
+  "T1105",
+  "T1110",
+  "T1112",
+  "T1114",
+  "T1176",
+  "T1190",
+  "T1484",
+  "T1498",
+  "T1499",
+  "T1529",
+  "T1531",
+  "T1562.001"
+]

Needed/murshid_logreg_pipeline_manual_oof_pcatuned.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:be629d9f6780456a9435f8be2655e3fa0a848fbe2a4f166813913331b4c43ba4
+size 206584

Needed/murshid_logreg_thresholds_manual_oof_pcatuned.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:005a664d7faa22104e4a9e58ace6976628d1d00c1cabcaead1833ff792366c79
+size 208

Needed/murshid_query_template_structure_clean_shared.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a1491c4dee86bbf29691b3c4254a344e2cb87eabbb77f04f49da09856cb1d145
+size 20938

README.md ADDED Viewed

	@@ -0,0 +1,39 @@

+---
+title: Murshid - مُرشِد
+emoji: 🛡️
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+pinned: false
+license: mit
+---
+# 🛡️ Murshid | مُرشِد
+**From Alerts to Guidance: MITRE ATT&CK-Aligned Techniques Mapping for SOC Analysts**
+REST API + Dashboard for analyzing Wazuh IDS rules and mapping them to MITRE ATT&CK techniques.
+## Features
+- **Rule Analysis**: Parse Wazuh XML rules and classify MITRE ATT&CK techniques
+- **WQL Queries**: Get pre-built Wazuh Query Language templates per technique
+- **Dashboard**: Interactive web UI with statistics and DB viewer
+- **ML Pipeline**: Logistic Regression with SecureBERT+ embeddings
+## Tech Stack
+- **FastAPI** — REST API
+- **SQLite** — Database
+- **Logistic Regression** — Primary classification model
+- **SecureBERT+** — Text embeddings (optional, requires torch)
+## API Endpoints
+| Method | URL | Description |
+|--------|-----|-------------|
+| `GET` | `/health` | System health check |
+| `POST` | `/rules/analyze` | Analyze a Wazuh XML rule |
+| `GET` | `/results/{rule_id}` | Get stored results for a rule |
+| `GET` | `/queries/{technique_id}` | Get WQL templates for a technique |
+| `GET` | `/docs` | Interactive Swagger documentation |

murshid_backend/README.md ADDED Viewed

	@@ -0,0 +1,156 @@

+# Murshid Backend
+REST API لمشروع "مرشد — من التنبيهات إلى التوجيه: ربط تقنيات MITRE ATT&CK لمحللي SOC"
+## التقنيات
+- **FastAPI** — REST API
+- **MySQL** + **SQLAlchemy** — قاعدة البيانات
+- **Alembic** — هجرة الجداول
+- **Logistic Regression** — النموذج الأساسي في هذه المرحلة
+- **SecureBERT+** — تضمينات نصية
+- **Llama 3 8B** — تلخيص قواعد Wazuh
+> المنطق مستخرج من `MurshidUIPipeline.ipynb` دون تعديله.
+---
+## هيكل المشروع
+```
+murshid_backend/
+  app/
+    main.py                    ← نقطة تشغيل FastAPI
+    config.py
+    api/routes/
+      health.py                ← GET /health
+      rules.py                 ← POST /rules/analyze + GET /results/{rule_id}
+      queries.py               ← GET /queries/{technique_id} + Admin endpoints
+    services/
+      ml_service.py
+      rule_service.py
+      result_service.py
+      template_service.py
+    ml/
+      sanitizer.py             ← تنظيف XML
+      summarizer.py            ← تلخيص Llama
+      embedder.py              ← SecureBERT+
+      logistic_model.py        ← Logistic Regression inference
+      pipeline.py              ← analyze_rule() الشامل
+    models/                    ← SQLAlchemy ORM (6 جداول من ER Diagram)
+    schemas/                   ← Pydantic schemas
+    repositories/              ← DB access layer
+    db/
+      base.py
+      session.py
+  alembic/
+    versions/0001_initial_schema.py
+  requirements.txt
+  .env.example
+```
+---
+## جداول قاعدة البيانات (مستخرجة من ER Diagram §3.2.6)
+| جدول | المصدر في التقرير |
+|------|-------------------|
+| `users` | User entity — username, email, password_hash, role |
+| `mapping_jobs` | MappingJob entity — job_id, file_name, status, progress, timestamp |
+| `rules` | Rule entity — rule_id, embedding_vector, job_id |
+| `techniques` | Technique entity — technique_id, technique_name, tactic |
+| `rule_technique_mappings` | RuleTechniqueMapping — rule_id, technique_id, confidence_score |
+| `query_templates` | QueryTemplate — purpose, wql_query, note, is_active |
+---
+## الإعداد والتشغيل
+### 1) متطلبات
+- Python 3.10+
+- MySQL 8+
+- GPU موصى به لـ Llama 3 8B
+### 2) تثبيت
+```powershell
+cd d:\GP\murshid_backend
+python -m venv .venv
+.\.venv\Scripts\activate
+pip install -r requirements.txt
+```
+### 3) إعداد قاعدة البيانات
+إنشاء قاعدة البيانات في MySQL:
+```sql
+CREATE DATABASE murshid_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
+```
+### 4) إعداد `.env`
+```powershell
+copy .env.example .env
+```
+عدّلي القيم:
+```env
+MURSHID_DB_URL=mysql+pymysql://root:YOUR_PASSWORD@localhost:3306/murshid_db
+MURSHID_MODELS_DIR=d:/GP/Needed
+HF_TOKEN=hf_xxxx
+MURSHID_SKIP_LLM=false
+```
+### 5) تأكد من وجود ملفات النماذج في `d:\GP\Needed`
+```
+murshid_logreg_pipeline_manual_oof_pcatuned.joblib
+murshid_logreg_thresholds_manual_oof_pcatuned.npy
+murshid_label_columns.json
+```
+### 6) تشغيل Alembic (هجرة الجداول)
+```powershell
+alembic upgrade head
+```
+### 7) تشغيل الـ API
+```powershell
+uvicorn app.main:app --reload --host 127.0.0.1 --port 8000
+```
+---
+## الـ Endpoints
+| Method | URL | الوصف |
+|--------|-----|--------|
+| `GET` | `/health` | فحص حالة النظام والنماذج |
+| `POST` | `/rules/analyze` | تحليل قاعدة Wazuh XML وحفظ النتائج |
+| `GET` | `/results/{rule_id}` | استرجاع التقنيات المخزنة لمعرف القاعدة |
+| `GET` | `/queries/{technique_id}` | جلب قوالب WQL لتقنية معينة |
+| `POST` | `/admin/templates` | إضافة قالب WQL جديد (Admin) |
+| `PATCH` | `/admin/templates/{id}` | تعديل أو تعطيل قالب (Admin) |
+### مثال — تحليل قاعدة
+```bash
+curl -X POST http://127.0.0.1:8000/rules/analyze \
+  -H "Content-Type: application/json" \
+  -d '{"rule_xml": "<rule id=\"597\" level=\"5\"><description>Registry Key Entry Deleted.</description></rule>"}'
+```
+### التوثيق التفاعلي
+افتحي: **http://127.0.0.1:8000/docs**
+---
+## ملاحظات
+- الملف الأصلي `MurshidUIPipeline.ipynb` **لم يُعدَّل** — المنطق منسوخ إلى طبقة `app/ml/`.
+- النموذج المعتمد في هذه المرحلة: **Logistic Regression** فقط.
+- لتشغيل بدون GPU للاختبار فقط: ضعي `MURSHID_SKIP_LLM=true` في `.env` (لكن `/rules/analyze` ستعيد 503).

murshid_backend/TECHNICAL_REPORT.md ADDED Viewed

	@@ -0,0 +1,322 @@

+# تقرير تقني مفصّل — مشروع مُرشِد (Murshid)
+## From Alerts to Guidance: MITRE ATT&CK-Aligned Techniques Mapping for SOC Analysts
+---
+## 1. نظرة عامة
+مُرشِد نظام ذكي يحوّل تنبيهات قواعد Wazuh XML إلى تقنيات MITRE ATT&CK مُرتّبة بدرجات ثقة، ويُنتج استعلامات تحقيق WQL جاهزة لمحللي SOC.
+```
+قاعدة Wazuh XML
+      ↓
+ Sanitization (حذف if_sid, group, mitre)
+      ↓
+ LLaMA 3 8B (تلخيص بجملة واحدة)
+      ↓
+ SecureBERT+ (768-dim embedding)
+      ↓
+ Logistic Regression + PCA (تصنيف)
+      ↓
+ تقنيات MITRE ATT&CK + Confidence Scores
+      ↓
+ قوالب WQL للتحقيق
+```
+---
+## 2. هيكل المشروع الكامل
+```
+d:\GP\
+├── MurshidUIPipeline.ipynb         ← الدفتر الأصلي (لا يُعدَّل)
+├── Needed\                         ← ملفات النماذج المدرّبة
+│   ├── murshid_logreg_pipeline_manual_oof_pcatuned.joblib
+│   ├── murshid_logreg_thresholds_manual_oof_pcatuned.npy
+│   ├── murshid_svmlinear_per_label_thresholds.joblib
+│   └── murshid_label_columns.json  (20 تقنية)
+├── murshid_backend\                ← خدمة FastAPI
+│   ├── app\
+│   │   ├── main.py
+│   │   ├── config.py
+│   │   ├── api\routes\
+│   │   │   ├── health.py       GET /health
+│   │   │   ├── rules.py        POST /rules/analyze  |  GET /results/{rule_id}
+│   │   │   ├── queries.py      GET /queries/{technique_id}  |  POST,PATCH /admin/templates
+│   │   │   ├── stats.py        GET /api/stats
+│   │   │   └── db_viewer.py    GET /api/db/{summary|rules|mappings|...}
+│   │   ├── ml\
+│   │   │   ├── sanitizer.py    تنظيف XML
+│   │   │   ├── summarizer.py   LLaMA inference
+│   │   │   ├── embedder.py     SecureBERT+ embeddings
+│   │   │   ├── logistic_model.py  LogReg inference (PRIMARY)
+│   │   │   └── pipeline.py     تنسيق المراحل (FULL|LOCAL|LITE)
+│   │   ├── models\             SQLAlchemy ORM
+│   │   │   ├── user.py
+│   │   │   ├── mapping_job.py
+│   │   │   ├── rule.py
+│   │   │   ├── technique.py
+│   │   │   ├── rule_technique_mapping.py
+│   │   │   └── query_template.py
+│   │   ├── schemas\            Pydantic schemas
+│   │   ├── services\           Business logic
+│   │   ├── repositories\       DB access
+│   │   └── db\                 SQLAlchemy session
+│   ├── alembic\                Migrations
+│   ├── murshid.db              SQLite database
+│   ├── .env
+│   └── requirements.txt
+└── murshid_frontend\           واجهة React
+    └── index.html
+```
+---
+## 3. طبقة الباكند (FastAPI)
+### 3.1 الـ Endpoints
+| Method | URL | الوصف | Actor |
+|--------|-----|--------|-------|
+| `GET` | `/health` | حالة النظام + pipeline mode + ملفات النماذج | All |
+| `GET` | `/api/stats` | إحصائيات Dashboard (KPIs + Technique Frequency) | All |
+| `GET` | `/api/db/summary` | عدد الصفوف في كل جدول | Testing |
+| `GET` | `/api/db/rules` | جميع القواعد المخزّنة | Testing |
+| `GET` | `/api/db/mappings` | جميع مطابقات القواعد-التقنيات | Testing |
+| `GET` | `/api/db/techniques` | جميع تقنيات MITRE المخزّنة | Testing |
+| `GET` | `/api/db/templates` | جميع قوالب WQL | Testing |
+| `POST` | `/rules/analyze` | تحليل قاعدة XML → تخزين النتائج | Admin |
+| `GET` | `/results/{rule_id}` | استرجاع تقنيات قاعدة محددة (Figure 4-11/12) | SOC Analyst |
+| `GET` | `/queries/{technique_id}` | قوالب WQL لتقنية محددة | SOC Analyst |
+| `POST` | `/admin/templates` | إضافة قالب WQL جديد | Admin |
+| `PATCH` | `/admin/templates/{id}` | تعديل/تعطيل قالب | Admin |
+### 3.2 معمارية الطبقات
+```
+HTTP Request
+    │
+    ▼
+API Layer (FastAPI routes)
+    │  validates input (Pydantic)
+    ▼
+Service Layer
+    │  orchestrates business logic
+    ▼
+ML Layer           Repository Layer
+    │                    │
+    ▼                    ▼
+Pipeline           SQLAlchemy ORM
+(sanitize→embed→classify)    │
+    │                    ▼
+    └──────────→  SQLite DB
+```
+### 3.3 قاعدة البيانات (SQLite + SQLAlchemy)
+مستخرجة حرفياً من ER Diagram (§3.2.6 من التقرير):
+| الجدول | الأعمدة الرئيسية | المصدر في التقرير |
+|--------|------------------|-------------------|
+| `users` | user_id, username, email, password_hash, role | User entity |
+| `mapping_jobs` | job_id, user_id, file_name, status, progress, timestamp | MappingJob entity |
+| `rules` | rule_id (PK), job_id, embedding_vector | Rule entity |
+| `techniques` | technique_id (PK), technique_name, tactic | Technique entity |
+| `rule_technique_mappings` | mapping_id, rule_id, technique_id, confidence_score | RuleTechniqueMapping |
+| `query_templates` | template_id, technique_id, purpose, wql_query, note, is_active | QueryTemplate |
+> Index على `rule_id` في `rule_technique_mappings` (Use Case 6 §3.2.7)
+---
+## 4. طبقة ML
+### 4.1 مراحل الـ Pipeline (من الدفتر)
+#### المرحلة 1: Sanitization
+```python
+# ml/sanitizer.py — من cell 10 في الدفتر
+REMOVE_TAGS_ANYWHERE = {"mitre", "if_sid", "group", "if_group"}
+# يُحذف: group tags, if_sid, mitre IDs, compliance tags
+# يبقى: description, id, category, decoded_as, info
+```
+#### المرحلة 2: LLM Summarization (LLaMA 3 8B)
+```python
+# ml/summarizer.py — من cell 11 في الدفتر
+# Input: sanitized XML
+# Prompt: "Write EXACTLY ONE sentence describing the observable event pattern"
+# Output: JSON {"summary": "Detects ..."}
+# Constraints: 7-18 words, يبدأ بـ Detects/Monitors/...
+```
+#### المرحلة 3: Paragraph Construction
+```python
+# ml/embedder.py — من cell 12 في الدفتر
+text = f"{summary}. {description}."
+# مثال: "Detects deletion of global group. Windows: Security Enabled Global Group Deleted."
+```
+#### المرحلة 4: SecureBERT+ Embedding
+```python
+# ml/embedder.py — من cell 15 في الدفتر
+# Model: ehsanaghaei/SecureBERT_Plus
+# MAX_LEN: 512 tokens, chunks
+# Pooling: Mean pooling across tokens → 768-dim vector
+# Normalization: L2
+```
+#### المرحلة 5: Logistic Regression Inference
+```python
+# ml/logistic_model.py — من cell 18-19 في الدفتر
+proba = logreg_model.predict_proba(X_user)
+proba = proba.reshape(-1)
+pred  = (proba >= logreg_thr).astype(int)
+conf  = proba * 100
+gap   = proba - logreg_thr
+# تُرجع جميع الـ 20 تقنية مرتّبة تنازلياً
+```
+### 4.2 أوضاع التشغيل
+| الوضع | الشرط | الدقة | الاستخدام |
+|-------|--------|-------|-----------|
+| **FULL** | LLaMA + SecureBERT + LogReg | 100% (مطابق للدفتر) | Colab/GPU |
+| **LOCAL** | SecureBERT + LogReg (بدون LLaMA) | ~95% (وصف بدون ملخص) | الجهاز المحلي |
+| **LITE** | LogReg فقط (بدون torch) | منخفضة (عشوائي) | اختبار البنية فقط |
+---
+## 5. طبقة الفرونت (React + Tailwind + Chart.js)
+### 5.1 الصفحات (CDN-based React, بدون Build Step)
+| الصفحة | ID | المستخدم | الوصف |
+|--------|-----|----------|--------|
+| Login | — | All | تسجيل دخول + اختيار دور |
+| Dashboard | `dashboard` | All | KPIs + MITRE Technique Frequency Chart |
+| Rule Lookup | `rules` | SOC Analyst | بحث بـ Rule ID → Figure 4-11 + Figure 4-12 |
+| نتائج DB | `dbviewer` | All | استعراض قاعدة البيانات للاختبار |
+| Rule Mapping | `admin` | Admin | رفع XML + تحليل + جدول التقدم |
+| WQL Templates | `templates` | Admin | إدارة قوالب الاستعلامات |
+| Settings | `settings` | All | ملف شخصي + Dark Mode + ألوان |
+### 5.2 الـ Figures كما في التقرير
+| Figure | الصفحة | المكوّن |
+|--------|--------|---------|
+| Figure 4-10 | Rule Lookup | Search bar + Rule ID input |
+| Figure 4-11 | Rule Lookup | `TechniqueDistributionChart` — Horizontal bar chart (Top 5, مُلوَّن H/M/L) |
+| Figure 4-12 | Rule Lookup | Investigation Queries table (Primary + Secondary ≥50%) |
+| Figure 4-13 | Admin | Rule Mapping Panel (paste XML + Submit) |
+| Figure 4-14 | Admin | Mapping Progress Table (Job ID, Status, Progress) |
+| Figure 4-9 | Dashboard | KPIs + Technique Frequency Bar Chart |
+### 5.3 ربط الفرونت بالباكند
+```javascript
+const BASE = 'http://127.0.0.1:8000';
+// CORS مُفعَّل في الباكند لـ http://localhost:5173 و http://127.0.0.1:5173
+// الفرونت يُخدَّم مباشرةً من FastAPI عبر StaticFiles
+```
+---
+## 6. مخطط تدفق البيانات الكامل
+```
+┌─────────────────────────────────────────┐
+│           SOC Analyst / Admin            │
+│         (murshid_frontend/index.html)    │
+└────────────────┬────────────────────────┘
+                 │  HTTP/JSON
+                 ▼
+┌─────────────────────────────────────────┐
+│         FastAPI (port 8000)             │
+│                                         │
+│  /health → pipeline status              │
+│  POST /rules/analyze:                   │
+│    1. sanitizer.py   → clean XML        │
+│    2. summarizer.py  → LLaMA summary    │  ← FULL mode only
+│    3. embedder.py    → 768-dim vector   │
+│    4. logistic_model → proba + scores   │
+│    5. rule_repo      → save to DB       │
+│                                         │
+│  GET /results/{id} → from DB            │
+│  GET /queries/{id} → WQL templates      │
+└────────────────┬────────────────────────┘
+                 │  SQLAlchemy
+                 ▼
+┌─────────────────────────────────────────┐
+│          SQLite (murshid.db)            │
+│  rules | techniques | mappings          │
+│  query_templates | mapping_jobs         │
+└─────────────────────────────────────────┘
+```
+---
+## 7. التشغيل
+### المتطلبات
+- Python 3.12 (عبر uv)
+- ملفات النماذج في `d:\GP\Needed\`
+- اتصال إنترنت (لـ SecureBERT+ من HuggingFace أول مرة)
+### تشغيل الخادم
+```powershell
+cd d:\GP\murshid_backend
+.venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8000
+```
+### الروابط
+| الرابط | الوصف |
+|--------|--------|
+| http://127.0.0.1:8000/index.html | الواجهة الرئيسية |
+| http://127.0.0.1:8000/docs | Swagger API Documentation |
+| http://127.0.0.1:8000/health | فحص حالة النظام |
+| http://127.0.0.1:8000/api/db/summary | ملخص قاعدة البيانات |
+### اختبار سريع
+```powershell
+# 1. تحليل قاعدة
+$body = '{"rule_xml":"<rule id=\"597\"><description>Registry Key Entry Deleted.</description></rule>"}'
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/rules/analyze" -Method POST -ContentType "application/json" -Body $body
+# 2. استرجاع النتائج
+Invoke-RestMethod "http://127.0.0.1:8000/results/597"
+# 3. إضافة قالب WQL
+$t = '{"technique_id":"T1112","purpose":"Detect registry modification","wql_query":"agent.name:${HOST} AND rule.description:\"registry\"","note":"Replace ${HOST}"}'
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/admin/templates" -Method POST -ContentType "application/json" -Body $t
+# 4. جلب الاستعلامات
+Invoke-RestMethod "http://127.0.0.1:8000/queries/T1112"
+```
+---
+## 8. الفرق بين FULL mode (Colab) و LOCAL mode (الجهاز)
+| | Colab (FULL) | الجهاز المحلي (LOCAL) |
+|--|-------------|----------------------|
+| Input text | `"Detects deletion of a security-enabled global group. Windows: Security Enabled Global Group Deleted."` | `"Windows: Security Enabled Global Group Deleted"` |
+| T1484 proba | **0.9476 (94.76%)** | **0.8929 (89.29%)** |
+| سبب الفرق | LLaMA يُثري النص بسياق دلالي | الوصف فقط بدون إثراء |
+| القرار الصحيح | T1484 ✅ | T1484 ✅ |
+**الاستنتاج:** القرار النهائي صحيح في كلا الوضعين — الاختلاف في درجة الثقة فقط.
+---
+## 9. حالات الاستخدام المُنفَّذة (من التقرير)
+| Use Case | الوصف | مُنفَّذ |
+|----------|--------|---------|
+| UC1 | View techniques and scores for a rule | ✅ `GET /results/{rule_id}` |
+| UC2 | View WQL investigation queries | ✅ `GET /queries/{technique_id}` |
+| UC3 | Copy and fill investigation query | ✅ زر Copy في الفرونت |
+| UC4 | Upload Wazuh rule(s) | ✅ Admin Panel |
+| UC5 | Process rule via ML pipeline | ✅ `POST /rules/analyze` |
+| UC6 | Store mapped techniques in DB | ✅ تلقائي بعد analyze |
+| UC7 | Manage WQL templates repository | ✅ `POST/PATCH /admin/templates` |

murshid_backend/alembic.ini ADDED Viewed

	@@ -0,0 +1,38 @@

+[alembic]
+script_location = alembic
+prepend_sys_path = .
+sqlalchemy.url = sqlite:///murshid.db
+[loggers]
+keys = root,sqlalchemy,alembic
+[handlers]
+keys = console
+[formatters]
+keys = generic
+[logger_root]
+level = WARN
+handlers = console
+qualname =
+[logger_sqlalchemy]
+level = WARN
+handlers =
+qualname = sqlalchemy.engine
+[logger_alembic]
+level = INFO
+handlers =
+qualname = alembic
+[handler_console]
+class = StreamHandler
+args = (sys.stderr,)
+level = NOTSET
+formatter = generic
+[formatter_generic]
+format = %(levelname)-5.5s [%(name)s] %(message)s
+datefmt = %H:%M:%S

murshid_backend/alembic/env.py ADDED Viewed

	@@ -0,0 +1,52 @@

+import sys
+from logging.config import fileConfig
+from pathlib import Path
+from sqlalchemy import engine_from_config, pool
+from alembic import context
+# make app importable
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from app.config import settings
+from app.db.base import Base
+import app.models  # noqa: F401 — registers all models with Base.metadata
+config = context.config
+config.set_main_option("sqlalchemy.url", settings.murshid_db_url)
+if config.config_file_name is not None:
+    fileConfig(config.config_file_name)
+target_metadata = Base.metadata
+def run_migrations_offline() -> None:
+    url = config.get_main_option("sqlalchemy.url")
+    context.configure(
+        url=url,
+        target_metadata=target_metadata,
+        literal_binds=True,
+        dialect_opts={"paramstyle": "named"},
+    )
+    with context.begin_transaction():
+        context.run_migrations()
+def run_migrations_online() -> None:
+    connectable = engine_from_config(
+        config.get_section(config.config_ini_section, {}),
+        prefix="sqlalchemy.",
+        poolclass=pool.NullPool,
+    )
+    with connectable.connect() as connection:
+        context.configure(connection=connection, target_metadata=target_metadata)
+        with context.begin_transaction():
+            context.run_migrations()
+if context.is_offline_mode():
+    run_migrations_offline()
+else:
+    run_migrations_online()

murshid_backend/alembic/script.py.mako ADDED Viewed

	@@ -0,0 +1,25 @@

+"""${message}
+Revision ID: ${up_revision}
+Revises: ${down_revision | comma,n}
+Create Date: ${create_date}
+"""
+from typing import Sequence, Union
+from alembic import op
+import sqlalchemy as sa
+${imports if imports else ""}
+revision: str = ${repr(up_revision)}
+down_revision: Union[str, None] = ${repr(down_revision)}
+branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
+depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
+def upgrade() -> None:
+    ${upgrades if upgrades else "pass"}
+def downgrade() -> None:
+    ${downgrades if downgrades else "pass"}

murshid_backend/alembic/versions/0001_initial_schema.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""initial schema — all 6 tables from ER Diagram §3.2.6
+Revision ID: 0001
+Revises:
+Create Date: 2026-04-08
+"""
+from typing import Sequence, Union
+import sqlalchemy as sa
+from alembic import op
+revision: str = "0001"
+down_revision: Union[str, None] = None
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+def upgrade() -> None:
+    op.create_table(
+        "users",
+        sa.Column("user_id", sa.Integer(), primary_key=True, autoincrement=True),
+        sa.Column("username", sa.String(100), unique=True, nullable=False),
+        sa.Column("email", sa.String(255), unique=True, nullable=False),
+        sa.Column("password_hash", sa.String(255), nullable=False),
+        sa.Column("role", sa.String(20), nullable=False, server_default="analyst"),
+    )
+    op.create_table(
+        "mapping_jobs",
+        sa.Column("job_id", sa.Integer(), primary_key=True, autoincrement=True),
+        sa.Column("user_id", sa.Integer(), sa.ForeignKey("users.user_id"), nullable=False),
+        sa.Column("file_name", sa.String(255), nullable=False),
+        sa.Column("rules_count", sa.Integer(), server_default="0"),
+        sa.Column("status", sa.String(20), nullable=False, server_default="pending"),
+        sa.Column("progress", sa.Integer(), server_default="0"),
+        sa.Column("timestamp", sa.DateTime(), server_default=sa.func.now()),
+    )
+    op.create_table(
+        "rules",
+        sa.Column("rule_id", sa.String(50), primary_key=True),
+        sa.Column("job_id", sa.Integer(), sa.ForeignKey("mapping_jobs.job_id"), nullable=True),
+        sa.Column("embedding_vector", sa.Text(), nullable=True),
+    )
+    op.create_table(
+        "techniques",
+        sa.Column("technique_id", sa.String(20), primary_key=True),
+        sa.Column("technique_name", sa.String(255), nullable=False),
+        sa.Column("tactic", sa.String(100), nullable=True),
+    )
+    op.create_table(
+        "rule_technique_mappings",
+        sa.Column("mapping_id", sa.Integer(), primary_key=True, autoincrement=True),
+        sa.Column("rule_id", sa.String(50), sa.ForeignKey("rules.rule_id"), nullable=False),
+        sa.Column(
+            "technique_id", sa.String(20), sa.ForeignKey("techniques.technique_id"), nullable=False
+        ),
+        sa.Column("confidence_score", sa.Float(), nullable=False),
+    )
+    # Index on rule_id — Use Case 6 §3.2.7
+    op.create_index("ix_rule_technique_rule_id", "rule_technique_mappings", ["rule_id"])
+    op.create_table(
+        "query_templates",
+        sa.Column("template_id", sa.Integer(), primary_key=True, autoincrement=True),
+        sa.Column(
+            "technique_id", sa.String(20), sa.ForeignKey("techniques.technique_id"), nullable=False
+        ),
+        sa.Column("purpose", sa.String(255), nullable=True),
+        sa.Column("wql_query", sa.Text(), nullable=False),
+        sa.Column("note", sa.Text(), nullable=True),
+        sa.Column("is_active", sa.Boolean(), nullable=False, server_default="1"),
+    )
+def downgrade() -> None:
+    op.drop_table("query_templates")
+    op.drop_index("ix_rule_technique_rule_id", table_name="rule_technique_mappings")
+    op.drop_table("rule_technique_mappings")
+    op.drop_table("techniques")
+    op.drop_table("rules")
+    op.drop_table("mapping_jobs")
+    op.drop_table("users")
+    pass  # SQLite: no custom types to drop

murshid_backend/app/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Murshid backend package."""

murshid_backend/app/api/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """API layer — FastAPI routers."""

murshid_backend/app/api/routes/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Route modules."""

murshid_backend/app/api/routes/db_viewer.py ADDED Viewed

	@@ -0,0 +1,122 @@

+"""
+GET  /api/db/rules           — all rules in DB
+GET  /api/db/mappings        — all rule-technique mappings
+GET  /api/db/techniques      — all techniques
+GET  /api/db/templates       — all query templates
+GET  /api/db/summary         — counts per table
+POST /api/db/import-excel    — import WQL templates from Excel file
+"""
+from fastapi import APIRouter, Depends, HTTPException, Query
+from sqlalchemy import func
+from sqlalchemy.orm import Session
+from app.db.session import get_db
+from app.models.mapping_job import MappingJob
+from app.models.query_template import QueryTemplate
+from app.models.rule import Rule
+from app.models.rule_technique_mapping import RuleTechniqueMapping
+from app.models.technique import Technique
+router = APIRouter(prefix="/api/db", tags=["db-viewer"])
+@router.get("/summary")
+def db_summary(db: Session = Depends(get_db)):
+    return {
+        "rules":             db.query(func.count(Rule.rule_id)).scalar(),
+        "techniques":        db.query(func.count(Technique.technique_id)).scalar(),
+        "rule_mappings":     db.query(func.count(RuleTechniqueMapping.mapping_id)).scalar(),
+        "query_templates":   db.query(func.count(QueryTemplate.template_id)).scalar(),
+        "mapping_jobs":      db.query(func.count(MappingJob.job_id)).scalar(),
+    }
+@router.get("/rules")
+def all_rules(db: Session = Depends(get_db)):
+    rows = db.query(Rule).order_by(Rule.rule_id).all()
+    return [
+        {
+            "rule_id":  r.rule_id,
+            "job_id":   r.job_id,
+            "has_embedding": r.embedding_vector is not None,
+        }
+        for r in rows
+    ]
+@router.get("/mappings")
+def all_mappings(db: Session = Depends(get_db)):
+    rows = (
+        db.query(RuleTechniqueMapping)
+        .order_by(
+            RuleTechniqueMapping.rule_id,
+            RuleTechniqueMapping.confidence_score.desc(),
+        )
+        .all()
+    )
+    return [
+        {
+            "mapping_id":       m.mapping_id,
+            "rule_id":          m.rule_id,
+            "technique_id":     m.technique_id,
+            "confidence_score": round(m.confidence_score, 4),
+            "confidence_pct":   round(m.confidence_score * 100, 2),
+        }
+        for m in rows
+    ]
+@router.get("/techniques")
+def all_techniques(db: Session = Depends(get_db)):
+    rows = db.query(Technique).order_by(Technique.technique_id).all()
+    return [
+        {
+            "technique_id":   t.technique_id,
+            "technique_name": t.technique_name,
+            "tactic":         t.tactic,
+        }
+        for t in rows
+    ]
+@router.get("/templates")
+def all_templates(db: Session = Depends(get_db)):
+    rows = db.query(QueryTemplate).order_by(QueryTemplate.technique_id, QueryTemplate.template_id).all()
+    return [
+        {
+            "template_id":  t.template_id,
+            "technique_id": t.technique_id,
+            "purpose":      t.purpose,
+            "wql_query":    t.wql_query,
+            "note":         t.note,
+            "is_active":    t.is_active,
+        }
+        for t in rows
+    ]
+@router.post("/import-excel")
+def import_excel_templates(
+    replace: bool = Query(False, description="Update existing templates if True"),
+    db: Session = Depends(get_db),
+):
+    """
+    Import WQL query templates from the Excel file:
+        murshid_query_template_structure_clean_shared.xlsx
+    The file is read from MURSHID_MODELS_DIR or the GP root folder.
+    Pass ?replace=true to overwrite existing templates.
+    """
+    try:
+        from scripts.import_excel_templates import run
+        result = run(db, replace=replace)
+    except FileNotFoundError as e:
+        raise HTTPException(status_code=404, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+    if "error" in result:
+        raise HTTPException(status_code=404, detail=result["error"])
+    return result

murshid_backend/app/api/routes/health.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""GET /health — system readiness check with clear pipeline mode info."""
+from pathlib import Path
+from fastapi import APIRouter
+from app.config import settings
+from app.ml.pipeline import _store, is_ready
+router = APIRouter(tags=["health"])
+try:
+    import torch
+    _CUDA = torch.cuda.is_available()
+    _TORCH = True
+    _TORCH_ERR = None
+except (ImportError, OSError) as _e:
+    _CUDA = False
+    _TORCH = False
+    _TORCH_ERR = str(_e)
+def _check_model_files() -> dict:
+    base = Path(settings.murshid_models_dir).resolve()
+    files = {
+        "logreg_joblib":      base / settings.logreg_joblib,
+        "logreg_thresholds":  base / settings.logreg_thresholds_npy,
+        "label_columns":      base / settings.label_columns_json,
+    }
+    return {k: v.is_file() for k, v in files.items()}
+@router.get("/health")
+def health():
+    model_files = _check_model_files()
+    all_files_ok = all(model_files.values())
+    if _store.llama_model is not None:
+        mode = "full"
+        mode_desc = "LLaMA + SecureBERT+ + LogReg"
+    elif _store.embedder is not None and _store.logreg is not None:
+        mode = "local"
+        mode_desc = "SecureBERT+ + LogReg (no LLaMA — using description as text)"
+    elif _store.logreg is not None:
+        mode = "lite"
+        mode_desc = "LogReg only (no embedder — random vectors, testing only)"
+    else:
+        mode = "not_ready"
+        mode_desc = "No ML models loaded"
+    return {
+        "status": "ok",
+        "pipeline_ready": is_ready(),
+        "pipeline_mode": mode,
+        "pipeline_description": mode_desc,
+        "analyze_available": _store.logreg is not None,
+        "components": {
+            "llama_loaded":    _store.llama_model is not None,
+            "embedder_loaded": _store.embedder is not None,
+            "logreg_loaded":   _store.logreg is not None,
+            "torch_installed": _TORCH,
+            "cuda_available":  _CUDA,
+            "torch_error":     _TORCH_ERR,
+        },
+        "model_files": model_files,
+        "all_model_files_present": all_files_ok,
+        "models_dir": str(settings.murshid_models_dir.resolve()),
+        "skip_llm_env": settings.murshid_skip_llm,
+        "next_step": (
+            "POST /rules/analyze is ready!" if _store.logreg is not None
+            else "Copy .joblib and .npy files to MURSHID_MODELS_DIR and restart."
+        ),
+    }

murshid_backend/app/api/routes/queries.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+GET    /queries/{technique_id}              — SOC Analyst: fetch WQL templates.
+POST   /admin/templates                    — Admin: add new template.
+PATCH  /admin/templates/{template_id}      — Admin: update / disable template.
+Based on:
+  Use Case 2 (View Investigation WQL Queries) — §3.2.7
+  Use Case 7 (Manage static query templates) — §3.2.7
+"""
+from fastapi import APIRouter, Depends, HTTPException
+from sqlalchemy.orm import Session
+from app.db.session import get_db
+from app.schemas.query import QueryTemplateIn, QueryTemplateOut, QueryTemplateUpdate
+from app.services.template_service import TemplateService
+router = APIRouter(tags=["queries"])
+def _get_template_service(db: Session = Depends(get_db)) -> TemplateService:
+    return TemplateService(db=db)
+# ---------------------------------------------------------------------------
+# GET /queries/{technique_id}
+# ---------------------------------------------------------------------------
+@router.get("/queries/{technique_id}", response_model=list[QueryTemplateOut])
+def get_queries(
+    technique_id: str,
+    svc: TemplateService = Depends(_get_template_service),
+):
+    """
+    Returns all active WQL templates for the given MITRE technique.
+    Use Case 2 — §3.2.7
+    """
+    templates = svc.get_queries_for_technique(technique_id)
+    if not templates:
+        raise HTTPException(
+            status_code=404,
+            detail=f"No active query templates found for technique '{technique_id}'.",
+        )
+    return [QueryTemplateOut(**t) for t in templates]
+# ---------------------------------------------------------------------------
+# Admin endpoints
+# ---------------------------------------------------------------------------
+@router.post("/admin/templates", response_model=QueryTemplateOut, status_code=201)
+def add_template(
+    body: QueryTemplateIn,
+    svc: TemplateService = Depends(_get_template_service),
+):
+    """Admin: add a new WQL template. Use Case 7 — §3.2.7"""
+    result = svc.add_template(
+        technique_id=body.technique_id,
+        purpose=body.purpose,
+        wql_query=body.wql_query,
+        note=body.note,
+    )
+    return QueryTemplateOut(**result)
+@router.patch("/admin/templates/{template_id}", response_model=QueryTemplateOut)
+def update_template(
+    template_id: int,
+    body: QueryTemplateUpdate,
+    svc: TemplateService = Depends(_get_template_service),
+):
+    """Admin: update or disable a WQL template. Use Case 7 — §3.2.7"""
+    result = svc.update_template(template_id, body.model_dump(exclude_none=True))
+    if result is None:
+        raise HTTPException(status_code=404, detail=f"Template {template_id} not found.")
+    return QueryTemplateOut(**result)

murshid_backend/app/api/routes/rules.py ADDED Viewed

	@@ -0,0 +1,100 @@

+"""
+POST /rules/analyze   — Admin: analyze a rule, persist results.
+GET  /results/{rule_id} — SOC Analyst: retrieve stored mappings.
+Based on:
+  Use Case 4+5+6 (Upload, Process, Store) — §3.2.7
+  Use Case 1 (View techniques and scores) — §3.2.7
+"""
+from fastapi import APIRouter, Depends, HTTPException
+from sqlalchemy.orm import Session
+from app.db.session import get_db
+from app.ml.pipeline import is_ready
+from app.schemas.result import MappingResult, ResultsResponse
+from app.schemas.rule import AnalyzeRequest, AnalyzeResponse, TechniqueResult
+from app.services.ml_service import MLService
+from app.services.result_service import ResultService
+from app.services.rule_service import RuleService
+router = APIRouter(tags=["rules"])
+def _get_rule_service(db: Session = Depends(get_db)) -> RuleService:
+    return RuleService(db=db, ml=MLService())
+def _get_result_service(db: Session = Depends(get_db)) -> ResultService:
+    return ResultService(db=db)
+# ---------------------------------------------------------------------------
+# POST /rules/analyze
+# ---------------------------------------------------------------------------
+@router.post("/rules/analyze", response_model=AnalyzeResponse, status_code=201)
+def analyze_rule(
+    body: AnalyzeRequest,
+    svc: RuleService = Depends(_get_rule_service),
+):
+    """
+    Runs the full ML pipeline on the submitted Wazuh rule XML and stores
+    the results in the database.
+    """
+    if not is_ready():
+        raise HTTPException(status_code=503, detail="ML pipeline not ready.")
+    try:
+        result = svc.analyze_and_persist(body.rule_xml)
+    except ValueError as exc:
+        raise HTTPException(status_code=422, detail=str(exc)) from exc
+    except RuntimeError as exc:
+        raise HTTPException(status_code=503, detail=str(exc)) from exc
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=str(exc)) from exc
+    all_results = [TechniqueResult(**r) for r in result["results"]]
+    detected = [r for r in all_results if r.predicted]
+    return AnalyzeResponse(
+        rule_id=result["rule_id"],
+        sanitized_xml=result["sanitized_xml"],
+        summary=result["summary"],
+        text_for_embedding=result["text_for_embedding"],
+        embedding_dim=result["embedding_dim"],
+        pipeline_mode=result.get("pipeline_mode", "full"),
+        detected=detected,
+        all_results=all_results,
+    )
+# ---------------------------------------------------------------------------
+# GET /results/{rule_id}
+# ---------------------------------------------------------------------------
+@router.get("/results/{rule_id}", response_model=ResultsResponse)
+def get_results(
+    rule_id: str,
+    svc: ResultService = Depends(_get_result_service),
+):
+    """
+    Returns all stored MITRE ATT&CK techniques for a rule ID, sorted by confidence.
+    Use Case 1 — §3.2.7
+    - mappings: ALL techniques sorted by confidence desc (for Figure 4-11 Top 5 chart)
+    - detected: primary + secondary (≥0.5) only (for Figure 4-12 WQL queries)
+    """
+    data = svc.get_results_for_rule(rule_id)
+    if data is None:
+        raise HTTPException(
+            status_code=404,
+            detail=f"No mapping results found for rule_id '{rule_id}'. "
+            "Run POST /rules/analyze first.",
+        )
+    return ResultsResponse(
+        rule_id=rule_id,
+        mappings=[MappingResult(**m) for m in data["mappings"]],
+        detected=[MappingResult(**m) for m in data["detected"]],
+    )

murshid_backend/app/api/routes/stats.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""GET /api/stats — dashboard KPIs."""
+from fastapi import APIRouter, Depends
+from sqlalchemy import func
+from sqlalchemy.orm import Session
+from app.db.session import get_db
+from app.models.rule import Rule
+from app.models.rule_technique_mapping import RuleTechniqueMapping
+from app.models.query_template import QueryTemplate
+from app.models.technique import Technique
+router = APIRouter(prefix="/api", tags=["stats"])
+@router.get("/stats")
+def get_stats(db: Session = Depends(get_db)):
+    total_rules = db.query(func.count(Rule.rule_id)).scalar() or 0
+    total_mappings = db.query(func.count(RuleTechniqueMapping.mapping_id)).scalar() or 0
+    total_queries = db.query(func.count(QueryTemplate.template_id)).filter(QueryTemplate.is_active.is_(True)).scalar() or 0
+    total_techniques = db.query(func.count(Technique.technique_id)).scalar() or 0
+    technique_freq = (
+        db.query(
+            RuleTechniqueMapping.technique_id,
+            func.count(RuleTechniqueMapping.mapping_id).label("count"),
+        )
+        .group_by(RuleTechniqueMapping.technique_id)
+        .order_by(func.count(RuleTechniqueMapping.mapping_id).desc())
+        .limit(10)
+        .all()
+    )
+    return {
+        "total_rules_mapped": total_rules,
+        "total_techniques": total_techniques,
+        "total_mappings": total_mappings,
+        "total_queries": total_queries,
+        "technique_frequency": [
+            {"technique_id": t.technique_id, "count": t.count}
+            for t in technique_freq
+        ],
+    }

murshid_backend/app/config.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from pathlib import Path
+from pydantic_settings import BaseSettings, SettingsConfigDict
+_GP_ROOT = Path(__file__).resolve().parent.parent.parent
+class Settings(BaseSettings):
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        env_file_encoding="utf-8",
+        extra="ignore",
+    )
+    murshid_db_url: str = "mysql+pymysql://root:password@localhost:3306/murshid_db"
+    murshid_models_dir: Path = _GP_ROOT / "Needed"
+    hf_token: str | None = None
+    murshid_skip_llm: bool = False
+    secret_key: str = "change_me"
+    llama_model_id: str = "meta-llama/Meta-Llama-3-8B-Instruct"
+    embed_model_id: str = "ehsanaghaei/SecureBERT_Plus"
+    logreg_joblib: str = "murshid_logreg_pipeline_manual_oof_pcatuned.joblib"
+    logreg_thresholds_npy: str = "murshid_logreg_thresholds_manual_oof_pcatuned.npy"
+    label_columns_json: str = "murshid_label_columns.json"
+settings = Settings()

murshid_backend/app/db/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Database layer."""

murshid_backend/app/db/base.py ADDED Viewed

	@@ -0,0 +1,5 @@

+from sqlalchemy.orm import DeclarativeBase
+class Base(DeclarativeBase):
+    pass

murshid_backend/app/db/session.py ADDED Viewed

	@@ -0,0 +1,25 @@

+from collections.abc import Generator
+from sqlalchemy import create_engine
+from sqlalchemy.orm import Session, sessionmaker
+from app.config import settings
+_is_sqlite = settings.murshid_db_url.startswith("sqlite")
+engine = create_engine(
+    settings.murshid_db_url,
+    connect_args={"check_same_thread": False} if _is_sqlite else {},
+    pool_pre_ping=not _is_sqlite,
+    pool_recycle=3600 if not _is_sqlite else -1,
+)
+SessionLocal = sessionmaker(bind=engine, autocommit=False, autoflush=False)
+def get_db() -> Generator[Session, None, None]:
+    db = SessionLocal()
+    try:
+        yield db
+    finally:
+        db.close()

murshid_backend/app/main.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""
+Murshid Backend — FastAPI entrypoint.
+Architecture:
+  API Layer    → app/api/routes/
+  Service Layer→ app/services/
+  ML Layer     → app/ml/
+  Repository   → app/repositories/
+  Database     → app/db/  (SQLAlchemy + Alembic, MySQL)
+"""
+from __future__ import annotations
+from contextlib import asynccontextmanager
+from pathlib import Path
+from fastapi import FastAPI
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.staticfiles import StaticFiles
+from app.api.routes import db_viewer, health, queries, rules, stats
+from app.ml.pipeline import load_models, unload_models
+_FRONTEND_DIR = Path(__file__).resolve().parent.parent.parent / "murshid_frontend"
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    load_models()
+    yield
+    unload_models()
+app = FastAPI(
+    title="Murshid API",
+    description=(
+        "MITRE ATT&CK-Aligned Techniques Mapping for SOC Analysts. "
+        "Transforms Wazuh IDS rules into actionable threat intelligence."
+    ),
+    version="1.0.0",
+    lifespan=lifespan,
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+app.include_router(health.router)
+app.include_router(stats.router)
+app.include_router(db_viewer.router)
+app.include_router(rules.router)
+app.include_router(queries.router)
+if _FRONTEND_DIR.is_dir():
+    app.mount("/", StaticFiles(directory=str(_FRONTEND_DIR), html=True), name="frontend")

murshid_backend/app/ml/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """ML layer — logic extracted from MurshidUIPipeline.ipynb without modifying the original."""

murshid_backend/app/ml/embedder.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+SecureBERT+ embedder — extracted from MurshidUIPipeline.ipynb (cell 15).
+Produces a 768-dim float32 embedding for a text paragraph.
+Also provides build_text_for_embedding (cell 12).
+Original file is NOT modified.
+"""
+from __future__ import annotations
+import numpy as np
+from lxml import etree
+try:
+    import torch
+    from transformers import AutoModel, AutoTokenizer
+    _TORCH_OK = True
+except (ImportError, OSError):
+    _TORCH_OK = False
+from app.config import settings
+def _norm_spaces(s: str) -> str:
+    return " ".join((s or "").split()).strip()
+def _strip_end_punct(s: str) -> str:
+    return (s or "").rstrip(". ").strip()
+def build_text_for_embedding(clean_rule: str, summary: str) -> str:
+    """Combine LLM summary with rule description — cell 12 of notebook."""
+    rule_elem = etree.fromstring(clean_rule.strip())
+    raw_desc = rule_elem.findtext("description") or ""
+    description = _norm_spaces(raw_desc)
+    summary = _norm_spaces(summary)
+    description = _norm_spaces(description)
+    if not summary and not description:
+        return ""
+    if summary and not description:
+        return summary
+    if description and not summary:
+        return description
+    s0 = _strip_end_punct(summary).lower()
+    d0 = _strip_end_punct(description).lower()
+    if s0 == d0:
+        return _strip_end_punct(summary) + "."
+    return f"{_strip_end_punct(summary)}. {_strip_end_punct(description)}."
+class SecureBERTEmbedder:
+    """Mean-pooling embedder using ehsanaghaei/SecureBERT_Plus — cell 15."""
+    MAX_LEN = 512
+    BATCH_CHUNKS = 8
+    def __init__(self, model_id: str | None = None, device: str | None = None):
+        if not _TORCH_OK:
+            raise RuntimeError("torch/transformers not available — SecureBERTEmbedder cannot be initialised.")
+        mid = model_id or settings.embed_model_id
+        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+        torch.backends.cudnn.deterministic = True
+        torch.backends.cudnn.benchmark = False
+        self.tokenizer = AutoTokenizer.from_pretrained(mid, use_fast=True)
+        self.model = AutoModel.from_pretrained(mid).to(self.device)
+        self.model.eval()
+        self.cls_id = self.tokenizer.cls_token_id
+        self.sep_id = self.tokenizer.sep_token_id
+        self.pad_id = (
+            self.tokenizer.pad_token_id
+            if self.tokenizer.pad_token_id is not None
+            else self.sep_id
+        )
+    def _chunk_text(self, text: str) -> list[list[int]]:
+        token_ids = self.tokenizer.encode(text, add_special_tokens=False)
+        chunk_size = self.MAX_LEN - 2
+        chunks = []
+        for i in range(0, len(token_ids), chunk_size):
+            piece = token_ids[i : i + chunk_size]
+            chunks.append([self.cls_id] + piece + [self.sep_id])
+        return chunks
+    def embed_text(self, text: str) -> np.ndarray:
+        chunks = self._chunk_text(text)
+        all_embs: list[np.ndarray] = []
+        for i in range(0, len(chunks), self.BATCH_CHUNKS):
+            batch = chunks[i : i + self.BATCH_CHUNKS]
+            max_len = max(len(x) for x in batch)
+            input_ids, masks = [], []
+            for x in batch:
+                pad = max_len - len(x)
+                input_ids.append(x + [self.pad_id] * pad)
+                masks.append([1] * len(x) + [0] * pad)
+            ids_t = torch.tensor(input_ids).to(self.device)
+            mask_t = torch.tensor(masks).to(self.device)
+            with torch.no_grad():
+                out = self.model(input_ids=ids_t, attention_mask=mask_t)
+                tok_emb = out.last_hidden_state
+                mask_exp = mask_t.unsqueeze(-1).expand(tok_emb.size()).float()
+                summed = torch.sum(tok_emb * mask_exp, dim=1)
+                denom = torch.clamp(mask_exp.sum(dim=1), min=1e-9)
+                mean_pooled = summed / denom
+            all_embs.append(mean_pooled.cpu().numpy())
+        all_embs_np = np.vstack(all_embs)
+        para_emb = all_embs_np.mean(axis=0)
+        para_emb /= np.linalg.norm(para_emb) + 1e-12
+        return para_emb.astype(np.float32)

murshid_backend/app/ml/logistic_model.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+Logistic Regression — PRIMARY model per user decision.
+Inference logic extracted VERBATIM from MurshidUIPipeline.ipynb (cell 18-19):
+    logreg_model = joblib.load(f"{BASE_PATH}/murshid_logreg_pipeline_manual_oof_pcatuned.joblib")
+    logreg_thr   = np.load(f"{BASE_PATH}/murshid_logreg_thresholds_manual_oof_pcatuned.npy")
+    proba = logreg_model.predict_proba(X_user)
+    if isinstance(proba, list):
+        proba = np.column_stack([p[:, 1] for p in proba])
+    elif proba.ndim == 3:
+        proba = proba[:, :, 1]
+    proba = proba.reshape(-1)
+    pred_logreg = (proba >= logreg_thr).astype(int)
+    conf_logreg = proba * 100
+    gap_logreg  = proba - logreg_thr
+Original notebook file is NOT modified.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+import joblib
+import numpy as np
+from app.config import settings
+class LogisticRegressionModel:
+    """
+    Wraps the trained Logistic Regression pipeline + per-label thresholds.
+    File structure (from notebook cell 18):
+        logreg_model  → sklearn Pipeline (PCA-tuned + OneVsRestClassifier(LogReg))
+        logreg_thr    → np.ndarray shape (n_techniques,)  per-label thresholds
+    """
+    def __init__(self, models_dir: Path | None = None) -> None:
+        base = Path(models_dir or settings.murshid_models_dir).resolve()
+        logreg_path  = base / settings.logreg_joblib
+        thr_path     = base / settings.logreg_thresholds_npy
+        labels_path  = base / settings.label_columns_json
+        for p in (logreg_path, thr_path, labels_path):
+            if not p.is_file():
+                raise FileNotFoundError(f"Missing model file: {p}")
+        # --- notebook cell 18: load model + thresholds ---
+        self._model    = joblib.load(logreg_path)   # logreg_model
+        self._thr      = np.load(thr_path)           # logreg_thr
+        with open(labels_path, encoding="utf-8") as f:
+            self.technique_names: list[str] = json.load(f)
+        n = len(self.technique_names)
+        if self._thr.shape[0] != n:
+            raise ValueError(
+                f"LogReg thresholds length {self._thr.shape[0]} != {n} labels"
+            )
+    # ------------------------------------------------------------------
+    def predict(self, embedding_1d: np.ndarray) -> list[dict]:
+        """
+        Run LogReg inference exactly as in notebook cell 19.
+        Returns list of dicts sorted by confidence_percent desc:
+            technique_id, predicted, confidence_percent, proba, threshold, gap
+        """
+        X_user = embedding_1d.reshape(1, -1)
+        # --- verbatim from notebook cell 19 ---
+        proba = self._model.predict_proba(X_user)
+        if isinstance(proba, list):
+            proba = np.column_stack([p[:, 1] for p in proba])
+        elif proba.ndim == 3:
+            proba = proba[:, :, 1]
+        proba = proba.reshape(-1)
+        pred_logreg = (proba >= self._thr).astype(int)
+        conf_logreg = proba * 100
+        gap_logreg  = proba - self._thr
+        # --- end verbatim ---
+        results = [
+            {
+                "technique_id":       self.technique_names[i],
+                "predicted":          bool(pred_logreg[i]),
+                "confidence_percent": round(float(conf_logreg[i]), 2),
+                "proba":              round(float(proba[i]), 4),
+                "threshold":          round(float(self._thr[i]), 4),
+                "gap":                round(float(gap_logreg[i]), 4),
+            }
+            for i in range(len(self.technique_names))
+        ]
+        # sort: predicted first, then by confidence desc  (notebook sort logic)
+        return sorted(
+            results,
+            key=lambda r: (r["predicted"], r["confidence_percent"]),
+            reverse=True,
+        )

murshid_backend/app/ml/pipeline.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""
+Full inference pipeline — combines sanitizer → summarizer → embedder → logistic_model.
+Exposes analyze_rule(rule_xml) -> dict as the single callable for the service layer.
+Modes:
+  FULL  : LLaMA available + SecureBERT+ + LogReg   (GPU/Colab required)
+  LOCAL : MURSHID_SKIP_LLM=true + SecureBERT+ + LogReg
+          → skips LLaMA; uses <description> field as the paragraph text.
+          This allows POST /rules/analyze to work locally without a GPU.
+  LITE  : torch not installed → uses a trivial bag-of-words fake embedding (testing only)
+"""
+from __future__ import annotations
+import xml.etree.ElementTree as ET
+from dataclasses import dataclass
+from typing import Any
+import numpy as np
+from app.config import settings
+from app.ml.logistic_model import LogisticRegressionModel
+from app.ml.sanitizer import sanitize_rule_from_string
+try:
+    import torch
+    from huggingface_hub import login as hf_login
+    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+    from app.ml.embedder import SecureBERTEmbedder, build_text_for_embedding
+    from app.ml.summarizer import summarize_one_rule
+    _TORCH_AVAILABLE = True
+    _TORCH_ERROR: str | None = None
+except (ImportError, OSError) as _e:
+    _TORCH_AVAILABLE = False
+    _TORCH_ERROR = str(_e)
+# ---------------------------------------------------------------------------
+# Singleton container (loaded once at startup)
+# ---------------------------------------------------------------------------
+@dataclass
+class _ModelStore:
+    llama_model: Any | None = None
+    llama_tokenizer: Any | None = None
+    llama_device: str = "cpu"
+    embedder: SecureBERTEmbedder | None = None
+    logreg: LogisticRegressionModel | None = None
+    ready: bool = False
+_store = _ModelStore()
+def load_models() -> None:
+    """
+    Load all models into _store.
+    Call once at FastAPI startup (lifespan).
+    """
+    if _TORCH_AVAILABLE and settings.hf_token:
+        hf_login(token=settings.hf_token, add_to_git_credential=False)
+    if not settings.murshid_skip_llm:
+        if not _TORCH_AVAILABLE:
+            print("[Murshid] WARNING: torch not installed — skipping LLM load.")
+        else:
+            bnb_cfg = BitsAndBytesConfig(
+                load_in_4bit=True,
+                bnb_4bit_use_double_quant=True,
+                bnb_4bit_quant_type="nf4",
+                bnb_4bit_compute_dtype=torch.float16,
+            )
+            tok = AutoTokenizer.from_pretrained(settings.llama_model_id, use_fast=True)
+            if tok.pad_token is None:
+                tok.pad_token = tok.eos_token
+            m = AutoModelForCausalLM.from_pretrained(
+                settings.llama_model_id,
+                quantization_config=bnb_cfg,
+                device_map="auto",
+                low_cpu_mem_usage=True,
+                dtype=torch.float16,
+            )
+            m.config.pad_token_id = tok.pad_token_id
+            m.eval()
+            _store.llama_tokenizer = tok
+            _store.llama_model = m
+            _store.llama_device = "cuda" if torch.cuda.is_available() else "cpu"
+    if _TORCH_AVAILABLE:
+        try:
+            _store.embedder = SecureBERTEmbedder()
+        except Exception as exc:
+            print(f"[Murshid] WARNING: SecureBERT+ not loaded — {exc}")
+            _store.embedder = None
+    else:
+        print("[Murshid] WARNING: torch not installed — embedder skipped.")
+        _store.embedder = None
+    try:
+        _store.logreg = LogisticRegressionModel()
+    except FileNotFoundError as exc:
+        print(f"[Murshid] WARNING: LogReg model files missing — {exc}")
+        _store.logreg = None
+    except Exception as exc:
+        print(f"[Murshid] WARNING: LogReg not loaded — {exc}")
+        _store.logreg = None
+    _store.ready = True
+def unload_models() -> None:
+    _store.llama_model = None
+    _store.llama_tokenizer = None
+    _store.embedder = None
+    _store.logreg = None
+    _store.ready = False
+def is_ready() -> bool:
+    return _store.ready
+# ---------------------------------------------------------------------------
+# Public function
+# ---------------------------------------------------------------------------
+def _extract_description(clean_xml: str) -> str:
+    """Extract <description> text from sanitized rule XML."""
+    try:
+        elem = ET.fromstring(clean_xml.strip())
+        desc = elem.findtext("description") or ""
+        return " ".join(desc.split()).strip()
+    except ET.ParseError:
+        return ""
+def analyze_rule(rule_xml: str) -> dict:
+    """
+    Full pipeline: XML → sanitize → summarize → embed → LogReg → ranked results.
+    Operates in three modes depending on environment:
+    FULL mode  (MURSHID_SKIP_LLM=false, GPU available):
+        LLaMA generates a natural-language summary → SecureBERT+ embeds it → LogReg predicts.
+    LOCAL mode (MURSHID_SKIP_LLM=true, torch installed):
+        Skips LLaMA. Uses the rule's <description> field directly as the text.
+        SecureBERT+ still embeds it properly → LogReg predicts.
+        ⚠️  Accuracy slightly lower than FULL mode (no LLaMA enrichment).
+    LITE mode  (torch not installed):
+        Uses a random unit-vector as a placeholder embedding.
+        Results are meaningless — for structural testing only.
+    Returns:
+    {
+        "sanitized_xml": str,
+        "summary": str,          # LLaMA output OR description OR "(lite mode)"
+        "text_for_embedding": str,
+        "embedding_dim": int,
+        "pipeline_mode": str,    # "full" | "local" | "lite"
+        "results": [...],        # all techniques sorted by confidence desc
+        "detected": [...],       # predicted == True only
+    }
+    """
+    if not _store.ready:
+        raise RuntimeError("Models not loaded. Call load_models() first.")
+    if "<rule" not in rule_xml or "</rule>" not in rule_xml:
+        raise ValueError("Incomplete XML: must contain <rule> and </rule>.")
+    if _store.logreg is None:
+        raise RuntimeError(
+            "LogReg model not loaded. "
+            "Copy the .joblib and .npy files to MURSHID_MODELS_DIR and restart."
+        )
+    clean_xml = sanitize_rule_from_string(rule_xml)
+    # ── Choose mode ────────────────────────────────────────────────────────────
+    if _store.llama_model is not None and _store.llama_tokenizer is not None:
+        # FULL mode: LLaMA summary
+        mode = "full"
+        summary = summarize_one_rule(
+            clean_xml,
+            _store.llama_model,
+            _store.llama_tokenizer,
+            _store.llama_device,
+        )
+        text = build_text_for_embedding(clean_xml, summary)
+        embedding: np.ndarray = _store.embedder.embed_text(text)
+    elif _store.embedder is not None:
+        # LOCAL mode: no LLaMA, use <description> as text
+        mode = "local"
+        desc = _extract_description(clean_xml)
+        summary = desc or "No description available."
+        text = desc or clean_xml[:300]
+        embedding = _store.embedder.embed_text(text)
+    else:
+        # LITE mode: torch not available, random unit-vector (structural test only)
+        mode = "lite"
+        desc = _extract_description(clean_xml)
+        summary = f"(lite mode — no embedder) {desc}"
+        text = desc or clean_xml[:300]
+        dim = 768
+        raw = np.random.default_rng(abs(hash(text)) % (2**32)).random(dim).astype(np.float32)
+        embedding = raw / (np.linalg.norm(raw) + 1e-12)
+    # ── Classify ───────────────────────────────────────────────────────────────
+    all_results = _store.logreg.predict(embedding)
+    detected = [r for r in all_results if r["predicted"]]
+    return {
+        "sanitized_xml": clean_xml,
+        "summary": summary,
+        "text_for_embedding": text,
+        "embedding_dim": int(embedding.shape[0]),
+        "pipeline_mode": mode,
+        "results": all_results,
+        "detected": detected,
+    }

murshid_backend/app/ml/sanitizer.py ADDED Viewed

	@@ -0,0 +1,32 @@

+"""
+Rule sanitizer — extracted from MurshidUIPipeline.ipynb (cell 10).
+Removes: mitre, if_sid, group, if_group tags from Wazuh XML rule.
+Original file is NOT modified.
+"""
+from __future__ import annotations
+import copy
+import xml.etree.ElementTree as ET
+REMOVE_TAGS_ANYWHERE: set[str] = {"mitre", "if_sid", "group", "if_group"}
+def _remove_tag_anywhere(root_elem: ET.Element, tag: str) -> None:
+    for parent in list(root_elem.iter()):
+        for child in list(parent):
+            if child.tag == tag:
+                parent.remove(child)
+def sanitize_rule(rule_elem: ET.Element) -> ET.Element:
+    r = copy.deepcopy(rule_elem)
+    for tag in REMOVE_TAGS_ANYWHERE:
+        _remove_tag_anywhere(r, tag)
+    return r
+def sanitize_rule_from_string(rule_xml: str) -> str:
+    rule_elem = ET.fromstring(rule_xml.strip())
+    sanitized = sanitize_rule(rule_elem)
+    return ET.tostring(sanitized, encoding="unicode")

murshid_backend/app/ml/summarizer.py ADDED Viewed

	@@ -0,0 +1,262 @@

+"""
+LLM summarizer — extracted from MurshidUIPipeline.ipynb (cells 11-12).
+Converts sanitized Wazuh XML rule to a one-sentence behavior summary.
+Original file is NOT modified.
+"""
+from __future__ import annotations
+import json
+import re
+import unicodedata
+import torch
+# --------------------------------------------------------------------------
+# Constants (identical to notebook)
+# --------------------------------------------------------------------------
+MAX_INPUT_TOKENS = 2048
+MAX_NEW_TOKENS = 160
+DO_SAMPLE = False
+NUM_BEAMS = 4
+MAX_RETRIES = 3
+SYSTEM_INSTR = (
+    "You are a cybersecurity expert.\n"
+    "You will be provided with a Wazuh rule in XML format.\n"
+    "Write EXACTLY ONE sentence describing the observable event pattern the rule matches.\n\n"
+    "HARD CONSTRAINTS:\n"
+    '1) Output must be minified JSON only: {"summary":"..."}\n'
+    "2) ONE sentence only.\n"
+    "3) Start with one of: Detects, Monitors, Identifies, Flags, Reports, Tracks, Captures.\n"
+    "4) Use ONLY facts present in the XML. Describe the observable system event only.\n"
+    "5) Do NOT infer attacker intent, attack type, or technique.\n"
+    "6) Do NOT mention MITRE, ATT&CK, or attack technique names unless explicitly present in the XML.\n"
+    "7) Do NOT use speculative language: likely, potentially, possible, possibly, may indicate, or could indicate.\n"
+    "8) Length: 7 to 18 words.\n"
+    "9) SHOULD include a clear event type when possible.\n"
+    "10) Mention at least ONE concrete indicator if available (event_id, process name, file path,\n"
+    "   registry key, service, protocol/port, URL pattern, command, username, IP).\n"
+    "If only a single indicator exists, still produce a complete behavior-focused sentence.\n"
+)
+REPAIR_HINT = (
+    "Your previous output was rejected.\n"
+    "Fix it to satisfy ALL constraints:\n"
+    '- Output MUST be minified JSON only: {"summary":"..."}\n'
+    "- One sentence only.\n"
+    "- Keep it behavior-focused.\n"
+    "- Include at least ONE concrete indicator if present in the XML.\n"
+    "- Do NOT add any extra text outside JSON.\n"
+)
+VERB_OK = ("Detects", "Monitors", "Identifies", "Flags", "Reports", "Tracks", "Captures")
+JSON_OBJ_RE = re.compile(r"\{.*?\}", re.DOTALL)
+BAD_INTRO_RE = re.compile(
+    r"^\s*(this\s+(wazuh\s+)?rule|the\s+rule|this\s+alert)\b", re.IGNORECASE
+)
+BAD_INTENT_RE = re.compile(r"\b(likely|potentially|possible|maybe)\b", re.IGNORECASE)
+GENERIC_RE = re.compile(
+    r"\b(detects activity|detects suspicious activity|detects potentially suspicious activity|"
+    r"monitors activity|reports activity|detects an event pattern defined by the rule indicators)\b",
+    re.IGNORECASE,
+)
+# --------------------------------------------------------------------------
+# Helpers (identical to notebook)
+# --------------------------------------------------------------------------
+def _build_prompt(rule_xml: str, tokenizer, extra_hint: str = "") -> str:
+    sys = SYSTEM_INSTR + (("\n" + extra_hint) if extra_hint else "")
+    user = f"Wazuh rule XML:\n{rule_xml}\n\nReturn JSON only:"
+    messages = [{"role": "system", "content": sys}, {"role": "user", "content": user}]
+    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+def _looks_broken_encoding(s: str) -> bool:
+    return any(m in s for m in ("Ã", "Ð", "Ñ", "â", "â")) if s else False
+def _try_extract_json_summary(text: str) -> str | None:
+    t = (text or "").strip()
+    if not t:
+        return None
+    if t.startswith("{") and '"summary"' in t:
+        try:
+            obj = json.loads(t)
+            if isinstance(obj, dict) and isinstance(obj.get("summary"), str):
+                return obj["summary"].strip()
+        except Exception:
+            pass
+    m = JSON_OBJ_RE.search(t)
+    if m and '"summary"' in m.group(0):
+        blob = m.group(0)
+        try:
+            obj = json.loads(blob)
+            if isinstance(obj, dict) and isinstance(obj.get("summary"), str):
+                return obj["summary"].strip()
+        except Exception:
+            m2 = re.search(r'"summary"\s*:\s*"([^"]+)"', blob)
+            if m2:
+                return m2.group(1).strip()
+    return None
+def _normalize_one_sentence(s: str) -> str:
+    s = re.sub(r"\s+", " ", (s or "").strip()).strip()
+    s = unicodedata.normalize("NFKC", s)
+    if not s:
+        return ""
+    if BAD_INTRO_RE.match(s):
+        s = BAD_INTRO_RE.sub("", s).lstrip(":,- ").strip()
+    if not s:
+        return ""
+    if not any(s.startswith(v) for v in VERB_OK):
+        s = "Detects " + (s[0].lower() + s[1:]) if len(s) > 1 else ""
+    if not s:
+        return ""
+    m = re.search(r"[.!?](?:\s|$)", s)
+    s = s[: m.end()].strip() if m else s + "."
+    s = re.sub(r"^(Detects\s+)+", "Detects ", s).strip()
+    return re.sub(r"\s+", " ", s).strip()
+def _looks_truncated(s: str) -> bool:
+    return not s or s.strip().endswith(("(", ":", " -", ","))
+def _has_behavior_signal(s: str) -> bool:
+    kws = ["create","delete","execute","spawn","launch","login","logon","authentication",
+           "connect","request","query","modify","registry","process","command","file",
+           "service","ip","url","dns","http","vpn","account"]
+    return any(k in s.lower() for k in kws)
+def _has_indicator_signal(s: str) -> bool:
+    kws = [".exe",".dll",".ps1",".bat",".cmd","powershell","cmd.exe","reg.exe","rundll32",
+           "svchost","registry","temp","system32","event_id","http","dns","ip","url","port","key"]
+    return any(k in s.lower() for k in kws)
+def _is_bad(s: str) -> bool:
+    if not s or BAD_INTRO_RE.match(s) or BAD_INTENT_RE.search(s) or GENERIC_RE.search(s):
+        return True
+    if _looks_broken_encoding(s) or _looks_truncated(s):
+        return True
+    wc = len(s.split())
+    if wc < 7 or wc > 18 or not _has_behavior_signal(s):
+        return True
+    return bool((s.startswith("{") and "summary" in s) or ('"summary"' in s and "{" in s))
+def _is_catastrophic(s: str) -> bool:
+    return not s or _looks_broken_encoding(s) or _looks_truncated(s) or len(s.split()) < 3
+def _score(s: str) -> int:
+    wc = len(s.split())
+    return (
+        (3 if 7 <= wc <= 18 else 0)
+        + (3 if _has_behavior_signal(s) else 0)
+        + (2 if _has_indicator_signal(s) else 0)
+        + (1 if not GENERIC_RE.search(s) else 0)
+        + (1 if not BAD_INTENT_RE.search(s) else 0)
+    )
+def _rescue_finalize(s: str) -> str:
+    s = _normalize_one_sentence(s)
+    if not s:
+        return "Detects rule-matched behavior."
+    s = re.sub(r",\s*(possibly|potentially|maybe|may)\b.*$", "", s, flags=re.IGNORECASE).strip()
+    s = re.sub(r"\b(possibly|potentially|maybe|may)\b", "", s, flags=re.IGNORECASE)
+    s = re.sub(r"\s+", " ", s).strip()
+    if len(s.split()) < 7:
+        low = s.lower()
+        for kw, rep in [
+            ("powershell", "Detects powershell.exe process execution."),
+            ("cmd", "Detects cmd.exe process execution."),
+            ("reg", "Detects reg.exe process execution."),
+            ("svchost", "Detects svchost.exe process execution."),
+        ]:
+            if kw in low:
+                s = rep
+                break
+        else:
+            s = s.rstrip(".") + " matching rule indicators."
+    if _looks_truncated(s):
+        s = s.rstrip(".") + " matching rule indicators."
+    if not any(s.startswith(v) for v in VERB_OK):
+        s = "Detects " + s[0].lower() + s[1:] if len(s) > 1 else "Detects rule-matched behavior."
+    words = s.split()
+    if len(words) > 18:
+        s = " ".join(words[:18]).rstrip(".") + "."
+    return re.sub(r"\s+", " ", s if s.endswith(".") else s + ".").strip()
+# --------------------------------------------------------------------------
+# Public API
+# --------------------------------------------------------------------------
+def summarize_one_rule(rule_xml: str, model, tokenizer, device: str | None = None) -> str:
+    """Generate a one-sentence summary for a sanitized Wazuh rule XML string."""
+    if device is None:
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+    pad_id = tokenizer.pad_token_id or tokenizer.eos_token_id
+    eos_id = tokenizer.eos_token_id or pad_id
+    best: str | None = None
+    best_any: str | None = None
+    last_raw = ""
+    last_cleaned = ""
+    for attempt in range(1, MAX_RETRIES + 1):
+        prompt = _build_prompt(
+            rule_xml, tokenizer, extra_hint=REPAIR_HINT if attempt >= 2 else ""
+        )
+        inputs = tokenizer(
+            prompt, return_tensors="pt", truncation=True, max_length=MAX_INPUT_TOKENS
+        ).to(device)
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=MAX_NEW_TOKENS,
+                do_sample=DO_SAMPLE,
+                num_beams=NUM_BEAMS,
+                pad_token_id=pad_id,
+                eos_token_id=eos_id,
+                repetition_penalty=1.05,
+                no_repeat_ngram_size=3,
+            )
+        raw = tokenizer.decode(
+            outputs[0][inputs["input_ids"].shape[1] :], skip_special_tokens=True
+        ).strip()
+        last_raw = raw
+        parsed = _try_extract_json_summary(raw)
+        if parsed is None:
+            continue
+        cleaned = _normalize_one_sentence(parsed)
+        last_cleaned = cleaned
+        if cleaned and not _is_catastrophic(cleaned):
+            if best_any is None or _score(cleaned) > _score(best_any):
+                best_any = cleaned
+        if not _is_bad(cleaned):
+            best = cleaned
+            break
+    if best is None:
+        if best_any and not _is_catastrophic(best_any):
+            best = best_any
+        else:
+            src = last_cleaned or _try_extract_json_summary(last_raw) or last_raw
+            best = _rescue_finalize(src)
+    return best

murshid_backend/app/ml/svm_model.py ADDED Viewed

	@@ -0,0 +1,101 @@

+"""
+SVM classifier — PRIMARY model per the report (§3.1.3 + §4.1).
+Report quote:
+  "the Support Vector Machine (SVM) was adopted as the core classifier"
+  "classification using SVM to predict the associated MITRE ATT&CK techniques"
+Inference logic (verbatim from MurshidUIPipeline.ipynb cell 16+19):
+  scores  = svm_model.named_steps["clf"].decision_function(
+                svm_model.named_steps["pca"].transform(X_user)
+            ).reshape(-1)
+  pred    = (scores >= thr_per_label).astype(int)
+  margins = scores - thr_per_label
+  conf    = sigmoid(margins) * 100
+Original notebook file is NOT modified.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+import joblib
+import numpy as np
+from app.config import settings
+def _sigmoid(x: np.ndarray) -> np.ndarray:
+    """Probability calibration: sigmoid(margin) — notebook cell 17."""
+    x = np.clip(x, -30, 30)
+    return 1.0 / (1.0 + np.exp(-x))
+class SVMModel:
+    """
+    Wraps the trained LinearSVC pipeline with per-label thresholds.
+    Structure of the .joblib pack (from notebook):
+        svm_pack["model"]                  → sklearn Pipeline (PCA + LinearSVC)
+        svm_pack["thresholds_per_label"]   → np.ndarray shape (n_techniques,)
+    """
+    def __init__(self, models_dir: Path | None = None) -> None:
+        base = Path(models_dir or settings.murshid_models_dir).resolve()
+        svm_path    = base / settings.svm_joblib
+        labels_path = base / settings.label_columns_json
+        for p in (svm_path, labels_path):
+            if not p.is_file():
+                raise FileNotFoundError(f"Missing model file: {p}")
+        svm_pack = joblib.load(svm_path)
+        self._model      = svm_pack["model"]                 # Pipeline(PCA → LinearSVC)
+        self._thresholds = np.asarray(
+            svm_pack["thresholds_per_label"], dtype=np.float64
+        )
+        with open(labels_path, encoding="utf-8") as f:
+            self.technique_names: list[str] = json.load(f)
+        n = len(self.technique_names)
+        if self._thresholds.shape[0] != n:
+            raise ValueError(
+                f"SVM thresholds length {self._thresholds.shape[0]} != {n} labels"
+            )
+    # ------------------------------------------------------------------
+    def predict(self, embedding_1d: np.ndarray) -> list[dict]:
+        """
+        Run SVM inference exactly as in the notebook.
+        Returns list of dicts sorted by confidence_percent desc:
+            technique_id, predicted, confidence_percent, score, threshold, margin
+        """
+        X = embedding_1d.reshape(1, -1)
+        # Apply PCA then LinearSVC decision function (notebook cell 19)
+        scores = self._model.named_steps["clf"].decision_function(
+            self._model.named_steps["pca"].transform(X)
+        ).reshape(-1)
+        pred    = (scores >= self._thresholds).astype(int)
+        margins = scores - self._thresholds
+        conf    = _sigmoid(margins) * 100          # calibrated confidence (%)
+        results = [
+            {
+                "technique_id":      self.technique_names[i],
+                "predicted":         bool(pred[i]),
+                "confidence_percent": round(float(conf[i]), 2),
+                "score":             round(float(scores[i]), 4),
+                "threshold":         round(float(self._thresholds[i]), 4),
+                "margin":            round(float(margins[i]), 4),
+            }
+            for i in range(len(self.technique_names))
+        ]
+        return sorted(results, key=lambda r: r["confidence_percent"], reverse=True)

murshid_backend/app/models/__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+"""SQLAlchemy ORM models (tables defined exactly per ER Diagram §3.2.6 of the report)."""
+from app.models.user import User
+from app.models.mapping_job import MappingJob
+from app.models.rule import Rule
+from app.models.technique import Technique
+from app.models.rule_technique_mapping import RuleTechniqueMapping
+from app.models.query_template import QueryTemplate
+__all__ = [
+    "User",
+    "MappingJob",
+    "Rule",
+    "Technique",
+    "RuleTechniqueMapping",
+    "QueryTemplate",
+]

murshid_backend/app/models/mapping_job.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""
+MappingJob entity — ER Diagram §3.2.6
+Attributes: job_ID, file_name, timestamp, rules_count, status, progress
+Linked to User via "uploads" relationship.
+Also visible in Figure 4-14 (Mapping Progress Table).
+"""
+import enum
+from datetime import datetime
+from sqlalchemy import DateTime, Enum, ForeignKey, Integer, String, func
+from sqlalchemy.orm import Mapped, mapped_column, relationship
+from app.db.base import Base
+class JobStatus(str, enum.Enum):
+    pending = "pending"
+    running = "running"
+    done = "done"
+    failed = "failed"
+class MappingJob(Base):
+    __tablename__ = "mapping_jobs"
+    job_id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
+    user_id: Mapped[int] = mapped_column(ForeignKey("users.user_id"), nullable=False)
+    file_name: Mapped[str] = mapped_column(String(255), nullable=False)
+    rules_count: Mapped[int] = mapped_column(Integer, default=0)
+    status: Mapped[JobStatus] = mapped_column(
+        Enum(JobStatus), nullable=False, default=JobStatus.pending
+    )
+    progress: Mapped[int] = mapped_column(Integer, default=0)
+    timestamp: Mapped[datetime] = mapped_column(
+        DateTime, nullable=False, server_default=func.now()
+    )
+    user: Mapped["User"] = relationship(back_populates="jobs")
+    rules: Mapped[list["Rule"]] = relationship(back_populates="job")

murshid_backend/app/models/query_template.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""
+QueryTemplate entity — ER Diagram §3.2.6
+Attributes: Template_ID, Purpose, wql_query, Note
+Linked to Technique. Admin can add/update/disable (Use Case 7, §3.2.7).
+"""
+from sqlalchemy import Boolean, ForeignKey, String, Text
+from sqlalchemy.orm import Mapped, mapped_column, relationship
+from app.db.base import Base
+class QueryTemplate(Base):
+    __tablename__ = "query_templates"
+    template_id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
+    technique_id: Mapped[str] = mapped_column(
+        String(20), ForeignKey("techniques.technique_id"), nullable=False
+    )
+    purpose: Mapped[str | None] = mapped_column(String(255), nullable=True)
+    # WQL with placeholders: ${HOST}, ${USER}, ${IP}
+    wql_query: Mapped[str] = mapped_column(Text, nullable=False)
+    note: Mapped[str | None] = mapped_column(Text, nullable=True)
+    # Admin can disable without deleting — Use Case 7
+    is_active: Mapped[bool] = mapped_column(Boolean, default=True, nullable=False)
+    technique: Mapped["Technique"] = relationship(back_populates="query_templates")

murshid_backend/app/models/rule.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""
+Rule entity — ER Diagram §3.2.6
+Attributes: Rule_ID, embedding_vector, job_ID (FK)
+Rule_ID is the Wazuh rule ID string (e.g. "597").
+"""
+from sqlalchemy import ForeignKey, String, Text
+from sqlalchemy.orm import Mapped, mapped_column, relationship
+from app.db.base import Base
+class Rule(Base):
+    __tablename__ = "rules"
+    rule_id: Mapped[str] = mapped_column(String(50), primary_key=True)
+    job_id: Mapped[int | None] = mapped_column(
+        ForeignKey("mapping_jobs.job_id"), nullable=True
+    )
+    # 768-dimensional float vector stored as JSON string; kept nullable for
+    # rules where only the mapping result is persisted without the vector.
+    embedding_vector: Mapped[str | None] = mapped_column(Text, nullable=True)
+    job: Mapped["MappingJob | None"] = relationship(back_populates="rules")
+    technique_mappings: Mapped[list["RuleTechniqueMapping"]] = relationship(
+        back_populates="rule", cascade="all, delete-orphan"
+    )

murshid_backend/app/models/rule_technique_mapping.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""
+RuleTechniqueMapping associative entity — ER Diagram §3.2.6
+Attributes: Mapping_ID, Rule_ID (FK), Technique_ID (FK), confidence_score
+Index on rule_id for fast lookup — mentioned explicitly in Use Case 6 (§3.2.7).
+"""
+from sqlalchemy import Float, ForeignKey, Index, Integer, String
+from sqlalchemy.orm import Mapped, mapped_column, relationship
+from app.db.base import Base
+class RuleTechniqueMapping(Base):
+    __tablename__ = "rule_technique_mappings"
+    mapping_id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
+    rule_id: Mapped[str] = mapped_column(
+        String(50), ForeignKey("rules.rule_id"), nullable=False
+    )
+    technique_id: Mapped[str] = mapped_column(
+        String(20), ForeignKey("techniques.technique_id"), nullable=False
+    )
+    confidence_score: Mapped[float] = mapped_column(Float, nullable=False)
+    rule: Mapped["Rule"] = relationship(back_populates="technique_mappings")
+    technique: Mapped["Technique"] = relationship(back_populates="rule_mappings")
+    __table_args__ = (
+        # "creates an index on rule_id for efficient lookup" — Use Case 6
+        Index("ix_rule_technique_rule_id", "rule_id"),
+    )

murshid_backend/app/models/technique.py ADDED Viewed

	@@ -0,0 +1,24 @@

+"""
+Technique entity — ER Diagram §3.2.6
+Attributes: Technique_ID, technique_name, tactic
+"""
+from sqlalchemy import String
+from sqlalchemy.orm import Mapped, mapped_column, relationship
+from app.db.base import Base
+class Technique(Base):
+    __tablename__ = "techniques"
+    technique_id: Mapped[str] = mapped_column(String(20), primary_key=True)
+    technique_name: Mapped[str] = mapped_column(String(255), nullable=False)
+    tactic: Mapped[str | None] = mapped_column(String(100), nullable=True)
+    rule_mappings: Mapped[list["RuleTechniqueMapping"]] = relationship(
+        back_populates="technique"
+    )
+    query_templates: Mapped[list["QueryTemplate"]] = relationship(
+        back_populates="technique"
+    )

murshid_backend/app/models/user.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""
+User entity — ER Diagram §3.2.6
+Attributes: User_ID, username, email, password_hash, role
+"""
+import enum
+from sqlalchemy import Enum, String
+from sqlalchemy.orm import Mapped, mapped_column, relationship
+from app.db.base import Base
+class UserRole(str, enum.Enum):
+    admin = "admin"
+    analyst = "analyst"
+class User(Base):
+    __tablename__ = "users"
+    user_id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
+    username: Mapped[str] = mapped_column(String(100), unique=True, nullable=False)
+    email: Mapped[str] = mapped_column(String(255), unique=True, nullable=False)
+    password_hash: Mapped[str] = mapped_column(String(255), nullable=False)
+    role: Mapped[UserRole] = mapped_column(
+        Enum(UserRole), nullable=False, default=UserRole.analyst
+    )
+    jobs: Mapped[list["MappingJob"]] = relationship(back_populates="user")

murshid_backend/app/repositories/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Repository layer — thin DB access wrappers."""

murshid_backend/app/repositories/job_repo.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""CRUD for MappingJob table."""
+from __future__ import annotations
+from datetime import datetime, timezone
+from sqlalchemy.orm import Session
+from app.models.mapping_job import JobStatus, MappingJob
+def create_job(db: Session, *, user_id: int, file_name: str, rules_count: int = 0) -> MappingJob:
+    job = MappingJob(
+        user_id=user_id,
+        file_name=file_name,
+        rules_count=rules_count,
+        status=JobStatus.pending,
+        progress=0,
+        timestamp=datetime.now(tz=timezone.utc),
+    )
+    db.add(job)
+    db.flush()
+    return job
+def update_job_status(
+    db: Session,
+    job_id: int,
+    *,
+    status: JobStatus,
+    progress: int | None = None,
+) -> MappingJob | None:
+    job = db.get(MappingJob, job_id)
+    if job is None:
+        return None
+    job.status = status
+    if progress is not None:
+        job.progress = progress
+    db.flush()
+    return job
+def get_job(db: Session, job_id: int) -> MappingJob | None:
+    return db.get(MappingJob, job_id)

murshid_backend/app/repositories/rule_repo.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""CRUD for Rule and RuleTechniqueMapping tables."""
+from __future__ import annotations
+import json
+import numpy as np
+from sqlalchemy.orm import Session
+from app.models.rule import Rule
+from app.models.rule_technique_mapping import RuleTechniqueMapping
+def upsert_rule(
+    db: Session,
+    *,
+    rule_id: str,
+    job_id: int | None = None,
+    embedding: np.ndarray | None = None,
+) -> Rule:
+    rule = db.get(Rule, rule_id)
+    if rule is None:
+        rule = Rule(rule_id=rule_id)
+        db.add(rule)
+    if job_id is not None:
+        rule.job_id = job_id
+    if embedding is not None:
+        rule.embedding_vector = json.dumps(embedding.tolist())
+    db.flush()
+    return rule
+def save_technique_mappings(
+    db: Session,
+    *,
+    rule_id: str,
+    results: list[dict],
+) -> list[RuleTechniqueMapping]:
+    """
+    Persist ALL (rule_id, technique_id, confidence_score) rows sorted by confidence.
+    Deletes existing mappings first so re-runs are idempotent.
+    Saves ALL techniques (not just detected ones) so Figure 4-11 can show Top 5.
+    """
+    db.query(RuleTechniqueMapping).filter(
+        RuleTechniqueMapping.rule_id == rule_id
+    ).delete(synchronize_session=False)
+    sorted_results = sorted(results, key=lambda r: r["confidence_percent"], reverse=True)
+    rows = []
+    for r in sorted_results:
+        row = RuleTechniqueMapping(
+            rule_id=rule_id,
+            technique_id=r["technique_id"],
+            confidence_score=r["confidence_percent"] / 100.0,
+        )
+        db.add(row)
+        rows.append(row)
+    db.flush()
+    return rows
+def get_mappings_for_rule(
+    db: Session, rule_id: str
+) -> list[RuleTechniqueMapping]:
+    return (
+        db.query(RuleTechniqueMapping)
+        .filter(RuleTechniqueMapping.rule_id == rule_id)
+        .order_by(RuleTechniqueMapping.confidence_score.desc())
+        .all()
+    )

murshid_backend/app/repositories/template_repo.py ADDED Viewed

	@@ -0,0 +1,94 @@

+"""CRUD for Technique and QueryTemplate tables."""
+from __future__ import annotations
+from sqlalchemy.orm import Session
+from app.models.query_template import QueryTemplate
+from app.models.technique import Technique
+# --------------------------------------------------------------------------
+# Techniques
+# --------------------------------------------------------------------------
+def get_or_create_technique(
+    db: Session, *, technique_id: str, technique_name: str = "", tactic: str | None = None
+) -> Technique:
+    t = db.get(Technique, technique_id)
+    if t is None:
+        t = Technique(
+            technique_id=technique_id,
+            technique_name=technique_name or technique_id,
+            tactic=tactic,
+        )
+        db.add(t)
+        db.flush()
+    return t
+def get_technique(db: Session, technique_id: str) -> Technique | None:
+    return db.get(Technique, technique_id)
+# --------------------------------------------------------------------------
+# Query templates
+# --------------------------------------------------------------------------
+def get_templates_for_technique(
+    db: Session, technique_id: str
+) -> list[QueryTemplate]:
+    return (
+        db.query(QueryTemplate)
+        .filter(
+            QueryTemplate.technique_id == technique_id,
+            QueryTemplate.is_active.is_(True),
+        )
+        .all()
+    )
+def create_template(
+    db: Session,
+    *,
+    technique_id: str,
+    purpose: str | None,
+    wql_query: str,
+    note: str | None,
+) -> QueryTemplate:
+    tpl = QueryTemplate(
+        technique_id=technique_id,
+        purpose=purpose,
+        wql_query=wql_query,
+        note=note,
+        is_active=True,
+    )
+    db.add(tpl)
+    db.flush()
+    return tpl
+def update_template(
+    db: Session,
+    template_id: int,
+    *,
+    purpose: str | None = None,
+    wql_query: str | None = None,
+    note: str | None = None,
+    is_active: bool | None = None,
+) -> QueryTemplate | None:
+    tpl = db.get(QueryTemplate, template_id)
+    if tpl is None:
+        return None
+    if purpose is not None:
+        tpl.purpose = purpose
+    if wql_query is not None:
+        tpl.wql_query = wql_query
+    if note is not None:
+        tpl.note = note
+    if is_active is not None:
+        tpl.is_active = is_active
+    db.flush()
+    return tpl

murshid_backend/app/schemas/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Pydantic schemas for API request/response validation."""