oyabun-dev commited on
Commit
55bcd2b
·
0 Parent(s):

deploy: 2026-04-02T00:05:48Z

Browse files
Dockerfile ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ RUN apt-get update && apt-get install -y \
4
+ ffmpeg \
5
+ libgl1 \
6
+ libglib2.0-0 \
7
+ libsm6 \
8
+ libxext6 \
9
+ libxrender-dev \
10
+ curl \
11
+ && rm -rf /var/lib/apt/lists/*
12
+
13
+ WORKDIR /app
14
+
15
+ COPY requirements.txt .
16
+ # Install torch CPU-only first (~250MB vs ~2GB for the default CUDA build)
17
+ RUN \
18
+ pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
19
+ RUN \
20
+ pip install --timeout 300 -r requirements.txt
21
+
22
+ COPY app/ ./app/
23
+ COPY scripts/ ./scripts/
24
+
25
+ EXPOSE 8000
26
+
27
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=300s --retries=3 \
28
+ CMD curl -f http://localhost:8000/health || exit 1
29
+
30
+ # Models download on first cold start and are cached by HF Spaces persistently.
31
+ # preload_models.py runs before uvicorn so the API is ready when /health passes.
32
+ CMD ["sh", "-c", "python scripts/preload_models.py && uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 1"]
README.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: KAMY Vision AI
3
+ emoji: 🛡️
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: docker
7
+ app_port: 8000
8
+ pinned: false
9
+ ---
10
+
11
+ # KAMY Vision AI
12
+
13
+ Multimodal forensic platform for deepfake detection. Analyzes images, audio, video, and text via a layered pipeline combining Vision Transformer ensembles with deterministic forensic signals.
14
+
15
+ **Production:** [app.kamydev.com](https://app.kamydev.com) · API at [oyabun-dev-kamyvision.hf.space](https://oyabun-dev-kamyvision.hf.space) · Docs at [docs.kamydev.com](https://docs.kamydev.com)
16
+
17
+ ---
18
+
19
+ ## Stack
20
+
21
+ - **Backend:** Python 3.10+, FastAPI, uvicorn, PyTorch, HuggingFace Transformers
22
+ - **Frontend:** React 18, TypeScript, Vite — deployed on Vercel
23
+ - **API hosting:** HuggingFace Spaces (Docker)
24
+ - **Docs:** React + custom CSS — deployed on Vercel
25
+
26
+ ---
27
+
28
+ ## Models
29
+
30
+ ### Image ensemble (3 ViT models, weighted average)
31
+
32
+ | Model | Weight | Task |
33
+ |-------|--------|------|
34
+ | `Ateeqq/ai-vs-human-image-detector` | 45% | AI-generated vs human photo |
35
+ | `prithivMLmods/AI-vs-Deepfake-vs-Real` | 35% | 3 classes: AI / Deepfake / Real |
36
+ | `prithivMLmods/Deep-Fake-Detector-Model` | 20% | Facial deepfakes |
37
+
38
+ ### Forensic layers (no ML)
39
+
40
+ - **EXIF** — 19 AI generator signatures detected (Gemini, DALL-E, Firefly, Midjourney, Flux, SynthID, Canva AI, Stable Diffusion...)
41
+ - **FFT** — frequency spectrum analysis, GAN oversmoothing and periodic peak detection
42
+ - **Texture** — local variance per 16×16 patch, unnatural uniformity in skin/background
43
+ - **Color** — colorimetric entropy and HSV distribution, artificial saturation patterns
44
+
45
+ ### Fusion profiles
46
+
47
+ The engine selects a profile based on EXIF results, then adjusts weights:
48
+
49
+ | Profile | Trigger | EXIF weight |
50
+ |---------|---------|-------------|
51
+ | `EXIF_IA_DETECTE` | AI source found in metadata | 60% |
52
+ | `EXIF_FIABLE` | Real camera identified | 32% |
53
+ | `EXIF_ABSENT` | No metadata (stripped by social network) | 0%, FFT+texture boosted |
54
+ | `STANDARD` | General case | 20% |
55
+
56
+ ### Audio (pending)
57
+
58
+ `MelodyMachine/Deepfake-audio-detection-V2` (wav2vec2) — pending ONNX conversion.
59
+
60
+ ---
61
+
62
+ ## API endpoints
63
+
64
+ Base URL (local): `http://localhost:8000`
65
+ Base URL (production): `https://oyabun-dev-kamyvision.hf.space`
66
+
67
+ | Method | Endpoint | Status | Description |
68
+ |--------|----------|--------|-------------|
69
+ | `GET` | `/health` | Stable | API and model status |
70
+ | `POST` | `/analyze/image` | Stable | Full image analysis (3 ViT + 4 forensic layers) |
71
+ | `POST` | `/analyze/image/fast` | Stable | Fast image analysis (2 ViT + EXIF only) |
72
+ | `POST` | `/analyze/audio` | WIP | Synthetic voice detection |
73
+ | `POST` | `/analyze/video` | WIP | Frame-by-frame video analysis |
74
+ | `POST` | `/analyze/text` | WIP | AI-generated text detection |
75
+
76
+ ```bash
77
+ # Health check
78
+ curl http://localhost:8000/health
79
+
80
+ # Full image analysis
81
+ curl -X POST http://localhost:8000/analyze/image \
82
+ -F "file=@photo.jpg"
83
+
84
+ # Fast image analysis
85
+ curl -X POST http://localhost:8000/analyze/image/fast \
86
+ -F "file=@photo.jpg"
87
+ ```
88
+
89
+ ### Response structure
90
+
91
+ ```json
92
+ {
93
+ "status": "success",
94
+ "verdict": "DEEPFAKE",
95
+ "fake_prob": 0.8731,
96
+ "real_prob": 0.1269,
97
+ "confidence": "high",
98
+ "reason": "AI source detected in EXIF metadata (Google Gemini).",
99
+ "fusion_profile": "EXIF_IA_DETECTE",
100
+ "ai_source": "Google Gemini",
101
+ "layer_scores": {
102
+ "ensemble": 0.82,
103
+ "exif": 0.97,
104
+ "fft": 0.61,
105
+ "texture": 0.55,
106
+ "color": 0.70
107
+ },
108
+ "weights_used": {
109
+ "ensemble": 0.20,
110
+ "exif": 0.60,
111
+ "fft": 0.08,
112
+ "texture": 0.07,
113
+ "color": 0.05
114
+ },
115
+ "models": [
116
+ "Ateeqq/ai-vs-human-image-detector",
117
+ "prithivMLmods/AI-vs-Deepfake-vs-Real",
118
+ "prithivMLmods/Deep-Fake-Detector-Model"
119
+ ]
120
+ }
121
+ ```
122
+
123
+ ---
124
+
125
+ ## Getting started
126
+
127
+ ### Prerequisites
128
+
129
+ | Tool | Version |
130
+ |------|---------|
131
+ | Python | 3.10+ |
132
+ | Node.js | 18+ |
133
+ | Docker | 24+ (optional) |
134
+
135
+ ### Backend
136
+
137
+ ```bash
138
+ git clone https://github.com/oyabun-dev/deepfake_detection
139
+ cd deepfake_detection
140
+
141
+ python -m venv .venv
142
+ source .venv/bin/activate
143
+
144
+ pip install -r requirements.txt
145
+ uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
146
+ ```
147
+
148
+ Models (~2–4 GB) are downloaded and cached automatically on first startup.
149
+
150
+ ### Frontend
151
+
152
+ ```bash
153
+ cd frontend-react
154
+ npm install
155
+ npm run dev
156
+ ```
157
+
158
+ ### Docker (recommended)
159
+
160
+ ```bash
161
+ docker compose up --build
162
+ ```
163
+
164
+ - API: `http://localhost:8000`
165
+ - Frontend: `http://localhost:3000`
166
+
167
+ ---
168
+
169
+ ## Project structure
170
+
171
+ ```
172
+ deepfake_detection/
173
+ ├── app/
174
+ │ ├── main.py — FastAPI application, CORS, routers
175
+ │ ├── core/
176
+ │ │ ├── config.py — Constants (formats, thresholds, max size)
177
+ │ │ └── device.py — Automatic CPU/GPU selection
178
+ │ ├── routers/
179
+ │ │ ├── image.py — /analyze/image and /analyze/image/fast
180
+ │ │ ├── audio.py — /analyze/audio (WIP)
181
+ │ │ ├── video.py — /analyze/video (WIP)
182
+ │ │ └── text.py — /analyze/text (WIP)
183
+ │ └── pipelines/
184
+ │ └── image.py — Full pipeline: run() and run_fast()
185
+ ├── frontend-react/ — React + Vite frontend
186
+ ├── docs/ — React documentation site
187
+ ├── docker-compose.yml
188
+ ├── docker-compose.prod.yml
189
+ ├── Dockerfile
190
+ └── requirements.txt
191
+ ```
192
+
193
+ ---
194
+
195
+ ## Deployment
196
+
197
+ ### HuggingFace Spaces (API)
198
+
199
+ ```bash
200
+ pip install huggingface_hub
201
+ huggingface-cli login
202
+ git remote add spaces https://huggingface.co/spaces/oyabun-dev/kamyvision
203
+ git push spaces main
204
+ ```
205
+
206
+ ### Vercel (frontend + docs)
207
+
208
+ Both the React frontend (`frontend-react/`) and the documentation (`docs/`) are deployed on Vercel. See the [Deployment docs](https://docs.kamydev.com/deploy) for full configuration.
209
+
210
+ ---
211
+
212
+ ## Known limitations
213
+
214
+ The 3 ViT models were primarily trained on GAN datasets. Performance is degraded on recent diffusion model outputs (Midjourney v6, Stable Diffusion XL, Flux.1). EXIF analysis partially compensates for images that retain their metadata.
215
+
216
+ ---
217
+
218
+ ## License
219
+
220
+ MIT
app/__init__.py ADDED
File without changes
app/core/__init__.py ADDED
File without changes
app/core/config.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MAX_FILE_SIZE = 20 * 1024 * 1024 # 20 MB
2
+ MAX_VIDEO_SIZE = 200 * 1024 * 1024 # 200 MB
3
+
4
+ ALLOWED_IMAGE_MIMETYPES = [
5
+ "image/jpeg", "image/png", "image/webp",
6
+ "image/heic", "image/heif",
7
+ "image/jfif", "image/pjpeg", "image/bmp",
8
+ "image/gif", "image/tiff", "image/avif",
9
+ "image/x-jfif",
10
+ ]
11
+
12
+ ALLOWED_AUDIO_MIMETYPES = [
13
+ "audio/wav", "audio/mpeg", "audio/mp3",
14
+ "audio/ogg", "audio/flac", "audio/x-m4a", "audio/x-wav",
15
+ ]
16
+
17
+ ALLOWED_VIDEO_MIMETYPES = [
18
+ "video/mp4", "video/quicktime", "video/x-msvideo",
19
+ "video/webm", "video/mpeg", "video/x-matroska",
20
+ ]
21
+
22
+ # ── Image models ───────────────────────────────────────────────────────────────
23
+
24
+ IMAGE_ENSEMBLE = [
25
+ {
26
+ "key": "ai_vs_human",
27
+ "name": "Ateeqq/ai-vs-human-image-detector",
28
+ "weight": 0.45,
29
+ "desc": "AI vs Human 120k (ViT)",
30
+ },
31
+ {
32
+ "key": "ai_vs_deepfake_vs_real",
33
+ "name": "prithivMLmods/AI-vs-Deepfake-vs-Real",
34
+ "weight": 0.35,
35
+ "desc": "AI/Deepfake/Real 3-class (ViT)",
36
+ },
37
+ {
38
+ "key": "deepfake_detector",
39
+ "name": "prithivMLmods/Deep-Fake-Detector-Model",
40
+ "weight": 0.20,
41
+ "desc": "Deepfake faces (ViT)",
42
+ },
43
+ ]
44
+
45
+ IMAGE_FAST_ENSEMBLE = [
46
+ {
47
+ "key": "ai_vs_human",
48
+ "name": "Ateeqq/ai-vs-human-image-detector",
49
+ "weight": 0.45,
50
+ "desc": "AI vs Human 120k (ViT)",
51
+ },
52
+ {
53
+ "key": "deepfake_detector",
54
+ "name": "prithivMLmods/Deep-Fake-Detector-Model",
55
+ "weight": 0.55,
56
+ "desc": "Deepfake faces (ViT)",
57
+ },
58
+ ]
59
+
60
+ # ── Audio model ────────────────────────────────────────────────────────────────
61
+
62
+ AUDIO_MODEL = {
63
+ "key": "deepfake_audio_v2",
64
+ "name": "MelodyMachine/Deepfake-audio-detection-V2",
65
+ "desc": "Deepfake Audio Detection V2",
66
+ }
67
+
68
+ # ── Video ensemble (reuses ViT models) ────────────────────────────────────────
69
+
70
+ VIDEO_ENSEMBLE = [
71
+ {
72
+ "key": "deepfake_detector",
73
+ "name": "prithivMLmods/Deep-Fake-Detector-Model",
74
+ "weight": 0.40,
75
+ "desc": "Deepfake faces (ViT)",
76
+ },
77
+ {
78
+ "key": "ai_vs_deepfake_vs_real",
79
+ "name": "prithivMLmods/AI-vs-Deepfake-vs-Real",
80
+ "weight": 0.35,
81
+ "desc": "AI/Deepfake/Real 3-class",
82
+ },
83
+ {
84
+ "key": "ai_vs_human",
85
+ "name": "Ateeqq/ai-vs-human-image-detector",
86
+ "weight": 0.25,
87
+ "desc": "AI vs Human 120k",
88
+ },
89
+ ]
90
+
91
+ # ── Text models ────────────────────────────────────────────────────────────────
92
+
93
+ TEXT_MODELS = {
94
+ "ai1": {
95
+ "name": "fakespot-ai/roberta-base-ai-text-detection-v1",
96
+ "desc": "RoBERTa AI text detector (Fakespot)",
97
+ },
98
+ "ai2": {
99
+ "name": "Hello-SimpleAI/chatgpt-detector-roberta",
100
+ "desc": "RoBERTa ChatGPT detector",
101
+ },
102
+ "fn1": {
103
+ "name": "vikram71198/distilroberta-base-finetuned-fake-news-detection",
104
+ "desc": "DistilRoBERTa fake news detector",
105
+ },
106
+ "fn2": {
107
+ "name": "jy46604790/Fake-News-Bert-Detect",
108
+ "desc": "BERT fake news detector",
109
+ },
110
+ }
app/core/device.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+ if torch.backends.mps.is_available():
4
+ DEVICE = torch.device("mps")
5
+ elif torch.cuda.is_available():
6
+ DEVICE = torch.device("cuda")
7
+ else:
8
+ DEVICE = torch.device("cpu")
app/main.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI
2
+ from fastapi.middleware.cors import CORSMiddleware
3
+
4
+ from app.routers import image as image_router
5
+ from app.routers import audio as audio_router
6
+ from app.routers import video as video_router
7
+ from app.routers import text as text_router
8
+
9
+ app = FastAPI(title="KAMY Vision AI", description="Plateforme de détection de deepfakes")
10
+
11
+ app.add_middleware(
12
+ CORSMiddleware,
13
+ allow_origins=[
14
+ "http://localhost:3000",
15
+ "http://127.0.0.1:3000",
16
+ "http://localhost:5173",
17
+ "http://127.0.0.1:5173",
18
+ "http://localhost:8000",
19
+ "http://127.0.0.1:8000",
20
+ "null",
21
+ ],
22
+ allow_credentials=True,
23
+ allow_methods=["*"],
24
+ allow_headers=["*"],
25
+ )
26
+
27
+ app.include_router(image_router.router)
28
+ app.include_router(audio_router.router)
29
+ app.include_router(video_router.router)
30
+ app.include_router(text_router.router)
31
+
32
+
33
+ @app.get("/health")
34
+ async def health():
35
+ return {"status": "ok"}
app/models/__init__.py ADDED
File without changes
app/models/loader.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gc
2
+ import torch
3
+ from transformers import AutoImageProcessor, AutoModelForImageClassification
4
+
5
+ from app.core.device import DEVICE
6
+
7
+ _cache: dict = {}
8
+
9
+
10
+ def load_image_model(cfg: dict):
11
+ """Lazy-load a model by config key. Returns (processor, model) or None on failure."""
12
+ key = cfg["key"]
13
+ if key in _cache:
14
+ return _cache[key]
15
+
16
+ print(f"Loading {cfg['desc']} ({cfg['name']})...")
17
+ try:
18
+ proc = AutoImageProcessor.from_pretrained(cfg["name"])
19
+ model = AutoModelForImageClassification.from_pretrained(cfg["name"]).to(DEVICE)
20
+ model.eval()
21
+ _cache[key] = (proc, model)
22
+ print(f"{key} ready, labels: {model.config.id2label}")
23
+ except Exception as e:
24
+ print(f"Failed to load {key}: {e}")
25
+ _cache[key] = None
26
+
27
+ return _cache[key]
28
+
29
+
30
+ def unload_all():
31
+ global _cache
32
+ for entry in _cache.values():
33
+ if entry is not None:
34
+ proc, model = entry
35
+ del model
36
+ del proc
37
+ _cache = {}
38
+ gc.collect()
39
+ if torch.backends.mps.is_available():
40
+ torch.mps.empty_cache()
41
+ elif torch.cuda.is_available():
42
+ torch.cuda.empty_cache()
app/pipelines/__init__.py ADDED
File without changes
app/pipelines/audio.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import tempfile
3
+
4
+ import librosa
5
+ import numpy as np
6
+ import torch
7
+ from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
8
+
9
+ from app.core.config import AUDIO_MODEL
10
+ from app.core.device import DEVICE
11
+
12
+ _audio_model = None
13
+ _audio_proc = None
14
+
15
+
16
+ def _get_audio_model():
17
+ global _audio_model, _audio_proc
18
+ if _audio_model is None:
19
+ name = AUDIO_MODEL["name"]
20
+ print(f"Loading {AUDIO_MODEL['desc']} ({name})...")
21
+ _audio_proc = AutoFeatureExtractor.from_pretrained(name)
22
+ _audio_model = AutoModelForAudioClassification.from_pretrained(name).to(DEVICE)
23
+ _audio_model.eval()
24
+ print(f"{AUDIO_MODEL['key']} ready")
25
+ return _audio_proc, _audio_model
26
+
27
+
28
+ def run(audio_bytes: bytes, sensitivity: int = 50) -> dict:
29
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tf:
30
+ tf.write(audio_bytes)
31
+ tmp = tf.name
32
+ try:
33
+ speech, sr = librosa.load(tmp, sr=16000)
34
+ finally:
35
+ if os.path.exists(tmp):
36
+ os.remove(tmp)
37
+
38
+ proc, model = _get_audio_model()
39
+ inputs = proc(speech, sampling_rate=sr, return_tensors="pt").to(DEVICE)
40
+ with torch.no_grad():
41
+ probs_tensor = torch.nn.functional.softmax(model(**inputs).logits, dim=-1)[0].cpu().numpy()
42
+
43
+ # Resolve fake/real indices dynamically, label order varies by model
44
+ id2label = {int(k): v.lower() for k, v in model.config.id2label.items()}
45
+ fake_kw = ["fake", "spoof", "synthetic", "generated", "deepfake", "ai"]
46
+ real_kw = ["real", "human", "authentic", "genuine", "bonafide", "natural"]
47
+ fake_idx = [i for i, lbl in id2label.items() if any(w in lbl for w in fake_kw)]
48
+ real_idx = [i for i, lbl in id2label.items() if any(w in lbl for w in real_kw)]
49
+ print(f" audio id2label: {id2label} | fake_idx={fake_idx} real_idx={real_idx}") # noqa: T201
50
+
51
+ if fake_idx:
52
+ fake_prob = float(sum(probs_tensor[i] for i in fake_idx))
53
+ elif real_idx:
54
+ fake_prob = float(1.0 - sum(probs_tensor[i] for i in real_idx))
55
+ else:
56
+ # Fallback: assume index 0 = fake (common for binary audio models)
57
+ fake_prob = float(probs_tensor[0])
58
+
59
+ shift = (sensitivity - 50.0) / 50.0
60
+ adjusted = float(np.clip(fake_prob + shift * 0.18, 0.0, 1.0))
61
+
62
+ if adjusted > 0.65:
63
+ verdict, reason = "DEEPFAKE", "Signal vocal présentant des caractéristiques synthétiques détectées."
64
+ elif adjusted < 0.35:
65
+ verdict, reason = "AUTHENTIQUE", "Signal vocal naturel, aucun artefact de synthèse détecté."
66
+ else:
67
+ verdict, reason = "INDÉTERMINÉ", "Signal vocal ambigu, analyse non concluante."
68
+
69
+ confidence = "haute" if adjusted > 0.85 or adjusted < 0.15 else ("moyenne" if adjusted > 0.70 or adjusted < 0.30 else "faible")
70
+
71
+ return {
72
+ "verdict": verdict,
73
+ "confidence": confidence,
74
+ "reason": reason,
75
+ "fake_prob": round(adjusted, 4),
76
+ "real_prob": round(1.0 - adjusted, 4),
77
+ "sensitivity_used": sensitivity,
78
+ "models": {
79
+ AUDIO_MODEL["key"]: {
80
+ "score": round(fake_prob, 4),
81
+ "desc": AUDIO_MODEL["desc"],
82
+ }
83
+ },
84
+ }
app/pipelines/fakenews.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
3
+ from deep_translator import GoogleTranslator
4
+
5
+ _fn_cache = {"tokenizers": {}, "models": {}}
6
+
7
+ device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
8
+
9
+ def get_or_load_fn_model(model_key):
10
+ fn_models_map = {
11
+ "fn1": "vikram71198/distilroberta-base-finetuned-fake-news-detection",
12
+ "fn2": "jy46604790/Fake-News-Bert-Detect"
13
+ }
14
+
15
+ if model_key not in _fn_cache["models"]:
16
+ print(f"📥 Chargement du modèle Fake News {model_key}...")
17
+ repo = fn_models_map[model_key]
18
+ tok = AutoTokenizer.from_pretrained(repo)
19
+ mod = AutoModelForSequenceClassification.from_pretrained(repo).to(device)
20
+ mod.eval()
21
+ _fn_cache["tokenizers"][model_key] = tok
22
+ _fn_cache["models"][model_key] = mod
23
+
24
+ return _fn_cache["tokenizers"][model_key], _fn_cache["models"][model_key]
25
+
26
+ def apply_local_context_guardrails(text: str, fake_prob: float) -> float:
27
+ """
28
+ Réduit artificiellement le score de Fake News si des entités sénégalaises
29
+ ou africaines fiables sont mentionnées. Empêche le biais "Out-Of-Distribution".
30
+ """
31
+ text_lower = text.lower()
32
+
33
+ # Mots-clés de crédibilité locaux ou institutions
34
+ credible_keywords = [
35
+ "aps", "agence de presse sénégalaise", "rts", "radiodiffusion télévision sénégalaise",
36
+ "le soleil", "seneweb", "dakaractu", "igfm", "tfm", "walfadjri", "sud quotidien"
37
+ ]
38
+
39
+ # Noms propres souvent détectés comme du bruit par un modèle américain
40
+ local_entities = [
41
+ "dakar", "sénégal", "senegal", "macky sall", "ousmane sonko", "diomaye faye",
42
+ "bassirou diomaye", "pastef", "apr", "assemblée nationale", "ucad"
43
+ ]
44
+
45
+ credible_matches = sum(1 for kw in credible_keywords if kw in text_lower)
46
+ entity_matches = sum(1 for kw in local_entities if kw in text_lower)
47
+
48
+ # Applique une réduction progressive du fake prob
49
+ discount = 0.0
50
+ if credible_matches > 0:
51
+ discount += 0.25 * credible_matches
52
+ if entity_matches > 0:
53
+ discount += 0.15 * entity_matches
54
+
55
+ discount = min(discount, 0.45) # Max discount 45%
56
+
57
+ adjusted_prob = float(max(0.01, fake_prob - discount))
58
+ return adjusted_prob
59
+
60
+
61
+ def analyze_fakenews_text(text: str) -> dict:
62
+ # 1. TRADUCTION MULTILINGUE (Hackathon Solution)
63
+ # On traduit le texte peu importe sa langue (auto) vers l'anglais
64
+ # Cela permet d'utiliser les puissants modèles Fake News anglophones sur du Français, Wolof (si supporté), etc.
65
+ try:
66
+ translated_text = GoogleTranslator(source='auto', target='en').translate(text)
67
+ print("📝 Traduction en cours pour analyse FakeNews...")
68
+ except Exception as e:
69
+ print(f"⚠️ Erreur de traduction : {e}. Utilisation du texte original.")
70
+ translated_text = text
71
+
72
+ def _predict(model_key, txt):
73
+ tokenizer, model = get_or_load_fn_model(model_key)
74
+ inputs = tokenizer(txt, return_tensors="pt", truncation=True, max_length=512).to(device)
75
+ with torch.inference_mode():
76
+ outputs = model(**inputs)
77
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
78
+ fake_prob = float(probs[0][1])
79
+ return fake_prob
80
+
81
+ # 2. INFERENCE SUR LE TEXTE TRADUIT (EN ANGLAIS)
82
+ prob1 = _predict("fn1", translated_text)
83
+ prob2 = _predict("fn2", translated_text)
84
+ weighted_fake_prob = (prob1 * 0.60) + (prob2 * 0.40)
85
+
86
+ # 3. GARDE-FOUS CONTEXTUELS LOCAUX (SÉNÉGAL)
87
+ # Appliqué sur le texte *original* (pas traduit)
88
+ adjusted_prob = apply_local_context_guardrails(text, weighted_fake_prob)
89
+
90
+ return {
91
+ "verdict": "FAKE NEWS" if adjusted_prob > 0.50 else "INFO VRAIE",
92
+ "fake_prob": adjusted_prob,
93
+ "real_prob": 1.0 - adjusted_prob,
94
+ "is_fake": adjusted_prob > 0.50,
95
+ "raw_fake_prob": weighted_fake_prob, # Pour le débug
96
+ "was_translated": translated_text != text
97
+ }
app/pipelines/image.py ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import torch
3
+ from PIL import Image, ImageFilter
4
+
5
+ from app.core.config import IMAGE_ENSEMBLE, IMAGE_FAST_ENSEMBLE
6
+ from app.core.device import DEVICE
7
+ from app.models.loader import load_image_model
8
+
9
+
10
+ # ── Model inference ────────────────────────────────────────────────────────────
11
+
12
+ def _infer_fake_score(proc, model, img: Image.Image) -> float:
13
+ """
14
+ Stable inference: average over 3 passes to reduce variance.
15
+ Dynamically resolves fake/real indices from id2label, no hardcoded assumptions.
16
+ Returns a score 0→1 (1 = synthetic/fake).
17
+ """
18
+ inputs = proc(images=img, return_tensors="pt").to(DEVICE)
19
+ with torch.no_grad():
20
+ logits_list = [model(**inputs).logits for _ in range(3)]
21
+ logits_mean = torch.stack(logits_list).mean(dim=0)
22
+ probs = torch.nn.functional.softmax(logits_mean, dim=-1)[0].cpu().numpy()
23
+
24
+ id2label = {int(k): v.lower() for k, v in model.config.id2label.items()}
25
+ fake_kw = ["fake", "ai", "artificial", "synthetic", "generated", "deepfake"]
26
+ real_kw = ["real", "human", "authentic", "genuine"]
27
+
28
+ fake_indices = [i for i, lbl in id2label.items() if any(w in lbl for w in fake_kw)]
29
+ real_indices = [i for i, lbl in id2label.items() if any(w in lbl for w in real_kw)]
30
+
31
+ if not fake_indices and not real_indices:
32
+ return float(probs[1]) if len(probs) >= 2 else 0.5
33
+
34
+ fake_score = float(np.sum([probs[i] for i in fake_indices])) if fake_indices else 0.0
35
+ real_score = float(np.sum([probs[i] for i in real_indices])) if real_indices else 0.0
36
+ total = fake_score + real_score
37
+ return fake_score / total if total > 1e-9 else 0.5
38
+
39
+
40
+ def _run_ensemble(img: Image.Image, ensemble: list) -> dict:
41
+ """Run all models in the ensemble and return weighted score + per-model details."""
42
+ results = {}
43
+ weighted_sum = 0.0
44
+ total_weight = 0.0
45
+
46
+ for cfg in ensemble:
47
+ loaded = load_image_model(cfg)
48
+ if loaded is None:
49
+ print(f" {cfg['key']} skipped (load failed)")
50
+ continue
51
+ proc, model = loaded
52
+ try:
53
+ score = _infer_fake_score(proc, model, img)
54
+ results[cfg["key"]] = {"score": round(score, 4), "weight": cfg["weight"], "desc": cfg["desc"]}
55
+ weighted_sum += score * cfg["weight"]
56
+ total_weight += cfg["weight"]
57
+ print(f" [{cfg['key']}] fake={score:.4f} × {cfg['weight']}")
58
+ except Exception as e:
59
+ print(f" [{cfg['key']}] error: {e}")
60
+
61
+ ensemble_score = weighted_sum / total_weight if total_weight > 0 else 0.5
62
+ return {"models": results, "ensemble_score": round(ensemble_score, 4)}
63
+
64
+
65
+ # ── Forensic layers ────────────────────────────────────────────────────────────
66
+
67
+ def _analyze_exif(image_bytes: bytes) -> dict:
68
+ result = {"score": 0.50, "exif_absent": False, "has_camera_info": False,
69
+ "suspicious_software": False, "ai_source": None, "details": []}
70
+ try:
71
+ import piexif
72
+ exif_data = piexif.load(image_bytes)
73
+ has_content = any(len(exif_data.get(b, {})) > 0 for b in ["0th", "Exif", "GPS", "1st"])
74
+ if not has_content:
75
+ result["exif_absent"] = True
76
+ result["details"].append("EXIF absent")
77
+ return result
78
+
79
+ zeroth = exif_data.get("0th", {})
80
+ exif_ifd = exif_data.get("Exif", {})
81
+ gps_ifd = exif_data.get("GPS", {})
82
+
83
+ sw = zeroth.get(piexif.ImageIFD.Software, b"").decode("utf-8", errors="ignore").lower()
84
+ desc = zeroth.get(piexif.ImageIFD.ImageDescription, b"").decode("utf-8", errors="ignore").lower()
85
+ artist = zeroth.get(piexif.ImageIFD.Artist, b"").decode("utf-8", errors="ignore").lower()
86
+ combined = sw + " " + desc + " " + artist
87
+
88
+ ai_sources = {
89
+ "stable diffusion": "Stable Diffusion", "midjourney": "Midjourney",
90
+ "dall-e": "DALL-E", "dall·e": "DALL-E", "comfyui": "ComfyUI/SD",
91
+ "automatic1111": "Automatic1111/SD", "generative": "IA Générative",
92
+ "diffusion": "Modèle Diffusion", "novelai": "NovelAI",
93
+ "firefly": "Adobe Firefly", "imagen": "Google Imagen",
94
+ "gemini": "Google Gemini", "flux": "Flux (BFL)",
95
+ "ideogram": "Ideogram", "leonardo": "Leonardo.ai",
96
+ "adobe ai": "Adobe AI", "ai generated": "IA Générique",
97
+ "synthid": "Google SynthID",
98
+ }
99
+ for kw, source in ai_sources.items():
100
+ if kw in combined:
101
+ result["suspicious_software"] = True
102
+ result["ai_source"] = source
103
+ result["score"] = 0.97
104
+ result["details"].append(f"Source IA détectée: {source}")
105
+ return result
106
+
107
+ make = zeroth.get(piexif.ImageIFD.Make, b"")
108
+ cam = zeroth.get(piexif.ImageIFD.Model, b"")
109
+ iso = exif_ifd.get(piexif.ExifIFD.ISOSpeedRatings)
110
+ shut = exif_ifd.get(piexif.ExifIFD.ExposureTime)
111
+ gps = bool(gps_ifd and len(gps_ifd) > 2)
112
+
113
+ if make or cam:
114
+ result["has_camera_info"] = True
115
+ result["details"].append(
116
+ f"Appareil: {make.decode('utf-8', errors='ignore')} {cam.decode('utf-8', errors='ignore')}".strip()
117
+ )
118
+ if gps:
119
+ result["details"].append("GPS présent")
120
+
121
+ if result["has_camera_info"] and gps and iso and shut:
122
+ result["score"] = 0.05
123
+ elif result["has_camera_info"] and (iso or shut):
124
+ result["score"] = 0.12
125
+ elif result["has_camera_info"]:
126
+ result["score"] = 0.28
127
+ else:
128
+ result["score"] = 0.55
129
+
130
+ except Exception as e:
131
+ result["exif_absent"] = True
132
+ result["details"].append(f"Erreur EXIF: {str(e)[:60]}")
133
+ return result
134
+
135
+
136
+ def _analyze_fft(img: Image.Image, fc: float = 0.0) -> dict:
137
+ result = {"score": 0.50, "details": []}
138
+ try:
139
+ gray = np.array(img.convert("L")).astype(np.float32)
140
+ mag = np.log1p(np.abs(np.fft.fftshift(np.fft.fft2(gray))))
141
+ h, w = mag.shape
142
+ cy, cx = h // 2, w // 2
143
+ Y, X = np.ogrid[:h, :w]
144
+ dist = np.sqrt((X - cx) ** 2 + (Y - cy) ** 2)
145
+ rl, rm = min(h, w) // 8, min(h, w) // 4
146
+ le = np.mean(mag[dist <= rl])
147
+ he = np.mean(mag[(dist > rl) & (dist <= rm)])
148
+ fr = he / (le + 1e-9)
149
+ tl = 0.18 if fc > 0.45 else 0.25
150
+ th = 0.85 if fc > 0.45 else 0.72
151
+ ss = 0.70 if fr < tl else (0.55 if fr > th else 0.20)
152
+ result["details"].append(f"Ratio freq. {fr:.3f}" + (" → sur-lissage IA" if fr < tl else " ✓"))
153
+
154
+ pr = np.sum((mag * (dist > 5)) > (np.mean(mag) + 5 * np.std(mag))) / (h * w)
155
+ ps = 0.85 if pr > 0.003 else (0.50 if pr > 0.001 else 0.15)
156
+ result["details"].append(f"Pics GAN: {pr:.4f}" + (" ⚠️" if pr > 0.003 else " ✓"))
157
+
158
+ result["score"] = float(0.55 * ss + 0.45 * ps)
159
+ except Exception as e:
160
+ result["details"].append(f"Erreur FFT: {str(e)[:60]}")
161
+ return result
162
+
163
+
164
+ def _analyze_texture(img: Image.Image, fc: float = 0.0) -> dict:
165
+ result = {"score": 0.50, "details": []}
166
+ try:
167
+ arr = np.array(img).astype(np.float32)
168
+ gray = np.array(img.convert("L")).astype(np.float32)
169
+ lap = np.array(img.convert("L").filter(ImageFilter.FIND_EDGES)).astype(np.float32)
170
+ nl = float(np.std(lap))
171
+
172
+ if arr.shape[2] >= 3:
173
+ r, g, b = arr[:, :, 0], arr[:, :, 1], arr[:, :, 2]
174
+ if float(np.mean(np.abs(r - g) < 1)) > 0.98 and float(np.mean(np.abs(g - b) < 1)) > 0.98:
175
+ result["score"] = 0.85
176
+ result["details"].append("Canaux RGB identiques → image IA synthétique")
177
+ return result
178
+
179
+ ts, tm = (5.0, 14.0) if fc > 0.45 else (8.0, 20.0)
180
+ ns = 0.75 if nl > 20.0 else (0.72 if nl < ts else (0.42 if nl < tm else 0.15))
181
+ result["details"].append(f"Bruit: {nl:.1f}")
182
+
183
+ h, w, bl = gray.shape[0], gray.shape[1], 32
184
+ stds = [np.std(gray[y:y + bl, x:x + bl]) for y in range(0, h - bl, bl) for x in range(0, w - bl, bl)]
185
+ u = np.std(stds) / (np.mean(stds) + 1e-9) if stds else 0.5
186
+ ul, uh = (0.20, 0.50) if fc > 0.45 else (0.30, 0.60)
187
+ us = 0.72 if u < ul else (0.38 if u < uh else 0.15)
188
+ result["details"].append(f"Uniformité: {u:.3f}")
189
+
190
+ bg_ratio = float(np.mean(gray > 200))
191
+ border_std = float(np.std(gray[:h // 8, :]))
192
+ if bg_ratio > 0.50 and border_std < 6.0:
193
+ studio_score = 0.88
194
+ elif bg_ratio > 0.50 and border_std < 15.0:
195
+ studio_score = 0.82
196
+ elif bg_ratio > 0.35 and border_std < 25.0:
197
+ studio_score = 0.55
198
+ else:
199
+ studio_score = 0.10
200
+ result["details"].append(f"Fond: {bg_ratio:.0%}")
201
+
202
+ result["score"] = float(0.35 * ns + 0.25 * us + 0.40 * studio_score)
203
+ except Exception as e:
204
+ result["details"].append(f"Erreur texture: {str(e)[:60]}")
205
+ return result
206
+
207
+
208
+ def _analyze_color(img: Image.Image) -> dict:
209
+ result = {"score": 0.50, "details": []}
210
+ try:
211
+ arr = np.array(img.convert("RGB")).astype(np.float32)
212
+ r, g, b = arr[:, :, 0].flatten(), arr[:, :, 1].flatten(), arr[:, :, 2].flatten()
213
+
214
+ def channel_entropy(ch):
215
+ hist, _ = np.histogram(ch, bins=64, range=(0, 255), density=True)
216
+ hist = hist[hist > 0]
217
+ return float(-np.sum(hist * np.log2(hist + 1e-9)))
218
+
219
+ er, eg, eb = channel_entropy(r), channel_entropy(g), channel_entropy(b)
220
+ mean_entropy = (er + eg + eb) / 3.0
221
+ entropy_std = float(np.std([er, eg, eb]))
222
+
223
+ if mean_entropy > 5.2 and entropy_std < 0.15:
224
+ ent_score = 0.72
225
+ elif mean_entropy > 4.8 and entropy_std < 0.25:
226
+ ent_score = 0.45
227
+ else:
228
+ ent_score = 0.20
229
+ result["details"].append(f"Entropie couleur: {mean_entropy:.2f}")
230
+
231
+ lum = 0.299 * r + 0.587 * g + 0.114 * b
232
+ extreme_ratio = float(np.mean((lum < 8) | (lum > 247)))
233
+ ext_score = 0.65 if extreme_ratio < 0.005 else (0.35 if extreme_ratio < 0.02 else 0.15)
234
+ result["details"].append(f"Pixels extrêmes: {extreme_ratio:.4f}")
235
+
236
+ result["score"] = float(0.60 * ent_score + 0.40 * ext_score)
237
+ except Exception as e:
238
+ result["details"].append(f"Erreur palette: {str(e)[:60]}")
239
+ return result
240
+
241
+
242
+ # ── Fusion ─────────────────────────────────────────────────────────────────────
243
+
244
+ def _fuse(ensemble_score: float, exif_r: dict, fft_r: dict, tex_r: dict, color_r: dict) -> dict:
245
+ exif_absent = exif_r.get("exif_absent", False)
246
+
247
+ if exif_r.get("suspicious_software"):
248
+ profile = "EXIF_IA_DETECTE"
249
+ w = {"ensemble": 0.20, "exif": 0.60, "fft": 0.12, "texture": 0.05, "color": 0.03}
250
+ elif not exif_absent and exif_r["has_camera_info"] and exif_r["score"] < 0.20:
251
+ profile = "EXIF_FIABLE"
252
+ w = {"ensemble": 0.45, "exif": 0.32, "fft": 0.12, "texture": 0.07, "color": 0.04}
253
+ elif exif_absent:
254
+ profile = "EXIF_ABSENT"
255
+ w = {"ensemble": 0.52, "exif": 0.00, "fft": 0.24, "texture": 0.14, "color": 0.10}
256
+ else:
257
+ profile = "STANDARD"
258
+ w = {"ensemble": 0.48, "exif": 0.22, "fft": 0.16, "texture": 0.09, "color": 0.05}
259
+
260
+ scores = {
261
+ "ensemble": ensemble_score,
262
+ "exif": exif_r["score"],
263
+ "fft": fft_r["score"],
264
+ "texture": tex_r["score"],
265
+ "color": color_r["score"],
266
+ }
267
+
268
+ raw = sum(w[k] * scores[k] for k in w)
269
+
270
+ # Anti-false-positive guardrails
271
+ if ensemble_score < 0.35 and fft_r["score"] < 0.38:
272
+ raw = min(raw, 0.46)
273
+ if not exif_absent and exif_r["has_camera_info"] and exif_r["score"] < 0.15:
274
+ raw = min(raw, 0.82)
275
+ if exif_r.get("suspicious_software") and raw < 0.85:
276
+ raw = max(raw, 0.90)
277
+
278
+ # High-confidence ensemble override: modern diffusion models evade forensic layers;
279
+ # when all ML models agree strongly, trust them over FFT/texture/color heuristics.
280
+ if ensemble_score >= 0.80 and not exif_r.get("has_camera_info"):
281
+ raw = max(raw, ensemble_score * 0.90)
282
+ if ensemble_score <= 0.20:
283
+ raw = min(raw, ensemble_score * 1.10 + 0.05)
284
+
285
+ return {
286
+ "fake_prob": round(raw, 4),
287
+ "real_prob": round(1.0 - raw, 4),
288
+ "layer_scores": {k: round(v, 4) for k, v in scores.items()},
289
+ "weights_used": {k: round(v, 2) for k, v in w.items()},
290
+ "fusion_profile": profile,
291
+ "ai_source": exif_r.get("ai_source"),
292
+ }
293
+
294
+
295
+ # ── Verdict ────────────────────────────────────────────────────────────────────
296
+
297
+ def _verdict(fake_prob: float, details: dict) -> dict:
298
+ if fake_prob > 0.65:
299
+ verdict = "DEEPFAKE"
300
+ confidence = "haute" if fake_prob > 0.85 else "moyenne"
301
+ reason = "Artefacts de synthèse détectés."
302
+ elif fake_prob < 0.35:
303
+ verdict = "AUTHENTIQUE"
304
+ confidence = "haute" if fake_prob < 0.15 else "moyenne"
305
+ reason = "Aucun artefact de synthèse détecté."
306
+ else:
307
+ verdict = "INDÉTERMINÉ"
308
+ confidence = "faible"
309
+ reason = "Signal ambigu, analyse non concluante."
310
+
311
+ if details.get("ai_source"):
312
+ reason = f"Source IA identifiée dans les métadonnées: {details['ai_source']}."
313
+
314
+ return {"verdict": verdict, "confidence": confidence, "reason": reason}
315
+
316
+
317
+ # ── Public API ─────────────────────────────────────────────────────────────────
318
+
319
+ def run(img: Image.Image, image_bytes: bytes) -> dict:
320
+ """Full analysis: 3-model ensemble + forensic layers."""
321
+ ensemble_result = _run_ensemble(img, IMAGE_ENSEMBLE)
322
+ exif_r = _analyze_exif(image_bytes)
323
+ fft_r = _analyze_fft(img)
324
+ tex_r = _analyze_texture(img)
325
+ color_r = _analyze_color(img)
326
+
327
+ fusion = _fuse(ensemble_result["ensemble_score"], exif_r, fft_r, tex_r, color_r)
328
+ verdict = _verdict(fusion["fake_prob"], fusion)
329
+
330
+ return {**verdict, **fusion, "models": ensemble_result["models"]}
331
+
332
+
333
+ def run_fast(img: Image.Image, image_bytes: bytes) -> dict:
334
+ """Fast analysis: 2-model ensemble + EXIF only."""
335
+ ensemble_result = _run_ensemble(img, IMAGE_FAST_ENSEMBLE)
336
+ exif_r = _analyze_exif(image_bytes)
337
+ fft_r = {"score": 0.50, "details": []}
338
+ tex_r = {"score": 0.50, "details": []}
339
+ color_r = {"score": 0.50, "details": []}
340
+
341
+ fusion = _fuse(ensemble_result["ensemble_score"], exif_r, fft_r, tex_r, color_r)
342
+ verdict = _verdict(fusion["fake_prob"], fusion)
343
+
344
+ return {**verdict, **fusion, "models": ensemble_result["models"]}
app/pipelines/text_ai.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
3
+
4
+ _text_cache = {"tokenizers": {}, "models": {}}
5
+
6
+ device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
7
+
8
+ def get_or_load_ai_model(model_key):
9
+ text_models_map = {
10
+ "ai1": "fakespot-ai/roberta-base-ai-text-detection-v1",
11
+ "ai2": "Hello-SimpleAI/chatgpt-detector-roberta"
12
+ }
13
+
14
+ if model_key not in _text_cache["models"]:
15
+ print(f"📥 Chargement du modèle texte {model_key}...")
16
+ repo = text_models_map[model_key]
17
+ tok = AutoTokenizer.from_pretrained(repo)
18
+ mod = AutoModelForSequenceClassification.from_pretrained(repo).to(device)
19
+ mod.eval()
20
+ _text_cache["tokenizers"][model_key] = tok
21
+ _text_cache["models"][model_key] = mod
22
+
23
+ return _text_cache["tokenizers"][model_key], _text_cache["models"][model_key]
24
+
25
+ def analyze_ai_text(text: str) -> dict:
26
+ def _predict(model_key, txt):
27
+ tokenizer, model = get_or_load_ai_model(model_key)
28
+ inputs = tokenizer(txt, return_tensors="pt", truncation=True, max_length=512).to(device)
29
+ with torch.inference_mode():
30
+ outputs = model(**inputs)
31
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
32
+ fake_prob = float(probs[0][1])
33
+ return fake_prob
34
+
35
+ prob1 = _predict("ai1", text)
36
+ prob2 = _predict("ai2", text)
37
+ avg_ai_prob = (prob1 + prob2) / 2.0
38
+
39
+ return {
40
+ "verdict": "TEXTE IA" if avg_ai_prob > 0.50 else "TEXTE HUMAIN",
41
+ "ai_prob": avg_ai_prob,
42
+ "human_prob": 1.0 - avg_ai_prob,
43
+ "is_ai": avg_ai_prob > 0.50
44
+ }
app/pipelines/video.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import tempfile
3
+
4
+ import cv2
5
+ import numpy as np
6
+ from PIL import Image
7
+
8
+ from app.core.config import VIDEO_ENSEMBLE
9
+ from app.pipelines.image import _run_ensemble
10
+
11
+ # Number of frames sampled per video
12
+ MAX_FRAMES = 16
13
+
14
+
15
+ def _extract_frames(video_path: str, n: int = MAX_FRAMES) -> list[Image.Image]:
16
+ """Extract n evenly-spaced frames from a video file."""
17
+ cap = cv2.VideoCapture(video_path)
18
+ total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
19
+ if total <= 0:
20
+ cap.release()
21
+ raise ValueError("Impossible de lire les frames de la vidéo.")
22
+
23
+ indices = np.linspace(0, total - 1, min(n, total), dtype=int)
24
+ frames = []
25
+ for idx in indices:
26
+ cap.set(cv2.CAP_PROP_POS_FRAMES, int(idx))
27
+ ret, frame = cap.read()
28
+ if ret:
29
+ rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
30
+ frames.append(Image.fromarray(rgb))
31
+ cap.release()
32
+ return frames
33
+
34
+
35
+ def run(video_bytes: bytes) -> dict:
36
+ """
37
+ Analyze a video for deepfake content.
38
+ Samples MAX_FRAMES frames evenly across the video,
39
+ runs the image ensemble on each, then aggregates.
40
+ """
41
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".mp4") as tf:
42
+ tf.write(video_bytes)
43
+ tmp_path = tf.name
44
+
45
+ try:
46
+ frames = _extract_frames(tmp_path)
47
+ finally:
48
+ if os.path.exists(tmp_path):
49
+ os.remove(tmp_path)
50
+
51
+ if not frames:
52
+ raise ValueError("Aucune frame exploitable extraite de la vidéo.")
53
+
54
+ # Run ensemble on each frame
55
+ frame_scores = []
56
+ per_model_scores: dict[str, list[float]] = {}
57
+
58
+ for i, frame in enumerate(frames):
59
+ result = _run_ensemble(frame, VIDEO_ENSEMBLE)
60
+ frame_scores.append(result["ensemble_score"])
61
+ for key, data in result["models"].items():
62
+ per_model_scores.setdefault(key, []).append(data["score"])
63
+ print(f" Frame {i + 1}/{len(frames)} → score={result['ensemble_score']:.4f}")
64
+
65
+ scores_arr = np.array(frame_scores)
66
+ fake_prob = float(np.mean(scores_arr))
67
+ high_ratio = float(np.mean(scores_arr > 0.65))
68
+
69
+ # Boost when most frames agree on deepfake
70
+ if high_ratio > 0.60:
71
+ fake_prob = min(fake_prob * 1.10, 1.0)
72
+
73
+ fake_prob = round(fake_prob, 4)
74
+
75
+ model_summary = {
76
+ key: round(float(np.mean(v)), 4)
77
+ for key, v in per_model_scores.items()
78
+ }
79
+
80
+ if fake_prob > 0.65:
81
+ verdict = "DEEPFAKE"
82
+ confidence = "haute" if fake_prob > 0.85 else "moyenne"
83
+ reason = "Artefacts de synthèse détectés sur plusieurs frames."
84
+ elif fake_prob < 0.35:
85
+ verdict = "AUTHENTIQUE"
86
+ confidence = "haute" if fake_prob < 0.15 else "moyenne"
87
+ reason = "Aucun artefact de synthèse détecté."
88
+ else:
89
+ verdict = "INDÉTERMINÉ"
90
+ confidence = "faible"
91
+ reason = "Signal ambigu, les frames présentent des résultats mixtes."
92
+
93
+ return {
94
+ "verdict": verdict,
95
+ "confidence": confidence,
96
+ "reason": reason,
97
+ "fake_prob": fake_prob,
98
+ "real_prob": round(1.0 - fake_prob, 4),
99
+ "frames_analyzed": len(frames),
100
+ "suspicious_frames_ratio": round(high_ratio, 4),
101
+ "models": model_summary,
102
+ }
app/routers/__init__.py ADDED
File without changes
app/routers/audio.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import APIRouter, File, HTTPException, UploadFile
2
+
3
+ from app.core.config import ALLOWED_AUDIO_MIMETYPES, MAX_FILE_SIZE
4
+ from app.pipelines import audio as audio_pipeline
5
+
6
+ router = APIRouter()
7
+
8
+
9
+ @router.post("/analyze/audio")
10
+ async def analyze_audio(file: UploadFile = File(...)):
11
+ content_type = getattr(file, "content_type", "")
12
+ if content_type not in ALLOWED_AUDIO_MIMETYPES:
13
+ raise HTTPException(
14
+ status_code=400,
15
+ detail=f"Format non supporté: {content_type}. Formats acceptés: WAV, MP3, OGG, FLAC, M4A.",
16
+ )
17
+ contents = await file.read()
18
+ if len(contents) > MAX_FILE_SIZE:
19
+ raise HTTPException(status_code=413, detail="Fichier trop volumineux (max 20 Mo).")
20
+ result = audio_pipeline.run(contents)
21
+ return {"status": "success", **result}
app/routers/image.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import io
2
+ from fastapi import APIRouter, File, HTTPException, UploadFile
3
+ from PIL import Image
4
+
5
+ from app.core.config import ALLOWED_IMAGE_MIMETYPES, MAX_FILE_SIZE
6
+ from app.pipelines import image as image_pipeline
7
+
8
+ try:
9
+ from pillow_heif import register_heif_opener
10
+ register_heif_opener()
11
+ except ImportError:
12
+ pass
13
+
14
+ Image.MAX_IMAGE_PIXELS = 100_000_000
15
+
16
+ router = APIRouter()
17
+
18
+
19
+ def _validate_and_read(file: UploadFile, contents: bytes) -> Image.Image:
20
+ content_type = getattr(file, "content_type", "")
21
+ if content_type not in ALLOWED_IMAGE_MIMETYPES:
22
+ raise HTTPException(
23
+ status_code=400,
24
+ detail=f"Format non supporté: {content_type}. Formats acceptés: JPEG, PNG, WEBP, HEIC.",
25
+ )
26
+ if len(contents) > MAX_FILE_SIZE:
27
+ raise HTTPException(status_code=413, detail="Fichier trop volumineux (max 20 Mo).")
28
+ try:
29
+ return Image.open(io.BytesIO(contents)).convert("RGB")
30
+ except Exception:
31
+ raise HTTPException(status_code=400, detail="Fichier image corrompu ou illisible.")
32
+
33
+
34
+ @router.post("/analyze/image")
35
+ async def analyze_image(file: UploadFile = File(...)):
36
+ contents = await file.read()
37
+ img = _validate_and_read(file, contents)
38
+ result = image_pipeline.run(img, contents)
39
+ return {"status": "success", **result}
40
+
41
+
42
+ @router.post("/analyze/image/fast")
43
+ async def analyze_image_fast(file: UploadFile = File(...)):
44
+ contents = await file.read()
45
+ img = _validate_and_read(file, contents)
46
+ result = image_pipeline.run_fast(img, contents)
47
+ return {"status": "success", **result}
app/routers/text.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import APIRouter, HTTPException
2
+ from pydantic import BaseModel
3
+
4
+ from app.pipelines.text_ai import analyze_ai_text
5
+ from app.pipelines.fakenews import analyze_fakenews_text
6
+
7
+ router = APIRouter()
8
+
9
+ class TextRequest(BaseModel):
10
+ text: str
11
+
12
+ @router.post("/analyze/text")
13
+ async def analyze_text(body: TextRequest):
14
+ if not body.text or not body.text.strip():
15
+ raise HTTPException(status_code=400, detail="Le texte ne peut pas être vide.")
16
+ res = analyze_ai_text(body.text)
17
+ return {
18
+ "status": "success",
19
+ "verdict": res["verdict"],
20
+ "ai_prob": res["ai_prob"],
21
+ "human_prob": res["human_prob"]
22
+ }
23
+
24
+ @router.post("/analyze/fakenews")
25
+ async def analyze_fakenews(body: TextRequest):
26
+ if not body.text or not body.text.strip():
27
+ raise HTTPException(status_code=400, detail="Le texte ne peut pas être vide.")
28
+ res = analyze_fakenews_text(body.text)
29
+ return {
30
+ "status": "success",
31
+ "verdict": res["verdict"],
32
+ "fake_prob": res["fake_prob"],
33
+ "real_prob": res["real_prob"]
34
+ }
35
+
36
+ @router.post("/analyze/text/full")
37
+ async def analyze_text_full(body: TextRequest):
38
+ if not body.text or not body.text.strip():
39
+ raise HTTPException(status_code=400, detail="Le texte ne peut pas être vide.")
40
+ res_ai = analyze_ai_text(body.text)
41
+ res_fn = analyze_fakenews_text(body.text)
42
+
43
+ is_ai = res_ai["is_ai"]
44
+ is_fake = res_fn["is_fake"]
45
+
46
+ if is_ai and is_fake:
47
+ verdict = "DANGER MAX : Fake news générée par IA"
48
+ elif is_ai and not is_fake:
49
+ verdict = "Texte IA mais contenu vérifié"
50
+ elif not is_ai and is_fake:
51
+ verdict = "Désinformation humaine"
52
+ else:
53
+ verdict = "Texte humain, contenu vérifié"
54
+
55
+ return {
56
+ "status": "success",
57
+ "verdict": verdict,
58
+ "ai_prob": res_ai["ai_prob"],
59
+ "fake_news_prob": res_fn["fake_prob"],
60
+ "is_ai_generated": is_ai,
61
+ "is_fake_news": is_fake
62
+ }
app/routers/video.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import APIRouter, File, HTTPException, UploadFile
2
+
3
+ from app.core.config import ALLOWED_VIDEO_MIMETYPES, MAX_VIDEO_SIZE
4
+ from app.pipelines import video as video_pipeline
5
+
6
+ router = APIRouter()
7
+
8
+
9
+ @router.post("/analyze/video")
10
+ async def analyze_video(file: UploadFile = File(...)):
11
+ content_type = getattr(file, "content_type", "")
12
+ if content_type not in ALLOWED_VIDEO_MIMETYPES:
13
+ raise HTTPException(
14
+ status_code=400,
15
+ detail=f"Format non supporté: {content_type}. Formats acceptés: MP4, MOV, AVI, WEBM, MKV.",
16
+ )
17
+ contents = await file.read()
18
+ if len(contents) > MAX_VIDEO_SIZE:
19
+ raise HTTPException(status_code=413, detail="Fichier trop volumineux (max 200 Mo).")
20
+ try:
21
+ result = video_pipeline.run(contents)
22
+ except ValueError as e:
23
+ raise HTTPException(status_code=400, detail=str(e))
24
+ return {"status": "success", **result}
docker-compose.yml ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.9'
2
+
3
+ services:
4
+
5
+ api:
6
+ build: .
7
+ container_name: kamyvision-api
8
+ ports:
9
+ - "8000:8000"
10
+ volumes:
11
+ - ./app:/app/app # hot reload du code
12
+ - model-cache:/root/.cache/huggingface # cache modèles HF
13
+ - ./models:/app/models # modèle ONNX Assietou
14
+ environment:
15
+ - PYTHONUNBUFFERED=1
16
+ - HF_HOME=/root/.cache/huggingface
17
+ restart: unless-stopped
18
+ healthcheck:
19
+ test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
20
+ interval: 30s
21
+ timeout: 10s
22
+ retries: 3
23
+ start_period: 90s
24
+
25
+ frontend:
26
+ image: nginx:alpine
27
+ container_name: kamyvision-frontend
28
+ ports:
29
+ - "3000:80"
30
+ volumes:
31
+ - ./frontend:/usr/share/nginx/html:ro
32
+ - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
33
+ depends_on:
34
+ api:
35
+ condition: service_healthy
36
+ restart: unless-stopped
37
+
38
+ volumes:
39
+ model-cache: # persiste les modèles HF entre les redémarrages
nginx.conf ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ server {
2
+ listen 80;
3
+ server_name localhost;
4
+ root /usr/share/nginx/html;
5
+ index index.html;
6
+
7
+ # Fichiers statiques
8
+ location / {
9
+ try_files $uri $uri/ /index.html;
10
+ }
11
+
12
+ # Proxy vers l'API backend
13
+ location /predict {
14
+ proxy_pass http://api:8000/predict;
15
+ proxy_set_header Host $host;
16
+ proxy_set_header X-Real-IP $remote_addr;
17
+ proxy_read_timeout 120s;
18
+ }
19
+
20
+ location /analyze/ {
21
+ proxy_pass http://api:8000/analyze/;
22
+ proxy_set_header Host $host;
23
+ proxy_set_header X-Real-IP $remote_addr;
24
+ proxy_read_timeout 120s;
25
+ }
26
+
27
+ location /health {
28
+ proxy_pass http://api:8000/health;
29
+ }
30
+ }
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi
2
+ uvicorn[standard]
3
+ python-multipart
4
+ torch
5
+ transformers
6
+ Pillow
7
+ pillow-heif
8
+ librosa
9
+ soundfile
10
+ numpy
11
+ mediapipe
12
+ opencv-python-headless
13
+ optimum[onnxruntime]
14
+ deep-translator
scripts/preload_models.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Preload all models into HuggingFace Hub cache at Docker build time.
3
+ This avoids cold-start downloads on the first request in production.
4
+ """
5
+ from transformers import (
6
+ AutoFeatureExtractor,
7
+ AutoModelForAudioClassification,
8
+ AutoModelForSequenceClassification,
9
+ AutoModelForImageClassification,
10
+ AutoTokenizer,
11
+ )
12
+
13
+ import sys
14
+
15
+ MODEL_GROUPS = {
16
+ "Audio": [
17
+ ("AutoFeatureExtractor", "MelodyMachine/Deepfake-audio-detection-V2"),
18
+ ("AutoModelForAudioClassification", "MelodyMachine/Deepfake-audio-detection-V2"),
19
+ ],
20
+ "Text": [
21
+ ("AutoTokenizer", "fakespot-ai/roberta-base-ai-text-detection-v1"),
22
+ ("AutoModelForSequenceClassification", "fakespot-ai/roberta-base-ai-text-detection-v1"),
23
+ ("AutoTokenizer", "Hello-SimpleAI/chatgpt-detector-roberta"),
24
+ ("AutoModelForSequenceClassification", "Hello-SimpleAI/chatgpt-detector-roberta"),
25
+ ("AutoTokenizer", "vikram71198/distilroberta-base-finetuned-fake-news-detection"),
26
+ ("AutoModelForSequenceClassification", "vikram71198/distilroberta-base-finetuned-fake-news-detection"),
27
+ ("AutoTokenizer", "jy46604790/Fake-News-Bert-Detect"),
28
+ ("AutoModelForSequenceClassification", "jy46604790/Fake-News-Bert-Detect"),
29
+ ],
30
+ "Image": [
31
+ ("AutoModelForImageClassification", "Ateeqq/ai-vs-human-image-detector"),
32
+ ("AutoModelForImageClassification", "prithivMLmods/AI-vs-Deepfake-vs-Real"),
33
+ ("AutoModelForImageClassification", "prithivMLmods/Deep-Fake-Detector-Model"),
34
+ ],
35
+ }
36
+
37
+ LOADERS = {
38
+ "AutoFeatureExtractor": AutoFeatureExtractor,
39
+ "AutoModelForAudioClassification": AutoModelForAudioClassification,
40
+ "AutoModelForSequenceClassification": AutoModelForSequenceClassification,
41
+ "AutoModelForImageClassification": AutoModelForImageClassification,
42
+ "AutoTokenizer": AutoTokenizer,
43
+ }
44
+
45
+ errors = []
46
+ for group, models in MODEL_GROUPS.items():
47
+ print(f"\n── {group} ──")
48
+ for loader_name, model_name in models:
49
+ try:
50
+ print(f" Downloading {model_name} ({loader_name})...", end=" ", flush=True)
51
+ LOADERS[loader_name].from_pretrained(model_name)
52
+ print("OK")
53
+ except Exception as e:
54
+ print(f"FAILED: {e}")
55
+ errors.append((model_name, str(e)))
56
+
57
+ if errors:
58
+ print(f"\n⚠️ {len(errors)} model(s) failed to preload (will download on first request):")
59
+ for name, err in errors:
60
+ print(f" - {name}: {err}")
61
+ else:
62
+ print("\nAll models preloaded successfully.")
scripts/train_ai_detector.py ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ import pandas as pd
4
+ from datasets import Dataset, load_dataset
5
+ from transformers import (
6
+ AutoTokenizer,
7
+ AutoModelForSequenceClassification,
8
+ TrainingArguments,
9
+ Trainer,
10
+ DataCollatorWithPadding
11
+ )
12
+
13
+ # 1. Configuration pour Colab / GPU
14
+ MODEL_NAME = "xlm-roberta-base" # Multilingue pour supporter le Français
15
+ OUTPUT_DIR = "./ai_results"
16
+ SAVE_DIR = "./ai_detector_model"
17
+
18
+ def train():
19
+ # 2. Chargement des données
20
+ # Sur Colab, vous pouvez charger directement depuis Hugging Face :
21
+ # Exemple avec HC3 (Human-ChatGPT Comparison Corpus) en Français
22
+ print("📥 Chargement du dataset HC3 (Français)...")
23
+ dataset = load_dataset("Hello-SimpleAI/HC3", "french", split="train")
24
+
25
+ # Transformation du format HC3 en classification binaire
26
+ # HC3 a 'human_answers' et 'chatgpt_answers'
27
+ # On va créer un dataset plat : [texte, label]
28
+ texts = []
29
+ labels = []
30
+ for item in dataset:
31
+ for ans in item['human_answers']:
32
+ texts.append(ans)
33
+ labels.append(0) # Humain
34
+ for ans in item['chatgpt_answers']:
35
+ texts.append(ans)
36
+ labels.append(1) # AI
37
+
38
+ df = pd.DataFrame({"text": texts, "label": labels})
39
+
40
+ # Échantillonnage pour la démo Hackathon (rapide)
41
+ MAX_SAMPLES = 5000
42
+ if len(df) > MAX_SAMPLES:
43
+ df = df.sample(MAX_SAMPLES, random_state=42).reset_index(drop=True)
44
+
45
+ print(f"✅ Dataset prêt : {len(df)} exemples.")
46
+ hf_dataset = Dataset.from_pandas(df)
47
+
48
+ # 3. Tokenization
49
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
50
+
51
+ def preprocess_function(examples):
52
+ return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)
53
+
54
+ tokenized_dataset = hf_dataset.map(preprocess_function, batched=True)
55
+
56
+ # 4. Chargement du modèle
57
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
58
+
59
+ # 5. Arguments d'entraînement optimisés pour Colab (GPU)
60
+ training_args = TrainingArguments(
61
+ output_dir=OUTPUT_DIR,
62
+ learning_rate=2e-5,
63
+ per_device_train_batch_size=8,
64
+ num_train_epochs=1,
65
+ weight_decay=0.01,
66
+ save_strategy="epoch",
67
+ fp16=torch.cuda.is_available(), # Active l'accélération si GPU présent
68
+ push_to_hub=False,
69
+ )
70
+
71
+ # 6. Trainer
72
+ trainer = Trainer(
73
+ model=model,
74
+ args=training_args,
75
+ train_dataset=tokenized_dataset,
76
+ tokenizer=tokenizer,
77
+ data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
78
+ )
79
+
80
+ # 7. Lancement
81
+ print("🚀 Début de l'entraînement sur Colab...")
82
+ trainer.train()
83
+
84
+ # 8. Sauvegarde
85
+ model.save_pretrained(SAVE_DIR)
86
+ tokenizer.save_pretrained(SAVE_DIR)
87
+ print(f"✅ Modèle sauvegardé dans {SAVE_DIR}")
88
+
89
+ if __name__ == "__main__":
90
+ train()
scripts/train_fake_news.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ import pandas as pd
4
+ from datasets import Dataset
5
+ from transformers import (
6
+ AutoTokenizer,
7
+ AutoModelForSequenceClassification,
8
+ TrainingArguments,
9
+ Trainer,
10
+ DataCollatorWithPadding
11
+ )
12
+
13
+ # 1. Configuration
14
+ # Pour le Français, on utilise un modèle multilingue (XLM-RoBERTa) ou dédié (CamemBERT)
15
+ MODEL_NAME = "xlm-roberta-base"
16
+ OUTPUT_DIR = "./results"
17
+ SAVE_DIR = "./fakenews_model_fr"
18
+
19
+ def train():
20
+ # 2. Chargement des données (Exemple avec un CSV fictif)
21
+ # Assurez-vous d'avoir un CSV avec au moins 'text' et 'label' (0=Real, 1=Fake)
22
+ if not os.path.exists("data.csv"):
23
+ print("Erreur: data.csv non trouvé. Création d'un exemple factice...")
24
+ data = {
25
+ "text": [
26
+ "Le président a inauguré une nouvelle école ce matin à Paris.",
27
+ "Des extraterrestres ont envahi la tour Eiffel hier soir !"
28
+ ],
29
+ "label": [0, 1]
30
+ }
31
+ pd.DataFrame(data).to_csv("data.csv", index=False)
32
+
33
+ df = pd.read_csv("data.csv")
34
+
35
+ # Nettoyage des données : Supprimer les NaNs et s'assurer que 'text' est de type string
36
+ df = df.dropna(subset=['text', 'label'])
37
+ df['text'] = df['text'].astype(str)
38
+ df = df[df['text'].str.len() > 10]
39
+
40
+ # OPTIMISATION HACKATHON 🚀 : Échantillon aléatoire pour aller plus vite
41
+ MAX_SAMPLES = 2000
42
+ if len(df) > MAX_SAMPLES:
43
+ print(f"Extraction d'un échantillon aléatoire de {MAX_SAMPLES} sur {len(df)} lignes...")
44
+ df = df.sample(MAX_SAMPLES, random_state=42).reset_index(drop=True)
45
+
46
+ dataset = Dataset.from_pandas(df.reset_index(drop=True))
47
+
48
+ # 3. Tokenization
49
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
50
+
51
+ def preprocess_function(examples):
52
+ return tokenizer(examples["text"], truncation=True, padding=True)
53
+
54
+ tokenized_dataset = dataset.map(preprocess_function, batched=True)
55
+
56
+ # 4. Chargement du modèle
57
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
58
+
59
+ # 5. Arguments d'entraînement
60
+ training_args = TrainingArguments(
61
+ output_dir=OUTPUT_DIR,
62
+ learning_rate=2e-5,
63
+ per_device_train_batch_size=8, # Réduit pour économiser la RAM
64
+ num_train_epochs=1, # 1 seule passe pour le hackathon
65
+ weight_decay=0.01,
66
+ save_strategy="epoch",
67
+ )
68
+
69
+ # 6. Trainer
70
+ trainer = Trainer(
71
+ model=model,
72
+ args=training_args,
73
+ train_dataset=tokenized_dataset,
74
+ tokenizer=tokenizer,
75
+ data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
76
+ )
77
+
78
+ # 7. Lancement de l'entraînement
79
+ print("Début de l'entraînement...")
80
+ trainer.train()
81
+
82
+ # 8. Sauvegarde
83
+ model.save_pretrained(SAVE_DIR)
84
+ tokenizer.save_pretrained(SAVE_DIR)
85
+ print(f"Modèle sauvegardé dans {SAVE_DIR}")
86
+
87
+ if __name__ == "__main__":
88
+ train()