Spaces:

user9200
/

celarium

Configuration error

App Files Files Community

user9200 commited on Nov 20, 2025

Commit

6de83f6

verified ·

1 Parent(s): 9f33b5c

Upload 4 files

Browse files

Files changed (4) hide show

Dockerfile +15 -0
README.md +191 -0
main.py +240 -0
requirements.txt +7 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,15 @@

+FROM python:3.10-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+RUN python -c "from gliner import GLiNER; GLiNER.from_pretrained('urchade/gliner_small-v2.1')"
+COPY . .
+ENV PORT=8000
+CMD ["sh", "-c", "uvicorn main:app --host 0.0.0.0 --port ${PORT} --workers 1"]

README.md ADDED Viewed

	@@ -0,0 +1,191 @@

+# Celarium
+**Context-Aware Privacy Middleware for AI Agents & LLMs**
+Celarium acts as a smart firewall between your users and Large Language Models. It intercepts sensitive data, replaces it with context-aware, consistent fake entities, and restores the original values after the LLM responds.
+---
+## Why Celarium?
+Unlike simple regex tools that redact data ([REDACTED]), Celarium maintains semantic consistency.
+- **Context-Aware AI**: Uses GLiNER (Generalist Lightweight NER) to detect entities based on context, not just patterns.
+- **Handles**: MRN, SSN, Insurance Policy, Group IDs, Hospital Names (regex + AI detection)
+- **Data Consistency**: If "John Doe" becomes "Robert Smith", his email becomes robert.smith@example.com automatically.
+- **Batch Processing**: Natively handles JSON Lists and complex objects without hitting token limits.
+- **Smart Restoration**: The LLM "thinks" it's talking to Robert Smith. When the response comes back, Celarium swaps it back to John Doe.
+---
+## 🛠 How It Works
+```mermaid
+graph LR
+A[User Input] -->|Contains PII| B(Celarium)
+B -->|Anonymized Data| C[LLM / Agent]
+C -->|Response with Fakes| B
+B -->|Restored Data| D[User Output]
+```
+1. **Intercept**: Send raw data (Text or JSON) to Celarium.
+2. **Anonymize**: Celarium uses a Hybrid Engine (GLiNER AI + Strict Regex) to generate realistic fakes.
+3. **Process**: Send the clean data to OpenAI/Claude/Gemini.
+4. **Restore**: Send the LLM's response back to Celarium to swap the names back.
+---
+## Quick Start
+### Option A: Hosted API
+The easiest way to use Celarium is via the hosted API. No installation required.
+Base URL:
+1. Anonymize Data
+Send text or a list of JSON objects. The system auto-detects PII (Names, Emails, Phones, Medical IDs).
+code
+curl -X POST  \
+  -H "X-API-Key: sk_test_celarium_founder_001" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "text": "Patient John Doe (DOB 1985-07-14) admitted to Springfield General. MRN-998877."
+  }'
+Response:
+code
+JSON
+{
+  "anonymized_text": "Patient Michael Smith (DOB 1962-03-12) admitted to Oak Ridge Medical Center. MRN-112233.",
+  "session_id": "abc123uuid...",
+  "entities_found": 4
+}
+2. Process with LLM
+Send the anonymized_text to OpenAI, Claude, or your local model. The LLM sees "Michael Smith" and processes it safely.
+3. Restore Data
+Send the LLM's response back to Celarium to swap the names back.
+code
+curl -X POST  \
+  -H "X-API-Key: sk_test_celarium_founder_001" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "session_id": "abc123uuid...",
+    "text": "Summary: Michael Smith was treated at Oak Ridge..."
+  }'
+### Option B: Local Python
+Requires Python 3.10+
+```bash
+pip install -r requirements.txt
+python main.py
+```
+The API runs on [http://localhost:8000](http://localhost:8000)
+---
+## Usage Examples
+### 1. Medical / Clinical Data (Unstructured)
+Celarium detects specialized medical fields and formatting.
+```bash
+curl -X POST http://localhost:8000/v1/anonymize \
+  -H "X-API-Key: sk_test_celarium_founder_001" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "text": "Patient John Doe, DOB 1985-07-14, SSN 123-45-6789, MRN MRN-998877, admitted to Springfield General Hospital with Dr. House."
+  }'
+```
+**Response Example:**
+```json
+{
+  "anonymized_text": "Patient Michael Stevens, DOB 1962-03-12, SSN 542-11-9021, MRN MRN-112233, admitted to Oak Ridge Medical Center with Dr. Wilson.",
+  "session_id": "abc123uuid",
+  "entities_found": 6
+}
+```
+---
+### 2. Batch Processing (JSON Lists)
+Send entire database records. Celarium automatically handles list iteration and context preservation.
+```bash
+curl -X POST http://localhost:8000/v1/anonymize \
+  -H "X-API-Key: sk_test_celarium_founder_001" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "text": [
+      { "name": "Carlos Rivera", "email": "carlos@outlook.com", "policy": "POL-12345" },
+      { "name": "Sarah Jones", "email": "sarah.j@gmail.com", "policy": "POL-98765" }
+    ]
+  }'
+```
+**Response Example:**
+```json
+{
+  "anonymized_text": "[\n  { \"name\": \"David Kim\", \"email\": \"davidkim99@example.com\", \"policy\": \"POL-554433\" },\n  { \"name\": \"Emily White\", \"email\": \"emilywhite22@example.com\", \"policy\": \"POL-112211\" }\n]",
+  "session_id": "xyz789uuid",
+  "entities_found": 6
+}
+```
+---
+### 3. Restore Data
+After your LLM generates a response using the fake names, swap them back.
+```bash
+curl -X POST http://localhost:8000/v1/restore \
+  -H "X-API-Key: sk_test_celarium_founder_001" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "session_id": "abc123uuid",
+    "text": "We have updated the records for Michael Stevens regarding MRN-112233."
+  }'
+```
+**Response Example:**
+```json
+{
+  "restored_text": "We have updated the records for John Doe regarding MRN-998877."
+}
+```
+---
+## 🚀 Deployment
+- Designed for Railway, Heroku, or AWS.
+- Push to GitHub.
+- Connect to Railway/Heroku.
+- Deploy.
+The included Dockerfile handles the AI model download during the build phase. The server automatically optimizes for CPU usage.
+---
+## 🛡 Security & Compliance
+- **Ephemeral Storage**: Mappings are stored in-memory. If the server restarts, the data is gone.
+- **PII Never Logs**: We do not log the input text or the mappings to disk.
+- **Strict Regex Fallback**: If the AI misses a pattern, our strict Regex engine catches SSNs, Phones, and Emails as a failsafe.
+- **Address Protection**: Entire address blocks (Street + City + State) are replaced to prevent location leakage.
+---
+## Status
+This is a **proof-of-concept**. We're actively looking for:
+- Early adopters to validate the approach
+- Feedback on detection accuracy
+- Use cases beyond healthcare

main.py ADDED Viewed

	@@ -0,0 +1,240 @@

+import uvicorn
+import os
+import uuid
+import re
+import random
+import json
+from datetime import datetime
+from typing import Union, List, Dict, Any
+from fastapi import FastAPI, HTTPException, Security
+from fastapi.security import APIKeyHeader
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from faker import Faker
+from gliner import GLiNER
+app = FastAPI(title="Celarium AI")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
+VALID_API_KEYS = {"sk_test_celarium_founder_001", "sk_test_celarium_beta_001"}
+SESSIONS = {}
+fake = Faker()
+# Load Model
+print("Loading GLiNER...")
+model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
+print("Loaded.")
+# Regex & Labels
+REGEX_PATTERNS = {
+    "EMAIL_ADDRESS": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
+    "PHONE_NUMBER": r'(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
+    "MRN": r'\bMRN[-_]\w+\b',
+    "SSN": r'\b\d{3}-\d{2}-\d{4}\b',
+    "INSURANCE_GROUP": r'\bG\d{5,}\b',
+    "INSURANCE_POLICY": r'\b(POL|POLICY)[-_]?\d+\b',
+    "FULL_ADDRESS": r'\d+\s+[A-Za-z0-9\s\.]+,\s+[A-Za-z\s\.]+,\s+[A-Z]{2}\s+\d{5}(?:-\d{4})?'
+}
+AI_LABELS = ["person", "physical address", "organization", "date of birth"]
+# Generators
+def generate_clean_name():
+    return f"{fake.first_name()} {fake.last_name()}"
+def generate_matching_email(fake_name: str):
+    if not fake_name: return f"user{random.randint(1000, 9999)}@example.com"
+    parts = fake_name.lower().split()
+    base = f"{parts[0]}{parts[1]}" if len(parts) >= 2 else parts[0]
+    return f"{base}{random.randint(100, 9999)}@example.com"
+# --- UPDATED GENERATORS ---
+def generate_clean_phone():
+    """Matches the requested format: +1-XXX-XXX-XXXX"""
+    return f"+1-{random.randint(200, 999)}-{random.randint(200, 999)}-{random.randint(1000, 9999)}"
+def generate_medical_org():
+    """Generates realistic Healthcare/Clinical names"""
+    suffixes = [
+        "Medical Center", "Regional Health", "General Hospital",
+        "Health Group", "Family Clinic", "Community Care",
+        "Medical Associates", "Health System", "Diagnostics Lab"
+    ]
+    # 50% chance of City-based name (e.g. "Austin Regional Health")
+    # 50% chance of Name-based name (e.g. "Rivera Medical Group")
+    prefix = fake.city() if random.random() > 0.5 else fake.last_name()
+    return f"{prefix} {random.choice(suffixes)}"
+def get_fake_value(label: str, context: dict) -> str:
+    label = label.upper()
+    if "PERSON" in label:
+        val = generate_clean_name()
+        context["last_person"] = val
+        return val
+    if "EMAIL" in label:
+        return generate_matching_email(context.get("last_person", ""))
+    if "PHONE" in label:
+        return generate_clean_phone()  # <--- Uses new format
+    if "ADDRESS" in label or "LOCATION" in label:
+        # Fixes address leak by generating full block
+        return f"{fake.street_address()}, {fake.city()}, {fake.state_abbr()} {fake.zipcode()}"
+    if "MRN" in label:
+        return f"MRN-{fake.random_number(digits=8, fix_len=True)}"
+    if "SSN" in label:
+        return fake.ssn()
+    if "DATE" in label:
+        return str(fake.date_of_birth(minimum_age=18, maximum_age=90))
+    if "POLICY" in label:
+        return f"POL-{fake.random_number(digits=9, fix_len=True)}"
+    if "GROUP" in label:
+        return f"G{fake.random_number(digits=5, fix_len=True)}"
+    if "ORGANIZATION" in label:
+        return generate_medical_org()  # <--- Uses new medical generator
+    return f"REDACTED_{uuid.uuid4().hex[:6]}"
+def analyze_and_replace(text: str) -> (str, dict):
+    """Core logic to anonymize a single string block"""
+    findings = []
+    # Regex
+    for label, pattern in REGEX_PATTERNS.items():
+        for match in re.finditer(pattern, text):
+            findings.append({"start": match.start(), "end": match.end(), "label": label, "score": 1.0})
+    # AI
+    try:
+        ai_preds = model.predict_entities(text, AI_LABELS, threshold=0.35)
+        for p in ai_preds:
+            findings.append({"start": p["start"], "end": p["end"], "label": p["label"], "score": p["score"]})
+    except:
+        pass
+    # Merge
+    findings.sort(key=lambda x: x["start"])
+    merged = []
+    for f in findings:
+        if not merged:
+            merged.append(f)
+            continue
+        last = merged[-1]
+        if f["start"] < last["end"]:
+            if f["score"] > last["score"] or (f["end"] - f["start"]) > (last["end"] - last["start"]):
+                merged[-1] = f
+        else:
+            merged.append(f)
+    # Generate Fakes
+    mapping = {}
+    replacements = []
+    context = {"last_person": ""}
+    used_fakes = set()
+    for ent in merged:
+        original = text[ent["start"]:ent["end"]]
+        # Skip JSON Keys
+        if original.lower() in ["person_name", "date_of_birth", "ssn", "mrn", "email", "phone", "address"]:
+            continue
+        fake_val = get_fake_value(ent["label"], context)
+        if fake_val in used_fakes:
+            fake_val = f"{fake_val}_{random.randint(1, 99)}"
+        used_fakes.add(fake_val)
+        mapping[fake_val] = original
+        replacements.append({"start": ent["start"], "end": ent["end"], "fake": fake_val})
+    # Replace
+    replacements.sort(key=lambda x: x["start"], reverse=True)
+    text_chars = list(text)
+    for r in replacements:
+        text_chars[r["start"]:r["end"]] = list(r["fake"])
+    return "".join(text_chars), mapping
+# --- ENDPOINTS ---
+async def get_api_key(api_key: str = Security(api_key_header)):
+    if not api_key or api_key not in VALID_API_KEYS:
+        raise HTTPException(401, "Invalid API Key")
+    return api_key
+class AnonymizeRequest(BaseModel):
+    text: Union[str, List[Any], Dict[str, Any]]
+class RestoreRequest(BaseModel):
+    session_id: str
+    text: str
+@app.post("/v1/anonymize")
+async def anonymize(req: AnonymizeRequest, api_key: str = Security(get_api_key)):
+    input_data = req.text
+    global_mapping = {}
+    final_output_str = ""
+    # LOGIC: Handle List vs Single String
+    if isinstance(input_data, list):
+        # Process each item individually to avoid Token Limit
+        anonymized_list = []
+        for item in input_data:
+            item_str = json.dumps(item)
+            anon_str, item_map = analyze_and_replace(item_str)
+            anonymized_list.append(json.loads(anon_str))  # Convert back to dict
+            global_mapping.update(item_map)
+        # Return as formatted JSON string
+        final_output_str = json.dumps(anonymized_list, indent=2)
+    else:
+        # Single object or string
+        text_to_process = json.dumps(input_data) if isinstance(input_data, dict) else str(input_data)
+        final_output_str, global_mapping = analyze_and_replace(text_to_process)
+    session_id = str(uuid.uuid4())
+    SESSIONS[session_id] = {"mapping": global_mapping, "created": datetime.now(), "api_key": api_key}
+    return {
+        "anonymized_text": final_output_str,
+        "session_id": session_id,
+        "entities_found": len(global_mapping)
+    }
+@app.post("/v1/restore")
+async def restore(req: RestoreRequest, api_key: str = Security(get_api_key)):
+    session = SESSIONS.get(req.session_id)
+    if not session or session["api_key"] != api_key:
+        raise HTTPException(404, "Session not found")
+    restored = req.text
+    for fake_v, real_v in session["mapping"].items():
+        restored = restored.replace(fake_v, real_v)
+    return {"restored_text": restored}
+if __name__ == "__main__":
+    port = int(os.getenv("PORT", 8000))
+    uvicorn.run(app, host="0.0.0.0", port=port)

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+fastapi==0.104.1
+uvicorn==0.24.0
+faker==20.1.0
+python-dotenv==1.0.0
+pydantic==2.5.0
+gliner==0.1.11
+torch==2.0.1