Spaces:

amarck
/

Researcher

Sleeping

amarck commited on Feb 26

Commit

8b98d83

1 Parent(s): 430d0f8

Add demo mode, fix scoring, improve preferences

Demo mode:
- Ships 420 pre-scored AI/ML papers as JSON for HF Spaces
- Entrypoint auto-bootstraps DB when no config/API key exists
- Dashboard shows Demo Mode banner, pipelines disabled
- Remove CLAUDE.md from public release

Scoring fixes:
- Generate axis field names from config (was hardcoded wrong:
code_weights vs code_and_weights, has_code vs has_code_poc)
- Fix filter 422: accept empty min_score as string, convert manually
- Fix min_score=0 displaying as blank in template

Preferences:
- Increase boost range from [-1, 1.5] to [-2, 3] for stronger signal

Files changed (11) hide show

CLAUDE.md +0 -73
Dockerfile +2 -0
data/demo-config.yaml +64 -0
data/demo-data.json +0 -0
entrypoint.sh +15 -0
src/config.py +15 -4
src/demo.py +93 -0
src/preferences.py +5 -5
src/web/app.py +15 -3
src/web/templates/dashboard.html +9 -0
src/web/templates/papers.html +1 -1

CLAUDE.md DELETED Viewed

@@ -1,73 +0,0 @@
-# Research Intelligence System
-## Architecture
-- **Web dashboard**: FastAPI + Jinja2 + HTMX on port 8888
-- **Database**: SQLite at `data/researcher.db` (configurable in `config.yaml`)
-- **Config**: YAML-driven via `config.yaml` (generated by setup wizard on first run)
-- **Pipelines**: `src/pipelines/aiml.py` (HF + arXiv), `src/pipelines/security.py` (arXiv cs.CR)
-- **Scoring**: `src/scoring.py` — Claude API batch scoring with configurable axes
-- **Preferences**: `src/preferences.py` — learns from user signals (upvote/downvote/save/dismiss)
-- **Scheduler**: APScheduler runs on configurable cron schedule
-## Key Files
-| File | Purpose |
-|------|---------|
-| `src/config.py` | YAML config loader, scoring prompt builder, defaults |
-| `src/db.py` | SQLite schema + query helpers |
-| `src/scoring.py` | Unified Claude API scorer |
-| `src/preferences.py` | Preference computation from user signals |
-| `src/pipelines/aiml.py` | AI/ML paper fetching (HF + arXiv) |
-| `src/pipelines/security.py` | Security paper fetching (arXiv cs.CR) |
-| `src/pipelines/github.py` | GitHub trending projects via OSSInsight |
-| `src/pipelines/events.py` | Conferences, releases, RSS news |
-| `src/web/app.py` | FastAPI routes, middleware, report generation |
-| `src/scheduler.py` | APScheduler weekly trigger |
-## Config System
-`src/config.py` loads `config.yaml` and exposes module-level constants:
-- `FIRST_RUN` — True when `config.yaml` doesn't exist (triggers setup wizard)
-- `SCORING_CONFIGS` — Dict of domain scoring configs (axes, weights, prompts)
-- `DB_PATH` — Path to SQLite database
-- `ANTHROPIC_API_KEY` — From `.env` or environment
-Scoring prompts are built dynamically from `scoring_axes` and `preferences` in config.
-## Working with the Database
-```bash
-sqlite3 data/researcher.db
-# Top papers
-SELECT title, composite, summary FROM papers
-WHERE domain='aiml' AND composite IS NOT NULL
-ORDER BY composite DESC LIMIT 10;
-# Signal counts
-SELECT action, COUNT(*) FROM signals GROUP BY action;
-# Preference profile
-SELECT * FROM preferences ORDER BY abs(pref_value) DESC LIMIT 20;
-```
-## Docker
-```bash
-docker compose up --build
-# Dashboard at http://localhost:9090
-# Setup wizard runs on first visit
-# Trigger pipelines
-curl -X POST http://localhost:9090/run/aiml
-curl -X POST http://localhost:9090/run/security
-```
-## Allowed Tools
-When working with this project in Claude Code:
-- **Bash**: python, sqlite3, curl, docker commands
-- **WebSearch/WebFetch**: arXiv, GitHub, HuggingFace for paper details
-- **Read/Edit**: all project files and data/

Dockerfile CHANGED Viewed

@@ -12,6 +12,8 @@ RUN useradd -m -u 1000 -s /bin/bash appuser
 # Copy source
 COPY src/ src/
 COPY data/seed_papers.json data/seed_papers.json
 COPY entrypoint.sh .
 RUN chmod +x entrypoint.sh

 # Copy source
 COPY src/ src/
 COPY data/seed_papers.json data/seed_papers.json
+COPY data/demo-data.json data/demo-data.json
+COPY data/demo-config.yaml data/demo-config.yaml
 COPY entrypoint.sh .
 RUN chmod +x entrypoint.sh

data/demo-config.yaml ADDED Viewed

	@@ -0,0 +1,64 @@

+scoring:
+  model: claude-haiku-4-5-20251001
+  rescore_model: claude-sonnet-4-5-20250929
+  rescore_top_n: 15
+  batch_size: 20
+domains:
+  aiml:
+    enabled: true
+    label: AI / ML
+    sources:
+    - huggingface
+    - arxiv
+    arxiv_categories:
+    - cs.CV
+    - cs.CL
+    - cs.LG
+    scoring_axes:
+    - name: Code & Weights
+      weight: 0.3
+      description: Open weights on HF, code on GitHub
+    - name: Novelty
+      weight: 0.35
+      description: Paradigm shifts over incremental
+    - name: Practical Applicability
+      weight: 0.35
+      description: Usable by practitioners soon
+    include_patterns: []
+    exclude_patterns: []
+    preferences:
+      boost_topics: []
+      penalize_topics: []
+  security:
+    enabled: true
+    label: Security
+    sources:
+    - arxiv
+    arxiv_categories:
+    - cs.CR
+    scoring_axes:
+    - name: Has Code/PoC
+      weight: 0.25
+      description: Working tools, repos, artifacts
+    - name: Novel Attack Surface
+      weight: 0.4
+      description: First-of-kind research
+    - name: Real-World Impact
+      weight: 0.35
+      description: Affects production systems
+    include_patterns: []
+    exclude_patterns: []
+    preferences:
+      boost_topics: []
+      penalize_topics: []
+github:
+  enabled: true
+events:
+  enabled: true
+schedule:
+  cron: 0 22 * * 0
+database:
+  path: data/researcher.db
+web:
+  host: 0.0.0.0
+  port: 8888

data/demo-data.json ADDED Viewed

The diff for this file is too large to render. See raw diff

entrypoint.sh CHANGED Viewed

@@ -4,6 +4,21 @@ set -e
 PORT="${PORT:-8888}"
 echo "=== Research Intelligence ==="
 echo "Starting web server + scheduler on port ${PORT} ..."
 exec python -m uvicorn src.web.app:app --host 0.0.0.0 --port "${PORT}"

 PORT="${PORT:-8888}"
 echo "=== Research Intelligence ==="
+# Bootstrap demo data if no config exists and no API key set
+if [ -n "$SPACE_ID" ]; then
+    CONFIG_PATH="/data/config.yaml"
+else
+    CONFIG_PATH="config.yaml"
+fi
+if [ ! -f "$CONFIG_PATH" ] && [ -f "data/demo-data.json" ] && [ -z "$ANTHROPIC_API_KEY" ]; then
+    echo "No config found and no API key set — loading demo data..."
+    python -c "from src.demo import load_demo; load_demo()"
+    export DEMO_MODE=1
+    echo "Demo mode active. Deploy locally with an API key for full functionality."
+fi
 echo "Starting web server + scheduler on port ${PORT} ..."
 exec python -m uvicorn src.web.app:app --host 0.0.0.0 --port "${PORT}"

src/config.py CHANGED Viewed

@@ -31,6 +31,7 @@ log = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 IS_HF_SPACE = bool(os.environ.get("SPACE_ID"))
 def _spaces_data_dir() -> Path:
@@ -255,7 +256,7 @@ def _build_aiml_prompt(axes: list[dict], boost: list[str], penalize: list[str])
     for i, ax in enumerate(axes, 1):
         name = ax.get("name", f"axis_{i}")
         desc = ax.get("description", "")
-        field = name.lower().replace(" ", "_").replace("&", "and").replace("/", "_")
         axis_fields.append(field)
         axis_section.append(f"{i}. **{field}** — {name}: {desc}")
@@ -297,7 +298,7 @@ def _build_security_prompt(axes: list[dict], boost: list[str], penalize: list[st
     for i, ax in enumerate(axes, 1):
         name = ax.get("name", f"axis_{i}")
         desc = ax.get("description", "")
-        field = name.lower().replace(" ", "_").replace("&", "and").replace("/", "_")
         axis_fields.append(field)
         axes_section.append(f"{i}. **{field}** (1-10) — {name}: {desc}")
@@ -367,9 +368,14 @@ def _build_scoring_configs() -> dict:
         aiml_weights[key] = ax.get("weight", 1.0 / len(aiml_axes_cfg))
     aiml_weights = _normalize_weights(aiml_weights)
     configs["aiml"] = {
         "weights": aiml_weights,
-        "axes": ["code_weights", "novelty", "practical_applicability"],
         "axis_labels": [ax.get("name", f"Axis {i+1}") for i, ax in enumerate(aiml_axes_cfg)],
         "prompt": _build_scoring_prompt("aiml", aiml_axes_cfg, aiml_prefs),
     }
@@ -388,9 +394,14 @@ def _build_scoring_configs() -> dict:
         sec_weights[key] = ax.get("weight", 1.0 / len(sec_axes_cfg))
     sec_weights = _normalize_weights(sec_weights)
     configs["security"] = {
         "weights": sec_weights,
-        "axes": ["has_code", "novel_attack_surface", "real_world_impact"],
         "axis_labels": [ax.get("name", f"Axis {i+1}") for i, ax in enumerate(sec_axes_cfg)],
         "prompt": _build_scoring_prompt("security", sec_axes_cfg, sec_prefs),
     }

 # ---------------------------------------------------------------------------
 IS_HF_SPACE = bool(os.environ.get("SPACE_ID"))
+DEMO_MODE = bool(os.environ.get("DEMO_MODE"))
 def _spaces_data_dir() -> Path:
     for i, ax in enumerate(axes, 1):
         name = ax.get("name", f"axis_{i}")
         desc = ax.get("description", "")
+        field = name.lower().replace(" ", "_").replace("&", "and").replace("/", "_").replace("-", "_")
         axis_fields.append(field)
         axis_section.append(f"{i}. **{field}** — {name}: {desc}")
     for i, ax in enumerate(axes, 1):
         name = ax.get("name", f"axis_{i}")
         desc = ax.get("description", "")
+        field = name.lower().replace(" ", "_").replace("&", "and").replace("/", "_").replace("-", "_")
         axis_fields.append(field)
         axes_section.append(f"{i}. **{field}** (1-10) — {name}: {desc}")
         aiml_weights[key] = ax.get("weight", 1.0 / len(aiml_axes_cfg))
     aiml_weights = _normalize_weights(aiml_weights)
+    aiml_axis_fields = [
+        ax.get("name", f"axis_{i+1}").lower().replace(" ", "_").replace("&", "and").replace("/", "_").replace("-", "_")
+        for i, ax in enumerate(aiml_axes_cfg)
+    ]
     configs["aiml"] = {
         "weights": aiml_weights,
+        "axes": aiml_axis_fields,
         "axis_labels": [ax.get("name", f"Axis {i+1}") for i, ax in enumerate(aiml_axes_cfg)],
         "prompt": _build_scoring_prompt("aiml", aiml_axes_cfg, aiml_prefs),
     }
         sec_weights[key] = ax.get("weight", 1.0 / len(sec_axes_cfg))
     sec_weights = _normalize_weights(sec_weights)
+    sec_axis_fields = [
+        ax.get("name", f"axis_{i+1}").lower().replace(" ", "_").replace("&", "and").replace("/", "_").replace("-", "_")
+        for i, ax in enumerate(sec_axes_cfg)
+    ]
     configs["security"] = {
         "weights": sec_weights,
+        "axes": sec_axis_fields,
         "axis_labels": [ax.get("name", f"Axis {i+1}") for i, ax in enumerate(sec_axes_cfg)],
         "prompt": _build_scoring_prompt("security", sec_axes_cfg, sec_prefs),
     }

src/demo.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""Demo data loader — bootstraps a pre-scored DB from bundled JSON."""
+import json
+import logging
+import shutil
+from pathlib import Path
+log = logging.getLogger(__name__)
+def load_demo():
+    """Load demo data into a fresh database and copy demo config."""
+    from src.config import IS_HF_SPACE, SPACES_DATA_DIR
+    json_path = Path("data/demo-data.json")
+    config_src = Path("data/demo-config.yaml")
+    if not json_path.exists():
+        log.warning("Demo data not found at %s", json_path)
+        return
+    # Determine target paths
+    if IS_HF_SPACE:
+        config_dst = SPACES_DATA_DIR / "config.yaml"
+        db_path = SPACES_DATA_DIR / "researcher.db"
+    else:
+        config_dst = Path("config.yaml")
+        db_path = Path("data/researcher.db")
+    # Copy config
+    if config_src.exists() and not config_dst.exists():
+        config_dst.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copy2(config_src, config_dst)
+        log.info("Demo config copied to %s", config_dst)
+    # Skip if DB already has data
+    if db_path.exists():
+        log.info("DB already exists at %s — skipping demo load", db_path)
+        return
+    # Initialize DB with current schema
+    import os
+    os.environ["DB_PATH"] = str(db_path)
+    # Re-import to pick up new path
+    import importlib
+    import src.config
+    importlib.reload(src.config)
+    from src.db import init_db, get_conn
+    init_db()
+    # Load JSON
+    data = json.loads(json_path.read_text())
+    runs = data.get("runs", [])
+    papers = data.get("papers", [])
+    with get_conn() as conn:
+        # Insert runs
+        for r in runs:
+            conn.execute(
+                """INSERT OR IGNORE INTO runs (id, domain, started_at, finished_at,
+                   date_start, date_end, paper_count, status)
+                   VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
+                (r["id"], r["domain"], r["started_at"], r["finished_at"],
+                 r["date_start"], r["date_end"], r["paper_count"], r["status"]),
+            )
+        # Insert papers
+        for p in papers:
+            conn.execute(
+                """INSERT INTO papers (run_id, domain, arxiv_id, entry_id, title,
+                   authors, abstract, published, categories, pdf_url, arxiv_url,
+                   comment, source, github_repo, github_stars, hf_upvotes,
+                   hf_models, hf_datasets, hf_spaces, score_axis_1, score_axis_2,
+                   score_axis_3, composite, summary, reasoning, code_url,
+                   s2_tldr, s2_paper_id, topics)
+                   VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
+                (p.get("run_id"), p.get("domain"), p.get("arxiv_id"), p.get("entry_id"),
+                 p.get("title"), p.get("authors"), p.get("abstract"), p.get("published"),
+                 p.get("categories"), p.get("pdf_url"), p.get("arxiv_url"),
+                 p.get("comment"), p.get("source"), p.get("github_repo"),
+                 p.get("github_stars"), p.get("hf_upvotes"), p.get("hf_models"),
+                 p.get("hf_datasets"), p.get("hf_spaces"), p.get("score_axis_1"),
+                 p.get("score_axis_2"), p.get("score_axis_3"), p.get("composite"),
+                 p.get("summary"), p.get("reasoning"), p.get("code_url"),
+                 p.get("s2_tldr"), p.get("s2_paper_id"), p.get("topics")),
+            )
+        # Rebuild FTS index
+        conn.execute("INSERT INTO papers_fts(papers_fts) VALUES('rebuild')")
+    log.info("Demo data loaded: %d runs, %d papers into %s", len(runs), len(papers), db_path)

src/preferences.py CHANGED Viewed

@@ -1,6 +1,6 @@
 """Preference engine — learns from user signals to personalize paper rankings.
-Adds a preference_boost (max +1.5 / min -1.0) on top of stored composite scores.
 Never re-scores papers. Papers with composite >= 8 are never penalized.
 """
@@ -190,7 +190,7 @@ def compute_paper_boost(paper: dict, preferences: dict[str, float]) -> tuple[flo
     """Compute preference boost for a single paper.
     Returns (boost_value, list_of_reasons).
-    Boost is clamped to [-1.0, +1.5].
     Papers with composite >= 8 are never penalized (boost >= 0).
     """
     if not preferences:
@@ -284,11 +284,11 @@ def compute_paper_boost(paper: dict, preferences: dict[str, float]) -> tuple[flo
     if total_weight > 0:
         boost = boost / total_weight  # Normalize by actual weight used
-    # Scale to boost range: preferences are [-1, 1], we want [-1, 1.5]
-    boost = boost * 1.5
     # Clamp
-    boost = max(-1.0, min(1.5, boost))
     # Safety net: high-scoring papers never penalized
     composite = paper.get("composite") or 0

 """Preference engine — learns from user signals to personalize paper rankings.
+Adds a preference_boost (max +3.0 / min -2.0) on top of stored composite scores.
 Never re-scores papers. Papers with composite >= 8 are never penalized.
 """
     """Compute preference boost for a single paper.
     Returns (boost_value, list_of_reasons).
+    Boost is clamped to [-2.0, +3.0].
     Papers with composite >= 8 are never penalized (boost >= 0).
     """
     if not preferences:
     if total_weight > 0:
         boost = boost / total_weight  # Normalize by actual weight used
+    # Scale to boost range: preferences are [-1, 1], we want [-2, 3]
+    boost = boost * 3.0
     # Clamp
+    boost = max(-2.0, min(3.0, boost))
     # Safety net: high-scoring papers never penalized
     composite = paper.get("composite") or 0

src/web/app.py CHANGED Viewed

@@ -269,6 +269,7 @@ async def dashboard(request: Request):
         "running_pipelines": running,
         "show_seed_banner": show_seed_banner,
         "has_papers": (aiml_count + security_count) > 0,
     })
@@ -284,7 +285,7 @@ async def papers_list(
     offset: int = 0,
     limit: int = 50,
     search: str | None = None,
-    min_score: float | None = None,
     has_code: bool = False,
     topic: str | None = None,
     sort: str | None = None,
@@ -292,6 +293,14 @@ async def papers_list(
     if domain not in ("aiml", "security"):
         return RedirectResponse("/")
     config = SCORING_CONFIGS[domain]
     run = get_latest_run(domain) or {}
@@ -307,7 +316,7 @@ async def papers_list(
     papers, total = get_papers_page(
         domain, run_id=run.get("id"),
         offset=offset, limit=limit,
-        min_score=min_score,
         has_code=has_code if has_code else None,
         search=search,
         topic=topic,
@@ -341,7 +350,7 @@ async def papers_list(
         "offset": offset,
         "limit": limit,
         "search": search,
-        "min_score": min_score,
         "has_code": has_code,
         "topic": topic,
         "sort": sort,
@@ -631,6 +640,9 @@ async def trigger_run(domain: str):
     if domain not in ("aiml", "security", "github", "events"):
         return RedirectResponse("/", status_code=303)
     from src.config import is_pipeline_enabled
     if not is_pipeline_enabled(domain):
         return RedirectResponse("/", status_code=303)

         "running_pipelines": running,
         "show_seed_banner": show_seed_banner,
         "has_papers": (aiml_count + security_count) > 0,
+        "demo_mode": bool(os.environ.get("DEMO_MODE")),
     })
     offset: int = 0,
     limit: int = 50,
     search: str | None = None,
+    min_score: str | None = None,
     has_code: bool = False,
     topic: str | None = None,
     sort: str | None = None,
     if domain not in ("aiml", "security"):
         return RedirectResponse("/")
+    # Convert min_score from string (empty string from blank input → None)
+    min_score_val: float | None = None
+    if min_score:
+        try:
+            min_score_val = float(min_score)
+        except ValueError:
+            min_score_val = None
     config = SCORING_CONFIGS[domain]
     run = get_latest_run(domain) or {}
     papers, total = get_papers_page(
         domain, run_id=run.get("id"),
         offset=offset, limit=limit,
+        min_score=min_score_val,
         has_code=has_code if has_code else None,
         search=search,
         topic=topic,
         "offset": offset,
         "limit": limit,
         "search": search,
+        "min_score": min_score_val,
         "has_code": has_code,
         "topic": topic,
         "sort": sort,
     if domain not in ("aiml", "security", "github", "events"):
         return RedirectResponse("/", status_code=303)
+    if os.environ.get("DEMO_MODE"):
+        return RedirectResponse("/", status_code=303)
     from src.config import is_pipeline_enabled
     if not is_pipeline_enabled(domain):
         return RedirectResponse("/", status_code=303)

src/web/templates/dashboard.html CHANGED Viewed

@@ -1,6 +1,15 @@
 {% extends "base.html" %}
 {% block title %}Dashboard — Research Intelligence{% endblock %}
 {% block content %}
 <div class="page-header">
     <h1>Week of {{ week_label }}</h1>
     <div class="subtitle">Research triage overview</div>

 {% extends "base.html" %}
 {% block title %}Dashboard — Research Intelligence{% endblock %}
 {% block content %}
+{% if demo_mode %}
+<div style="background:linear-gradient(135deg, rgba(251,191,36,0.1), rgba(251,146,60,0.06)); border:1px solid rgba(251,191,36,0.3); border-radius:var(--radius-xl); padding:1rem 1.5rem; margin-bottom:1.5rem; display:flex; align-items:center; gap:0.75rem">
+    <span style="font-size:1.2rem">&#9888;</span>
+    <div>
+        <span style="font-weight:600; font-size:0.9rem">Demo Mode</span>
+        <span style="font-size:0.85rem; color:var(--text-muted)"> — Browsing sample data. Pipelines and scoring are disabled. To run your own instance, deploy locally with Docker Compose and an Anthropic API key.</span>
+    </div>
+</div>
+{% endif %}
 <div class="page-header">
     <h1>Week of {{ week_label }}</h1>
     <div class="subtitle">Research triage overview</div>

src/web/templates/papers.html CHANGED Viewed

@@ -16,7 +16,7 @@
         <input type="search" name="search" value="{{ search or '' }}" placeholder="Search papers...">
         <label>
             Min score
-            <input type="number" name="min_score" value="{{ min_score or '' }}" min="0" max="10" step="0.5">
         </label>
         <label>
             <input type="checkbox" name="has_code" value="1" {% if has_code %}checked{% endif %}>

         <input type="search" name="search" value="{{ search or '' }}" placeholder="Search papers...">
         <label>
             Min score
+            <input type="number" name="min_score" value="{{ min_score if min_score is not none else '' }}" min="0" max="10" step="0.5">
         </label>
         <label>
             <input type="checkbox" name="has_code" value="1" {% if has_code %}checked{% endif %}>