Spaces:

SEUyishu
/

MatTableGPT

Sleeping

App Files Files Community

SEUyishu commited on Dec 4, 2025

Commit

84a8f07

verified ·

1 Parent(s): 7d21217

Upload 7 files

Browse files

Files changed (7) hide show

Dockerfile +50 -0
README.md +280 -10
__init__.py +33 -0
app.py +627 -0
mcp_service.py +1413 -0
requirements.txt +24 -0
start_mcp.py +144 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,50 @@

+# MaTableGPT MCP Service Docker Image
+# ====================================
+# For HuggingFace Spaces Deployment
+FROM python:3.10-slim
+# Set working directory
+WORKDIR /app
+# Set environment variables
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV GRADIO_SERVER_NAME=0.0.0.0
+ENV GRADIO_SERVER_PORT=7860
+# Install system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r requirements.txt
+# Download NLTK data for table splitting
+RUN python -c "import nltk; nltk.download('punkt')"
+# Copy application code
+COPY . .
+# Create necessary directories
+RUN mkdir -p /app/sessions /app/temp
+# Set permissions for HuggingFace Spaces
+RUN chmod -R 777 /app/sessions /app/temp
+# Expose ports
+# 7860 for Gradio, 7865 for MCP SSE
+EXPOSE 7860 7865
+# Health check
+HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
+    CMD python -c "import requests; requests.get('http://localhost:7860/')" || exit 1
+# Run the application
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,10 +1,280 @@
----
-title: MatTableGPT
-emoji: 🚀
-colorFrom: green
-colorTo: green
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: MaTableGPT MCP
+emoji: 🔬
+colorFrom: blue
+colorTo: green
+sdk: docker
+pinned: false
+license: mit
+app_port: 7860
+---
+# MaTableGPT MCP Service
+[![HuggingFace Spaces](https://img.shields.io/badge/🤗-HuggingFace%20Spaces-blue)](https://huggingface.co/spaces)
+[![MCP](https://img.shields.io/badge/MCP-Compatible-green)](https://modelcontextprotocol.io/)
+**GPT-based Table Data Extractor from Materials Science Literature**
+A Model Context Protocol (MCP) service that extracts structured catalyst performance data from HTML tables in materials science publications.
+## 🌟 Features
+### Table Representation
+- **HTML to TSV**: Convert HTML tables to tab-separated format with preserved structure
+- **HTML to JSON**: Convert HTML tables to nested JSON format
+- **Table Splitting**: Break down complex tables with multiple headers into simpler components
+### GPT-based Extraction
+- **Zero-shot**: Multi-step questioning approach without examples
+- **Few-shot**: Guided extraction with input/output examples
+- **Fine-tuned**: Use pre-trained specialized models
+### Session Management
+- Track multiple table processing workflows
+- Store representations and extractions
+- Export session data for analysis
+## 📦 Installation
+### Prerequisites
+- Python 3.8+
+- OpenAI API key (for GPT extraction)
+### Local Installation
+```bash
+# Clone or copy the mcp_output folder
+cd mcp_output
+# Create virtual environment
+python -m venv venv
+# Activate (Windows)
+venv\Scripts\activate
+# Activate (Unix/Mac)
+source venv/bin/activate
+# Install dependencies
+pip install -r requirements.txt
+# Set API configuration (use your third-party API service info)
+# Windows PowerShell
+$env:LLM_API_KEY = "your_api_key"
+$env:LLM_API_BASE = "https://api.your-service.com/v1"
+$env:LLM_MODEL = "gpt-4-turbo-preview"
+# Windows CMD
+set LLM_API_KEY=your_api_key
+set LLM_API_BASE=https://api.your-service.com/v1
+set LLM_MODEL=gpt-4-turbo-preview
+# Unix/Mac
+export LLM_API_KEY=your_api_key
+export LLM_API_BASE=https://api.your-service.com/v1
+export LLM_MODEL=gpt-4-turbo-preview
+```
+## 🔑 Environment Variables
+This service supports third-party API services (reverse proxy, OneAPI, API aggregators, etc.)
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `LLM_API_KEY` | ✅ Yes | Your API key from the service provider |
+| `LLM_API_BASE` | ✅ Yes | API base URL, e.g., `https://api.your-service.com/v1` |
+| `LLM_MODEL` | ❌ No | Model name (default: gpt-4-turbo-preview) |
+**Alternative variable names (also supported):**
+| Variable | Description |
+|----------|-------------|
+| `OPENAI_API_KEY` | Alternative to LLM_API_KEY |
+| `OPENAI_API_BASE` | Alternative to LLM_API_BASE |
+| `OPENAI_MODEL` | Alternative to LLM_MODEL |
+## 🚀 Usage
+### Start MCP Server (stdio mode)
+```bash
+python start_mcp.py
+```
+### Start MCP Server (SSE mode for web integration)
+```bash
+python start_mcp.py --mode sse --port 7865
+```
+### Start Gradio Web Interface
+```bash
+python app.py
+```
+## 🔧 MCP Tools Reference
+### Session Management
+| Tool | Description |
+|------|-------------|
+| `create_session` | Create a new extraction session |
+| `get_session_data` | Retrieve all data from a session |
+### Table Processing
+| Tool | Description |
+|------|-------------|
+| `html_to_tsv_representation` | Convert HTML table to TSV format |
+| `html_to_json_representation` | Convert HTML table to JSON format |
+| `analyze_table_structure` | Analyze table structure (headers, merged cells) |
+| `split_complex_table` | Split tables with multiple internal headers |
+### Data Extraction
+| Tool | Description |
+|------|-------------|
+| `extract_catalyst_data_zero_shot` | Extract using zero-shot GPT |
+| `extract_catalyst_data_few_shot` | Extract with example pairs |
+| `extract_catalyst_data_fine_tuned` | Extract using fine-tuned model |
+### Utilities
+| Tool | Description |
+|------|-------------|
+| `list_performance_types` | List supported catalyst performance types |
+| `validate_extraction_result` | Validate extraction against schema |
+| `get_extraction_code_template` | Get Python code for local extraction |
+| `get_environment_requirements` | Get setup requirements |
+## 📋 Supported Performance Types
+The following catalyst performance types can be extracted:
+- `overpotential`, `tafel_slope`, `Rct`, `stability`, `Cdl`
+- `onset_potential`, `current_density`, `potential`, `TOF`, `ECSA`
+- `water_splitting_potential`, `mass_activity`, `exchange_current_density`
+- `Rs`, `specific_activity`, `onset_overpotential`, `BET`, `surface_area`
+- `loading`, `apparent_activation_energy`
+## 🔄 Workflow Example
+### 1. Create a session
+```python
+result = create_session()
+session_id = result["session_id"]
+```
+### 2. Convert HTML table to representation
+```python
+html = "<table>...</table>"
+tsv = html_to_tsv_representation(
+    html_table=html,
+    title="Table 1: Catalyst Performance",
+    caption="OER performance in 1M KOH",
+    session_id=session_id,
+    table_name="table1"
+)
+```
+### 3. Extract catalyst data
+```python
+extraction = extract_catalyst_data_zero_shot(
+    table_representation=tsv["representation"],
+    session_id=session_id,
+    table_name="table1"
+)
+```
+### 4. Validate and export
+```python
+validation = validate_extraction_result(extraction["extraction"])
+session_data = get_session_data(session_id)
+```
+## 🐳 Docker Deployment
+### Build image
+```bash
+docker build -t matablgpt-mcp .
+```
+### Run container
+```bash
+docker run -p 7860:7860 -p 7865:7865 \
+    -e OPENAI_API_KEY=your_key \
+    matablgpt-mcp
+```
+## 🤗 HuggingFace Spaces Deployment
+1. Create a new Space with Docker SDK
+2. Upload all files from `mcp_output/`
+3. Add `OPENAI_API_KEY` as a secret in Space settings
+4. Space will auto-build and deploy
+## 📝 MCP Client Configuration
+Add to your MCP client configuration (e.g., Claude Desktop):
+```json
+{
+  "mcpServers": {
+    "matablgpt": {
+      "command": "python",
+      "args": ["path/to/mcp_output/start_mcp.py"],
+      "env": {
+        "OPENAI_API_KEY": "your_key"
+      }
+    }
+  }
+}
+```
+Or for SSE mode:
+```json
+{
+  "mcpServers": {
+    "matablgpt": {
+      "url": "http://localhost:7865/sse"
+    }
+  }
+}
+```
+## 📄 Output Format
+Extracted data follows this JSON schema:
+```json
+{
+  "catalyst_name": {
+    "overpotential": {
+      "electrolyte": "1.0 M KOH",
+      "reaction_type": "OER",
+      "value": "230 mV",
+      "current_density": "10 mA/cm²"
+    },
+    "tafel_slope": {
+      "electrolyte": "1.0 M KOH",
+      "reaction_type": "OER",
+      "value": "45 mV/dec"
+    }
+  }
+}
+```
+## 🙏 Acknowledgments
+Based on [MaTableGPT](https://github.com/your-repo/MaTableGPT) - GPT-based Table Data Extractor from Materials Science Literature.
+## 📜 License
+MIT License

__init__.py ADDED Viewed

	@@ -0,0 +1,33 @@

+"""
+MaTableGPT MCP Output Package
+"""
+from .mcp_service import (
+    TableRepresenter,
+    TableToJSON,
+    TableSplitter,
+    GPTExtractor,
+    SessionManager,
+    table_representer,
+    table_to_json,
+    table_splitter,
+    session_manager,
+    get_extractor,
+    mcp
+)
+__all__ = [
+    'TableRepresenter',
+    'TableToJSON',
+    'TableSplitter',
+    'GPTExtractor',
+    'SessionManager',
+    'table_representer',
+    'table_to_json',
+    'table_splitter',
+    'session_manager',
+    'get_extractor',
+    'mcp'
+]
+__version__ = '1.0.0'

app.py ADDED Viewed

	@@ -0,0 +1,627 @@

+#!/usr/bin/env python3
+"""
+MaTableGPT Gradio Web Interface
+================================
+A web interface for the MaTableGPT MCP service.
+Provides an interactive UI for table data extraction from materials science literature.
+For HuggingFace Spaces deployment.
+"""
+import os
+import json
+import logging
+import gradio as gr
+from typing import Optional, Tuple, Dict, Any
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger("matablgpt-app")
+# Import MCP service components
+try:
+    from mcp_service import (
+        table_representer,
+        table_to_json,
+        table_splitter,
+        session_manager,
+        get_extractor,
+        GPTExtractor
+    )
+    MCP_AVAILABLE = True
+except ImportError as e:
+    logger.warning(f"MCP service not available: {e}")
+    MCP_AVAILABLE = False
+# =============================================================================
+# Helper Functions
+# =============================================================================
+def format_json_output(data: Any) -> str:
+    """Format data as pretty JSON string."""
+    try:
+        return json.dumps(data, indent=2, ensure_ascii=False)
+    except:
+        return str(data)
+def check_openai_config() -> Tuple[bool, str]:
+    """Check if API configuration is complete (supports third-party services)."""
+    # Check multiple env var names
+    key = (
+        os.environ.get('LLM_API_KEY', '') or
+        os.environ.get('OPENAI_API_KEY', '')
+    )
+    base_url = (
+        os.environ.get('LLM_API_BASE', '') or
+        os.environ.get('OPENAI_API_BASE', '') or
+        os.environ.get('OPENAI_BASE_URL', '')
+    )
+    model = (
+        os.environ.get('LLM_MODEL', '') or
+        os.environ.get('OPENAI_MODEL', '') or
+        'gpt-4-turbo-preview'
+    )
+    status_parts = []
+    if key:
+        status_parts.append(f"✅ API Key: ***{key[-4:]}")
+    else:
+        return False, "⚠️ API Key not configured (set LLM_API_KEY or OPENAI_API_KEY). GPT extraction will not work."
+    if base_url:
+        # Show shortened URL
+        display_url = base_url if len(base_url) <= 35 else base_url[:32] + "..."
+        status_parts.append(f"✅ API URL: {display_url}")
+    else:
+        return False, "⚠️ API Base URL not configured (set LLM_API_BASE or OPENAI_API_BASE). Required for third-party API services."
+    status_parts.append(f"✅ Model: {model}")
+    return True, " | ".join(status_parts)
+def check_openai_key() -> Tuple[bool, str]:
+    """Legacy function - redirects to check_openai_config."""
+    return check_openai_config()
+# =============================================================================
+# Gradio Interface Functions
+# =============================================================================
+def convert_html_to_tsv(html_input: str, title: str, caption: str) -> str:
+    """Convert HTML table to TSV representation."""
+    if not MCP_AVAILABLE:
+        return "Error: MCP service not available"
+    if not html_input.strip():
+        return "Error: Please provide HTML table input"
+    try:
+        result = table_representer.html_to_tsv(html_input, title, caption)
+        return result
+    except Exception as e:
+        return f"Error: {str(e)}"
+def convert_html_to_json(html_input: str, title: str, caption: str) -> str:
+    """Convert HTML table to JSON representation."""
+    if not MCP_AVAILABLE:
+        return "Error: MCP service not available"
+    if not html_input.strip():
+        return "Error: Please provide HTML table input"
+    try:
+        result = table_to_json.html_to_json(html_input, title, caption)
+        return format_json_output(result)
+    except Exception as e:
+        return f"Error: {str(e)}"
+def analyze_table(html_input: str) -> str:
+    """Analyze HTML table structure."""
+    if not MCP_AVAILABLE:
+        return "Error: MCP service not available"
+    if not html_input.strip():
+        return "Error: Please provide HTML table input"
+    try:
+        result = table_splitter.analyze_table_structure(html_input)
+        return format_json_output(result)
+    except Exception as e:
+        return f"Error: {str(e)}"
+def split_table(html_input: str, title: str, caption: str) -> str:
+    """Split complex table into simpler components."""
+    if not MCP_AVAILABLE:
+        return "Error: MCP service not available"
+    if not html_input.strip():
+        return "Error: Please provide HTML table input"
+    try:
+        result = table_splitter.split_table(html_input, title, caption)
+        return format_json_output({
+            "table_count": len(result),
+            "tables": result
+        })
+    except Exception as e:
+        return f"Error: {str(e)}"
+def extract_zero_shot(table_repr: str) -> str:
+    """Extract catalyst data using zero-shot approach."""
+    if not MCP_AVAILABLE:
+        return "Error: MCP service not available"
+    if not table_repr.strip():
+        return "Error: Please provide table representation"
+    has_key, key_status = check_openai_key()
+    if not has_key:
+        return f"Error: {key_status}"
+    try:
+        extractor = get_extractor()
+        result = extractor.extract_zero_shot(table_repr)
+        return format_json_output(result)
+    except Exception as e:
+        return f"Error: {str(e)}"
+def extract_few_shot(table_repr: str, examples_json: str) -> str:
+    """Extract catalyst data using few-shot approach."""
+    if not MCP_AVAILABLE:
+        return "Error: MCP service not available"
+    if not table_repr.strip():
+        return "Error: Please provide table representation"
+    has_key, key_status = check_openai_key()
+    if not has_key:
+        return f"Error: {key_status}"
+    try:
+        examples = json.loads(examples_json) if examples_json.strip() else []
+        extractor = get_extractor()
+        result = extractor.extract_few_shot(table_repr, examples)
+        return format_json_output(result)
+    except json.JSONDecodeError:
+        return "Error: Invalid examples JSON format"
+    except Exception as e:
+        return f"Error: {str(e)}"
+def validate_extraction(extraction_json: str) -> str:
+    """Validate extraction result."""
+    if not extraction_json.strip():
+        return "Error: Please provide extraction JSON"
+    try:
+        extraction = json.loads(extraction_json)
+    except json.JSONDecodeError:
+        return "Error: Invalid JSON format"
+    issues = []
+    warnings = []
+    if not isinstance(extraction, dict):
+        return format_json_output({"valid": False, "issues": ["Extraction must be a dictionary"]})
+    if "error" in extraction:
+        issues.append(f"Extraction contains error: {extraction['error']}")
+    valid_performance_types = set(GPTExtractor.PERFORMANCE_LIST)
+    for catalyst_name, performances in extraction.items():
+        if catalyst_name in ["error", "raw_response", "catalysts"]:
+            continue
+        if not isinstance(performances, dict):
+            warnings.append(f"Catalyst '{catalyst_name}' should have dict of performances")
+            continue
+        for perf_name, properties in performances.items():
+            if perf_name not in valid_performance_types:
+                warnings.append(f"Unknown performance type: {perf_name}")
+            if isinstance(properties, dict):
+                for prop_key in properties.keys():
+                    if prop_key not in GPTExtractor.PROPERTY_TEMPLATE:
+                        warnings.append(f"Unknown property key: {prop_key}")
+    return format_json_output({
+        "valid": len(issues) == 0,
+        "issues": issues,
+        "warnings": warnings
+    })
+def get_performance_types() -> str:
+    """Get list of supported performance types."""
+    return format_json_output({
+        "performance_types": GPTExtractor.PERFORMANCE_LIST,
+        "property_template": GPTExtractor.PROPERTY_TEMPLATE
+    })
+def get_code_template(repr_format: str, model_type: str) -> str:
+    """Generate code template for local extraction."""
+    code = f'''"""
+MaTableGPT Local Extraction Template
+Model Type: {model_type}
+Representation Format: {repr_format}
+"""
+from openai import OpenAI
+import json
+# Initialize client
+client = OpenAI(api_key="YOUR_API_KEY")
+# Performance types to extract
+PERFORMANCE_LIST = [
+    'overpotential', 'tafel_slope', 'Rct', 'stability', 'Cdl',
+    'onset_potential', 'current_density', 'potential', 'TOF', 'ECSA',
+    'water_splitting_potential', 'mass_activity', 'exchange_current_density',
+    'Rs', 'specific_activity', 'onset_overpotential', 'BET', 'surface_area',
+    'loading', 'apparent_activation_energy'
+]
+# Your table representation
+table_representation = """
+# Paste your {repr_format.upper()} representation here
+"""
+# System prompt
+system_prompt = """I will extract catalyst performance information from the table and create JSON format.
+Performance types: """ + str(PERFORMANCE_LIST) + """
+The JSON format will have performance within the catalyst, with elements:
+reaction type, value, electrolyte, condition, current density, versus, substrate.
+Output must contain only JSON dictionary."""
+# Extract
+response = client.chat.completions.create(
+    model="gpt-4-turbo-preview",
+    messages=[
+        {{"role": "system", "content": system_prompt}},
+        {{"role": "user", "content": table_representation}}
+    ],
+    temperature=0
+)
+result = response.choices[0].message.content.strip()
+print(json.dumps(json.loads(result), indent=2))
+'''
+    return code
+# =============================================================================
+# Gradio UI
+# =============================================================================
+# Sample HTML table for demo
+SAMPLE_HTML = '''<table>
+  <thead>
+    <tr>
+      <th>Catalyst</th>
+      <th>Overpotential (mV)</th>
+      <th>Tafel Slope (mV/dec)</th>
+      <th>Electrolyte</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Pt/C</td>
+      <td>280</td>
+      <td>65</td>
+      <td>1M KOH</td>
+    </tr>
+    <tr>
+      <td>NiFe-LDH</td>
+      <td>230</td>
+      <td>45</td>
+      <td>1M KOH</td>
+    </tr>
+    <tr>
+      <td>Co3O4</td>
+      <td>350</td>
+      <td>78</td>
+      <td>1M KOH</td>
+    </tr>
+  </tbody>
+</table>'''
+def create_ui():
+    """Create Gradio interface."""
+    # Check status
+    has_key, key_status = check_openai_key()
+    status_color = "green" if has_key else "orange"
+    with gr.Blocks(
+        title="MaTableGPT - Table Data Extractor",
+        theme=gr.themes.Soft()
+    ) as app:
+        gr.Markdown("""
+        # 🔬 MaTableGPT - Table Data Extractor
+        **Extract structured catalyst performance data from HTML tables in materials science literature**
+        This tool uses GPT models to convert complex HTML tables into structured JSON data with
+        catalyst names, performance metrics (overpotential, Tafel slope, etc.), and associated properties.
+        """)
+        gr.Markdown(f"**Status:** <span style='color:{status_color}'>{key_status}</span>")
+        with gr.Tabs():
+            # Tab 1: Table Representation
+            with gr.TabItem("📋 Table Representation"):
+                gr.Markdown("### Convert HTML tables to TSV or JSON format")
+                with gr.Row():
+                    with gr.Column():
+                        html_input = gr.Textbox(
+                            label="HTML Table Input",
+                            placeholder="Paste your HTML table here...",
+                            lines=15,
+                            value=SAMPLE_HTML
+                        )
+                        title_input = gr.Textbox(
+                            label="Table Title (optional)",
+                            placeholder="e.g., Table 1: OER Catalyst Performance"
+                        )
+                        caption_input = gr.Textbox(
+                            label="Table Caption (optional)",
+                            placeholder="e.g., Performance measured at 10 mA/cm²"
+                        )
+                        with gr.Row():
+                            tsv_btn = gr.Button("Convert to TSV", variant="primary")
+                            json_btn = gr.Button("Convert to JSON", variant="primary")
+                    with gr.Column():
+                        repr_output = gr.Textbox(
+                            label="Representation Output",
+                            lines=20,
+                            show_copy_button=True
+                        )
+                tsv_btn.click(
+                    convert_html_to_tsv,
+                    inputs=[html_input, title_input, caption_input],
+                    outputs=repr_output
+                )
+                json_btn.click(
+                    convert_html_to_json,
+                    inputs=[html_input, title_input, caption_input],
+                    outputs=repr_output
+                )
+            # Tab 2: Table Analysis & Splitting
+            with gr.TabItem("🔍 Table Analysis"):
+                gr.Markdown("### Analyze and split complex tables")
+                with gr.Row():
+                    with gr.Column():
+                        html_analyze = gr.Textbox(
+                            label="HTML Table Input",
+                            placeholder="Paste your HTML table here...",
+                            lines=10,
+                            value=SAMPLE_HTML
+                        )
+                        with gr.Row():
+                            analyze_btn = gr.Button("Analyze Structure", variant="secondary")
+                            split_btn = gr.Button("Split Table", variant="secondary")
+                    with gr.Column():
+                        analysis_output = gr.Textbox(
+                            label="Analysis Result",
+                            lines=15,
+                            show_copy_button=True
+                        )
+                analyze_btn.click(
+                    analyze_table,
+                    inputs=html_analyze,
+                    outputs=analysis_output
+                )
+                split_btn.click(
+                    split_table,
+                    inputs=[html_analyze, title_input, caption_input],
+                    outputs=analysis_output
+                )
+            # Tab 3: GPT Extraction
+            with gr.TabItem("🤖 GPT Extraction"):
+                gr.Markdown("### Extract catalyst data using GPT models")
+                if not has_key:
+                    gr.Markdown("""
+                    ⚠️ **OpenAI API Key Required**
+                    Set the `OPENAI_API_KEY` environment variable to enable GPT extraction.
+                    """)
+                with gr.Row():
+                    with gr.Column():
+                        table_repr_input = gr.Textbox(
+                            label="Table Representation (TSV or JSON)",
+                            placeholder="Paste your table representation here...",
+                            lines=10
+                        )
+                        extraction_method = gr.Radio(
+                            ["Zero-shot", "Few-shot"],
+                            label="Extraction Method",
+                            value="Zero-shot"
+                        )
+                        examples_input = gr.Textbox(
+                            label="Examples (for Few-shot, JSON format)",
+                            placeholder='[{"input": "...", "output": "..."}]',
+                            lines=5,
+                            visible=False
+                        )
+                        extract_btn = gr.Button("Extract Catalyst Data", variant="primary")
+                    with gr.Column():
+                        extraction_output = gr.Textbox(
+                            label="Extraction Result",
+                            lines=20,
+                            show_copy_button=True
+                        )
+                def update_examples_visibility(method):
+                    return gr.update(visible=(method == "Few-shot"))
+                extraction_method.change(
+                    update_examples_visibility,
+                    inputs=extraction_method,
+                    outputs=examples_input
+                )
+                def extract_data(table_repr, method, examples):
+                    if method == "Zero-shot":
+                        return extract_zero_shot(table_repr)
+                    else:
+                        return extract_few_shot(table_repr, examples)
+                extract_btn.click(
+                    extract_data,
+                    inputs=[table_repr_input, extraction_method, examples_input],
+                    outputs=extraction_output
+                )
+            # Tab 4: Validation
+            with gr.TabItem("✅ Validation"):
+                gr.Markdown("### Validate extraction results")
+                with gr.Row():
+                    with gr.Column():
+                        validation_input = gr.Textbox(
+                            label="Extraction JSON to Validate",
+                            placeholder="Paste extraction JSON here...",
+                            lines=15
+                        )
+                        validate_btn = gr.Button("Validate", variant="secondary")
+                    with gr.Column():
+                        validation_output = gr.Textbox(
+                            label="Validation Result",
+                            lines=10
+                        )
+                        gr.Markdown("### Supported Performance Types")
+                        perf_types = gr.Textbox(
+                            label="",
+                            value=get_performance_types(),
+                            lines=10,
+                            interactive=False
+                        )
+                validate_btn.click(
+                    validate_extraction,
+                    inputs=validation_input,
+                    outputs=validation_output
+                )
+            # Tab 5: Code Template
+            with gr.TabItem("💻 Code Template"):
+                gr.Markdown("### Generate Python code for local extraction")
+                with gr.Row():
+                    repr_format = gr.Dropdown(
+                        ["tsv", "json"],
+                        label="Representation Format",
+                        value="tsv"
+                    )
+                    model_type = gr.Dropdown(
+                        ["zero-shot", "few-shot", "fine-tuning"],
+                        label="Model Type",
+                        value="zero-shot"
+                    )
+                generate_btn = gr.Button("Generate Code", variant="secondary")
+                code_output = gr.Code(
+                    label="Python Code Template",
+                    language="python",
+                    lines=30
+                )
+                generate_btn.click(
+                    get_code_template,
+                    inputs=[repr_format, model_type],
+                    outputs=code_output
+                )
+            # Tab 6: About
+            with gr.TabItem("ℹ️ About"):
+                gr.Markdown("""
+                ## About MaTableGPT
+                MaTableGPT is a GPT-based table data extractor specifically designed for
+                materials science literature. It converts complex HTML tables containing
+                catalyst performance data into structured JSON format.
+                ### Workflow
+                1. **Table Representation**: Convert HTML tables to TSV or JSON format
+                2. **Table Splitting** (optional): Break down complex tables with multiple headers
+                3. **GPT Extraction**: Use zero-shot, few-shot, or fine-tuned models to extract data
+                4. **Validation**: Verify extraction results against expected schema
+                ### Supported Performance Types
+                - Overpotential, Tafel slope, Rct, Stability, Cdl
+                - Onset potential, Current density, Potential, TOF, ECSA
+                - Water splitting potential, Mass activity, Exchange current density
+                - Rs, Specific activity, Onset overpotential, BET, Surface area
+                - Loading, Apparent activation energy
+                ### MCP Integration
+                This service is also available as an MCP (Model Context Protocol) server,
+                allowing integration with AI assistants like Claude.
+                ### Credits
+                Based on [MaTableGPT](https://github.com/your-repo/MaTableGPT) research.
+                """)
+        gr.Markdown("---\n*MaTableGPT MCP Service - Materials Science Table Data Extraction*")
+    return app
+# =============================================================================
+# Main Entry Point
+# =============================================================================
+def main():
+    """Run the Gradio app."""
+    app = create_ui()
+    # Get port from environment or default
+    port = int(os.environ.get('GRADIO_SERVER_PORT', 7860))
+    app.launch(
+        server_name="0.0.0.0",
+        server_port=port,
+        share=False
+    )
+if __name__ == "__main__":
+    main()

mcp_service.py ADDED Viewed

	@@ -0,0 +1,1413 @@

+"""
+MaTableGPT MCP Service
+======================
+A Model Context Protocol (MCP) service for extracting table data from
+materials science literature using GPT models.
+This service provides tools for:
+1. Table Representation: Converting HTML tables to TSV or JSON format
+2. Table Splitting: Breaking down complex tables into simpler components
+3. GPT-based Data Extraction: Using fine-tuning, few-shot, or zero-shot models
+4. Follow-up Questions: Refining extraction results through iterative questioning
+5. Model Evaluation: Assessing extraction quality
+"""
+import os
+import json
+import re
+import logging
+import tempfile
+import uuid
+from datetime import datetime
+from typing import Optional, Dict, List, Any, Union
+from dataclasses import dataclass, field
+from contextlib import asynccontextmanager
+from bs4 import BeautifulSoup
+import pandas as pd
+# MCP imports
+from mcp.server.fastmcp import FastMCP
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger("matablgpt-mcp")
+# =============================================================================
+# Data Classes
+# =============================================================================
+@dataclass
+class TableData:
+    """Represents a parsed table structure"""
+    title: str = ""
+    caption: str = ""
+    tag: str = ""  # HTML table tag
+    headers: List[List[str]] = field(default_factory=list)
+    body: List[List[str]] = field(default_factory=list)
+@dataclass
+class ExtractionResult:
+    """Represents the result of GPT extraction"""
+    session_id: str
+    table_name: str
+    model_type: str  # 'fine-tuning', 'few-shot', 'zero-shot'
+    result: Dict[str, Any]
+    timestamp: str
+    follow_up_applied: bool = False
+@dataclass
+class SessionData:
+    """Session data for storing extraction results"""
+    session_id: str
+    created_at: str
+    tables: Dict[str, TableData] = field(default_factory=dict)
+    representations: Dict[str, str] = field(default_factory=dict)
+    extractions: List[ExtractionResult] = field(default_factory=list)
+# =============================================================================
+# Table Processing Classes
+# =============================================================================
+class TableRepresenter:
+    """
+    Converts HTML tables to TSV (Tab-Separated Values) representation.
+    Handles merged cells, captions, and titles.
+    """
+    def __init__(self):
+        # Cell representation formats
+        self.merged_cell = '<merge {}={}>{}</merge>'
+        self.both_merged_cell = '<merge {}={} {}={}>{}</merge>'
+        self.cell = '{}\\t'
+        self.line_breaking = '\\n'
+        self.table_tag = '<table>{}</table>'
+        self.caption_tag = '<caption>{}</caption>'
+        self.title_tag = '<title>{}</title>'
+    def text_filter(self, text: str) -> str:
+        """Remove unnecessary text and HTML tags from the given string."""
+        out = text
+        # Replace special Unicode characters
+        replacements = [
+            ('\\xa0', ' '), ('\\u2005', ' '), ('\\u2009', ' '),
+            ('\\u202f', ' '), ('\\u200b', ''), ('<b>', ''), ('</b>', '')
+        ]
+        for old, new in replacements:
+            out = out.replace(old, new)
+        # Remove specific patterns
+        patterns = [
+            (r'<cap>(\(\d+\)|\d+|\[\d+\]|\d+\,\d+|\d+\,\d+\,\d+|\d+\,\d+\–\d+|\d+\D+|\(\d+\,\s*\d+\)|\(\d+\D+\))</cap>', r'\1'),
+            (r'<cap>(\s*ref\.\s\d+.*?)</cap>', r'\1'),
+            (r'\(<cap>(\s*(ref\.\s\d+.*?)\s*)</cap>\)', r'\1'),
+            (r'<cap>(\s*Ref\.\s\d+.*?)</cap>', r'\1'),
+            (r'\(<cap>(\s*(Ref\.\s\d+.*?)\s*)</cap>\)', r'\1'),
+            (r'<cap>(\[\d+|\d+\])</cap>', r'\1'),
+            (r'<cap>((.*?)et al\..*?)</cap>', r'\1'),
+            (r'<cap>((.*?)Fig\..*?)</cap>', r'\1'),
+            (r'<cap>(Song and Hu \(2014\))</cap>', r'\1'),
+            (r'<div> <cap>  </cap> </div> ', ''),
+            (r'<cap>(mA\.cm)</cap>', r'\1'),
+            (r'<cap>(https.*?)</cap>', r'\1'),
+            (r'<cap>(\d+\.\d+\@\d+)</cap>', r'\1')
+        ]
+        for pattern, repl in patterns:
+            out = re.sub(pattern, repl, out)
+        return out
+    def process_table(self, t):
+        """Remove unnecessary HTML tags from the table element."""
+        tags_to_remove = [
+            'img', 'em', 'i', 'p', 'span', 'strong', 'math', 'mi', 'br',
+            'script', 'svg', 'mrow', 'mo', 'mn', 'msub', 'msubsup', 'mtext',
+            'mjx-container', 'mjx-math', 'mjx-mrow', 'mjx-msub', 'mjx-mi',
+            'mjx-c', 'mjx-script', 'mjx-mspace', 'mjx-assistive-mml', 'mspace'
+        ]
+        for tag in tags_to_remove:
+            elements = t.find_all(tag)
+            for element in elements:
+                if tag in ['img', 'script', 'svg']:
+                    element.decompose()
+                else:
+                    element.unwrap()
+        return t
+    def html_to_tsv(self, html_table: str, title: str = "", caption: str = "") -> str:
+        """
+        Convert HTML table to TSV representation.
+        Args:
+            html_table: HTML string containing the table
+            title: Table title
+            caption: Table caption
+        Returns:
+            TSV representation of the table
+        """
+        soup = BeautifulSoup(html_table, 'html.parser')
+        table = soup.find('table')
+        if not table:
+            table = soup
+        # Get table dimensions
+        tbody = table.find('tbody') or table
+        first_row = tbody.find('tr')
+        if not first_row:
+            return "Error: No table rows found"
+        width = sum(int(cell.get('colspan', 1)) for cell in first_row.find_all(re.compile('(?<!ma)th|td')))
+        height = len(table.find_all('tr'))
+        # Initialize output grid
+        out = [['' for _ in range(width)] for _ in range(height)]
+        # Process each row
+        i = 0
+        for tr in table.find_all('tr'):
+            j = 0
+            for cell in tr.find_all(re.compile('(?<!ma)th|td')):
+                # Process links
+                for a_tag in cell.find_all('a'):
+                    a_text = a_tag.get_text()
+                    if a_text.isdigit():
+                        a_tag.string = f"<ref>{a_text}</ref>"
+                    else:
+                        a_tag.string = f"<cap>{a_text}</cap>"
+                cell = self.process_table(cell)
+                # Find next empty cell
+                while j < width and out[i][j] != '':
+                    j += 1
+                if j >= width:
+                    break
+                refined_text = ''.join(str(element) for element in cell.contents)
+                colspan = int(cell.get('colspan', 0))
+                rowspan = int(cell.get('rowspan', 0))
+                # Handle merged cells
+                if colspan and rowspan:
+                    out[i][j] = self.both_merged_cell.format('colspan', colspan, 'rowspan', rowspan, self.text_filter(refined_text))
+                    for c in range(colspan):
+                        for r in range(rowspan):
+                            if c > 0 or r > 0:
+                                if i + r < height and j + c < width:
+                                    out[i + r][j + c] = '::'
+                elif colspan:
+                    out[i][j] = self.merged_cell.format('colspan', colspan, self.text_filter(refined_text))
+                    for c in range(1, colspan):
+                        if j + c < width:
+                            out[i][j + c] = '::'
+                elif rowspan:
+                    out[i][j] = self.merged_cell.format('rowspan', rowspan, self.text_filter(refined_text))
+                    for r in range(1, rowspan):
+                        if i + r < height:
+                            out[i + r][j] = '::'
+                else:
+                    text = self.text_filter(refined_text) if refined_text else ' '
+                    out[i][j] = text
+                j += colspan if colspan else 1
+            i += 1
+        # Build result string
+        result = ''
+        for row in out:
+            for element in row:
+                if element != '::':
+                    result += self.cell.format(element)
+            result += self.line_breaking
+        final_result = self.title_tag.format(title) + self.table_tag.format(result)
+        if caption:
+            if isinstance(caption, dict):
+                caption_str = ', '.join([f"{k}: {v}" for k, v in caption.items()])
+            else:
+                caption_str = str(caption)
+            final_result += '\n' + self.caption_tag.format(caption_str)
+        return final_result
+class TableToJSON:
+    """
+    Converts HTML tables to JSON representation.
+    """
+    def process_caption(self, table):
+        """Process caption and reference tags."""
+        # Remove tfoot
+        for tfoot in table.find_all('tfoot'):
+            tfoot.decompose()
+        for cell in table.find_all(['td', 'th']):
+            for link in cell.find_all('a'):
+                link_text = link.get_text()
+                if len(link_text) == 1 and (link_text.isalpha() or link_text == '*'):
+                    link.string = f"<cap>{link_text}</cap>"
+                else:
+                    link.string = f"<ref>{link_text}</ref>"
+        return table
+    def process_sub_sup(self, table):
+        """Process subscript and superscript tags."""
+        for cell in table.find_all(['td', 'th']):
+            for sup in cell.find_all('sup'):
+                sup_text = sup.get_text() or ""
+                sup.string = f"<sup>{sup_text}</sup>"
+            for sub in cell.find_all('sub'):
+                sub_text = sub.get_text() or ""
+                sub.string = f"<sub>{sub_text}</sub>"
+        return table
+    def html_to_json(self, html_table: str, title: str = "", caption: str = "") -> Dict:
+        """
+        Convert HTML table to JSON representation.
+        Args:
+            html_table: HTML string containing the table
+            title: Table title
+            caption: Table caption
+        Returns:
+            JSON dictionary representation of the table
+        """
+        soup = BeautifulSoup(html_table, 'html.parser')
+        table = soup.find('table')
+        if not table:
+            table = soup
+        # Process table
+        table = self.process_caption(table)
+        table = self.process_sub_sup(table)
+        # Fill empty header cells
+        for th in table.find_all('th'):
+            if not th.text.strip():
+                th.insert(0, '-')
+        # Convert to DataFrame
+        try:
+            dfs = pd.read_html(str(table))
+            if not dfs:
+                return {"error": "Could not parse table"}
+            df = dfs[0]
+            df.fillna("NaN", inplace=True)
+        except Exception as e:
+            return {"error": f"Failed to parse table: {str(e)}"}
+        # Build JSON structure
+        result = {}
+        header_levels = df.columns.nlevels
+        keys = list(df.columns)
+        for i, key in enumerate(keys):
+            values = df.iloc[:, i].tolist()
+            if header_levels > 1:
+                current = result
+                for j, k in enumerate(key):
+                    if j == len(key) - 1:
+                        current[k] = values
+                    else:
+                        if k not in current:
+                            current[k] = {}
+                        current = current[k]
+            else:
+                result[key] = values
+        # Add metadata
+        final_result = {
+            "Title": title,
+            "caption": caption,
+            **result
+        }
+        return final_result
+class TableSplitter:
+    """
+    Splits complex tables into simpler components for better extraction.
+    """
+    def analyze_table_structure(self, html_table: str) -> Dict:
+        """
+        Analyze the structure of an HTML table.
+        Args:
+            html_table: HTML string containing the table
+        Returns:
+            Dictionary containing structural analysis
+        """
+        soup = BeautifulSoup(html_table, 'html.parser')
+        table = soup.find('table') or soup
+        rows = table.find_all('tr')
+        # Analyze each row
+        row_analysis = []
+        for row in rows:
+            cells = row.find_all(['td', 'th'])
+            cell_types = [cell.name for cell in cells]
+            merged_cells = sum(1 for cell in cells if cell.get('colspan') or cell.get('rowspan'))
+            # Determine if row is header or body
+            is_header = all(c.name == 'th' for c in cells) or self._is_header_content(cells)
+            row_analysis.append({
+                "cell_count": len(cells),
+                "cell_types": cell_types,
+                "merged_cells": merged_cells,
+                "is_header": is_header
+            })
+        return {
+            "total_rows": len(rows),
+            "has_thead": table.find('thead') is not None,
+            "has_tbody": table.find('tbody') is not None,
+            "row_analysis": row_analysis
+        }
+    def _is_header_content(self, cells) -> bool:
+        """Check if cells contain header-like content."""
+        if not cells:
+            return False
+        # Check if all cells have the same value (likely a spanning header)
+        texts = [c.get_text().strip() for c in cells]
+        if len(set(texts)) == 1 and texts[0]:
+            return True
+        # Check if content is mostly non-numeric
+        numeric_count = 0
+        for text in texts:
+            try:
+                float(re.sub(r'[^\d.-]', '', text))
+                numeric_count += 1
+            except:
+                pass
+        return numeric_count < len(texts) / 2
+    def split_table(self, html_table: str, title: str = "", caption: str = "") -> List[Dict]:
+        """
+        Split a complex table into simpler components.
+        Args:
+            html_table: HTML string containing the table
+            title: Table title
+            caption: Table caption
+        Returns:
+            List of simplified table dictionaries
+        """
+        soup = BeautifulSoup(html_table, 'html.parser')
+        table = soup.find('table') or soup
+        analysis = self.analyze_table_structure(html_table)
+        # If simple table, return as-is
+        if all(not r['is_header'] or i == 0 for i, r in enumerate(analysis['row_analysis'])):
+            return [{
+                "html": str(table),
+                "title": title,
+                "caption": caption,
+                "index": 1
+            }]
+        # Split based on internal headers
+        split_tables = []
+        current_header = None
+        current_rows = []
+        thead = table.find('thead')
+        original_header = str(thead) if thead else ""
+        tbody = table.find('tbody') or table
+        for i, row in enumerate(tbody.find_all('tr')):
+            if analysis['row_analysis'][i if not thead else i + len(thead.find_all('tr'))]['is_header']:
+                # Save previous section
+                if current_rows:
+                    split_tables.append({
+                        "html": self._build_table_html(original_header, current_header, current_rows),
+                        "title": title,
+                        "caption": caption,
+                        "index": len(split_tables) + 1
+                    })
+                current_header = str(row)
+                current_rows = []
+            else:
+                current_rows.append(str(row))
+        # Save last section
+        if current_rows:
+            split_tables.append({
+                "html": self._build_table_html(original_header, current_header, current_rows),
+                "title": title,
+                "caption": caption,
+                "index": len(split_tables) + 1
+            })
+        return split_tables if split_tables else [{
+            "html": str(table),
+            "title": title,
+            "caption": caption,
+            "index": 1
+        }]
+    def _build_table_html(self, original_header: str, sub_header: str, rows: List[str]) -> str:
+        """Build HTML table from components."""
+        header = original_header
+        if sub_header:
+            if header:
+                header = header.replace('</thead>', sub_header + '</thead>')
+            else:
+                header = f"<thead>{sub_header}</thead>"
+        body = "<tbody>" + "".join(rows) + "</tbody>"
+        return f"<table>{header}{body}</table>"
+# =============================================================================
+# GPT Extraction Classes
+# =============================================================================
+class GPTExtractor:
+    """
+    Handles GPT-based extraction of catalyst data from table representations.
+    Supports third-party API services with custom base URL (reverse proxy,
+    API aggregators like OpenRouter, OneAPI, etc.).
+    Environment Variables:
+        LLM_API_KEY or OPENAI_API_KEY: Your API key
+        LLM_API_BASE or OPENAI_API_BASE: API base URL (required for third-party services)
+        LLM_MODEL or OPENAI_MODEL: Model name (default: gpt-4-turbo-preview)
+    """
+    # Performance types to extract
+    PERFORMANCE_LIST = [
+        'overpotential', 'tafel_slope', 'Rct', 'stability', 'Cdl',
+        'onset_potential', 'current_density', 'potential', 'TOF', 'ECSA',
+        'water_splitting_potential', 'mass_activity', 'exchange_current_density',
+        'Rs', 'specific_activity', 'onset_overpotential', 'BET', 'surface_area',
+        'loading', 'apparent_activation_energy'
+    ]
+    # Property template
+    PROPERTY_TEMPLATE = {
+        'electrolyte': '', 'reaction_type': '', 'value': '',
+        'current_density': '', 'overpotential': '', 'potential': '',
+        'substrate': '', 'versus': '', 'condition': ''
+    }
+    # Default model
+    DEFAULT_MODEL = "gpt-4-turbo-preview"
+    def __init__(self, api_key: Optional[str] = None, base_url: Optional[str] = None, model: Optional[str] = None):
+        """
+        Initialize GPT Extractor.
+        Args:
+            api_key: API key. Falls back to LLM_API_KEY or OPENAI_API_KEY env var.
+            base_url: API base URL. Falls back to LLM_API_BASE or OPENAI_API_BASE env var.
+            model: Model name. Falls back to LLM_MODEL or OPENAI_MODEL env var.
+        """
+        # Support multiple env var names for flexibility
+        self.api_key = (
+            api_key or
+            os.environ.get('LLM_API_KEY', '') or
+            os.environ.get('OPENAI_API_KEY', '')
+        )
+        self.base_url = (
+            base_url or
+            os.environ.get('LLM_API_BASE', '') or
+            os.environ.get('OPENAI_API_BASE', '') or
+            os.environ.get('OPENAI_BASE_URL', '')
+        )
+        self.model = (
+            model or
+            os.environ.get('LLM_MODEL', '') or
+            os.environ.get('OPENAI_MODEL', '') or
+            self.DEFAULT_MODEL
+        )
+        self._client = None
+        logger.info(f"GPTExtractor initialized with model: {self.model}")
+        if self.base_url:
+            logger.info(f"Using custom API base URL: {self.base_url}")
+        else:
+            logger.warning("No API base URL configured - using default OpenAI endpoint")
+    @property
+    def client(self):
+        """Lazy initialization of OpenAI-compatible client."""
+        if self._client is None:
+            try:
+                from openai import OpenAI
+                # Build client kwargs
+                client_kwargs = {"api_key": self.api_key}
+                # Add base_url for third-party API services
+                if self.base_url:
+                    client_kwargs["base_url"] = self.base_url
+                self._client = OpenAI(**client_kwargs)
+                logger.info("API client initialized successfully")
+            except ImportError:
+                raise ImportError("OpenAI package not installed. Install with: pip install openai")
+        return self._client
+    def get_model(self) -> str:
+        """Get the model name to use for API calls."""
+        return self.model
+    def get_system_prompt(self, model_type: str) -> str:
+        """Get system prompt based on model type."""
+        if model_type == 'fine-tuning':
+            return """This task is to take a string as input and convert it to JSON format.
+I want to extract the performance below: [reaction_type, versus, overpotential, substrate, loading,
+tafel_slope, onset_potential, current_density, BET, specific_activity, mass_activity, surface_area,
+ECSA, apparent_activation_energy, water_splitting_potential, potential, Rs, Rct, Cdl, TOF, stability,
+electrolyte, exchange_current_density, onset_overpotential].
+If there is information about overpotential and Tafel slope in the input, the output should be:
+{
+    "catalyst_name": {
+        "overpotential": {"electrolyte": "1.0 M KOH", "reaction_type": "OER", "value": "230 mV", "current_density": "50 mA/cm2"},
+        "tafel_slope": {"electrolyte": "1.0 M KOH", "reaction_type": "OER", "value": "54 mV/dec"}
+    }
+}
+If certain information cannot be found, those keys should not be included in the output.
+If there are no values corresponding to performance metrics, simply extract the catalyst name as: {"catalyst_name": {}}"""
+        elif model_type == 'few-shot':
+            return f"""I will extract the performance information of the catalyst from the table and create a JSON format.
+The types of performance to be extracted: performance_list = {self.PERFORMANCE_LIST}
+You can only use the names as they are in the performance_list.
+The JSON format will have performance within the catalyst, and each performance will include elements present in the table:
+reaction type, value, electrolyte, condition, current density, versus (ex: RHE) and substrate.
+The output must contain only JSON dictionary. Other sentences or opinions must not be in output."""
+        else:  # zero-shot
+            return f"""I'm going to convert the information in the table representer into JSON format.
+CATALYST_TEMPLATE = {{'catalyst_name': {{'performance_name': {{PROPERTY_TEMPLATE}}}}}}
+PROPERTY_TEMPLATE = {self.PROPERTY_TEMPLATE}
+performance_list = {self.PERFORMANCE_LIST}
+Extract catalyst information following these templates strictly."""
+    def extract_zero_shot(self, table_representation: str) -> Dict:
+        """
+        Extract data using zero-shot approach with step-by-step questioning.
+        Args:
+            table_representation: TSV or JSON representation of the table
+        Returns:
+            Extracted catalyst data in JSON format
+        """
+        messages = [{"role": "system", "content": self.get_system_prompt('zero-shot') + "\n\n" + table_representation}]
+        # Step 1: Get catalyst list
+        catalyst_q = "Show the catalysts present in the table representer as a Python list. Answer must be ONLY python list."
+        messages.append({"role": "user", "content": catalyst_q})
+        try:
+            response = self.client.chat.completions.create(
+                model=self.get_model(),
+                messages=messages,
+                temperature=0
+            )
+            catalyst_answer = response.choices[0].message.content.strip()
+            catalyst_list = eval(catalyst_answer)
+            messages.append({"role": "assistant", "content": catalyst_answer})
+        except Exception as e:
+            return {"error": f"Failed to extract catalysts: {str(e)}"}
+        result = {"catalysts": []}
+        for catalyst in catalyst_list:
+            # Step 2: Get performance template for each catalyst
+            perf_q = f"""Create a CATALYST_TEMPLATE filling in the performance of '{catalyst}' from the table representer,
+strictly adhering to these rules:
+Rule 1: Only include actual existing performances from the Performance_list.
+Rule 2: Set all values of keys in PROPERTY_TEMPLATE to be " ". DO NOT INSERT ANY VALUE.
+Rule 3: Answer must be ONLY JSON format."""
+            messages.append({"role": "user", "content": perf_q})
+            try:
+                response = self.client.chat.completions.create(
+                    model=self.get_model(),
+                    messages=messages,
+                    temperature=0
+                )
+                perf_answer = response.choices[0].message.content.strip()
+                messages.append({"role": "assistant", "content": perf_answer})
+                # Step 3: Fill in property values
+                prop_q = """In PROPERTY_TEMPLATE, maintain all keys, and fill in values that exist in the table representer.
+If there are more than two "values" for the same performance, make it into a list. Include units in the values."""
+                messages.append({"role": "user", "content": prop_q})
+                response = self.client.chat.completions.create(
+                    model=self.get_model(),
+                    messages=messages,
+                    temperature=0
+                )
+                prop_answer = response.choices[0].message.content.strip()
+                # Step 4: Remove empty keys
+                delete_q = "Remove keys with no values from previous version of CATALYST_TEMPLATE. Output only JSON."
+                messages.append({"role": "assistant", "content": prop_answer})
+                messages.append({"role": "user", "content": delete_q})
+                response = self.client.chat.completions.create(
+                    model=self.get_model(),
+                    messages=messages,
+                    temperature=0
+                )
+                final_answer = response.choices[0].message.content.strip()
+                # Parse JSON
+                if "```" in final_answer:
+                    final_answer = final_answer.replace("```json", "").replace("```", "")
+                catalyst_data = json.loads(final_answer)
+                result["catalysts"].append(catalyst_data)
+            except Exception as e:
+                result["catalysts"].append({catalyst: {"error": str(e)}})
+        return result["catalysts"][0] if len(result["catalysts"]) == 1 else result
+    def extract_few_shot(self, table_representation: str, examples: List[Dict] = None) -> Dict:
+        """
+        Extract data using few-shot approach with example pairs.
+        Args:
+            table_representation: TSV or JSON representation of the table
+            examples: List of input/output example pairs
+        Returns:
+            Extracted catalyst data in JSON format
+        """
+        messages = [{"role": "system", "content": self.get_system_prompt('few-shot')}]
+        # Add examples if provided
+        if examples:
+            for ex in examples:
+                messages.append({"role": "user", "content": ex.get('input', '')})
+                messages.append({"role": "assistant", "content": ex.get('output', '')})
+        messages.append({"role": "user", "content": table_representation})
+        try:
+            response = self.client.chat.completions.create(
+                model=self.get_model(),
+                messages=messages,
+                temperature=0
+            )
+            result = response.choices[0].message.content.strip()
+            if "```" in result:
+                result = result.replace("```json", "").replace("```", "")
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"raw_response": result, "error": "Could not parse as JSON"}
+        except Exception as e:
+            return {"error": str(e)}
+    def extract_with_fine_tuned(self, table_representation: str, model_name: str) -> Dict:
+        """
+        Extract data using a fine-tuned model.
+        Args:
+            table_representation: TSV or JSON representation of the table
+            model_name: Name of the fine-tuned model
+        Returns:
+            Extracted catalyst data in JSON format
+        """
+        messages = [
+            {"role": "system", "content": self.get_system_prompt('fine-tuning')},
+            {"role": "user", "content": str(table_representation)}
+        ]
+        try:
+            response = self.client.chat.completions.create(
+                model=model_name,
+                messages=messages,
+                temperature=0
+            )
+            result = response.choices[0].message.content.strip()
+            try:
+                return json.loads(result)
+            except:
+                from ast import literal_eval
+                return literal_eval(result)
+        except Exception as e:
+            return {"error": str(e)}
+# =============================================================================
+# Session Management
+# =============================================================================
+class SessionManager:
+    """Manages extraction sessions and data storage."""
+    def __init__(self, storage_dir: str = None):
+        self.storage_dir = storage_dir or tempfile.mkdtemp(prefix="matablgpt_")
+        os.makedirs(self.storage_dir, exist_ok=True)
+        self.sessions: Dict[str, SessionData] = {}
+    def create_session(self) -> str:
+        """Create a new session."""
+        session_id = f"session_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}"
+        session_dir = os.path.join(self.storage_dir, session_id)
+        os.makedirs(session_dir, exist_ok=True)
+        self.sessions[session_id] = SessionData(
+            session_id=session_id,
+            created_at=datetime.now().isoformat()
+        )
+        return session_id
+    def get_session(self, session_id: str) -> Optional[SessionData]:
+        """Get session by ID."""
+        return self.sessions.get(session_id)
+    def save_table(self, session_id: str, table_name: str, table_data: TableData) -> bool:
+        """Save table data to session."""
+        session = self.get_session(session_id)
+        if not session:
+            return False
+        session.tables[table_name] = table_data
+        return True
+    def save_representation(self, session_id: str, table_name: str, representation: str, format_type: str) -> bool:
+        """Save table representation to session."""
+        session = self.get_session(session_id)
+        if not session:
+            return False
+        key = f"{table_name}_{format_type}"
+        session.representations[key] = representation
+        return True
+    def save_extraction(self, session_id: str, result: ExtractionResult) -> bool:
+        """Save extraction result to session."""
+        session = self.get_session(session_id)
+        if not session:
+            return False
+        session.extractions.append(result)
+        return True
+    def export_session(self, session_id: str) -> Dict:
+        """Export session data as dictionary."""
+        session = self.get_session(session_id)
+        if not session:
+            return {"error": "Session not found"}
+        return {
+            "session_id": session.session_id,
+            "created_at": session.created_at,
+            "tables_count": len(session.tables),
+            "representations_count": len(session.representations),
+            "extractions_count": len(session.extractions),
+            "extractions": [
+                {
+                    "table_name": e.table_name,
+                    "model_type": e.model_type,
+                    "result": e.result,
+                    "timestamp": e.timestamp,
+                    "follow_up_applied": e.follow_up_applied
+                }
+                for e in session.extractions
+            ]
+        }
+# =============================================================================
+# MCP Server Definition
+# =============================================================================
+# Initialize global components
+table_representer = TableRepresenter()
+table_to_json = TableToJSON()
+table_splitter = TableSplitter()
+session_manager = SessionManager()
+gpt_extractor = None  # Lazy initialization
+def get_extractor() -> GPTExtractor:
+    """Get or create GPT extractor instance."""
+    global gpt_extractor
+    if gpt_extractor is None:
+        gpt_extractor = GPTExtractor()
+    return gpt_extractor
+# Create MCP server
+mcp = FastMCP("MaTableGPT-MCP")
+# =============================================================================
+# MCP Tools
+# =============================================================================
+@mcp.tool()
+def create_session() -> Dict:
+    """
+    Create a new extraction session.
+    Returns a session ID that should be used for subsequent operations.
+    Sessions help organize and track table processing workflows.
+    """
+    session_id = session_manager.create_session()
+    return {
+        "success": True,
+        "session_id": session_id,
+        "message": "Session created successfully. Use this session_id for subsequent operations."
+    }
+@mcp.tool()
+def html_to_tsv_representation(
+    html_table: str,
+    title: str = "",
+    caption: str = "",
+    session_id: str = "",
+    table_name: str = ""
+) -> Dict:
+    """
+    Convert an HTML table to TSV (Tab-Separated Values) representation.
+    This format is optimized for GPT extraction as it preserves table structure
+    including merged cells, headers, and captions in a text format.
+    Args:
+        html_table: HTML string containing the table element
+        title: Optional title of the table
+        caption: Optional caption/footnotes of the table
+        session_id: Optional session ID to save the representation
+        table_name: Optional name for the table (used for saving)
+    Returns:
+        Dictionary containing the TSV representation
+    """
+    try:
+        representation = table_representer.html_to_tsv(html_table, title, caption)
+        result = {
+            "success": True,
+            "format": "TSV",
+            "representation": representation
+        }
+        # Save to session if provided
+        if session_id and table_name:
+            session_manager.save_representation(session_id, table_name, representation, "tsv")
+            result["saved_to_session"] = session_id
+        return result
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+@mcp.tool()
+def html_to_json_representation(
+    html_table: str,
+    title: str = "",
+    caption: str = "",
+    session_id: str = "",
+    table_name: str = ""
+) -> Dict:
+    """
+    Convert an HTML table to JSON representation.
+    This format converts the table structure into a nested JSON dictionary
+    with column headers as keys and cell values as lists.
+    Args:
+        html_table: HTML string containing the table element
+        title: Optional title of the table
+        caption: Optional caption/footnotes of the table
+        session_id: Optional session ID to save the representation
+        table_name: Optional name for the table (used for saving)
+    Returns:
+        Dictionary containing the JSON representation
+    """
+    try:
+        representation = table_to_json.html_to_json(html_table, title, caption)
+        result = {
+            "success": True,
+            "format": "JSON",
+            "representation": representation
+        }
+        # Save to session if provided
+        if session_id and table_name:
+            session_manager.save_representation(
+                session_id, table_name, json.dumps(representation), "json"
+            )
+            result["saved_to_session"] = session_id
+        return result
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+@mcp.tool()
+def analyze_table_structure(html_table: str) -> Dict:
+    """
+    Analyze the structure of an HTML table.
+    This tool examines the table to identify:
+    - Total number of rows
+    - Presence of thead/tbody elements
+    - Header rows vs body rows
+    - Merged cells
+    Use this to understand complex tables before processing.
+    Args:
+        html_table: HTML string containing the table element
+    Returns:
+        Dictionary containing structural analysis
+    """
+    try:
+        analysis = table_splitter.analyze_table_structure(html_table)
+        return {"success": True, "analysis": analysis}
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+@mcp.tool()
+def split_complex_table(
+    html_table: str,
+    title: str = "",
+    caption: str = ""
+) -> Dict:
+    """
+    Split a complex table into simpler components.
+    Complex tables with multiple internal headers or sub-tables are split
+    into individual tables that are easier to process.
+    Args:
+        html_table: HTML string containing the table element
+        title: Optional title of the table
+        caption: Optional caption/footnotes of the table
+    Returns:
+        Dictionary containing list of split table components
+    """
+    try:
+        split_tables = table_splitter.split_table(html_table, title, caption)
+        return {
+            "success": True,
+            "table_count": len(split_tables),
+            "tables": split_tables
+        }
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+@mcp.tool()
+def extract_catalyst_data_zero_shot(
+    table_representation: str,
+    session_id: str = "",
+    table_name: str = ""
+) -> Dict:
+    """
+    Extract catalyst data from table representation using zero-shot GPT.
+    This uses a multi-step questioning approach to:
+    1. Identify catalysts in the table
+    2. Determine performance metrics for each catalyst
+    3. Extract property values
+    4. Clean up the result
+    Args:
+        table_representation: TSV or JSON representation of the table
+        session_id: Optional session ID to save the extraction
+        table_name: Optional name for the table
+    Returns:
+        Dictionary containing extracted catalyst data
+    """
+    try:
+        extractor = get_extractor()
+        result = extractor.extract_zero_shot(table_representation)
+        extraction_result = ExtractionResult(
+            session_id=session_id or "no_session",
+            table_name=table_name or "unnamed",
+            model_type="zero-shot",
+            result=result,
+            timestamp=datetime.now().isoformat()
+        )
+        if session_id:
+            session_manager.save_extraction(session_id, extraction_result)
+        return {
+            "success": True,
+            "model_type": "zero-shot",
+            "extraction": result
+        }
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+@mcp.tool()
+def extract_catalyst_data_few_shot(
+    table_representation: str,
+    examples: List[Dict] = None,
+    session_id: str = "",
+    table_name: str = ""
+) -> Dict:
+    """
+    Extract catalyst data from table representation using few-shot GPT.
+    Provide example input/output pairs to guide the extraction.
+    Args:
+        table_representation: TSV or JSON representation of the table
+        examples: List of {"input": ..., "output": ...} example pairs
+        session_id: Optional session ID to save the extraction
+        table_name: Optional name for the table
+    Returns:
+        Dictionary containing extracted catalyst data
+    """
+    try:
+        extractor = get_extractor()
+        result = extractor.extract_few_shot(table_representation, examples or [])
+        extraction_result = ExtractionResult(
+            session_id=session_id or "no_session",
+            table_name=table_name or "unnamed",
+            model_type="few-shot",
+            result=result,
+            timestamp=datetime.now().isoformat()
+        )
+        if session_id:
+            session_manager.save_extraction(session_id, extraction_result)
+        return {
+            "success": True,
+            "model_type": "few-shot",
+            "extraction": result
+        }
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+@mcp.tool()
+def extract_catalyst_data_fine_tuned(
+    table_representation: str,
+    model_name: str,
+    session_id: str = "",
+    table_name: str = ""
+) -> Dict:
+    """
+    Extract catalyst data using a fine-tuned GPT model.
+    Requires a pre-trained fine-tuned model name from OpenAI.
+    Args:
+        table_representation: TSV or JSON representation of the table
+        model_name: Name of the fine-tuned OpenAI model
+        session_id: Optional session ID to save the extraction
+        table_name: Optional name for the table
+    Returns:
+        Dictionary containing extracted catalyst data
+    """
+    try:
+        extractor = get_extractor()
+        result = extractor.extract_with_fine_tuned(table_representation, model_name)
+        extraction_result = ExtractionResult(
+            session_id=session_id or "no_session",
+            table_name=table_name or "unnamed",
+            model_type="fine-tuning",
+            result=result,
+            timestamp=datetime.now().isoformat()
+        )
+        if session_id:
+            session_manager.save_extraction(session_id, extraction_result)
+        return {
+            "success": True,
+            "model_type": "fine-tuning",
+            "model_name": model_name,
+            "extraction": result
+        }
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+@mcp.tool()
+def get_session_data(session_id: str) -> Dict:
+    """
+    Get all data from a session.
+    Returns tables, representations, and extractions stored in the session.
+    Args:
+        session_id: The session ID to retrieve
+    Returns:
+        Dictionary containing session data
+    """
+    return session_manager.export_session(session_id)
+@mcp.tool()
+def list_performance_types() -> Dict:
+    """
+    List all supported performance types for catalyst extraction.
+    These are the standard property names that can be extracted from
+    materials science literature tables about catalysts.
+    Returns:
+        Dictionary containing list of performance types
+    """
+    return {
+        "success": True,
+        "performance_types": GPTExtractor.PERFORMANCE_LIST,
+        "property_template": GPTExtractor.PROPERTY_TEMPLATE
+    }
+@mcp.tool()
+def validate_extraction_result(extraction: Dict) -> Dict:
+    """
+    Validate an extraction result against expected schema.
+    Checks if the extraction follows the expected format with
+    catalyst names, performance types, and property values.
+    Args:
+        extraction: The extraction result to validate
+    Returns:
+        Dictionary containing validation results
+    """
+    issues = []
+    warnings = []
+    if not isinstance(extraction, dict):
+        return {"valid": False, "issues": ["Extraction must be a dictionary"]}
+    # Check for error
+    if "error" in extraction:
+        issues.append(f"Extraction contains error: {extraction['error']}")
+    # Check structure
+    valid_performance_types = set(GPTExtractor.PERFORMANCE_LIST)
+    for catalyst_name, performances in extraction.items():
+        if catalyst_name in ["error", "raw_response", "catalysts"]:
+            continue
+        if not isinstance(performances, dict):
+            warnings.append(f"Catalyst '{catalyst_name}' should have dict of performances")
+            continue
+        for perf_name, properties in performances.items():
+            if perf_name not in valid_performance_types:
+                warnings.append(f"Unknown performance type: {perf_name}")
+            if isinstance(properties, dict):
+                for prop_key in properties.keys():
+                    if prop_key not in GPTExtractor.PROPERTY_TEMPLATE:
+                        warnings.append(f"Unknown property key: {prop_key}")
+    return {
+        "valid": len(issues) == 0,
+        "issues": issues,
+        "warnings": warnings
+    }
+@mcp.tool()
+def get_extraction_code_template(representation_format: str = "tsv", model_type: str = "zero-shot") -> Dict:
+    """
+    Get Python code template for local extraction.
+    Returns code that can be run locally to perform extraction
+    without relying on the MCP service.
+    Args:
+        representation_format: Either 'tsv' or 'json'
+        model_type: One of 'zero-shot', 'few-shot', or 'fine-tuning'
+    Returns:
+        Dictionary containing code template and instructions
+    """
+    code = f'''"""
+MaTableGPT Local Extraction Template
+Model Type: {model_type}
+Representation Format: {representation_format}
+"""
+from openai import OpenAI
+import json
+# Initialize client
+client = OpenAI(api_key="YOUR_API_KEY")
+# Performance types to extract
+PERFORMANCE_LIST = [
+    'overpotential', 'tafel_slope', 'Rct', 'stability', 'Cdl',
+    'onset_potential', 'current_density', 'potential', 'TOF', 'ECSA',
+    'water_splitting_potential', 'mass_activity', 'exchange_current_density',
+    'Rs', 'specific_activity', 'onset_overpotential', 'BET', 'surface_area',
+    'loading', 'apparent_activation_energy'
+]
+# Your table representation
+table_representation = """
+# Paste your {representation_format.upper()} representation here
+"""
+# System prompt
+system_prompt = """I will extract catalyst performance information from the table and create JSON format.
+Performance types: """ + str(PERFORMANCE_LIST) + """
+The JSON format will have performance within the catalyst, with elements:
+reaction type, value, electrolyte, condition, current density, versus, substrate.
+Output must contain only JSON dictionary."""
+# Extract
+response = client.chat.completions.create(
+    model="gpt-4-turbo-preview",
+    messages=[
+        {{"role": "system", "content": system_prompt}},
+        {{"role": "user", "content": table_representation}}
+    ],
+    temperature=0
+)
+result = response.choices[0].message.content.strip()
+print(json.dumps(json.loads(result), indent=2))
+'''
+    return {
+        "success": True,
+        "code": code,
+        "instructions": [
+            "1. Install openai package: pip install openai",
+            "2. Replace YOUR_API_KEY with your OpenAI API key",
+            "3. Paste your table representation in the designated area",
+            "4. Run the script"
+        ]
+    }
+@mcp.tool()
+def get_environment_requirements() -> Dict:
+    """
+    Get the required environment setup for MaTableGPT.
+    Returns package requirements and setup instructions.
+    Supports third-party API services (reverse proxy, API aggregators).
+    Returns:
+        Dictionary containing requirements and instructions
+    """
+    return {
+        "success": True,
+        "python_version": ">=3.8",
+        "required_packages": [
+            "openai>=1.0.0  # OpenAI-compatible client, works with third-party APIs",
+            "beautifulsoup4>=4.9.0",
+            "pandas>=1.0.0",
+            "lxml>=4.0.0",
+            "mcp>=0.1.0"
+        ],
+        "optional_packages": [
+            "nltk>=3.6.0  # For table splitting analysis"
+        ],
+        "environment_variables": {
+            "LLM_API_KEY": "(Required) Your API key from third-party service",
+            "LLM_API_BASE": "(Required) API base URL, e.g., https://api.your-service.com/v1",
+            "LLM_MODEL": "(Optional) Model name, default: gpt-4-turbo-preview",
+            "---": "--- Alternative variable names (also supported) ---",
+            "OPENAI_API_KEY": "Alternative to LLM_API_KEY",
+            "OPENAI_API_BASE": "Alternative to LLM_API_BASE",
+            "OPENAI_MODEL": "Alternative to LLM_MODEL"
+        },
+        "setup_instructions": [
+            "1. Create virtual environment: python -m venv venv",
+            "2. Activate: venv\\Scripts\\activate (Windows) or source venv/bin/activate (Unix)",
+            "3. Install: pip install -r requirements.txt",
+            "4. Set environment variables (use your API provider's info):",
+            "   - LLM_API_KEY=your_api_key (Required)",
+            "   - LLM_API_BASE=https://api.your-service.com/v1 (Required)",
+            "   - LLM_MODEL=gpt-4-turbo-preview (Optional)",
+            "5. Run: python start_mcp.py"
+        ],
+        "third_party_api_example": {
+            "description": "Configuration for third-party API services (reverse proxy, OneAPI, etc.)",
+            "windows_powershell": [
+                "$env:LLM_API_KEY = 'sk-xxxx'",
+                "$env:LLM_API_BASE = 'https://api.your-service.com/v1'",
+                "$env:LLM_MODEL = 'gpt-4-turbo-preview'",
+                "python start_mcp.py"
+            ],
+            "windows_cmd": [
+                "set LLM_API_KEY=sk-xxxx",
+                "set LLM_API_BASE=https://api.your-service.com/v1",
+                "set LLM_MODEL=gpt-4-turbo-preview",
+                "python start_mcp.py"
+            ],
+            "unix_bash": [
+                "export LLM_API_KEY=sk-xxxx",
+                "export LLM_API_BASE=https://api.your-service.com/v1",
+                "export LLM_MODEL=gpt-4-turbo-preview",
+                "python start_mcp.py"
+            ],
+            "docker_env": [
+                "-e LLM_API_KEY=sk-xxxx",
+                "-e LLM_API_BASE=https://api.your-service.com/v1",
+                "-e LLM_MODEL=gpt-4-turbo-preview"
+            ],
+            "huggingface_secrets": [
+                "LLM_API_KEY = sk-xxxx",
+                "LLM_API_BASE = https://api.your-service.com/v1",
+                "LLM_MODEL = gpt-4-turbo-preview"
+            ]
+        }
+    }
+# =============================================================================
+# Server Entry Point
+# =============================================================================
+def main():
+    """Run the MCP server."""
+    mcp.run()
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,24 @@

+# MaTableGPT MCP Service Requirements
+# ====================================
+# Core MCP Framework
+mcp>=0.1.0
+# OpenAI API for GPT extraction
+openai>=1.0.0
+# HTML Parsing
+beautifulsoup4>=4.12.0
+lxml>=4.9.0
+# Data Processing
+pandas>=2.0.0
+# Web Framework for HuggingFace Space
+gradio>=4.0.0
+# Async Support
+httpx>=0.25.0
+# Optional: For table splitting analysis
+nltk>=3.8.0

start_mcp.py ADDED Viewed

	@@ -0,0 +1,144 @@

+#!/usr/bin/env python3
+"""
+MaTableGPT MCP Server Launcher
+==============================
+This script starts the MaTableGPT MCP service for extracting
+table data from materials science literature.
+Usage:
+    python start_mcp.py [--host HOST] [--port PORT] [--mode MODE]
+Arguments:
+    --host      Host address (default: 0.0.0.0)
+    --port      Port number (default: 7865)
+    --mode      Run mode: 'stdio' or 'sse' (default: stdio)
+"""
+import os
+import sys
+import argparse
+import logging
+# Add parent directory to path for imports
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger("matablgpt-mcp-launcher")
+def check_environment():
+    """Check if required environment variables are set."""
+    warnings = []
+    if not os.environ.get('OPENAI_API_KEY'):
+        warnings.append(
+            "OPENAI_API_KEY not set. GPT extraction features will not work. "
+            "Set it with: export OPENAI_API_KEY=your_key (Unix) or "
+            "set OPENAI_API_KEY=your_key (Windows)"
+        )
+    return warnings
+def check_dependencies():
+    """Check if required packages are installed."""
+    missing = []
+    required = [
+        ('mcp', 'mcp'),
+        ('openai', 'openai'),
+        ('bs4', 'beautifulsoup4'),
+        ('pandas', 'pandas'),
+        ('lxml', 'lxml')
+    ]
+    for module, package in required:
+        try:
+            __import__(module)
+        except ImportError:
+            missing.append(package)
+    return missing
+def main():
+    """Main entry point."""
+    parser = argparse.ArgumentParser(
+        description="MaTableGPT MCP Server - Table Data Extraction from Materials Science Literature"
+    )
+    parser.add_argument(
+        '--host',
+        default='0.0.0.0',
+        help='Host address (default: 0.0.0.0)'
+    )
+    parser.add_argument(
+        '--port',
+        type=int,
+        default=7865,
+        help='Port number (default: 7865)'
+    )
+    parser.add_argument(
+        '--mode',
+        choices=['stdio', 'sse'],
+        default='stdio',
+        help='Run mode: stdio for standard I/O, sse for Server-Sent Events (default: stdio)'
+    )
+    parser.add_argument(
+        '--debug',
+        action='store_true',
+        help='Enable debug logging'
+    )
+    args = parser.parse_args()
+    if args.debug:
+        logging.getLogger().setLevel(logging.DEBUG)
+    # Check dependencies
+    missing = check_dependencies()
+    if missing:
+        logger.error(f"Missing required packages: {', '.join(missing)}")
+        logger.error(f"Install with: pip install {' '.join(missing)}")
+        sys.exit(1)
+    # Check environment
+    warnings = check_environment()
+    for warning in warnings:
+        logger.warning(warning)
+    # Display startup info
+    logger.info("=" * 60)
+    logger.info("MaTableGPT MCP Server")
+    logger.info("=" * 60)
+    logger.info(f"Mode: {args.mode}")
+    if args.mode == 'sse':
+        logger.info(f"Host: {args.host}")
+        logger.info(f"Port: {args.port}")
+    logger.info("=" * 60)
+    # Import and run MCP service
+    try:
+        from mcp_service import mcp
+        if args.mode == 'stdio':
+            logger.info("Starting MCP server in stdio mode...")
+            mcp.run()
+        else:
+            logger.info(f"Starting MCP server in SSE mode on {args.host}:{args.port}...")
+            mcp.run(transport='sse', host=args.host, port=args.port)
+    except ImportError as e:
+        logger.error(f"Failed to import MCP service: {e}")
+        sys.exit(1)
+    except Exception as e:
+        logger.error(f"Error starting MCP server: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()