--- title: MaTableGPT MCP emoji: 🔬 colorFrom: blue colorTo: green sdk: docker pinned: false license: mit app_port: 7860 --- # MaTableGPT MCP Service [![HuggingFace Spaces](https://img.shields.io/badge/🤗-HuggingFace%20Spaces-blue)](https://huggingface.co/spaces) [![MCP](https://img.shields.io/badge/MCP-Compatible-green)](https://modelcontextprotocol.io/) **GPT-based Table Data Extractor from Materials Science Literature** A Model Context Protocol (MCP) service that extracts structured catalyst performance data from HTML tables in materials science publications. ## 🌟 Features ### Table Representation - **HTML to TSV**: Convert HTML tables to tab-separated format with preserved structure - **HTML to JSON**: Convert HTML tables to nested JSON format - **Table Splitting**: Break down complex tables with multiple headers into simpler components ### GPT-based Extraction - **Zero-shot**: Multi-step questioning approach without examples - **Few-shot**: Guided extraction with input/output examples - **Fine-tuned**: Use pre-trained specialized models ### Session Management - Track multiple table processing workflows - Store representations and extractions - Export session data for analysis ## 🚀 Quick Start (HuggingFace Space SSE Mode) This service runs as a **pure MCP SSE server** on HuggingFace Space, accessible via SSE endpoint. **SSE Endpoint**: `https://your-space-name.hf.space/sse` ### Connect from Cursor/Claude Desktop ```json { "mcpServers": { "matablgpt": { "url": "https://your-space-name.hf.space/sse" } } } ``` ## 📦 Installation ### Prerequisites - Python 3.8+ - OpenAI-compatible API key (for GPT extraction) ### Local Installation ```bash # Clone or copy the mcp_output folder cd mcp_output # Create virtual environment python -m venv venv # Activate (Windows) venv\Scripts\activate # Activate (Unix/Mac) source venv/bin/activate # Install dependencies pip install -r requirements.txt # Set API configuration (use your third-party API service info) # Windows PowerShell $env:LLM_API_KEY = "your_api_key" $env:LLM_API_BASE = "https://api.your-service.com/v1" $env:LLM_MODEL = "gpt-4-turbo-preview" # Windows CMD set LLM_API_KEY=your_api_key set LLM_API_BASE=https://api.your-service.com/v1 set LLM_MODEL=gpt-4-turbo-preview # Unix/Mac export LLM_API_KEY=your_api_key export LLM_API_BASE=https://api.your-service.com/v1 export LLM_MODEL=gpt-4-turbo-preview ``` ## 🔑 Environment Variables This service supports third-party API services (reverse proxy, OneAPI, API aggregators, etc.) | Variable | Required | Description | |----------|----------|-------------| | `LLM_API_KEY` | ✅ Yes | Your API key from the service provider | | `LLM_API_BASE` | ✅ Yes | API base URL, e.g., `https://api.your-service.com/v1` | | `LLM_MODEL` | ❌ No | Model name (default: gpt-4-turbo-preview) | **Alternative variable names (also supported):** | Variable | Description | |----------|-------------| | `OPENAI_API_KEY` | Alternative to LLM_API_KEY | | `OPENAI_API_BASE` | Alternative to LLM_API_BASE | | `OPENAI_MODEL` | Alternative to LLM_MODEL | ## 🚀 Usage ### Start MCP Server (SSE mode - Default for HuggingFace Space) ```bash # Default: SSE mode on port 7860 python start_mcp.py # Custom port python start_mcp.py --mode sse --port 8080 ``` ### Start MCP Server (stdio mode - For local Cursor integration) ```bash python start_mcp.py --mode stdio ``` ## 🔧 MCP Tools Reference ### Session Management | Tool | Description | |------|-------------| | `create_session` | Create a new extraction session | | `get_session_data` | Retrieve all data from a session | ### Table Processing | Tool | Description | |------|-------------| | `html_to_tsv_representation` | Convert HTML table to TSV format | | `html_to_json_representation` | Convert HTML table to JSON format | | `analyze_table_structure` | Analyze table structure (headers, merged cells) | | `split_complex_table` | Split tables with multiple internal headers | ### Data Extraction | Tool | Description | |------|-------------| | `extract_catalyst_data_zero_shot` | Extract using zero-shot GPT | | `extract_catalyst_data_few_shot` | Extract with example pairs | | `extract_catalyst_data_fine_tuned` | Extract using fine-tuned model | | `batch_extract_tables` | Extract from multiple tables in batch | ### Follow-up & Refinement | Tool | Description | |------|-------------| | `apply_follow_up_questions` | Refine extraction with iterative Q&A (from original MaTableGPT) | ### Evaluation | Tool | Description | |------|-------------| | `evaluate_extraction` | Compute Structure F1 Score and Value Accuracy | | `validate_extraction_result` | Validate extraction against schema | ### Utilities | Tool | Description | |------|-------------| | `list_performance_types` | List supported catalyst performance types | | `get_extraction_code_template` | Get Python code for local extraction | | `get_environment_requirements` | Get setup requirements | ## 📋 Supported Performance Types The following catalyst performance types can be extracted: - `overpotential`, `tafel_slope`, `Rct`, `stability`, `Cdl` - `onset_potential`, `current_density`, `potential`, `TOF`, `ECSA` - `water_splitting_potential`, `mass_activity`, `exchange_current_density` - `Rs`, `specific_activity`, `onset_overpotential`, `BET`, `surface_area` - `loading`, `apparent_activation_energy` ## 🔄 Workflow Example ### 1. Create a session ```python result = create_session() session_id = result["session_id"] ``` ### 2. Convert HTML table to representation ```python html = "...
" tsv = html_to_tsv_representation( html_table=html, title="Table 1: Catalyst Performance", caption="OER performance in 1M KOH", session_id=session_id, table_name="table1" ) ``` ### 3. Extract catalyst data ```python extraction = extract_catalyst_data_zero_shot( table_representation=tsv["representation"], session_id=session_id, table_name="table1" ) ``` ### 4. Validate and export ```python validation = validate_extraction_result(extraction["extraction"]) session_data = get_session_data(session_id) ``` ## 🐳 Docker Deployment ### Build image ```bash docker build -t matablgpt-mcp . ``` ### Run container (SSE mode) ```bash docker run -p 7860:7860 \ -e LLM_API_KEY=your_key \ -e LLM_API_BASE=https://api.your-service.com/v1 \ matablgpt-mcp ``` ## 🤗 HuggingFace Spaces Deployment 1. Create a new Space with **Docker SDK** 2. Upload all files from `mcp_output/` 3. Add secrets in Space settings: - `LLM_API_KEY`: Your API key - `LLM_API_BASE`: Your API base URL (e.g., `https://api.your-service.com/v1`) - `LLM_MODEL`: (Optional) Model name 4. Space will auto-build and deploy the MCP SSE service 5. Connect via: `https://your-space-name.hf.space/sse` ## 📝 MCP Client Configuration ### For Cursor (SSE mode - HuggingFace Space) Add to `~/.cursor/mcp.json`: ```json { "mcpServers": { "matablgpt": { "url": "https://your-space-name.hf.space/sse" } } } ``` ### For Cursor (stdio mode - Local) ```json { "mcpServers": { "matablgpt": { "command": "python", "args": ["F:/Material_Agent/MaTableGPT/mcp_output/start_mcp.py", "--mode", "stdio"], "env": { "LLM_API_KEY": "your_key", "LLM_API_BASE": "https://api.your-service.com/v1" } } } } ``` ### For Claude Desktop ```json { "mcpServers": { "matablgpt": { "url": "https://your-space-name.hf.space/sse" } } } ``` ## 📄 Output Format Extracted data follows this JSON schema: ```json { "catalyst_name": { "overpotential": { "electrolyte": "1.0 M KOH", "reaction_type": "OER", "value": "230 mV", "current_density": "10 mA/cm²" }, "tafel_slope": { "electrolyte": "1.0 M KOH", "reaction_type": "OER", "value": "45 mV/dec" } } } ``` ## 🙏 Acknowledgments Based on [MaTableGPT](https://github.com/KIST-CSRC/MaTableGPT) - GPT-based Table Data Extractor from Materials Science Literature. ## 📜 License MIT License