SEUyishu commited on
Commit
84a8f07
·
verified ·
1 Parent(s): 7d21217

Upload 7 files

Browse files
Files changed (7) hide show
  1. Dockerfile +50 -0
  2. README.md +280 -10
  3. __init__.py +33 -0
  4. app.py +627 -0
  5. mcp_service.py +1413 -0
  6. requirements.txt +24 -0
  7. start_mcp.py +144 -0
Dockerfile ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MaTableGPT MCP Service Docker Image
2
+ # ====================================
3
+ # For HuggingFace Spaces Deployment
4
+
5
+ FROM python:3.10-slim
6
+
7
+ # Set working directory
8
+ WORKDIR /app
9
+
10
+ # Set environment variables
11
+ ENV PYTHONDONTWRITEBYTECODE=1
12
+ ENV PYTHONUNBUFFERED=1
13
+ ENV GRADIO_SERVER_NAME=0.0.0.0
14
+ ENV GRADIO_SERVER_PORT=7860
15
+
16
+ # Install system dependencies
17
+ RUN apt-get update && apt-get install -y --no-install-recommends \
18
+ build-essential \
19
+ git \
20
+ && rm -rf /var/lib/apt/lists/*
21
+
22
+ # Copy requirements first for better caching
23
+ COPY requirements.txt .
24
+
25
+ # Install Python dependencies
26
+ RUN pip install --no-cache-dir --upgrade pip && \
27
+ pip install --no-cache-dir -r requirements.txt
28
+
29
+ # Download NLTK data for table splitting
30
+ RUN python -c "import nltk; nltk.download('punkt')"
31
+
32
+ # Copy application code
33
+ COPY . .
34
+
35
+ # Create necessary directories
36
+ RUN mkdir -p /app/sessions /app/temp
37
+
38
+ # Set permissions for HuggingFace Spaces
39
+ RUN chmod -R 777 /app/sessions /app/temp
40
+
41
+ # Expose ports
42
+ # 7860 for Gradio, 7865 for MCP SSE
43
+ EXPOSE 7860 7865
44
+
45
+ # Health check
46
+ HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
47
+ CMD python -c "import requests; requests.get('http://localhost:7860/')" || exit 1
48
+
49
+ # Run the application
50
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,10 +1,280 @@
1
- ---
2
- title: MatTableGPT
3
- emoji: 🚀
4
- colorFrom: green
5
- colorTo: green
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: MaTableGPT MCP
3
+ emoji: 🔬
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: docker
7
+ pinned: false
8
+ license: mit
9
+ app_port: 7860
10
+ ---
11
+
12
+ # MaTableGPT MCP Service
13
+
14
+ [![HuggingFace Spaces](https://img.shields.io/badge/🤗-HuggingFace%20Spaces-blue)](https://huggingface.co/spaces)
15
+ [![MCP](https://img.shields.io/badge/MCP-Compatible-green)](https://modelcontextprotocol.io/)
16
+
17
+ **GPT-based Table Data Extractor from Materials Science Literature**
18
+
19
+ A Model Context Protocol (MCP) service that extracts structured catalyst performance data from HTML tables in materials science publications.
20
+
21
+ ## 🌟 Features
22
+
23
+ ### Table Representation
24
+ - **HTML to TSV**: Convert HTML tables to tab-separated format with preserved structure
25
+ - **HTML to JSON**: Convert HTML tables to nested JSON format
26
+ - **Table Splitting**: Break down complex tables with multiple headers into simpler components
27
+
28
+ ### GPT-based Extraction
29
+ - **Zero-shot**: Multi-step questioning approach without examples
30
+ - **Few-shot**: Guided extraction with input/output examples
31
+ - **Fine-tuned**: Use pre-trained specialized models
32
+
33
+ ### Session Management
34
+ - Track multiple table processing workflows
35
+ - Store representations and extractions
36
+ - Export session data for analysis
37
+
38
+ ## 📦 Installation
39
+
40
+ ### Prerequisites
41
+ - Python 3.8+
42
+ - OpenAI API key (for GPT extraction)
43
+
44
+ ### Local Installation
45
+
46
+ ```bash
47
+ # Clone or copy the mcp_output folder
48
+ cd mcp_output
49
+
50
+ # Create virtual environment
51
+ python -m venv venv
52
+
53
+ # Activate (Windows)
54
+ venv\Scripts\activate
55
+ # Activate (Unix/Mac)
56
+ source venv/bin/activate
57
+
58
+ # Install dependencies
59
+ pip install -r requirements.txt
60
+
61
+ # Set API configuration (use your third-party API service info)
62
+ # Windows PowerShell
63
+ $env:LLM_API_KEY = "your_api_key"
64
+ $env:LLM_API_BASE = "https://api.your-service.com/v1"
65
+ $env:LLM_MODEL = "gpt-4-turbo-preview"
66
+
67
+ # Windows CMD
68
+ set LLM_API_KEY=your_api_key
69
+ set LLM_API_BASE=https://api.your-service.com/v1
70
+ set LLM_MODEL=gpt-4-turbo-preview
71
+
72
+ # Unix/Mac
73
+ export LLM_API_KEY=your_api_key
74
+ export LLM_API_BASE=https://api.your-service.com/v1
75
+ export LLM_MODEL=gpt-4-turbo-preview
76
+ ```
77
+
78
+ ## 🔑 Environment Variables
79
+
80
+ This service supports third-party API services (reverse proxy, OneAPI, API aggregators, etc.)
81
+
82
+ | Variable | Required | Description |
83
+ |----------|----------|-------------|
84
+ | `LLM_API_KEY` | ✅ Yes | Your API key from the service provider |
85
+ | `LLM_API_BASE` | ✅ Yes | API base URL, e.g., `https://api.your-service.com/v1` |
86
+ | `LLM_MODEL` | ❌ No | Model name (default: gpt-4-turbo-preview) |
87
+
88
+ **Alternative variable names (also supported):**
89
+ | Variable | Description |
90
+ |----------|-------------|
91
+ | `OPENAI_API_KEY` | Alternative to LLM_API_KEY |
92
+ | `OPENAI_API_BASE` | Alternative to LLM_API_BASE |
93
+ | `OPENAI_MODEL` | Alternative to LLM_MODEL |
94
+
95
+ ## 🚀 Usage
96
+
97
+ ### Start MCP Server (stdio mode)
98
+
99
+ ```bash
100
+ python start_mcp.py
101
+ ```
102
+
103
+ ### Start MCP Server (SSE mode for web integration)
104
+
105
+ ```bash
106
+ python start_mcp.py --mode sse --port 7865
107
+ ```
108
+
109
+ ### Start Gradio Web Interface
110
+
111
+ ```bash
112
+ python app.py
113
+ ```
114
+
115
+ ## 🔧 MCP Tools Reference
116
+
117
+ ### Session Management
118
+
119
+ | Tool | Description |
120
+ |------|-------------|
121
+ | `create_session` | Create a new extraction session |
122
+ | `get_session_data` | Retrieve all data from a session |
123
+
124
+ ### Table Processing
125
+
126
+ | Tool | Description |
127
+ |------|-------------|
128
+ | `html_to_tsv_representation` | Convert HTML table to TSV format |
129
+ | `html_to_json_representation` | Convert HTML table to JSON format |
130
+ | `analyze_table_structure` | Analyze table structure (headers, merged cells) |
131
+ | `split_complex_table` | Split tables with multiple internal headers |
132
+
133
+ ### Data Extraction
134
+
135
+ | Tool | Description |
136
+ |------|-------------|
137
+ | `extract_catalyst_data_zero_shot` | Extract using zero-shot GPT |
138
+ | `extract_catalyst_data_few_shot` | Extract with example pairs |
139
+ | `extract_catalyst_data_fine_tuned` | Extract using fine-tuned model |
140
+
141
+ ### Utilities
142
+
143
+ | Tool | Description |
144
+ |------|-------------|
145
+ | `list_performance_types` | List supported catalyst performance types |
146
+ | `validate_extraction_result` | Validate extraction against schema |
147
+ | `get_extraction_code_template` | Get Python code for local extraction |
148
+ | `get_environment_requirements` | Get setup requirements |
149
+
150
+ ## 📋 Supported Performance Types
151
+
152
+ The following catalyst performance types can be extracted:
153
+
154
+ - `overpotential`, `tafel_slope`, `Rct`, `stability`, `Cdl`
155
+ - `onset_potential`, `current_density`, `potential`, `TOF`, `ECSA`
156
+ - `water_splitting_potential`, `mass_activity`, `exchange_current_density`
157
+ - `Rs`, `specific_activity`, `onset_overpotential`, `BET`, `surface_area`
158
+ - `loading`, `apparent_activation_energy`
159
+
160
+ ## 🔄 Workflow Example
161
+
162
+ ### 1. Create a session
163
+
164
+ ```python
165
+ result = create_session()
166
+ session_id = result["session_id"]
167
+ ```
168
+
169
+ ### 2. Convert HTML table to representation
170
+
171
+ ```python
172
+ html = "<table>...</table>"
173
+ tsv = html_to_tsv_representation(
174
+ html_table=html,
175
+ title="Table 1: Catalyst Performance",
176
+ caption="OER performance in 1M KOH",
177
+ session_id=session_id,
178
+ table_name="table1"
179
+ )
180
+ ```
181
+
182
+ ### 3. Extract catalyst data
183
+
184
+ ```python
185
+ extraction = extract_catalyst_data_zero_shot(
186
+ table_representation=tsv["representation"],
187
+ session_id=session_id,
188
+ table_name="table1"
189
+ )
190
+ ```
191
+
192
+ ### 4. Validate and export
193
+
194
+ ```python
195
+ validation = validate_extraction_result(extraction["extraction"])
196
+ session_data = get_session_data(session_id)
197
+ ```
198
+
199
+ ## 🐳 Docker Deployment
200
+
201
+ ### Build image
202
+
203
+ ```bash
204
+ docker build -t matablgpt-mcp .
205
+ ```
206
+
207
+ ### Run container
208
+
209
+ ```bash
210
+ docker run -p 7860:7860 -p 7865:7865 \
211
+ -e OPENAI_API_KEY=your_key \
212
+ matablgpt-mcp
213
+ ```
214
+
215
+ ## 🤗 HuggingFace Spaces Deployment
216
+
217
+ 1. Create a new Space with Docker SDK
218
+ 2. Upload all files from `mcp_output/`
219
+ 3. Add `OPENAI_API_KEY` as a secret in Space settings
220
+ 4. Space will auto-build and deploy
221
+
222
+ ## 📝 MCP Client Configuration
223
+
224
+ Add to your MCP client configuration (e.g., Claude Desktop):
225
+
226
+ ```json
227
+ {
228
+ "mcpServers": {
229
+ "matablgpt": {
230
+ "command": "python",
231
+ "args": ["path/to/mcp_output/start_mcp.py"],
232
+ "env": {
233
+ "OPENAI_API_KEY": "your_key"
234
+ }
235
+ }
236
+ }
237
+ }
238
+ ```
239
+
240
+ Or for SSE mode:
241
+
242
+ ```json
243
+ {
244
+ "mcpServers": {
245
+ "matablgpt": {
246
+ "url": "http://localhost:7865/sse"
247
+ }
248
+ }
249
+ }
250
+ ```
251
+
252
+ ## 📄 Output Format
253
+
254
+ Extracted data follows this JSON schema:
255
+
256
+ ```json
257
+ {
258
+ "catalyst_name": {
259
+ "overpotential": {
260
+ "electrolyte": "1.0 M KOH",
261
+ "reaction_type": "OER",
262
+ "value": "230 mV",
263
+ "current_density": "10 mA/cm²"
264
+ },
265
+ "tafel_slope": {
266
+ "electrolyte": "1.0 M KOH",
267
+ "reaction_type": "OER",
268
+ "value": "45 mV/dec"
269
+ }
270
+ }
271
+ }
272
+ ```
273
+
274
+ ## 🙏 Acknowledgments
275
+
276
+ Based on [MaTableGPT](https://github.com/your-repo/MaTableGPT) - GPT-based Table Data Extractor from Materials Science Literature.
277
+
278
+ ## 📜 License
279
+
280
+ MIT License
__init__.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ MaTableGPT MCP Output Package
3
+ """
4
+
5
+ from .mcp_service import (
6
+ TableRepresenter,
7
+ TableToJSON,
8
+ TableSplitter,
9
+ GPTExtractor,
10
+ SessionManager,
11
+ table_representer,
12
+ table_to_json,
13
+ table_splitter,
14
+ session_manager,
15
+ get_extractor,
16
+ mcp
17
+ )
18
+
19
+ __all__ = [
20
+ 'TableRepresenter',
21
+ 'TableToJSON',
22
+ 'TableSplitter',
23
+ 'GPTExtractor',
24
+ 'SessionManager',
25
+ 'table_representer',
26
+ 'table_to_json',
27
+ 'table_splitter',
28
+ 'session_manager',
29
+ 'get_extractor',
30
+ 'mcp'
31
+ ]
32
+
33
+ __version__ = '1.0.0'
app.py ADDED
@@ -0,0 +1,627 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ MaTableGPT Gradio Web Interface
4
+ ================================
5
+
6
+ A web interface for the MaTableGPT MCP service.
7
+ Provides an interactive UI for table data extraction from materials science literature.
8
+
9
+ For HuggingFace Spaces deployment.
10
+ """
11
+
12
+ import os
13
+ import json
14
+ import logging
15
+ import gradio as gr
16
+ from typing import Optional, Tuple, Dict, Any
17
+
18
+ # Configure logging
19
+ logging.basicConfig(level=logging.INFO)
20
+ logger = logging.getLogger("matablgpt-app")
21
+
22
+ # Import MCP service components
23
+ try:
24
+ from mcp_service import (
25
+ table_representer,
26
+ table_to_json,
27
+ table_splitter,
28
+ session_manager,
29
+ get_extractor,
30
+ GPTExtractor
31
+ )
32
+ MCP_AVAILABLE = True
33
+ except ImportError as e:
34
+ logger.warning(f"MCP service not available: {e}")
35
+ MCP_AVAILABLE = False
36
+
37
+ # =============================================================================
38
+ # Helper Functions
39
+ # =============================================================================
40
+
41
+ def format_json_output(data: Any) -> str:
42
+ """Format data as pretty JSON string."""
43
+ try:
44
+ return json.dumps(data, indent=2, ensure_ascii=False)
45
+ except:
46
+ return str(data)
47
+
48
+
49
+ def check_openai_config() -> Tuple[bool, str]:
50
+ """Check if API configuration is complete (supports third-party services)."""
51
+ # Check multiple env var names
52
+ key = (
53
+ os.environ.get('LLM_API_KEY', '') or
54
+ os.environ.get('OPENAI_API_KEY', '')
55
+ )
56
+ base_url = (
57
+ os.environ.get('LLM_API_BASE', '') or
58
+ os.environ.get('OPENAI_API_BASE', '') or
59
+ os.environ.get('OPENAI_BASE_URL', '')
60
+ )
61
+ model = (
62
+ os.environ.get('LLM_MODEL', '') or
63
+ os.environ.get('OPENAI_MODEL', '') or
64
+ 'gpt-4-turbo-preview'
65
+ )
66
+
67
+ status_parts = []
68
+
69
+ if key:
70
+ status_parts.append(f"✅ API Key: ***{key[-4:]}")
71
+ else:
72
+ return False, "⚠️ API Key not configured (set LLM_API_KEY or OPENAI_API_KEY). GPT extraction will not work."
73
+
74
+ if base_url:
75
+ # Show shortened URL
76
+ display_url = base_url if len(base_url) <= 35 else base_url[:32] + "..."
77
+ status_parts.append(f"✅ API URL: {display_url}")
78
+ else:
79
+ return False, "⚠️ API Base URL not configured (set LLM_API_BASE or OPENAI_API_BASE). Required for third-party API services."
80
+
81
+ status_parts.append(f"✅ Model: {model}")
82
+
83
+ return True, " | ".join(status_parts)
84
+
85
+
86
+ def check_openai_key() -> Tuple[bool, str]:
87
+ """Legacy function - redirects to check_openai_config."""
88
+ return check_openai_config()
89
+
90
+
91
+ # =============================================================================
92
+ # Gradio Interface Functions
93
+ # =============================================================================
94
+
95
+ def convert_html_to_tsv(html_input: str, title: str, caption: str) -> str:
96
+ """Convert HTML table to TSV representation."""
97
+ if not MCP_AVAILABLE:
98
+ return "Error: MCP service not available"
99
+
100
+ if not html_input.strip():
101
+ return "Error: Please provide HTML table input"
102
+
103
+ try:
104
+ result = table_representer.html_to_tsv(html_input, title, caption)
105
+ return result
106
+ except Exception as e:
107
+ return f"Error: {str(e)}"
108
+
109
+
110
+ def convert_html_to_json(html_input: str, title: str, caption: str) -> str:
111
+ """Convert HTML table to JSON representation."""
112
+ if not MCP_AVAILABLE:
113
+ return "Error: MCP service not available"
114
+
115
+ if not html_input.strip():
116
+ return "Error: Please provide HTML table input"
117
+
118
+ try:
119
+ result = table_to_json.html_to_json(html_input, title, caption)
120
+ return format_json_output(result)
121
+ except Exception as e:
122
+ return f"Error: {str(e)}"
123
+
124
+
125
+ def analyze_table(html_input: str) -> str:
126
+ """Analyze HTML table structure."""
127
+ if not MCP_AVAILABLE:
128
+ return "Error: MCP service not available"
129
+
130
+ if not html_input.strip():
131
+ return "Error: Please provide HTML table input"
132
+
133
+ try:
134
+ result = table_splitter.analyze_table_structure(html_input)
135
+ return format_json_output(result)
136
+ except Exception as e:
137
+ return f"Error: {str(e)}"
138
+
139
+
140
+ def split_table(html_input: str, title: str, caption: str) -> str:
141
+ """Split complex table into simpler components."""
142
+ if not MCP_AVAILABLE:
143
+ return "Error: MCP service not available"
144
+
145
+ if not html_input.strip():
146
+ return "Error: Please provide HTML table input"
147
+
148
+ try:
149
+ result = table_splitter.split_table(html_input, title, caption)
150
+ return format_json_output({
151
+ "table_count": len(result),
152
+ "tables": result
153
+ })
154
+ except Exception as e:
155
+ return f"Error: {str(e)}"
156
+
157
+
158
+ def extract_zero_shot(table_repr: str) -> str:
159
+ """Extract catalyst data using zero-shot approach."""
160
+ if not MCP_AVAILABLE:
161
+ return "Error: MCP service not available"
162
+
163
+ if not table_repr.strip():
164
+ return "Error: Please provide table representation"
165
+
166
+ has_key, key_status = check_openai_key()
167
+ if not has_key:
168
+ return f"Error: {key_status}"
169
+
170
+ try:
171
+ extractor = get_extractor()
172
+ result = extractor.extract_zero_shot(table_repr)
173
+ return format_json_output(result)
174
+ except Exception as e:
175
+ return f"Error: {str(e)}"
176
+
177
+
178
+ def extract_few_shot(table_repr: str, examples_json: str) -> str:
179
+ """Extract catalyst data using few-shot approach."""
180
+ if not MCP_AVAILABLE:
181
+ return "Error: MCP service not available"
182
+
183
+ if not table_repr.strip():
184
+ return "Error: Please provide table representation"
185
+
186
+ has_key, key_status = check_openai_key()
187
+ if not has_key:
188
+ return f"Error: {key_status}"
189
+
190
+ try:
191
+ examples = json.loads(examples_json) if examples_json.strip() else []
192
+ extractor = get_extractor()
193
+ result = extractor.extract_few_shot(table_repr, examples)
194
+ return format_json_output(result)
195
+ except json.JSONDecodeError:
196
+ return "Error: Invalid examples JSON format"
197
+ except Exception as e:
198
+ return f"Error: {str(e)}"
199
+
200
+
201
+ def validate_extraction(extraction_json: str) -> str:
202
+ """Validate extraction result."""
203
+ if not extraction_json.strip():
204
+ return "Error: Please provide extraction JSON"
205
+
206
+ try:
207
+ extraction = json.loads(extraction_json)
208
+ except json.JSONDecodeError:
209
+ return "Error: Invalid JSON format"
210
+
211
+ issues = []
212
+ warnings = []
213
+
214
+ if not isinstance(extraction, dict):
215
+ return format_json_output({"valid": False, "issues": ["Extraction must be a dictionary"]})
216
+
217
+ if "error" in extraction:
218
+ issues.append(f"Extraction contains error: {extraction['error']}")
219
+
220
+ valid_performance_types = set(GPTExtractor.PERFORMANCE_LIST)
221
+
222
+ for catalyst_name, performances in extraction.items():
223
+ if catalyst_name in ["error", "raw_response", "catalysts"]:
224
+ continue
225
+
226
+ if not isinstance(performances, dict):
227
+ warnings.append(f"Catalyst '{catalyst_name}' should have dict of performances")
228
+ continue
229
+
230
+ for perf_name, properties in performances.items():
231
+ if perf_name not in valid_performance_types:
232
+ warnings.append(f"Unknown performance type: {perf_name}")
233
+
234
+ if isinstance(properties, dict):
235
+ for prop_key in properties.keys():
236
+ if prop_key not in GPTExtractor.PROPERTY_TEMPLATE:
237
+ warnings.append(f"Unknown property key: {prop_key}")
238
+
239
+ return format_json_output({
240
+ "valid": len(issues) == 0,
241
+ "issues": issues,
242
+ "warnings": warnings
243
+ })
244
+
245
+
246
+ def get_performance_types() -> str:
247
+ """Get list of supported performance types."""
248
+ return format_json_output({
249
+ "performance_types": GPTExtractor.PERFORMANCE_LIST,
250
+ "property_template": GPTExtractor.PROPERTY_TEMPLATE
251
+ })
252
+
253
+
254
+ def get_code_template(repr_format: str, model_type: str) -> str:
255
+ """Generate code template for local extraction."""
256
+ code = f'''"""
257
+ MaTableGPT Local Extraction Template
258
+ Model Type: {model_type}
259
+ Representation Format: {repr_format}
260
+ """
261
+
262
+ from openai import OpenAI
263
+ import json
264
+
265
+ # Initialize client
266
+ client = OpenAI(api_key="YOUR_API_KEY")
267
+
268
+ # Performance types to extract
269
+ PERFORMANCE_LIST = [
270
+ 'overpotential', 'tafel_slope', 'Rct', 'stability', 'Cdl',
271
+ 'onset_potential', 'current_density', 'potential', 'TOF', 'ECSA',
272
+ 'water_splitting_potential', 'mass_activity', 'exchange_current_density',
273
+ 'Rs', 'specific_activity', 'onset_overpotential', 'BET', 'surface_area',
274
+ 'loading', 'apparent_activation_energy'
275
+ ]
276
+
277
+ # Your table representation
278
+ table_representation = """
279
+ # Paste your {repr_format.upper()} representation here
280
+ """
281
+
282
+ # System prompt
283
+ system_prompt = """I will extract catalyst performance information from the table and create JSON format.
284
+ Performance types: """ + str(PERFORMANCE_LIST) + """
285
+ The JSON format will have performance within the catalyst, with elements:
286
+ reaction type, value, electrolyte, condition, current density, versus, substrate.
287
+ Output must contain only JSON dictionary."""
288
+
289
+ # Extract
290
+ response = client.chat.completions.create(
291
+ model="gpt-4-turbo-preview",
292
+ messages=[
293
+ {{"role": "system", "content": system_prompt}},
294
+ {{"role": "user", "content": table_representation}}
295
+ ],
296
+ temperature=0
297
+ )
298
+
299
+ result = response.choices[0].message.content.strip()
300
+ print(json.dumps(json.loads(result), indent=2))
301
+ '''
302
+ return code
303
+
304
+
305
+ # =============================================================================
306
+ # Gradio UI
307
+ # =============================================================================
308
+
309
+ # Sample HTML table for demo
310
+ SAMPLE_HTML = '''<table>
311
+ <thead>
312
+ <tr>
313
+ <th>Catalyst</th>
314
+ <th>Overpotential (mV)</th>
315
+ <th>Tafel Slope (mV/dec)</th>
316
+ <th>Electrolyte</th>
317
+ </tr>
318
+ </thead>
319
+ <tbody>
320
+ <tr>
321
+ <td>Pt/C</td>
322
+ <td>280</td>
323
+ <td>65</td>
324
+ <td>1M KOH</td>
325
+ </tr>
326
+ <tr>
327
+ <td>NiFe-LDH</td>
328
+ <td>230</td>
329
+ <td>45</td>
330
+ <td>1M KOH</td>
331
+ </tr>
332
+ <tr>
333
+ <td>Co3O4</td>
334
+ <td>350</td>
335
+ <td>78</td>
336
+ <td>1M KOH</td>
337
+ </tr>
338
+ </tbody>
339
+ </table>'''
340
+
341
+
342
+ def create_ui():
343
+ """Create Gradio interface."""
344
+
345
+ # Check status
346
+ has_key, key_status = check_openai_key()
347
+ status_color = "green" if has_key else "orange"
348
+
349
+ with gr.Blocks(
350
+ title="MaTableGPT - Table Data Extractor",
351
+ theme=gr.themes.Soft()
352
+ ) as app:
353
+
354
+ gr.Markdown("""
355
+ # 🔬 MaTableGPT - Table Data Extractor
356
+
357
+ **Extract structured catalyst performance data from HTML tables in materials science literature**
358
+
359
+ This tool uses GPT models to convert complex HTML tables into structured JSON data with
360
+ catalyst names, performance metrics (overpotential, Tafel slope, etc.), and associated properties.
361
+ """)
362
+
363
+ gr.Markdown(f"**Status:** <span style='color:{status_color}'>{key_status}</span>")
364
+
365
+ with gr.Tabs():
366
+ # Tab 1: Table Representation
367
+ with gr.TabItem("📋 Table Representation"):
368
+ gr.Markdown("### Convert HTML tables to TSV or JSON format")
369
+
370
+ with gr.Row():
371
+ with gr.Column():
372
+ html_input = gr.Textbox(
373
+ label="HTML Table Input",
374
+ placeholder="Paste your HTML table here...",
375
+ lines=15,
376
+ value=SAMPLE_HTML
377
+ )
378
+ title_input = gr.Textbox(
379
+ label="Table Title (optional)",
380
+ placeholder="e.g., Table 1: OER Catalyst Performance"
381
+ )
382
+ caption_input = gr.Textbox(
383
+ label="Table Caption (optional)",
384
+ placeholder="e.g., Performance measured at 10 mA/cm²"
385
+ )
386
+
387
+ with gr.Row():
388
+ tsv_btn = gr.Button("Convert to TSV", variant="primary")
389
+ json_btn = gr.Button("Convert to JSON", variant="primary")
390
+
391
+ with gr.Column():
392
+ repr_output = gr.Textbox(
393
+ label="Representation Output",
394
+ lines=20,
395
+ show_copy_button=True
396
+ )
397
+
398
+ tsv_btn.click(
399
+ convert_html_to_tsv,
400
+ inputs=[html_input, title_input, caption_input],
401
+ outputs=repr_output
402
+ )
403
+ json_btn.click(
404
+ convert_html_to_json,
405
+ inputs=[html_input, title_input, caption_input],
406
+ outputs=repr_output
407
+ )
408
+
409
+ # Tab 2: Table Analysis & Splitting
410
+ with gr.TabItem("🔍 Table Analysis"):
411
+ gr.Markdown("### Analyze and split complex tables")
412
+
413
+ with gr.Row():
414
+ with gr.Column():
415
+ html_analyze = gr.Textbox(
416
+ label="HTML Table Input",
417
+ placeholder="Paste your HTML table here...",
418
+ lines=10,
419
+ value=SAMPLE_HTML
420
+ )
421
+
422
+ with gr.Row():
423
+ analyze_btn = gr.Button("Analyze Structure", variant="secondary")
424
+ split_btn = gr.Button("Split Table", variant="secondary")
425
+
426
+ with gr.Column():
427
+ analysis_output = gr.Textbox(
428
+ label="Analysis Result",
429
+ lines=15,
430
+ show_copy_button=True
431
+ )
432
+
433
+ analyze_btn.click(
434
+ analyze_table,
435
+ inputs=html_analyze,
436
+ outputs=analysis_output
437
+ )
438
+ split_btn.click(
439
+ split_table,
440
+ inputs=[html_analyze, title_input, caption_input],
441
+ outputs=analysis_output
442
+ )
443
+
444
+ # Tab 3: GPT Extraction
445
+ with gr.TabItem("🤖 GPT Extraction"):
446
+ gr.Markdown("### Extract catalyst data using GPT models")
447
+
448
+ if not has_key:
449
+ gr.Markdown("""
450
+ ⚠️ **OpenAI API Key Required**
451
+
452
+ Set the `OPENAI_API_KEY` environment variable to enable GPT extraction.
453
+ """)
454
+
455
+ with gr.Row():
456
+ with gr.Column():
457
+ table_repr_input = gr.Textbox(
458
+ label="Table Representation (TSV or JSON)",
459
+ placeholder="Paste your table representation here...",
460
+ lines=10
461
+ )
462
+
463
+ extraction_method = gr.Radio(
464
+ ["Zero-shot", "Few-shot"],
465
+ label="Extraction Method",
466
+ value="Zero-shot"
467
+ )
468
+
469
+ examples_input = gr.Textbox(
470
+ label="Examples (for Few-shot, JSON format)",
471
+ placeholder='[{"input": "...", "output": "..."}]',
472
+ lines=5,
473
+ visible=False
474
+ )
475
+
476
+ extract_btn = gr.Button("Extract Catalyst Data", variant="primary")
477
+
478
+ with gr.Column():
479
+ extraction_output = gr.Textbox(
480
+ label="Extraction Result",
481
+ lines=20,
482
+ show_copy_button=True
483
+ )
484
+
485
+ def update_examples_visibility(method):
486
+ return gr.update(visible=(method == "Few-shot"))
487
+
488
+ extraction_method.change(
489
+ update_examples_visibility,
490
+ inputs=extraction_method,
491
+ outputs=examples_input
492
+ )
493
+
494
+ def extract_data(table_repr, method, examples):
495
+ if method == "Zero-shot":
496
+ return extract_zero_shot(table_repr)
497
+ else:
498
+ return extract_few_shot(table_repr, examples)
499
+
500
+ extract_btn.click(
501
+ extract_data,
502
+ inputs=[table_repr_input, extraction_method, examples_input],
503
+ outputs=extraction_output
504
+ )
505
+
506
+ # Tab 4: Validation
507
+ with gr.TabItem("✅ Validation"):
508
+ gr.Markdown("### Validate extraction results")
509
+
510
+ with gr.Row():
511
+ with gr.Column():
512
+ validation_input = gr.Textbox(
513
+ label="Extraction JSON to Validate",
514
+ placeholder="Paste extraction JSON here...",
515
+ lines=15
516
+ )
517
+ validate_btn = gr.Button("Validate", variant="secondary")
518
+
519
+ with gr.Column():
520
+ validation_output = gr.Textbox(
521
+ label="Validation Result",
522
+ lines=10
523
+ )
524
+
525
+ gr.Markdown("### Supported Performance Types")
526
+ perf_types = gr.Textbox(
527
+ label="",
528
+ value=get_performance_types(),
529
+ lines=10,
530
+ interactive=False
531
+ )
532
+
533
+ validate_btn.click(
534
+ validate_extraction,
535
+ inputs=validation_input,
536
+ outputs=validation_output
537
+ )
538
+
539
+ # Tab 5: Code Template
540
+ with gr.TabItem("💻 Code Template"):
541
+ gr.Markdown("### Generate Python code for local extraction")
542
+
543
+ with gr.Row():
544
+ repr_format = gr.Dropdown(
545
+ ["tsv", "json"],
546
+ label="Representation Format",
547
+ value="tsv"
548
+ )
549
+ model_type = gr.Dropdown(
550
+ ["zero-shot", "few-shot", "fine-tuning"],
551
+ label="Model Type",
552
+ value="zero-shot"
553
+ )
554
+
555
+ generate_btn = gr.Button("Generate Code", variant="secondary")
556
+
557
+ code_output = gr.Code(
558
+ label="Python Code Template",
559
+ language="python",
560
+ lines=30
561
+ )
562
+
563
+ generate_btn.click(
564
+ get_code_template,
565
+ inputs=[repr_format, model_type],
566
+ outputs=code_output
567
+ )
568
+
569
+ # Tab 6: About
570
+ with gr.TabItem("ℹ️ About"):
571
+ gr.Markdown("""
572
+ ## About MaTableGPT
573
+
574
+ MaTableGPT is a GPT-based table data extractor specifically designed for
575
+ materials science literature. It converts complex HTML tables containing
576
+ catalyst performance data into structured JSON format.
577
+
578
+ ### Workflow
579
+
580
+ 1. **Table Representation**: Convert HTML tables to TSV or JSON format
581
+ 2. **Table Splitting** (optional): Break down complex tables with multiple headers
582
+ 3. **GPT Extraction**: Use zero-shot, few-shot, or fine-tuned models to extract data
583
+ 4. **Validation**: Verify extraction results against expected schema
584
+
585
+ ### Supported Performance Types
586
+
587
+ - Overpotential, Tafel slope, Rct, Stability, Cdl
588
+ - Onset potential, Current density, Potential, TOF, ECSA
589
+ - Water splitting potential, Mass activity, Exchange current density
590
+ - Rs, Specific activity, Onset overpotential, BET, Surface area
591
+ - Loading, Apparent activation energy
592
+
593
+ ### MCP Integration
594
+
595
+ This service is also available as an MCP (Model Context Protocol) server,
596
+ allowing integration with AI assistants like Claude.
597
+
598
+ ### Credits
599
+
600
+ Based on [MaTableGPT](https://github.com/your-repo/MaTableGPT) research.
601
+ """)
602
+
603
+ gr.Markdown("---\n*MaTableGPT MCP Service - Materials Science Table Data Extraction*")
604
+
605
+ return app
606
+
607
+
608
+ # =============================================================================
609
+ # Main Entry Point
610
+ # =============================================================================
611
+
612
+ def main():
613
+ """Run the Gradio app."""
614
+ app = create_ui()
615
+
616
+ # Get port from environment or default
617
+ port = int(os.environ.get('GRADIO_SERVER_PORT', 7860))
618
+
619
+ app.launch(
620
+ server_name="0.0.0.0",
621
+ server_port=port,
622
+ share=False
623
+ )
624
+
625
+
626
+ if __name__ == "__main__":
627
+ main()
mcp_service.py ADDED
@@ -0,0 +1,1413 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ MaTableGPT MCP Service
3
+ ======================
4
+ A Model Context Protocol (MCP) service for extracting table data from
5
+ materials science literature using GPT models.
6
+
7
+ This service provides tools for:
8
+ 1. Table Representation: Converting HTML tables to TSV or JSON format
9
+ 2. Table Splitting: Breaking down complex tables into simpler components
10
+ 3. GPT-based Data Extraction: Using fine-tuning, few-shot, or zero-shot models
11
+ 4. Follow-up Questions: Refining extraction results through iterative questioning
12
+ 5. Model Evaluation: Assessing extraction quality
13
+ """
14
+
15
+ import os
16
+ import json
17
+ import re
18
+ import logging
19
+ import tempfile
20
+ import uuid
21
+ from datetime import datetime
22
+ from typing import Optional, Dict, List, Any, Union
23
+ from dataclasses import dataclass, field
24
+ from contextlib import asynccontextmanager
25
+ from bs4 import BeautifulSoup
26
+ import pandas as pd
27
+
28
+ # MCP imports
29
+ from mcp.server.fastmcp import FastMCP
30
+
31
+ # Configure logging
32
+ logging.basicConfig(level=logging.INFO)
33
+ logger = logging.getLogger("matablgpt-mcp")
34
+
35
+ # =============================================================================
36
+ # Data Classes
37
+ # =============================================================================
38
+
39
+ @dataclass
40
+ class TableData:
41
+ """Represents a parsed table structure"""
42
+ title: str = ""
43
+ caption: str = ""
44
+ tag: str = "" # HTML table tag
45
+ headers: List[List[str]] = field(default_factory=list)
46
+ body: List[List[str]] = field(default_factory=list)
47
+
48
+ @dataclass
49
+ class ExtractionResult:
50
+ """Represents the result of GPT extraction"""
51
+ session_id: str
52
+ table_name: str
53
+ model_type: str # 'fine-tuning', 'few-shot', 'zero-shot'
54
+ result: Dict[str, Any]
55
+ timestamp: str
56
+ follow_up_applied: bool = False
57
+
58
+ @dataclass
59
+ class SessionData:
60
+ """Session data for storing extraction results"""
61
+ session_id: str
62
+ created_at: str
63
+ tables: Dict[str, TableData] = field(default_factory=dict)
64
+ representations: Dict[str, str] = field(default_factory=dict)
65
+ extractions: List[ExtractionResult] = field(default_factory=list)
66
+
67
+ # =============================================================================
68
+ # Table Processing Classes
69
+ # =============================================================================
70
+
71
+ class TableRepresenter:
72
+ """
73
+ Converts HTML tables to TSV (Tab-Separated Values) representation.
74
+ Handles merged cells, captions, and titles.
75
+ """
76
+
77
+ def __init__(self):
78
+ # Cell representation formats
79
+ self.merged_cell = '<merge {}={}>{}</merge>'
80
+ self.both_merged_cell = '<merge {}={} {}={}>{}</merge>'
81
+ self.cell = '{}\\t'
82
+ self.line_breaking = '\\n'
83
+ self.table_tag = '<table>{}</table>'
84
+ self.caption_tag = '<caption>{}</caption>'
85
+ self.title_tag = '<title>{}</title>'
86
+
87
+ def text_filter(self, text: str) -> str:
88
+ """Remove unnecessary text and HTML tags from the given string."""
89
+ out = text
90
+ # Replace special Unicode characters
91
+ replacements = [
92
+ ('\\xa0', ' '), ('\\u2005', ' '), ('\\u2009', ' '),
93
+ ('\\u202f', ' '), ('\\u200b', ''), ('<b>', ''), ('</b>', '')
94
+ ]
95
+ for old, new in replacements:
96
+ out = out.replace(old, new)
97
+
98
+ # Remove specific patterns
99
+ patterns = [
100
+ (r'<cap>(\(\d+\)|\d+|\[\d+\]|\d+\,\d+|\d+\,\d+\,\d+|\d+\,\d+\–\d+|\d+\D+|\(\d+\,\s*\d+\)|\(\d+\D+\))</cap>', r'\1'),
101
+ (r'<cap>(\s*ref\.\s\d+.*?)</cap>', r'\1'),
102
+ (r'\(<cap>(\s*(ref\.\s\d+.*?)\s*)</cap>\)', r'\1'),
103
+ (r'<cap>(\s*Ref\.\s\d+.*?)</cap>', r'\1'),
104
+ (r'\(<cap>(\s*(Ref\.\s\d+.*?)\s*)</cap>\)', r'\1'),
105
+ (r'<cap>(\[\d+|\d+\])</cap>', r'\1'),
106
+ (r'<cap>((.*?)et al\..*?)</cap>', r'\1'),
107
+ (r'<cap>((.*?)Fig\..*?)</cap>', r'\1'),
108
+ (r'<cap>(Song and Hu \(2014\))</cap>', r'\1'),
109
+ (r'<div> <cap> </cap> </div> ', ''),
110
+ (r'<cap>(mA\.cm)</cap>', r'\1'),
111
+ (r'<cap>(https.*?)</cap>', r'\1'),
112
+ (r'<cap>(\d+\.\d+\@\d+)</cap>', r'\1')
113
+ ]
114
+ for pattern, repl in patterns:
115
+ out = re.sub(pattern, repl, out)
116
+
117
+ return out
118
+
119
+ def process_table(self, t):
120
+ """Remove unnecessary HTML tags from the table element."""
121
+ tags_to_remove = [
122
+ 'img', 'em', 'i', 'p', 'span', 'strong', 'math', 'mi', 'br',
123
+ 'script', 'svg', 'mrow', 'mo', 'mn', 'msub', 'msubsup', 'mtext',
124
+ 'mjx-container', 'mjx-math', 'mjx-mrow', 'mjx-msub', 'mjx-mi',
125
+ 'mjx-c', 'mjx-script', 'mjx-mspace', 'mjx-assistive-mml', 'mspace'
126
+ ]
127
+
128
+ for tag in tags_to_remove:
129
+ elements = t.find_all(tag)
130
+ for element in elements:
131
+ if tag in ['img', 'script', 'svg']:
132
+ element.decompose()
133
+ else:
134
+ element.unwrap()
135
+
136
+ return t
137
+
138
+ def html_to_tsv(self, html_table: str, title: str = "", caption: str = "") -> str:
139
+ """
140
+ Convert HTML table to TSV representation.
141
+
142
+ Args:
143
+ html_table: HTML string containing the table
144
+ title: Table title
145
+ caption: Table caption
146
+
147
+ Returns:
148
+ TSV representation of the table
149
+ """
150
+ soup = BeautifulSoup(html_table, 'html.parser')
151
+ table = soup.find('table')
152
+ if not table:
153
+ table = soup
154
+
155
+ # Get table dimensions
156
+ tbody = table.find('tbody') or table
157
+ first_row = tbody.find('tr')
158
+ if not first_row:
159
+ return "Error: No table rows found"
160
+
161
+ width = sum(int(cell.get('colspan', 1)) for cell in first_row.find_all(re.compile('(?<!ma)th|td')))
162
+ height = len(table.find_all('tr'))
163
+
164
+ # Initialize output grid
165
+ out = [['' for _ in range(width)] for _ in range(height)]
166
+
167
+ # Process each row
168
+ i = 0
169
+ for tr in table.find_all('tr'):
170
+ j = 0
171
+ for cell in tr.find_all(re.compile('(?<!ma)th|td')):
172
+ # Process links
173
+ for a_tag in cell.find_all('a'):
174
+ a_text = a_tag.get_text()
175
+ if a_text.isdigit():
176
+ a_tag.string = f"<ref>{a_text}</ref>"
177
+ else:
178
+ a_tag.string = f"<cap>{a_text}</cap>"
179
+
180
+ cell = self.process_table(cell)
181
+
182
+ # Find next empty cell
183
+ while j < width and out[i][j] != '':
184
+ j += 1
185
+ if j >= width:
186
+ break
187
+
188
+ refined_text = ''.join(str(element) for element in cell.contents)
189
+ colspan = int(cell.get('colspan', 0))
190
+ rowspan = int(cell.get('rowspan', 0))
191
+
192
+ # Handle merged cells
193
+ if colspan and rowspan:
194
+ out[i][j] = self.both_merged_cell.format('colspan', colspan, 'rowspan', rowspan, self.text_filter(refined_text))
195
+ for c in range(colspan):
196
+ for r in range(rowspan):
197
+ if c > 0 or r > 0:
198
+ if i + r < height and j + c < width:
199
+ out[i + r][j + c] = '::'
200
+ elif colspan:
201
+ out[i][j] = self.merged_cell.format('colspan', colspan, self.text_filter(refined_text))
202
+ for c in range(1, colspan):
203
+ if j + c < width:
204
+ out[i][j + c] = '::'
205
+ elif rowspan:
206
+ out[i][j] = self.merged_cell.format('rowspan', rowspan, self.text_filter(refined_text))
207
+ for r in range(1, rowspan):
208
+ if i + r < height:
209
+ out[i + r][j] = '::'
210
+ else:
211
+ text = self.text_filter(refined_text) if refined_text else ' '
212
+ out[i][j] = text
213
+
214
+ j += colspan if colspan else 1
215
+ i += 1
216
+
217
+ # Build result string
218
+ result = ''
219
+ for row in out:
220
+ for element in row:
221
+ if element != '::':
222
+ result += self.cell.format(element)
223
+ result += self.line_breaking
224
+
225
+ final_result = self.title_tag.format(title) + self.table_tag.format(result)
226
+
227
+ if caption:
228
+ if isinstance(caption, dict):
229
+ caption_str = ', '.join([f"{k}: {v}" for k, v in caption.items()])
230
+ else:
231
+ caption_str = str(caption)
232
+ final_result += '\n' + self.caption_tag.format(caption_str)
233
+
234
+ return final_result
235
+
236
+
237
+ class TableToJSON:
238
+ """
239
+ Converts HTML tables to JSON representation.
240
+ """
241
+
242
+ def process_caption(self, table):
243
+ """Process caption and reference tags."""
244
+ # Remove tfoot
245
+ for tfoot in table.find_all('tfoot'):
246
+ tfoot.decompose()
247
+
248
+ for cell in table.find_all(['td', 'th']):
249
+ for link in cell.find_all('a'):
250
+ link_text = link.get_text()
251
+ if len(link_text) == 1 and (link_text.isalpha() or link_text == '*'):
252
+ link.string = f"<cap>{link_text}</cap>"
253
+ else:
254
+ link.string = f"<ref>{link_text}</ref>"
255
+
256
+ return table
257
+
258
+ def process_sub_sup(self, table):
259
+ """Process subscript and superscript tags."""
260
+ for cell in table.find_all(['td', 'th']):
261
+ for sup in cell.find_all('sup'):
262
+ sup_text = sup.get_text() or ""
263
+ sup.string = f"<sup>{sup_text}</sup>"
264
+ for sub in cell.find_all('sub'):
265
+ sub_text = sub.get_text() or ""
266
+ sub.string = f"<sub>{sub_text}</sub>"
267
+ return table
268
+
269
+ def html_to_json(self, html_table: str, title: str = "", caption: str = "") -> Dict:
270
+ """
271
+ Convert HTML table to JSON representation.
272
+
273
+ Args:
274
+ html_table: HTML string containing the table
275
+ title: Table title
276
+ caption: Table caption
277
+
278
+ Returns:
279
+ JSON dictionary representation of the table
280
+ """
281
+ soup = BeautifulSoup(html_table, 'html.parser')
282
+ table = soup.find('table')
283
+ if not table:
284
+ table = soup
285
+
286
+ # Process table
287
+ table = self.process_caption(table)
288
+ table = self.process_sub_sup(table)
289
+
290
+ # Fill empty header cells
291
+ for th in table.find_all('th'):
292
+ if not th.text.strip():
293
+ th.insert(0, '-')
294
+
295
+ # Convert to DataFrame
296
+ try:
297
+ dfs = pd.read_html(str(table))
298
+ if not dfs:
299
+ return {"error": "Could not parse table"}
300
+ df = dfs[0]
301
+ df.fillna("NaN", inplace=True)
302
+ except Exception as e:
303
+ return {"error": f"Failed to parse table: {str(e)}"}
304
+
305
+ # Build JSON structure
306
+ result = {}
307
+ header_levels = df.columns.nlevels
308
+ keys = list(df.columns)
309
+
310
+ for i, key in enumerate(keys):
311
+ values = df.iloc[:, i].tolist()
312
+ if header_levels > 1:
313
+ current = result
314
+ for j, k in enumerate(key):
315
+ if j == len(key) - 1:
316
+ current[k] = values
317
+ else:
318
+ if k not in current:
319
+ current[k] = {}
320
+ current = current[k]
321
+ else:
322
+ result[key] = values
323
+
324
+ # Add metadata
325
+ final_result = {
326
+ "Title": title,
327
+ "caption": caption,
328
+ **result
329
+ }
330
+
331
+ return final_result
332
+
333
+
334
+ class TableSplitter:
335
+ """
336
+ Splits complex tables into simpler components for better extraction.
337
+ """
338
+
339
+ def analyze_table_structure(self, html_table: str) -> Dict:
340
+ """
341
+ Analyze the structure of an HTML table.
342
+
343
+ Args:
344
+ html_table: HTML string containing the table
345
+
346
+ Returns:
347
+ Dictionary containing structural analysis
348
+ """
349
+ soup = BeautifulSoup(html_table, 'html.parser')
350
+ table = soup.find('table') or soup
351
+
352
+ rows = table.find_all('tr')
353
+
354
+ # Analyze each row
355
+ row_analysis = []
356
+ for row in rows:
357
+ cells = row.find_all(['td', 'th'])
358
+ cell_types = [cell.name for cell in cells]
359
+ merged_cells = sum(1 for cell in cells if cell.get('colspan') or cell.get('rowspan'))
360
+
361
+ # Determine if row is header or body
362
+ is_header = all(c.name == 'th' for c in cells) or self._is_header_content(cells)
363
+
364
+ row_analysis.append({
365
+ "cell_count": len(cells),
366
+ "cell_types": cell_types,
367
+ "merged_cells": merged_cells,
368
+ "is_header": is_header
369
+ })
370
+
371
+ return {
372
+ "total_rows": len(rows),
373
+ "has_thead": table.find('thead') is not None,
374
+ "has_tbody": table.find('tbody') is not None,
375
+ "row_analysis": row_analysis
376
+ }
377
+
378
+ def _is_header_content(self, cells) -> bool:
379
+ """Check if cells contain header-like content."""
380
+ if not cells:
381
+ return False
382
+
383
+ # Check if all cells have the same value (likely a spanning header)
384
+ texts = [c.get_text().strip() for c in cells]
385
+ if len(set(texts)) == 1 and texts[0]:
386
+ return True
387
+
388
+ # Check if content is mostly non-numeric
389
+ numeric_count = 0
390
+ for text in texts:
391
+ try:
392
+ float(re.sub(r'[^\d.-]', '', text))
393
+ numeric_count += 1
394
+ except:
395
+ pass
396
+
397
+ return numeric_count < len(texts) / 2
398
+
399
+ def split_table(self, html_table: str, title: str = "", caption: str = "") -> List[Dict]:
400
+ """
401
+ Split a complex table into simpler components.
402
+
403
+ Args:
404
+ html_table: HTML string containing the table
405
+ title: Table title
406
+ caption: Table caption
407
+
408
+ Returns:
409
+ List of simplified table dictionaries
410
+ """
411
+ soup = BeautifulSoup(html_table, 'html.parser')
412
+ table = soup.find('table') or soup
413
+
414
+ analysis = self.analyze_table_structure(html_table)
415
+
416
+ # If simple table, return as-is
417
+ if all(not r['is_header'] or i == 0 for i, r in enumerate(analysis['row_analysis'])):
418
+ return [{
419
+ "html": str(table),
420
+ "title": title,
421
+ "caption": caption,
422
+ "index": 1
423
+ }]
424
+
425
+ # Split based on internal headers
426
+ split_tables = []
427
+ current_header = None
428
+ current_rows = []
429
+
430
+ thead = table.find('thead')
431
+ original_header = str(thead) if thead else ""
432
+
433
+ tbody = table.find('tbody') or table
434
+ for i, row in enumerate(tbody.find_all('tr')):
435
+ if analysis['row_analysis'][i if not thead else i + len(thead.find_all('tr'))]['is_header']:
436
+ # Save previous section
437
+ if current_rows:
438
+ split_tables.append({
439
+ "html": self._build_table_html(original_header, current_header, current_rows),
440
+ "title": title,
441
+ "caption": caption,
442
+ "index": len(split_tables) + 1
443
+ })
444
+ current_header = str(row)
445
+ current_rows = []
446
+ else:
447
+ current_rows.append(str(row))
448
+
449
+ # Save last section
450
+ if current_rows:
451
+ split_tables.append({
452
+ "html": self._build_table_html(original_header, current_header, current_rows),
453
+ "title": title,
454
+ "caption": caption,
455
+ "index": len(split_tables) + 1
456
+ })
457
+
458
+ return split_tables if split_tables else [{
459
+ "html": str(table),
460
+ "title": title,
461
+ "caption": caption,
462
+ "index": 1
463
+ }]
464
+
465
+ def _build_table_html(self, original_header: str, sub_header: str, rows: List[str]) -> str:
466
+ """Build HTML table from components."""
467
+ header = original_header
468
+ if sub_header:
469
+ if header:
470
+ header = header.replace('</thead>', sub_header + '</thead>')
471
+ else:
472
+ header = f"<thead>{sub_header}</thead>"
473
+
474
+ body = "<tbody>" + "".join(rows) + "</tbody>"
475
+ return f"<table>{header}{body}</table>"
476
+
477
+
478
+ # =============================================================================
479
+ # GPT Extraction Classes
480
+ # =============================================================================
481
+
482
+ class GPTExtractor:
483
+ """
484
+ Handles GPT-based extraction of catalyst data from table representations.
485
+
486
+ Supports third-party API services with custom base URL (reverse proxy,
487
+ API aggregators like OpenRouter, OneAPI, etc.).
488
+
489
+ Environment Variables:
490
+ LLM_API_KEY or OPENAI_API_KEY: Your API key
491
+ LLM_API_BASE or OPENAI_API_BASE: API base URL (required for third-party services)
492
+ LLM_MODEL or OPENAI_MODEL: Model name (default: gpt-4-turbo-preview)
493
+ """
494
+
495
+ # Performance types to extract
496
+ PERFORMANCE_LIST = [
497
+ 'overpotential', 'tafel_slope', 'Rct', 'stability', 'Cdl',
498
+ 'onset_potential', 'current_density', 'potential', 'TOF', 'ECSA',
499
+ 'water_splitting_potential', 'mass_activity', 'exchange_current_density',
500
+ 'Rs', 'specific_activity', 'onset_overpotential', 'BET', 'surface_area',
501
+ 'loading', 'apparent_activation_energy'
502
+ ]
503
+
504
+ # Property template
505
+ PROPERTY_TEMPLATE = {
506
+ 'electrolyte': '', 'reaction_type': '', 'value': '',
507
+ 'current_density': '', 'overpotential': '', 'potential': '',
508
+ 'substrate': '', 'versus': '', 'condition': ''
509
+ }
510
+
511
+ # Default model
512
+ DEFAULT_MODEL = "gpt-4-turbo-preview"
513
+
514
+ def __init__(self, api_key: Optional[str] = None, base_url: Optional[str] = None, model: Optional[str] = None):
515
+ """
516
+ Initialize GPT Extractor.
517
+
518
+ Args:
519
+ api_key: API key. Falls back to LLM_API_KEY or OPENAI_API_KEY env var.
520
+ base_url: API base URL. Falls back to LLM_API_BASE or OPENAI_API_BASE env var.
521
+ model: Model name. Falls back to LLM_MODEL or OPENAI_MODEL env var.
522
+ """
523
+ # Support multiple env var names for flexibility
524
+ self.api_key = (
525
+ api_key or
526
+ os.environ.get('LLM_API_KEY', '') or
527
+ os.environ.get('OPENAI_API_KEY', '')
528
+ )
529
+ self.base_url = (
530
+ base_url or
531
+ os.environ.get('LLM_API_BASE', '') or
532
+ os.environ.get('OPENAI_API_BASE', '') or
533
+ os.environ.get('OPENAI_BASE_URL', '')
534
+ )
535
+ self.model = (
536
+ model or
537
+ os.environ.get('LLM_MODEL', '') or
538
+ os.environ.get('OPENAI_MODEL', '') or
539
+ self.DEFAULT_MODEL
540
+ )
541
+ self._client = None
542
+
543
+ logger.info(f"GPTExtractor initialized with model: {self.model}")
544
+ if self.base_url:
545
+ logger.info(f"Using custom API base URL: {self.base_url}")
546
+ else:
547
+ logger.warning("No API base URL configured - using default OpenAI endpoint")
548
+
549
+ @property
550
+ def client(self):
551
+ """Lazy initialization of OpenAI-compatible client."""
552
+ if self._client is None:
553
+ try:
554
+ from openai import OpenAI
555
+
556
+ # Build client kwargs
557
+ client_kwargs = {"api_key": self.api_key}
558
+
559
+ # Add base_url for third-party API services
560
+ if self.base_url:
561
+ client_kwargs["base_url"] = self.base_url
562
+
563
+ self._client = OpenAI(**client_kwargs)
564
+ logger.info("API client initialized successfully")
565
+
566
+ except ImportError:
567
+ raise ImportError("OpenAI package not installed. Install with: pip install openai")
568
+ return self._client
569
+
570
+ def get_model(self) -> str:
571
+ """Get the model name to use for API calls."""
572
+ return self.model
573
+
574
+ def get_system_prompt(self, model_type: str) -> str:
575
+ """Get system prompt based on model type."""
576
+ if model_type == 'fine-tuning':
577
+ return """This task is to take a string as input and convert it to JSON format.
578
+ I want to extract the performance below: [reaction_type, versus, overpotential, substrate, loading,
579
+ tafel_slope, onset_potential, current_density, BET, specific_activity, mass_activity, surface_area,
580
+ ECSA, apparent_activation_energy, water_splitting_potential, potential, Rs, Rct, Cdl, TOF, stability,
581
+ electrolyte, exchange_current_density, onset_overpotential].
582
+
583
+ If there is information about overpotential and Tafel slope in the input, the output should be:
584
+ {
585
+ "catalyst_name": {
586
+ "overpotential": {"electrolyte": "1.0 M KOH", "reaction_type": "OER", "value": "230 mV", "current_density": "50 mA/cm2"},
587
+ "tafel_slope": {"electrolyte": "1.0 M KOH", "reaction_type": "OER", "value": "54 mV/dec"}
588
+ }
589
+ }
590
+
591
+ If certain information cannot be found, those keys should not be included in the output.
592
+ If there are no values corresponding to performance metrics, simply extract the catalyst name as: {"catalyst_name": {}}"""
593
+
594
+ elif model_type == 'few-shot':
595
+ return f"""I will extract the performance information of the catalyst from the table and create a JSON format.
596
+ The types of performance to be extracted: performance_list = {self.PERFORMANCE_LIST}
597
+ You can only use the names as they are in the performance_list.
598
+ The JSON format will have performance within the catalyst, and each performance will include elements present in the table:
599
+ reaction type, value, electrolyte, condition, current density, versus (ex: RHE) and substrate.
600
+ The output must contain only JSON dictionary. Other sentences or opinions must not be in output."""
601
+
602
+ else: # zero-shot
603
+ return f"""I'm going to convert the information in the table representer into JSON format.
604
+ CATALYST_TEMPLATE = {{'catalyst_name': {{'performance_name': {{PROPERTY_TEMPLATE}}}}}}
605
+ PROPERTY_TEMPLATE = {self.PROPERTY_TEMPLATE}
606
+ performance_list = {self.PERFORMANCE_LIST}
607
+ Extract catalyst information following these templates strictly."""
608
+
609
+ def extract_zero_shot(self, table_representation: str) -> Dict:
610
+ """
611
+ Extract data using zero-shot approach with step-by-step questioning.
612
+
613
+ Args:
614
+ table_representation: TSV or JSON representation of the table
615
+
616
+ Returns:
617
+ Extracted catalyst data in JSON format
618
+ """
619
+ messages = [{"role": "system", "content": self.get_system_prompt('zero-shot') + "\n\n" + table_representation}]
620
+
621
+ # Step 1: Get catalyst list
622
+ catalyst_q = "Show the catalysts present in the table representer as a Python list. Answer must be ONLY python list."
623
+ messages.append({"role": "user", "content": catalyst_q})
624
+
625
+ try:
626
+ response = self.client.chat.completions.create(
627
+ model=self.get_model(),
628
+ messages=messages,
629
+ temperature=0
630
+ )
631
+ catalyst_answer = response.choices[0].message.content.strip()
632
+ catalyst_list = eval(catalyst_answer)
633
+ messages.append({"role": "assistant", "content": catalyst_answer})
634
+ except Exception as e:
635
+ return {"error": f"Failed to extract catalysts: {str(e)}"}
636
+
637
+ result = {"catalysts": []}
638
+
639
+ for catalyst in catalyst_list:
640
+ # Step 2: Get performance template for each catalyst
641
+ perf_q = f"""Create a CATALYST_TEMPLATE filling in the performance of '{catalyst}' from the table representer,
642
+ strictly adhering to these rules:
643
+ Rule 1: Only include actual existing performances from the Performance_list.
644
+ Rule 2: Set all values of keys in PROPERTY_TEMPLATE to be " ". DO NOT INSERT ANY VALUE.
645
+ Rule 3: Answer must be ONLY JSON format."""
646
+
647
+ messages.append({"role": "user", "content": perf_q})
648
+
649
+ try:
650
+ response = self.client.chat.completions.create(
651
+ model=self.get_model(),
652
+ messages=messages,
653
+ temperature=0
654
+ )
655
+ perf_answer = response.choices[0].message.content.strip()
656
+ messages.append({"role": "assistant", "content": perf_answer})
657
+
658
+ # Step 3: Fill in property values
659
+ prop_q = """In PROPERTY_TEMPLATE, maintain all keys, and fill in values that exist in the table representer.
660
+ If there are more than two "values" for the same performance, make it into a list. Include units in the values."""
661
+
662
+ messages.append({"role": "user", "content": prop_q})
663
+ response = self.client.chat.completions.create(
664
+ model=self.get_model(),
665
+ messages=messages,
666
+ temperature=0
667
+ )
668
+ prop_answer = response.choices[0].message.content.strip()
669
+
670
+ # Step 4: Remove empty keys
671
+ delete_q = "Remove keys with no values from previous version of CATALYST_TEMPLATE. Output only JSON."
672
+ messages.append({"role": "assistant", "content": prop_answer})
673
+ messages.append({"role": "user", "content": delete_q})
674
+
675
+ response = self.client.chat.completions.create(
676
+ model=self.get_model(),
677
+ messages=messages,
678
+ temperature=0
679
+ )
680
+ final_answer = response.choices[0].message.content.strip()
681
+
682
+ # Parse JSON
683
+ if "```" in final_answer:
684
+ final_answer = final_answer.replace("```json", "").replace("```", "")
685
+ catalyst_data = json.loads(final_answer)
686
+ result["catalysts"].append(catalyst_data)
687
+
688
+ except Exception as e:
689
+ result["catalysts"].append({catalyst: {"error": str(e)}})
690
+
691
+ return result["catalysts"][0] if len(result["catalysts"]) == 1 else result
692
+
693
+ def extract_few_shot(self, table_representation: str, examples: List[Dict] = None) -> Dict:
694
+ """
695
+ Extract data using few-shot approach with example pairs.
696
+
697
+ Args:
698
+ table_representation: TSV or JSON representation of the table
699
+ examples: List of input/output example pairs
700
+
701
+ Returns:
702
+ Extracted catalyst data in JSON format
703
+ """
704
+ messages = [{"role": "system", "content": self.get_system_prompt('few-shot')}]
705
+
706
+ # Add examples if provided
707
+ if examples:
708
+ for ex in examples:
709
+ messages.append({"role": "user", "content": ex.get('input', '')})
710
+ messages.append({"role": "assistant", "content": ex.get('output', '')})
711
+
712
+ messages.append({"role": "user", "content": table_representation})
713
+
714
+ try:
715
+ response = self.client.chat.completions.create(
716
+ model=self.get_model(),
717
+ messages=messages,
718
+ temperature=0
719
+ )
720
+ result = response.choices[0].message.content.strip()
721
+
722
+ if "```" in result:
723
+ result = result.replace("```json", "").replace("```", "")
724
+
725
+ return json.loads(result)
726
+ except json.JSONDecodeError:
727
+ return {"raw_response": result, "error": "Could not parse as JSON"}
728
+ except Exception as e:
729
+ return {"error": str(e)}
730
+
731
+ def extract_with_fine_tuned(self, table_representation: str, model_name: str) -> Dict:
732
+ """
733
+ Extract data using a fine-tuned model.
734
+
735
+ Args:
736
+ table_representation: TSV or JSON representation of the table
737
+ model_name: Name of the fine-tuned model
738
+
739
+ Returns:
740
+ Extracted catalyst data in JSON format
741
+ """
742
+ messages = [
743
+ {"role": "system", "content": self.get_system_prompt('fine-tuning')},
744
+ {"role": "user", "content": str(table_representation)}
745
+ ]
746
+
747
+ try:
748
+ response = self.client.chat.completions.create(
749
+ model=model_name,
750
+ messages=messages,
751
+ temperature=0
752
+ )
753
+ result = response.choices[0].message.content.strip()
754
+
755
+ try:
756
+ return json.loads(result)
757
+ except:
758
+ from ast import literal_eval
759
+ return literal_eval(result)
760
+ except Exception as e:
761
+ return {"error": str(e)}
762
+
763
+
764
+ # =============================================================================
765
+ # Session Management
766
+ # =============================================================================
767
+
768
+ class SessionManager:
769
+ """Manages extraction sessions and data storage."""
770
+
771
+ def __init__(self, storage_dir: str = None):
772
+ self.storage_dir = storage_dir or tempfile.mkdtemp(prefix="matablgpt_")
773
+ os.makedirs(self.storage_dir, exist_ok=True)
774
+ self.sessions: Dict[str, SessionData] = {}
775
+
776
+ def create_session(self) -> str:
777
+ """Create a new session."""
778
+ session_id = f"session_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}"
779
+ session_dir = os.path.join(self.storage_dir, session_id)
780
+ os.makedirs(session_dir, exist_ok=True)
781
+
782
+ self.sessions[session_id] = SessionData(
783
+ session_id=session_id,
784
+ created_at=datetime.now().isoformat()
785
+ )
786
+
787
+ return session_id
788
+
789
+ def get_session(self, session_id: str) -> Optional[SessionData]:
790
+ """Get session by ID."""
791
+ return self.sessions.get(session_id)
792
+
793
+ def save_table(self, session_id: str, table_name: str, table_data: TableData) -> bool:
794
+ """Save table data to session."""
795
+ session = self.get_session(session_id)
796
+ if not session:
797
+ return False
798
+ session.tables[table_name] = table_data
799
+ return True
800
+
801
+ def save_representation(self, session_id: str, table_name: str, representation: str, format_type: str) -> bool:
802
+ """Save table representation to session."""
803
+ session = self.get_session(session_id)
804
+ if not session:
805
+ return False
806
+ key = f"{table_name}_{format_type}"
807
+ session.representations[key] = representation
808
+ return True
809
+
810
+ def save_extraction(self, session_id: str, result: ExtractionResult) -> bool:
811
+ """Save extraction result to session."""
812
+ session = self.get_session(session_id)
813
+ if not session:
814
+ return False
815
+ session.extractions.append(result)
816
+ return True
817
+
818
+ def export_session(self, session_id: str) -> Dict:
819
+ """Export session data as dictionary."""
820
+ session = self.get_session(session_id)
821
+ if not session:
822
+ return {"error": "Session not found"}
823
+
824
+ return {
825
+ "session_id": session.session_id,
826
+ "created_at": session.created_at,
827
+ "tables_count": len(session.tables),
828
+ "representations_count": len(session.representations),
829
+ "extractions_count": len(session.extractions),
830
+ "extractions": [
831
+ {
832
+ "table_name": e.table_name,
833
+ "model_type": e.model_type,
834
+ "result": e.result,
835
+ "timestamp": e.timestamp,
836
+ "follow_up_applied": e.follow_up_applied
837
+ }
838
+ for e in session.extractions
839
+ ]
840
+ }
841
+
842
+
843
+ # =============================================================================
844
+ # MCP Server Definition
845
+ # =============================================================================
846
+
847
+ # Initialize global components
848
+ table_representer = TableRepresenter()
849
+ table_to_json = TableToJSON()
850
+ table_splitter = TableSplitter()
851
+ session_manager = SessionManager()
852
+ gpt_extractor = None # Lazy initialization
853
+
854
+ def get_extractor() -> GPTExtractor:
855
+ """Get or create GPT extractor instance."""
856
+ global gpt_extractor
857
+ if gpt_extractor is None:
858
+ gpt_extractor = GPTExtractor()
859
+ return gpt_extractor
860
+
861
+ # Create MCP server
862
+ mcp = FastMCP("MaTableGPT-MCP")
863
+
864
+ # =============================================================================
865
+ # MCP Tools
866
+ # =============================================================================
867
+
868
+ @mcp.tool()
869
+ def create_session() -> Dict:
870
+ """
871
+ Create a new extraction session.
872
+
873
+ Returns a session ID that should be used for subsequent operations.
874
+ Sessions help organize and track table processing workflows.
875
+ """
876
+ session_id = session_manager.create_session()
877
+ return {
878
+ "success": True,
879
+ "session_id": session_id,
880
+ "message": "Session created successfully. Use this session_id for subsequent operations."
881
+ }
882
+
883
+
884
+ @mcp.tool()
885
+ def html_to_tsv_representation(
886
+ html_table: str,
887
+ title: str = "",
888
+ caption: str = "",
889
+ session_id: str = "",
890
+ table_name: str = ""
891
+ ) -> Dict:
892
+ """
893
+ Convert an HTML table to TSV (Tab-Separated Values) representation.
894
+
895
+ This format is optimized for GPT extraction as it preserves table structure
896
+ including merged cells, headers, and captions in a text format.
897
+
898
+ Args:
899
+ html_table: HTML string containing the table element
900
+ title: Optional title of the table
901
+ caption: Optional caption/footnotes of the table
902
+ session_id: Optional session ID to save the representation
903
+ table_name: Optional name for the table (used for saving)
904
+
905
+ Returns:
906
+ Dictionary containing the TSV representation
907
+ """
908
+ try:
909
+ representation = table_representer.html_to_tsv(html_table, title, caption)
910
+
911
+ result = {
912
+ "success": True,
913
+ "format": "TSV",
914
+ "representation": representation
915
+ }
916
+
917
+ # Save to session if provided
918
+ if session_id and table_name:
919
+ session_manager.save_representation(session_id, table_name, representation, "tsv")
920
+ result["saved_to_session"] = session_id
921
+
922
+ return result
923
+ except Exception as e:
924
+ return {"success": False, "error": str(e)}
925
+
926
+
927
+ @mcp.tool()
928
+ def html_to_json_representation(
929
+ html_table: str,
930
+ title: str = "",
931
+ caption: str = "",
932
+ session_id: str = "",
933
+ table_name: str = ""
934
+ ) -> Dict:
935
+ """
936
+ Convert an HTML table to JSON representation.
937
+
938
+ This format converts the table structure into a nested JSON dictionary
939
+ with column headers as keys and cell values as lists.
940
+
941
+ Args:
942
+ html_table: HTML string containing the table element
943
+ title: Optional title of the table
944
+ caption: Optional caption/footnotes of the table
945
+ session_id: Optional session ID to save the representation
946
+ table_name: Optional name for the table (used for saving)
947
+
948
+ Returns:
949
+ Dictionary containing the JSON representation
950
+ """
951
+ try:
952
+ representation = table_to_json.html_to_json(html_table, title, caption)
953
+
954
+ result = {
955
+ "success": True,
956
+ "format": "JSON",
957
+ "representation": representation
958
+ }
959
+
960
+ # Save to session if provided
961
+ if session_id and table_name:
962
+ session_manager.save_representation(
963
+ session_id, table_name, json.dumps(representation), "json"
964
+ )
965
+ result["saved_to_session"] = session_id
966
+
967
+ return result
968
+ except Exception as e:
969
+ return {"success": False, "error": str(e)}
970
+
971
+
972
+ @mcp.tool()
973
+ def analyze_table_structure(html_table: str) -> Dict:
974
+ """
975
+ Analyze the structure of an HTML table.
976
+
977
+ This tool examines the table to identify:
978
+ - Total number of rows
979
+ - Presence of thead/tbody elements
980
+ - Header rows vs body rows
981
+ - Merged cells
982
+
983
+ Use this to understand complex tables before processing.
984
+
985
+ Args:
986
+ html_table: HTML string containing the table element
987
+
988
+ Returns:
989
+ Dictionary containing structural analysis
990
+ """
991
+ try:
992
+ analysis = table_splitter.analyze_table_structure(html_table)
993
+ return {"success": True, "analysis": analysis}
994
+ except Exception as e:
995
+ return {"success": False, "error": str(e)}
996
+
997
+
998
+ @mcp.tool()
999
+ def split_complex_table(
1000
+ html_table: str,
1001
+ title: str = "",
1002
+ caption: str = ""
1003
+ ) -> Dict:
1004
+ """
1005
+ Split a complex table into simpler components.
1006
+
1007
+ Complex tables with multiple internal headers or sub-tables are split
1008
+ into individual tables that are easier to process.
1009
+
1010
+ Args:
1011
+ html_table: HTML string containing the table element
1012
+ title: Optional title of the table
1013
+ caption: Optional caption/footnotes of the table
1014
+
1015
+ Returns:
1016
+ Dictionary containing list of split table components
1017
+ """
1018
+ try:
1019
+ split_tables = table_splitter.split_table(html_table, title, caption)
1020
+ return {
1021
+ "success": True,
1022
+ "table_count": len(split_tables),
1023
+ "tables": split_tables
1024
+ }
1025
+ except Exception as e:
1026
+ return {"success": False, "error": str(e)}
1027
+
1028
+
1029
+ @mcp.tool()
1030
+ def extract_catalyst_data_zero_shot(
1031
+ table_representation: str,
1032
+ session_id: str = "",
1033
+ table_name: str = ""
1034
+ ) -> Dict:
1035
+ """
1036
+ Extract catalyst data from table representation using zero-shot GPT.
1037
+
1038
+ This uses a multi-step questioning approach to:
1039
+ 1. Identify catalysts in the table
1040
+ 2. Determine performance metrics for each catalyst
1041
+ 3. Extract property values
1042
+ 4. Clean up the result
1043
+
1044
+ Args:
1045
+ table_representation: TSV or JSON representation of the table
1046
+ session_id: Optional session ID to save the extraction
1047
+ table_name: Optional name for the table
1048
+
1049
+ Returns:
1050
+ Dictionary containing extracted catalyst data
1051
+ """
1052
+ try:
1053
+ extractor = get_extractor()
1054
+ result = extractor.extract_zero_shot(table_representation)
1055
+
1056
+ extraction_result = ExtractionResult(
1057
+ session_id=session_id or "no_session",
1058
+ table_name=table_name or "unnamed",
1059
+ model_type="zero-shot",
1060
+ result=result,
1061
+ timestamp=datetime.now().isoformat()
1062
+ )
1063
+
1064
+ if session_id:
1065
+ session_manager.save_extraction(session_id, extraction_result)
1066
+
1067
+ return {
1068
+ "success": True,
1069
+ "model_type": "zero-shot",
1070
+ "extraction": result
1071
+ }
1072
+ except Exception as e:
1073
+ return {"success": False, "error": str(e)}
1074
+
1075
+
1076
+ @mcp.tool()
1077
+ def extract_catalyst_data_few_shot(
1078
+ table_representation: str,
1079
+ examples: List[Dict] = None,
1080
+ session_id: str = "",
1081
+ table_name: str = ""
1082
+ ) -> Dict:
1083
+ """
1084
+ Extract catalyst data from table representation using few-shot GPT.
1085
+
1086
+ Provide example input/output pairs to guide the extraction.
1087
+
1088
+ Args:
1089
+ table_representation: TSV or JSON representation of the table
1090
+ examples: List of {"input": ..., "output": ...} example pairs
1091
+ session_id: Optional session ID to save the extraction
1092
+ table_name: Optional name for the table
1093
+
1094
+ Returns:
1095
+ Dictionary containing extracted catalyst data
1096
+ """
1097
+ try:
1098
+ extractor = get_extractor()
1099
+ result = extractor.extract_few_shot(table_representation, examples or [])
1100
+
1101
+ extraction_result = ExtractionResult(
1102
+ session_id=session_id or "no_session",
1103
+ table_name=table_name or "unnamed",
1104
+ model_type="few-shot",
1105
+ result=result,
1106
+ timestamp=datetime.now().isoformat()
1107
+ )
1108
+
1109
+ if session_id:
1110
+ session_manager.save_extraction(session_id, extraction_result)
1111
+
1112
+ return {
1113
+ "success": True,
1114
+ "model_type": "few-shot",
1115
+ "extraction": result
1116
+ }
1117
+ except Exception as e:
1118
+ return {"success": False, "error": str(e)}
1119
+
1120
+
1121
+ @mcp.tool()
1122
+ def extract_catalyst_data_fine_tuned(
1123
+ table_representation: str,
1124
+ model_name: str,
1125
+ session_id: str = "",
1126
+ table_name: str = ""
1127
+ ) -> Dict:
1128
+ """
1129
+ Extract catalyst data using a fine-tuned GPT model.
1130
+
1131
+ Requires a pre-trained fine-tuned model name from OpenAI.
1132
+
1133
+ Args:
1134
+ table_representation: TSV or JSON representation of the table
1135
+ model_name: Name of the fine-tuned OpenAI model
1136
+ session_id: Optional session ID to save the extraction
1137
+ table_name: Optional name for the table
1138
+
1139
+ Returns:
1140
+ Dictionary containing extracted catalyst data
1141
+ """
1142
+ try:
1143
+ extractor = get_extractor()
1144
+ result = extractor.extract_with_fine_tuned(table_representation, model_name)
1145
+
1146
+ extraction_result = ExtractionResult(
1147
+ session_id=session_id or "no_session",
1148
+ table_name=table_name or "unnamed",
1149
+ model_type="fine-tuning",
1150
+ result=result,
1151
+ timestamp=datetime.now().isoformat()
1152
+ )
1153
+
1154
+ if session_id:
1155
+ session_manager.save_extraction(session_id, extraction_result)
1156
+
1157
+ return {
1158
+ "success": True,
1159
+ "model_type": "fine-tuning",
1160
+ "model_name": model_name,
1161
+ "extraction": result
1162
+ }
1163
+ except Exception as e:
1164
+ return {"success": False, "error": str(e)}
1165
+
1166
+
1167
+ @mcp.tool()
1168
+ def get_session_data(session_id: str) -> Dict:
1169
+ """
1170
+ Get all data from a session.
1171
+
1172
+ Returns tables, representations, and extractions stored in the session.
1173
+
1174
+ Args:
1175
+ session_id: The session ID to retrieve
1176
+
1177
+ Returns:
1178
+ Dictionary containing session data
1179
+ """
1180
+ return session_manager.export_session(session_id)
1181
+
1182
+
1183
+ @mcp.tool()
1184
+ def list_performance_types() -> Dict:
1185
+ """
1186
+ List all supported performance types for catalyst extraction.
1187
+
1188
+ These are the standard property names that can be extracted from
1189
+ materials science literature tables about catalysts.
1190
+
1191
+ Returns:
1192
+ Dictionary containing list of performance types
1193
+ """
1194
+ return {
1195
+ "success": True,
1196
+ "performance_types": GPTExtractor.PERFORMANCE_LIST,
1197
+ "property_template": GPTExtractor.PROPERTY_TEMPLATE
1198
+ }
1199
+
1200
+
1201
+ @mcp.tool()
1202
+ def validate_extraction_result(extraction: Dict) -> Dict:
1203
+ """
1204
+ Validate an extraction result against expected schema.
1205
+
1206
+ Checks if the extraction follows the expected format with
1207
+ catalyst names, performance types, and property values.
1208
+
1209
+ Args:
1210
+ extraction: The extraction result to validate
1211
+
1212
+ Returns:
1213
+ Dictionary containing validation results
1214
+ """
1215
+ issues = []
1216
+ warnings = []
1217
+
1218
+ if not isinstance(extraction, dict):
1219
+ return {"valid": False, "issues": ["Extraction must be a dictionary"]}
1220
+
1221
+ # Check for error
1222
+ if "error" in extraction:
1223
+ issues.append(f"Extraction contains error: {extraction['error']}")
1224
+
1225
+ # Check structure
1226
+ valid_performance_types = set(GPTExtractor.PERFORMANCE_LIST)
1227
+
1228
+ for catalyst_name, performances in extraction.items():
1229
+ if catalyst_name in ["error", "raw_response", "catalysts"]:
1230
+ continue
1231
+
1232
+ if not isinstance(performances, dict):
1233
+ warnings.append(f"Catalyst '{catalyst_name}' should have dict of performances")
1234
+ continue
1235
+
1236
+ for perf_name, properties in performances.items():
1237
+ if perf_name not in valid_performance_types:
1238
+ warnings.append(f"Unknown performance type: {perf_name}")
1239
+
1240
+ if isinstance(properties, dict):
1241
+ for prop_key in properties.keys():
1242
+ if prop_key not in GPTExtractor.PROPERTY_TEMPLATE:
1243
+ warnings.append(f"Unknown property key: {prop_key}")
1244
+
1245
+ return {
1246
+ "valid": len(issues) == 0,
1247
+ "issues": issues,
1248
+ "warnings": warnings
1249
+ }
1250
+
1251
+
1252
+ @mcp.tool()
1253
+ def get_extraction_code_template(representation_format: str = "tsv", model_type: str = "zero-shot") -> Dict:
1254
+ """
1255
+ Get Python code template for local extraction.
1256
+
1257
+ Returns code that can be run locally to perform extraction
1258
+ without relying on the MCP service.
1259
+
1260
+ Args:
1261
+ representation_format: Either 'tsv' or 'json'
1262
+ model_type: One of 'zero-shot', 'few-shot', or 'fine-tuning'
1263
+
1264
+ Returns:
1265
+ Dictionary containing code template and instructions
1266
+ """
1267
+ code = f'''"""
1268
+ MaTableGPT Local Extraction Template
1269
+ Model Type: {model_type}
1270
+ Representation Format: {representation_format}
1271
+ """
1272
+
1273
+ from openai import OpenAI
1274
+ import json
1275
+
1276
+ # Initialize client
1277
+ client = OpenAI(api_key="YOUR_API_KEY")
1278
+
1279
+ # Performance types to extract
1280
+ PERFORMANCE_LIST = [
1281
+ 'overpotential', 'tafel_slope', 'Rct', 'stability', 'Cdl',
1282
+ 'onset_potential', 'current_density', 'potential', 'TOF', 'ECSA',
1283
+ 'water_splitting_potential', 'mass_activity', 'exchange_current_density',
1284
+ 'Rs', 'specific_activity', 'onset_overpotential', 'BET', 'surface_area',
1285
+ 'loading', 'apparent_activation_energy'
1286
+ ]
1287
+
1288
+ # Your table representation
1289
+ table_representation = """
1290
+ # Paste your {representation_format.upper()} representation here
1291
+ """
1292
+
1293
+ # System prompt
1294
+ system_prompt = """I will extract catalyst performance information from the table and create JSON format.
1295
+ Performance types: """ + str(PERFORMANCE_LIST) + """
1296
+ The JSON format will have performance within the catalyst, with elements:
1297
+ reaction type, value, electrolyte, condition, current density, versus, substrate.
1298
+ Output must contain only JSON dictionary."""
1299
+
1300
+ # Extract
1301
+ response = client.chat.completions.create(
1302
+ model="gpt-4-turbo-preview",
1303
+ messages=[
1304
+ {{"role": "system", "content": system_prompt}},
1305
+ {{"role": "user", "content": table_representation}}
1306
+ ],
1307
+ temperature=0
1308
+ )
1309
+
1310
+ result = response.choices[0].message.content.strip()
1311
+ print(json.dumps(json.loads(result), indent=2))
1312
+ '''
1313
+
1314
+ return {
1315
+ "success": True,
1316
+ "code": code,
1317
+ "instructions": [
1318
+ "1. Install openai package: pip install openai",
1319
+ "2. Replace YOUR_API_KEY with your OpenAI API key",
1320
+ "3. Paste your table representation in the designated area",
1321
+ "4. Run the script"
1322
+ ]
1323
+ }
1324
+
1325
+
1326
+ @mcp.tool()
1327
+ def get_environment_requirements() -> Dict:
1328
+ """
1329
+ Get the required environment setup for MaTableGPT.
1330
+
1331
+ Returns package requirements and setup instructions.
1332
+ Supports third-party API services (reverse proxy, API aggregators).
1333
+
1334
+ Returns:
1335
+ Dictionary containing requirements and instructions
1336
+ """
1337
+ return {
1338
+ "success": True,
1339
+ "python_version": ">=3.8",
1340
+ "required_packages": [
1341
+ "openai>=1.0.0 # OpenAI-compatible client, works with third-party APIs",
1342
+ "beautifulsoup4>=4.9.0",
1343
+ "pandas>=1.0.0",
1344
+ "lxml>=4.0.0",
1345
+ "mcp>=0.1.0"
1346
+ ],
1347
+ "optional_packages": [
1348
+ "nltk>=3.6.0 # For table splitting analysis"
1349
+ ],
1350
+ "environment_variables": {
1351
+ "LLM_API_KEY": "(Required) Your API key from third-party service",
1352
+ "LLM_API_BASE": "(Required) API base URL, e.g., https://api.your-service.com/v1",
1353
+ "LLM_MODEL": "(Optional) Model name, default: gpt-4-turbo-preview",
1354
+ "---": "--- Alternative variable names (also supported) ---",
1355
+ "OPENAI_API_KEY": "Alternative to LLM_API_KEY",
1356
+ "OPENAI_API_BASE": "Alternative to LLM_API_BASE",
1357
+ "OPENAI_MODEL": "Alternative to LLM_MODEL"
1358
+ },
1359
+ "setup_instructions": [
1360
+ "1. Create virtual environment: python -m venv venv",
1361
+ "2. Activate: venv\\Scripts\\activate (Windows) or source venv/bin/activate (Unix)",
1362
+ "3. Install: pip install -r requirements.txt",
1363
+ "4. Set environment variables (use your API provider's info):",
1364
+ " - LLM_API_KEY=your_api_key (Required)",
1365
+ " - LLM_API_BASE=https://api.your-service.com/v1 (Required)",
1366
+ " - LLM_MODEL=gpt-4-turbo-preview (Optional)",
1367
+ "5. Run: python start_mcp.py"
1368
+ ],
1369
+ "third_party_api_example": {
1370
+ "description": "Configuration for third-party API services (reverse proxy, OneAPI, etc.)",
1371
+ "windows_powershell": [
1372
+ "$env:LLM_API_KEY = 'sk-xxxx'",
1373
+ "$env:LLM_API_BASE = 'https://api.your-service.com/v1'",
1374
+ "$env:LLM_MODEL = 'gpt-4-turbo-preview'",
1375
+ "python start_mcp.py"
1376
+ ],
1377
+ "windows_cmd": [
1378
+ "set LLM_API_KEY=sk-xxxx",
1379
+ "set LLM_API_BASE=https://api.your-service.com/v1",
1380
+ "set LLM_MODEL=gpt-4-turbo-preview",
1381
+ "python start_mcp.py"
1382
+ ],
1383
+ "unix_bash": [
1384
+ "export LLM_API_KEY=sk-xxxx",
1385
+ "export LLM_API_BASE=https://api.your-service.com/v1",
1386
+ "export LLM_MODEL=gpt-4-turbo-preview",
1387
+ "python start_mcp.py"
1388
+ ],
1389
+ "docker_env": [
1390
+ "-e LLM_API_KEY=sk-xxxx",
1391
+ "-e LLM_API_BASE=https://api.your-service.com/v1",
1392
+ "-e LLM_MODEL=gpt-4-turbo-preview"
1393
+ ],
1394
+ "huggingface_secrets": [
1395
+ "LLM_API_KEY = sk-xxxx",
1396
+ "LLM_API_BASE = https://api.your-service.com/v1",
1397
+ "LLM_MODEL = gpt-4-turbo-preview"
1398
+ ]
1399
+ }
1400
+ }
1401
+
1402
+
1403
+ # =============================================================================
1404
+ # Server Entry Point
1405
+ # =============================================================================
1406
+
1407
+ def main():
1408
+ """Run the MCP server."""
1409
+ mcp.run()
1410
+
1411
+
1412
+ if __name__ == "__main__":
1413
+ main()
requirements.txt ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MaTableGPT MCP Service Requirements
2
+ # ====================================
3
+
4
+ # Core MCP Framework
5
+ mcp>=0.1.0
6
+
7
+ # OpenAI API for GPT extraction
8
+ openai>=1.0.0
9
+
10
+ # HTML Parsing
11
+ beautifulsoup4>=4.12.0
12
+ lxml>=4.9.0
13
+
14
+ # Data Processing
15
+ pandas>=2.0.0
16
+
17
+ # Web Framework for HuggingFace Space
18
+ gradio>=4.0.0
19
+
20
+ # Async Support
21
+ httpx>=0.25.0
22
+
23
+ # Optional: For table splitting analysis
24
+ nltk>=3.8.0
start_mcp.py ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ MaTableGPT MCP Server Launcher
4
+ ==============================
5
+
6
+ This script starts the MaTableGPT MCP service for extracting
7
+ table data from materials science literature.
8
+
9
+ Usage:
10
+ python start_mcp.py [--host HOST] [--port PORT] [--mode MODE]
11
+
12
+ Arguments:
13
+ --host Host address (default: 0.0.0.0)
14
+ --port Port number (default: 7865)
15
+ --mode Run mode: 'stdio' or 'sse' (default: stdio)
16
+ """
17
+
18
+ import os
19
+ import sys
20
+ import argparse
21
+ import logging
22
+
23
+ # Add parent directory to path for imports
24
+ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
25
+
26
+ # Configure logging
27
+ logging.basicConfig(
28
+ level=logging.INFO,
29
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
30
+ )
31
+ logger = logging.getLogger("matablgpt-mcp-launcher")
32
+
33
+
34
+ def check_environment():
35
+ """Check if required environment variables are set."""
36
+ warnings = []
37
+
38
+ if not os.environ.get('OPENAI_API_KEY'):
39
+ warnings.append(
40
+ "OPENAI_API_KEY not set. GPT extraction features will not work. "
41
+ "Set it with: export OPENAI_API_KEY=your_key (Unix) or "
42
+ "set OPENAI_API_KEY=your_key (Windows)"
43
+ )
44
+
45
+ return warnings
46
+
47
+
48
+ def check_dependencies():
49
+ """Check if required packages are installed."""
50
+ missing = []
51
+
52
+ required = [
53
+ ('mcp', 'mcp'),
54
+ ('openai', 'openai'),
55
+ ('bs4', 'beautifulsoup4'),
56
+ ('pandas', 'pandas'),
57
+ ('lxml', 'lxml')
58
+ ]
59
+
60
+ for module, package in required:
61
+ try:
62
+ __import__(module)
63
+ except ImportError:
64
+ missing.append(package)
65
+
66
+ return missing
67
+
68
+
69
+ def main():
70
+ """Main entry point."""
71
+ parser = argparse.ArgumentParser(
72
+ description="MaTableGPT MCP Server - Table Data Extraction from Materials Science Literature"
73
+ )
74
+ parser.add_argument(
75
+ '--host',
76
+ default='0.0.0.0',
77
+ help='Host address (default: 0.0.0.0)'
78
+ )
79
+ parser.add_argument(
80
+ '--port',
81
+ type=int,
82
+ default=7865,
83
+ help='Port number (default: 7865)'
84
+ )
85
+ parser.add_argument(
86
+ '--mode',
87
+ choices=['stdio', 'sse'],
88
+ default='stdio',
89
+ help='Run mode: stdio for standard I/O, sse for Server-Sent Events (default: stdio)'
90
+ )
91
+ parser.add_argument(
92
+ '--debug',
93
+ action='store_true',
94
+ help='Enable debug logging'
95
+ )
96
+
97
+ args = parser.parse_args()
98
+
99
+ if args.debug:
100
+ logging.getLogger().setLevel(logging.DEBUG)
101
+
102
+ # Check dependencies
103
+ missing = check_dependencies()
104
+ if missing:
105
+ logger.error(f"Missing required packages: {', '.join(missing)}")
106
+ logger.error(f"Install with: pip install {' '.join(missing)}")
107
+ sys.exit(1)
108
+
109
+ # Check environment
110
+ warnings = check_environment()
111
+ for warning in warnings:
112
+ logger.warning(warning)
113
+
114
+ # Display startup info
115
+ logger.info("=" * 60)
116
+ logger.info("MaTableGPT MCP Server")
117
+ logger.info("=" * 60)
118
+ logger.info(f"Mode: {args.mode}")
119
+ if args.mode == 'sse':
120
+ logger.info(f"Host: {args.host}")
121
+ logger.info(f"Port: {args.port}")
122
+ logger.info("=" * 60)
123
+
124
+ # Import and run MCP service
125
+ try:
126
+ from mcp_service import mcp
127
+
128
+ if args.mode == 'stdio':
129
+ logger.info("Starting MCP server in stdio mode...")
130
+ mcp.run()
131
+ else:
132
+ logger.info(f"Starting MCP server in SSE mode on {args.host}:{args.port}...")
133
+ mcp.run(transport='sse', host=args.host, port=args.port)
134
+
135
+ except ImportError as e:
136
+ logger.error(f"Failed to import MCP service: {e}")
137
+ sys.exit(1)
138
+ except Exception as e:
139
+ logger.error(f"Error starting MCP server: {e}")
140
+ sys.exit(1)
141
+
142
+
143
+ if __name__ == "__main__":
144
+ main()