opendatalab
/

MinerU-HTML

@@ -16,6 +16,9 @@ base_model:
 - Qwen/Qwen3-0.6B
 ---
 # Dripper(MinerU-HTML)
 **Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
@@ -29,7 +32,7 @@ base_model:
 - ⚡ **Distributed Processing**: Ray-based parallel processing for large-scale evaluation
 - 🔧 **Multiple Extractors**: Supports various baseline extractors for comparison
-## Github 🔧 🔧 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML)
 ## Installation
@@ -231,6 +234,107 @@ python app/eval_with_answer.py \
     --step 2 --cpus 128 --force_update
 ```
 ## Project Structure
 ```

 - Qwen/Qwen3-0.6B
 ---
 # Dripper(MinerU-HTML)
+<a href="https://github.com/opendatalab/MinerU-HTML">
+    <img src="https://img.shields.io/badge/GitHub-Repo-black?style=flat-square&logo=github" alt="GitHub Repo" />
+  </a>
 **Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
 - ⚡ **Distributed Processing**: Ray-based parallel processing for large-scale evaluation
 - 🔧 **Multiple Extractors**: Supports various baseline extractors for comparison
+---
 ## Installation
     --step 2 --cpus 128 --force_update
 ```
+## MinerU Ecosystem & Cloud API (No GPU Required)
+MinerU-HTML is part of the broader **MinerU Ecosystem**. If you don't have local GPU resources, or if you want to seamlessly integrate HTML/PDF/Document extraction into your existing workflows, you can use our official Cloud API, SDKs, and RAG integrations.
+### Command Line API
+<details>
+<summary>Show commands</summary>
+```bash
+# Windows (PowerShell)
+irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex
+# macOS / Linux
+curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh
+# Precision extract — token required
+mineru-open-api auth
+mineru-open-api extract webpage.html -o ./output/       # local file
+mineru-open-api crawl https://mineru.net/apiManage/docs  -o ./output/  # crawl from URL
+```
+</details>
+### Python SDK
+<details>
+<summary>Show code</summary>
+```python
+# pip install mineru-open-sdk
+from mineru import MinerU
+# Precision mode — tables, formulas, large files
+client = MinerU("your-token")  # https://mineru.net/apiManage/token
+result = client.extract("webpage.html")                         # local file
+result = client.crawl("https://mineru.net/apiManage/docs")     # crawl from URL
+print(result.markdown)
+```
+</details>
+### RAG — LangChain
+<details>
+<summary>Show code</summary>
+```python
+# pip install langchain-mineru
+from langchain_mineru import MinerULoader
+# Precision mode — full RAG pipeline
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from langchain_openai import OpenAIEmbeddings
+from langchain_community.vectorstores import FAISS
+docs = MinerULoader(source="article.html", mode="precision", token="your-token",
+                    formula=True, table=True).load()
+chunks = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200).split_documents(docs)
+vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
+results = vectorstore.similarity_search("key requirements", k=3)
+```
+</details>
+### RAG — LlamaIndex
+llama-index-readers-mineru is an official LlamaIndex Reader supporting multi-format document extraction.
+<details>
+<summary>Show code</summary>
+```python
+# pip install llama-index-readers-mineru
+from llama_index.readers.mineru import MinerUReader
+# Precision mode — OCR, formula, table
+docs = MinerUReader(mode="precision", token="your-token",
+                    ocr=True, formula=True, table=True).load_data("complex_article.html")
+# Full RAG pipeline
+from llama_index.core import VectorStoreIndex
+index = VectorStoreIndex.from_documents(docs)
+response = index.as_query_engine().query("Summarize the key content of this page")
+print(response)
+```
+</details>
+### MCP Server (Claude Desktop · Cursor · Windsurf)
+mineru-open-mcp lets any MCP-compatible AI client parse web pages and documents as a native tool.
+<details>
+<summary>Show config</summary>
+```json
+{
+  "mcpServers": {
+    "mineru": {
+      "command": "uvx",
+      "args": ["mineru-open-mcp"],
+      "env": { "MINERU_API_TOKEN": "your-token" }
+    }
+  }
+}
+```
+</details>
 ## Project Structure
 ```