hotelll commited on
Commit
baa1ca4
·
verified ·
1 Parent(s): 2498a57

add MinerU Ecosystem info

Browse files
Files changed (1) hide show
  1. README.md +105 -1
README.md CHANGED
@@ -16,6 +16,9 @@ base_model:
16
  - Qwen/Qwen3-0.6B
17
  ---
18
  # Dripper(MinerU-HTML)
 
 
 
19
 
20
  **Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
21
 
@@ -29,7 +32,7 @@ base_model:
29
  - ⚡ **Distributed Processing**: Ray-based parallel processing for large-scale evaluation
30
  - 🔧 **Multiple Extractors**: Supports various baseline extractors for comparison
31
 
32
- ## Github 🔧 🔧 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML)
33
 
34
  ## Installation
35
 
@@ -231,6 +234,107 @@ python app/eval_with_answer.py \
231
  --step 2 --cpus 128 --force_update
232
  ```
233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
  ## Project Structure
235
 
236
  ```
 
16
  - Qwen/Qwen3-0.6B
17
  ---
18
  # Dripper(MinerU-HTML)
19
+ <a href="https://github.com/opendatalab/MinerU-HTML">
20
+ <img src="https://img.shields.io/badge/GitHub-Repo-black?style=flat-square&logo=github" alt="GitHub Repo" />
21
+ </a>
22
 
23
  **Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
24
 
 
32
  - ⚡ **Distributed Processing**: Ray-based parallel processing for large-scale evaluation
33
  - 🔧 **Multiple Extractors**: Supports various baseline extractors for comparison
34
 
35
+ ---
36
 
37
  ## Installation
38
 
 
234
  --step 2 --cpus 128 --force_update
235
  ```
236
 
237
+ ## MinerU Ecosystem & Cloud API (No GPU Required)
238
+
239
+ MinerU-HTML is part of the broader **MinerU Ecosystem**. If you don't have local GPU resources, or if you want to seamlessly integrate HTML/PDF/Document extraction into your existing workflows, you can use our official Cloud API, SDKs, and RAG integrations.
240
+
241
+ ### Command Line API
242
+ <details>
243
+ <summary>Show commands</summary>
244
+
245
+ ```bash
246
+ # Windows (PowerShell)
247
+ irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex
248
+
249
+ # macOS / Linux
250
+ curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh
251
+
252
+ # Precision extract — token required
253
+ mineru-open-api auth
254
+ mineru-open-api extract webpage.html -o ./output/ # local file
255
+ mineru-open-api crawl https://mineru.net/apiManage/docs -o ./output/ # crawl from URL
256
+ ```
257
+ </details>
258
+
259
+ ### Python SDK
260
+ <details>
261
+ <summary>Show code</summary>
262
+
263
+ ```python
264
+ # pip install mineru-open-sdk
265
+ from mineru import MinerU
266
+
267
+ # Precision mode — tables, formulas, large files
268
+ client = MinerU("your-token") # https://mineru.net/apiManage/token
269
+ result = client.extract("webpage.html") # local file
270
+ result = client.crawl("https://mineru.net/apiManage/docs") # crawl from URL
271
+ print(result.markdown)
272
+ ```
273
+ </details>
274
+
275
+ ### RAG — LangChain
276
+ <details>
277
+ <summary>Show code</summary>
278
+
279
+ ```python
280
+ # pip install langchain-mineru
281
+ from langchain_mineru import MinerULoader
282
+
283
+ # Precision mode — full RAG pipeline
284
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
285
+ from langchain_openai import OpenAIEmbeddings
286
+ from langchain_community.vectorstores import FAISS
287
+
288
+ docs = MinerULoader(source="article.html", mode="precision", token="your-token",
289
+ formula=True, table=True).load()
290
+ chunks = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200).split_documents(docs)
291
+ vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
292
+ results = vectorstore.similarity_search("key requirements", k=3)
293
+ ```
294
+ </details>
295
+
296
+ ### RAG — LlamaIndex
297
+ llama-index-readers-mineru is an official LlamaIndex Reader supporting multi-format document extraction.
298
+
299
+ <details>
300
+ <summary>Show code</summary>
301
+
302
+ ```python
303
+ # pip install llama-index-readers-mineru
304
+ from llama_index.readers.mineru import MinerUReader
305
+
306
+ # Precision mode — OCR, formula, table
307
+ docs = MinerUReader(mode="precision", token="your-token",
308
+ ocr=True, formula=True, table=True).load_data("complex_article.html")
309
+
310
+ # Full RAG pipeline
311
+ from llama_index.core import VectorStoreIndex
312
+ index = VectorStoreIndex.from_documents(docs)
313
+ response = index.as_query_engine().query("Summarize the key content of this page")
314
+ print(response)
315
+ ```
316
+ </details>
317
+
318
+ ### MCP Server (Claude Desktop · Cursor · Windsurf)
319
+ mineru-open-mcp lets any MCP-compatible AI client parse web pages and documents as a native tool.
320
+
321
+ <details>
322
+ <summary>Show config</summary>
323
+
324
+ ```json
325
+ {
326
+ "mcpServers": {
327
+ "mineru": {
328
+ "command": "uvx",
329
+ "args": ["mineru-open-mcp"],
330
+ "env": { "MINERU_API_TOKEN": "your-token" }
331
+ }
332
+ }
333
+ }
334
+ ```
335
+ </details>
336
+
337
+
338
  ## Project Structure
339
 
340
  ```