Spaces:

GlokalAI
/

OrgAI

Running

App Files Files Community

OrgAI / rag_anything_smaranika /docs /enhanced_markdown.md

Phonex

TheTruthSchool_RAG

167596f 7 months ago

preview code

raw

history blame contribute delete

15.8 kB

	# Enhanced Markdown Conversion

	This document describes the enhanced markdown conversion feature for RAG-Anything, which provides high-quality PDF generation from markdown files with multiple backend options and advanced styling.

	## Overview

	The enhanced markdown conversion feature provides professional-quality PDF generation from markdown files. It supports multiple conversion backends, advanced styling options, syntax highlighting, and seamless integration with RAG-Anything's document processing pipeline.

	## Key Features

	- Multiple Backends: WeasyPrint, Pandoc, and automatic backend selection
	- Advanced Styling: Custom CSS, syntax highlighting, and professional layouts
	- Image Support: Embedded images with proper scaling and positioning
	- Table Support: Formatted tables with borders and professional styling
	- Code Highlighting: Syntax highlighting for code blocks using Pygments
	- Custom Templates: Support for custom CSS and document templates
	- Table of Contents: Automatic TOC generation with navigation links
	- Professional Typography: High-quality fonts and spacing

	## Installation

	### Required Dependencies

	```bash
	# Basic installation
	pip install raganything[all]

	# Required for enhanced markdown conversion
	pip install markdown weasyprint pygments
	```

	### Optional Dependencies

	```bash
	# For Pandoc backend (system installation required)
	# Ubuntu/Debian:
	sudo apt-get install pandoc wkhtmltopdf

	# macOS:
	brew install pandoc wkhtmltopdf

	# Or using conda:
	conda install -c conda-forge pandoc wkhtmltopdf
	```

	### Backend-Specific Installation

	#### WeasyPrint (Recommended)
	```bash
	# Install WeasyPrint with system dependencies
	pip install weasyprint

	# Ubuntu/Debian system dependencies:
	sudo apt-get install -y build-essential python3-dev python3-pip \
	python3-setuptools python3-wheel python3-cffi libcairo2 \
	libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
	libffi-dev shared-mime-info
	```

	#### Pandoc
	- Download from: https://pandoc.org/installing.html
	- Requires system-wide installation
	- Used for complex document structures and LaTeX-quality output

	## Usage

	### Basic Conversion

	```python
	from raganything.enhanced_markdown import EnhancedMarkdownConverter, MarkdownConfig

	# Create converter with default settings
	converter = EnhancedMarkdownConverter()

	# Convert markdown file to PDF
	success = converter.convert_file_to_pdf(
	input_path="document.md",
	output_path="document.pdf",
	method="auto" # Automatically select best available backend
	)

	if success:
	print("✅ Conversion successful!")
	else:
	print("❌ Conversion failed")
	```

	### Advanced Configuration

	```python
	# Create custom configuration
	config = MarkdownConfig(
	page_size="A4", # A4, Letter, Legal, etc.
	margin="1in", # CSS-style margins
	font_size="12pt", # Base font size
	line_height="1.5", # Line spacing
	include_toc=True, # Generate table of contents
	syntax_highlighting=True, # Enable code syntax highlighting

	# Custom CSS styling
	custom_css="""
	body {
	font-family: 'Georgia', serif;
	color: #333;
	}
	h1 {
	color: #2c3e50;
	border-bottom: 2px solid #3498db;
	padding-bottom: 0.3em;
	}
	code {
	background-color: #f8f9fa;
	padding: 2px 4px;
	border-radius: 3px;
	}
	pre {
	background-color: #f8f9fa;
	border-left: 4px solid #3498db;
	padding: 15px;
	border-radius: 5px;
	}
	table {
	border-collapse: collapse;
	width: 100%;
	margin: 1em 0;
	}
	th, td {
	border: 1px solid #ddd;
	padding: 8px 12px;
	text-align: left;
	}
	th {
	background-color: #f2f2f2;
	font-weight: bold;
	}
	"""
	)

	converter = EnhancedMarkdownConverter(config)
	```

	### Backend Selection

	```python
	# Check available backends
	converter = EnhancedMarkdownConverter()
	backend_info = converter.get_backend_info()

	print("Available backends:")
	for backend, available in backend_info["available_backends"].items():
	status = "✅" if available else "❌"
	print(f" {status} {backend}")

	print(f"Recommended backend: {backend_info['recommended_backend']}")

	# Use specific backend
	converter.convert_file_to_pdf(
	input_path="document.md",
	output_path="document.pdf",
	method="weasyprint" # or "pandoc", "pandoc_system", "auto"
	)
	```

	### Content Conversion

	```python
	# Convert markdown content directly (not from file)
	markdown_content = """
	# Sample Document

	## Introduction
	This is a bold statement with italic text.

	## Code Example
	```python
	def hello_world():
	print("Hello, World!")
	return "Success"
	```

	## Table
	\| Feature \| Status \| Notes \|
	\|---------\|--------\|-------\|
	\| PDF Generation \| ✅ \| Working \|
	\| Syntax Highlighting \| ✅ \| Pygments \|
	\| Custom CSS \| ✅ \| Full support \|
	"""

	success = converter.convert_markdown_to_pdf(
	markdown_content=markdown_content,
	output_path="sample.pdf",
	method="auto"
	)
	```

	### Command Line Interface

	```bash
	# Basic conversion
	python -m raganything.enhanced_markdown document.md --output document.pdf

	# With specific backend
	python -m raganything.enhanced_markdown document.md --method weasyprint

	# With custom CSS file
	python -m raganything.enhanced_markdown document.md --css custom_style.css

	# Show backend information
	python -m raganything.enhanced_markdown --info

	# Help
	python -m raganything.enhanced_markdown --help
	```

	## Backend Comparison

	\| Backend \| Pros \| Cons \| Best For \| Quality \|
	\|---------\|------\|------\|----------\|---------\|
	\| WeasyPrint \| • Excellent CSS support<br>• Fast rendering<br>• Great web-style layouts<br>• Python-based \| • Limited LaTeX features<br>• Requires system deps \| • Web-style documents<br>• Custom styling<br>• Fast conversion \| ⭐⭐⭐⭐ \|
	\| Pandoc \| • Extensive features<br>• LaTeX-quality output<br>• Academic formatting<br>• Many input/output formats \| • Slower conversion<br>• System installation<br>• Complex setup \| • Academic papers<br>• Complex documents<br>• Publication quality \| ⭐⭐⭐⭐⭐ \|
	\| Auto \| • Automatic selection<br>• Fallback support<br>• User-friendly \| • May not use optimal backend \| • General use<br>• Quick setup<br>• Development \| ⭐⭐⭐⭐ \|

	## Configuration Options

	### MarkdownConfig Parameters

	```python
	@dataclass
	class MarkdownConfig:
	# Page layout
	page_size: str = "A4" # A4, Letter, Legal, A3, etc.
	margin: str = "1in" # CSS margin format
	font_size: str = "12pt" # Base font size
	line_height: str = "1.5" # Line spacing multiplier

	# Content options
	include_toc: bool = True # Generate table of contents
	syntax_highlighting: bool = True # Enable code highlighting
	image_max_width: str = "100%" # Maximum image width
	table_style: str = "..." # Default table CSS

	# Styling
	css_file: Optional[str] = None # External CSS file path
	custom_css: Optional[str] = None # Inline CSS content
	template_file: Optional[str] = None # Custom HTML template

	# Output options
	output_format: str = "pdf" # Currently only PDF supported
	output_dir: Optional[str] = None # Output directory

	# Metadata
	metadata: Optional[Dict[str, str]] = None # Document metadata
	```

	### Supported Markdown Features

	#### Basic Formatting
	- Headers: `# ## ### #### ##### ######`
	- Emphasis: `italic`, `bold`, `*bold italic*`
	- Links: `[text](url)`, `[text][ref]`
	- Images: `![alt](url)`, `![alt][ref]`
	- Lists: Ordered and unordered, nested
	- Blockquotes: `> quote`
	- Line breaks: Double space or `\n\n`

	#### Advanced Features
	- Tables: GitHub-style tables with alignment
	- Code blocks: Fenced code blocks with language specification
	- Inline code: `backtick code`
	- Horizontal rules: `---` or `***`
	- Footnotes: `[^1]` references
	- Definition lists: Term and definition pairs
	- Attributes: `{#id .class key=value}`

	#### Code Highlighting

	```markdown
	```python
	def example_function():
	"""This will be syntax highlighted"""
	return "Hello, World!"
	```

	```javascript
	function exampleFunction() {
	// This will also be highlighted
	return "Hello, World!";
	}
	```
	```

	## Integration with RAG-Anything

	The enhanced markdown conversion integrates seamlessly with RAG-Anything:

	```python
	from raganything import RAGAnything

	# Initialize RAG-Anything
	rag = RAGAnything()

	# Process markdown files - enhanced conversion is used automatically
	await rag.process_document_complete("document.md")

	# Batch processing with enhanced markdown conversion
	result = rag.process_documents_batch(
	file_paths=["doc1.md", "doc2.md", "doc3.md"],
	output_dir="./output"
	)

	# The .md files will be converted to PDF using enhanced conversion
	# before being processed by the RAG system
	```

	## Performance Considerations

	### Conversion Speed
	- WeasyPrint: ~1-3 seconds for typical documents
	- Pandoc: ~3-10 seconds for typical documents
	- Large documents: Time scales roughly linearly with content

	### Memory Usage
	- WeasyPrint: ~50-100MB per conversion
	- Pandoc: ~100-200MB per conversion
	- Images: Large images increase memory usage significantly

	### Optimization Tips
	1. Resize large images before embedding
	2. Use compressed images (JPEG for photos, PNG for graphics)
	3. Limit concurrent conversions to avoid memory issues
	4. Cache converted content when processing multiple times

	## Examples

	### Sample Markdown Document

	```markdown
	# Technical Documentation

	## Table of Contents
	[TOC]

	## Overview
	This document provides comprehensive technical specifications.

	## Architecture

	### System Components
	1. Parser Engine: Handles document processing
	2. Storage Layer: Manages data persistence
	3. Query Interface: Provides search capabilities

	### Code Implementation
	```python
	from raganything import RAGAnything

	# Initialize system
	rag = RAGAnything(config={
	"working_dir": "./storage",
	"enable_image_processing": True
	})

	# Process document
	await rag.process_document_complete("document.pdf")
	```

	### Performance Metrics

	\| Component \| Throughput \| Latency \| Memory \|
	\|-----------\|------------\|---------\|--------\|
	\| Parser \| 100 docs/hour \| 36s avg \| 2.5 GB \|
	\| Storage \| 1000 ops/sec \| 1ms avg \| 512 MB \|
	\| Query \| 50 queries/sec \| 20ms avg \| 1 GB \|

	## Integration Notes

	> Important: Always validate input before processing.

	## Conclusion
	The enhanced system provides excellent performance for document processing workflows.
	```

	### Generated PDF Features

	The enhanced markdown converter produces PDFs with:

	- Professional typography with proper font selection and spacing
	- Syntax-highlighted code blocks using Pygments
	- Formatted tables with borders and alternating row colors
	- Clickable table of contents with navigation links
	- Responsive images that scale appropriately
	- Custom styling through CSS
	- Proper page breaks and margins
	- Document metadata and properties

	## Troubleshooting

	### Common Issues

	#### WeasyPrint Installation Problems
	```bash
	# Ubuntu/Debian: Install system dependencies
	sudo apt-get update
	sudo apt-get install -y build-essential python3-dev libcairo2 \
	libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
	libffi-dev shared-mime-info

	# Then reinstall WeasyPrint
	pip install --force-reinstall weasyprint
	```

	#### Pandoc Not Found
	```bash
	# Check if Pandoc is installed
	pandoc --version

	# Install Pandoc (Ubuntu/Debian)
	sudo apt-get install pandoc wkhtmltopdf

	# Or download from: https://pandoc.org/installing.html
	```

	#### CSS Issues
	- Check CSS syntax in custom_css
	- Verify CSS file paths exist
	- Test CSS with simple HTML first
	- Use browser developer tools to debug styling

	#### Image Problems
	- Ensure images are accessible (correct paths)
	- Check image file formats (PNG, JPEG, GIF supported)
	- Verify image file permissions
	- Consider image size and format optimization

	#### Font Issues
	```python
	# Use web-safe fonts
	config = MarkdownConfig(
	custom_css="""
	body {
	font-family: 'Arial', 'Helvetica', sans-serif;
	}
	"""
	)
	```

	### Debug Mode

	Enable detailed logging for troubleshooting:

	```python
	import logging

	# Enable debug logging
	logging.basicConfig(
	level=logging.DEBUG,
	format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
	)

	# Create converter with debug logging
	converter = EnhancedMarkdownConverter()
	result = converter.convert_file_to_pdf("test.md", "test.pdf")
	```

	### Error Handling

	```python
	def robust_conversion(input_path, output_path):
	"""Convert with fallback backends"""
	converter = EnhancedMarkdownConverter()

	# Try backends in order of preference
	backends = ["weasyprint", "pandoc", "auto"]

	for backend in backends:
	try:
	success = converter.convert_file_to_pdf(
	input_path=input_path,
	output_path=output_path,
	method=backend
	)
	if success:
	print(f"✅ Conversion successful with {backend}")
	return True
	except Exception as e:
	print(f"❌ {backend} failed: {str(e)}")
	continue

	print("❌ All backends failed")
	return False
	```

	## API Reference

	### EnhancedMarkdownConverter

	```python
	class EnhancedMarkdownConverter:
	def __init__(self, config: Optional[MarkdownConfig] = None):
	"""Initialize converter with optional configuration"""

	def convert_file_to_pdf(self, input_path: str, output_path: str, method: str = "auto") -> bool:
	"""Convert markdown file to PDF"""

	def convert_markdown_to_pdf(self, markdown_content: str, output_path: str, method: str = "auto") -> bool:
	"""Convert markdown content to PDF"""

	def get_backend_info(self) -> Dict[str, Any]:
	"""Get information about available backends"""

	def convert_with_weasyprint(self, markdown_content: str, output_path: str) -> bool:
	"""Convert using WeasyPrint backend"""

	def convert_with_pandoc(self, markdown_content: str, output_path: str) -> bool:
	"""Convert using Pandoc backend"""
	```

	## Best Practices

	1. Choose the right backend for your use case:
	- WeasyPrint for web-style documents and custom CSS
	- Pandoc for academic papers and complex formatting
	- Auto for general use and development

	2. Optimize images before embedding:
	- Use appropriate formats (JPEG for photos, PNG for graphics)
	- Compress images to reduce file size
	- Set reasonable maximum widths

	3. Design responsive layouts:
	- Use relative units (%, em) instead of absolute (px)
	- Test with different page sizes
	- Consider print-specific CSS

	4. Test your styling:
	- Start with default styling and incrementally customize
	- Test with sample content before production use
	- Validate CSS syntax

	5. Handle errors gracefully:
	- Implement fallback backends
	- Provide meaningful error messages
	- Log conversion attempts for debugging

	6. Performance optimization:
	- Cache converted content when possible
	- Process large batches with appropriate worker counts
	- Monitor memory usage with large documents

	## Conclusion

	The enhanced markdown conversion feature provides professional-quality PDF generation with flexible styling options and multiple backend support. It seamlessly integrates with RAG-Anything's document processing pipeline while offering standalone functionality for markdown-to-PDF conversion needs.