Spaces:

GlokalAI
/

OrgAI

Running

File size: 15,809 Bytes

167596f

# Enhanced Markdown Conversion

This document describes the enhanced markdown conversion feature for RAG-Anything, which provides high-quality PDF generation from markdown files with multiple backend options and advanced styling.

## Overview

The enhanced markdown conversion feature provides professional-quality PDF generation from markdown files. It supports multiple conversion backends, advanced styling options, syntax highlighting, and seamless integration with RAG-Anything's document processing pipeline.

## Key Features

- **Multiple Backends**: WeasyPrint, Pandoc, and automatic backend selection
- **Advanced Styling**: Custom CSS, syntax highlighting, and professional layouts
- **Image Support**: Embedded images with proper scaling and positioning
- **Table Support**: Formatted tables with borders and professional styling
- **Code Highlighting**: Syntax highlighting for code blocks using Pygments
- **Custom Templates**: Support for custom CSS and document templates
- **Table of Contents**: Automatic TOC generation with navigation links
- **Professional Typography**: High-quality fonts and spacing

## Installation

### Required Dependencies

```bash
# Basic installation
pip install raganything[all]

# Required for enhanced markdown conversion
pip install markdown weasyprint pygments
```

### Optional Dependencies

```bash
# For Pandoc backend (system installation required)
# Ubuntu/Debian:
sudo apt-get install pandoc wkhtmltopdf

# macOS:
brew install pandoc wkhtmltopdf

# Or using conda:
conda install -c conda-forge pandoc wkhtmltopdf
```

### Backend-Specific Installation

#### WeasyPrint (Recommended)
```bash
# Install WeasyPrint with system dependencies
pip install weasyprint

# Ubuntu/Debian system dependencies:
sudo apt-get install -y build-essential python3-dev python3-pip \
    python3-setuptools python3-wheel python3-cffi libcairo2 \
    libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
    libffi-dev shared-mime-info
```

#### Pandoc
- Download from: https://pandoc.org/installing.html
- Requires system-wide installation
- Used for complex document structures and LaTeX-quality output

## Usage

### Basic Conversion

```python
from raganything.enhanced_markdown import EnhancedMarkdownConverter, MarkdownConfig

# Create converter with default settings
converter = EnhancedMarkdownConverter()

# Convert markdown file to PDF
success = converter.convert_file_to_pdf(
    input_path="document.md",
    output_path="document.pdf",
    method="auto"  # Automatically select best available backend
)

if success:
    print("✅ Conversion successful!")
else:
    print("❌ Conversion failed")
```

### Advanced Configuration

```python
# Create custom configuration
config = MarkdownConfig(
    page_size="A4",           # A4, Letter, Legal, etc.
    margin="1in",             # CSS-style margins
    font_size="12pt",         # Base font size
    line_height="1.5",        # Line spacing
    include_toc=True,         # Generate table of contents
    syntax_highlighting=True, # Enable code syntax highlighting

    # Custom CSS styling
    custom_css="""
    body {
        font-family: 'Georgia', serif;
        color: #333;
    }
    h1 {
        color: #2c3e50;
        border-bottom: 2px solid #3498db;
        padding-bottom: 0.3em;
    }
    code {
        background-color: #f8f9fa;
        padding: 2px 4px;
        border-radius: 3px;
    }
    pre {
        background-color: #f8f9fa;
        border-left: 4px solid #3498db;
        padding: 15px;
        border-radius: 5px;
    }
    table {
        border-collapse: collapse;
        width: 100%;
        margin: 1em 0;
    }
    th, td {
        border: 1px solid #ddd;
        padding: 8px 12px;
        text-align: left;
    }
    th {
        background-color: #f2f2f2;
        font-weight: bold;
    }
    """
)

converter = EnhancedMarkdownConverter(config)
```

### Backend Selection

```python
# Check available backends
converter = EnhancedMarkdownConverter()
backend_info = converter.get_backend_info()

print("Available backends:")
for backend, available in backend_info["available_backends"].items():
    status = "✅" if available else "❌"
    print(f"  {status} {backend}")

print(f"Recommended backend: {backend_info['recommended_backend']}")

# Use specific backend
converter.convert_file_to_pdf(
    input_path="document.md",
    output_path="document.pdf",
    method="weasyprint"  # or "pandoc", "pandoc_system", "auto"
)
```

### Content Conversion

```python
# Convert markdown content directly (not from file)
markdown_content = """
# Sample Document

## Introduction
This is a **bold** statement with *italic* text.

## Code Example
```python
def hello_world():
    print("Hello, World!")
    return "Success"
```

## Table
| Feature | Status | Notes |
|---------|--------|-------|
| PDF Generation | ✅ | Working |
| Syntax Highlighting | ✅ | Pygments |
| Custom CSS | ✅ | Full support |
"""

success = converter.convert_markdown_to_pdf(
    markdown_content=markdown_content,
    output_path="sample.pdf",
    method="auto"
)
```

### Command Line Interface

```bash
# Basic conversion
python -m raganything.enhanced_markdown document.md --output document.pdf

# With specific backend
python -m raganything.enhanced_markdown document.md --method weasyprint

# With custom CSS file
python -m raganything.enhanced_markdown document.md --css custom_style.css

# Show backend information
python -m raganything.enhanced_markdown --info

# Help
python -m raganything.enhanced_markdown --help
```

## Backend Comparison

| Backend | Pros | Cons | Best For | Quality |
|---------|------|------|----------|---------|
| **WeasyPrint** | • Excellent CSS support<br>• Fast rendering<br>• Great web-style layouts<br>• Python-based | • Limited LaTeX features<br>• Requires system deps | • Web-style documents<br>• Custom styling<br>• Fast conversion | ⭐⭐⭐⭐ |
| **Pandoc** | • Extensive features<br>• LaTeX-quality output<br>• Academic formatting<br>• Many input/output formats | • Slower conversion<br>• System installation<br>• Complex setup | • Academic papers<br>• Complex documents<br>• Publication quality | ⭐⭐⭐⭐⭐ |
| **Auto** | • Automatic selection<br>• Fallback support<br>• User-friendly | • May not use optimal backend | • General use<br>• Quick setup<br>• Development | ⭐⭐⭐⭐ |

## Configuration Options

### MarkdownConfig Parameters

```python
@dataclass
class MarkdownConfig:
    # Page layout
    page_size: str = "A4"              # A4, Letter, Legal, A3, etc.
    margin: str = "1in"                # CSS margin format
    font_size: str = "12pt"            # Base font size
    line_height: str = "1.5"           # Line spacing multiplier

    # Content options
    include_toc: bool = True           # Generate table of contents
    syntax_highlighting: bool = True   # Enable code highlighting
    image_max_width: str = "100%"      # Maximum image width
    table_style: str = "..."           # Default table CSS

    # Styling
    css_file: Optional[str] = None     # External CSS file path
    custom_css: Optional[str] = None   # Inline CSS content
    template_file: Optional[str] = None # Custom HTML template

    # Output options
    output_format: str = "pdf"         # Currently only PDF supported
    output_dir: Optional[str] = None   # Output directory

    # Metadata
    metadata: Optional[Dict[str, str]] = None  # Document metadata
```

### Supported Markdown Features

#### Basic Formatting
- **Headers**: `# ## ### #### ##### ######`
- **Emphasis**: `*italic*`, `**bold**`, `***bold italic***`
- **Links**: `[text](url)`, `[text][ref]`
- **Images**: `![alt](url)`, `![alt][ref]`
- **Lists**: Ordered and unordered, nested
- **Blockquotes**: `> quote`
- **Line breaks**: Double space or `\n\n`

#### Advanced Features
- **Tables**: GitHub-style tables with alignment
- **Code blocks**: Fenced code blocks with language specification
- **Inline code**: `backtick code`
- **Horizontal rules**: `---` or `***`
- **Footnotes**: `[^1]` references
- **Definition lists**: Term and definition pairs
- **Attributes**: `{#id .class key=value}`

#### Code Highlighting

```markdown
```python
def example_function():
    """This will be syntax highlighted"""
    return "Hello, World!"
```

```javascript
function exampleFunction() {
    // This will also be highlighted
    return "Hello, World!";
}
```
```

## Integration with RAG-Anything

The enhanced markdown conversion integrates seamlessly with RAG-Anything:

```python
from raganything import RAGAnything

# Initialize RAG-Anything
rag = RAGAnything()

# Process markdown files - enhanced conversion is used automatically
await rag.process_document_complete("document.md")

# Batch processing with enhanced markdown conversion
result = rag.process_documents_batch(
    file_paths=["doc1.md", "doc2.md", "doc3.md"],
    output_dir="./output"
)

# The .md files will be converted to PDF using enhanced conversion
# before being processed by the RAG system
```

## Performance Considerations

### Conversion Speed
- **WeasyPrint**: ~1-3 seconds for typical documents
- **Pandoc**: ~3-10 seconds for typical documents
- **Large documents**: Time scales roughly linearly with content

### Memory Usage
- **WeasyPrint**: ~50-100MB per conversion
- **Pandoc**: ~100-200MB per conversion
- **Images**: Large images increase memory usage significantly

### Optimization Tips
1. **Resize large images** before embedding
2. **Use compressed images** (JPEG for photos, PNG for graphics)
3. **Limit concurrent conversions** to avoid memory issues
4. **Cache converted content** when processing multiple times

## Examples

### Sample Markdown Document

```markdown
# Technical Documentation

## Table of Contents
[TOC]

## Overview
This document provides comprehensive technical specifications.

## Architecture

### System Components
1. **Parser Engine**: Handles document processing
2. **Storage Layer**: Manages data persistence
3. **Query Interface**: Provides search capabilities

### Code Implementation
```python
from raganything import RAGAnything

# Initialize system
rag = RAGAnything(config={
    "working_dir": "./storage",
    "enable_image_processing": True
})

# Process document
await rag.process_document_complete("document.pdf")
```

### Performance Metrics

| Component | Throughput | Latency | Memory |
|-----------|------------|---------|--------|
| Parser | 100 docs/hour | 36s avg | 2.5 GB |
| Storage | 1000 ops/sec | 1ms avg | 512 MB |
| Query | 50 queries/sec | 20ms avg | 1 GB |

## Integration Notes

> **Important**: Always validate input before processing.

## Conclusion
The enhanced system provides excellent performance for document processing workflows.
```

### Generated PDF Features

The enhanced markdown converter produces PDFs with:

- **Professional typography** with proper font selection and spacing
- **Syntax-highlighted code blocks** using Pygments
- **Formatted tables** with borders and alternating row colors
- **Clickable table of contents** with navigation links
- **Responsive images** that scale appropriately
- **Custom styling** through CSS
- **Proper page breaks** and margins
- **Document metadata** and properties

## Troubleshooting

### Common Issues

#### WeasyPrint Installation Problems
```bash
# Ubuntu/Debian: Install system dependencies
sudo apt-get update
sudo apt-get install -y build-essential python3-dev libcairo2 \
    libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
    libffi-dev shared-mime-info

# Then reinstall WeasyPrint
pip install --force-reinstall weasyprint
```

#### Pandoc Not Found
```bash
# Check if Pandoc is installed
pandoc --version

# Install Pandoc (Ubuntu/Debian)
sudo apt-get install pandoc wkhtmltopdf

# Or download from: https://pandoc.org/installing.html
```

#### CSS Issues
- Check CSS syntax in custom_css
- Verify CSS file paths exist
- Test CSS with simple HTML first
- Use browser developer tools to debug styling

#### Image Problems
- Ensure images are accessible (correct paths)
- Check image file formats (PNG, JPEG, GIF supported)
- Verify image file permissions
- Consider image size and format optimization

#### Font Issues
```python
# Use web-safe fonts
config = MarkdownConfig(
    custom_css="""
    body {
        font-family: 'Arial', 'Helvetica', sans-serif;
    }
    """
)
```

### Debug Mode

Enable detailed logging for troubleshooting:

```python
import logging

# Enable debug logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Create converter with debug logging
converter = EnhancedMarkdownConverter()
result = converter.convert_file_to_pdf("test.md", "test.pdf")
```

### Error Handling

```python
def robust_conversion(input_path, output_path):
    """Convert with fallback backends"""
    converter = EnhancedMarkdownConverter()

    # Try backends in order of preference
    backends = ["weasyprint", "pandoc", "auto"]

    for backend in backends:
        try:
            success = converter.convert_file_to_pdf(
                input_path=input_path,
                output_path=output_path,
                method=backend
            )
            if success:
                print(f"✅ Conversion successful with {backend}")
                return True
        except Exception as e:
            print(f"❌ {backend} failed: {str(e)}")
            continue

    print("❌ All backends failed")
    return False
```

## API Reference

### EnhancedMarkdownConverter

```python
class EnhancedMarkdownConverter:
    def __init__(self, config: Optional[MarkdownConfig] = None):
        """Initialize converter with optional configuration"""

    def convert_file_to_pdf(self, input_path: str, output_path: str, method: str = "auto") -> bool:
        """Convert markdown file to PDF"""

    def convert_markdown_to_pdf(self, markdown_content: str, output_path: str, method: str = "auto") -> bool:
        """Convert markdown content to PDF"""

    def get_backend_info(self) -> Dict[str, Any]:
        """Get information about available backends"""

    def convert_with_weasyprint(self, markdown_content: str, output_path: str) -> bool:
        """Convert using WeasyPrint backend"""

    def convert_with_pandoc(self, markdown_content: str, output_path: str) -> bool:
        """Convert using Pandoc backend"""
```

## Best Practices

1. **Choose the right backend** for your use case:
   - **WeasyPrint** for web-style documents and custom CSS
   - **Pandoc** for academic papers and complex formatting
   - **Auto** for general use and development

2. **Optimize images** before embedding:
   - Use appropriate formats (JPEG for photos, PNG for graphics)
   - Compress images to reduce file size
   - Set reasonable maximum widths

3. **Design responsive layouts**:
   - Use relative units (%, em) instead of absolute (px)
   - Test with different page sizes
   - Consider print-specific CSS

4. **Test your styling**:
   - Start with default styling and incrementally customize
   - Test with sample content before production use
   - Validate CSS syntax

5. **Handle errors gracefully**:
   - Implement fallback backends
   - Provide meaningful error messages
   - Log conversion attempts for debugging

6. **Performance optimization**:
   - Cache converted content when possible
   - Process large batches with appropriate worker counts
   - Monitor memory usage with large documents

## Conclusion

The enhanced markdown conversion feature provides professional-quality PDF generation with flexible styling options and multiple backend support. It seamlessly integrates with RAG-Anything's document processing pipeline while offering standalone functionality for markdown-to-PDF conversion needs.