# Jupyter Notebook Usage This document shows how to use the document processing function in Jupyter notebooks for integration into larger processing pipelines. ## Simple Usage ```python from processing.document_processor import process_document_with_redaction # Process a single document result = process_document_with_redaction( file_path="path/to/your/document.pdf", endpoint="your-azure-openai-endpoint", api_key="your-azure-openai-key", api_version="2024-02-15-preview", deployment="o3-mini" # or "o4-mini", "o3", "o4" ) # Access the results original_md = result.original_document_md redacted_md = result.redacted_document_md input_tokens = result.input_tokens output_tokens = result.output_tokens cost = result.cost print(f"Processing complete!") print(f"Input tokens: {input_tokens:,}") print(f"Output tokens: {output_tokens:,}") print(f"Total cost: ${cost:.4f}") ``` ## Batch Processing ```python import os from processing.document_processor import process_document_with_redaction # Configuration AZURE_OPENAI_ENDPOINT = "your-azure-openai-endpoint" AZURE_OPENAI_KEY = "your-azure-openai-key" AZURE_OPENAI_VERSION = "2024-02-15-preview" AZURE_OPENAI_DEPLOYMENT = "o3-mini" # Process multiple documents pdf_directory = "path/to/pdf/files" results = [] for filename in os.listdir(pdf_directory): if filename.endswith('.pdf'): file_path = os.path.join(pdf_directory, filename) print(f"Processing {filename}...") try: result = process_document_with_redaction( file_path=file_path, endpoint=AZURE_OPENAI_ENDPOINT, api_key=AZURE_OPENAI_KEY, api_version=AZURE_OPENAI_VERSION, deployment=AZURE_OPENAI_DEPLOYMENT ) results.append({ 'filename': filename, 'original_md': result.original_document_md, 'redacted_md': result.redacted_document_md, 'input_tokens': result.input_tokens, 'output_tokens': result.output_tokens, 'cost': result.cost }) print(f" ✓ Completed - Cost: ${result.cost:.4f}") except Exception as e: print(f" ✗ Error processing {filename}: {e}") # Summary total_cost = sum(r['cost'] for r in results) total_input_tokens = sum(r['input_tokens'] for r in results) total_output_tokens = sum(r['output_tokens'] for r in results) print(f"\nBatch processing complete!") print(f"Documents processed: {len(results)}") print(f"Total input tokens: {total_input_tokens:,}") print(f"Total output tokens: {total_output_tokens:,}") print(f"Total cost: ${total_cost:.4f}") ``` ## Environment Variables You can also use environment variables for configuration: ```python import os from dotenv import load_dotenv from processing.document_processor import process_document_with_redaction # Load environment variables load_dotenv() # Get configuration from environment AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT") AZURE_OPENAI_KEY = os.getenv("AZURE_OPENAI_KEY") AZURE_OPENAI_VERSION = os.getenv("AZURE_OPENAI_VERSION") AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT") # Process document result = process_document_with_redaction( file_path="document.pdf", endpoint=AZURE_OPENAI_ENDPOINT, api_key=AZURE_OPENAI_KEY, api_version=AZURE_OPENAI_VERSION, deployment=AZURE_OPENAI_DEPLOYMENT ) ``` ## Return Value The function returns a `ProcessingResult` object with the following attributes: - `original_document_md`: Markdown version of the original document - `redacted_document_md`: Markdown version with medication sections removed - `input_tokens`: Number of input tokens used - `output_tokens`: Number of output tokens generated - `cost`: Total cost in USD ## Supported Models The function supports the following Azure OpenAI deployment names: - `o3-mini` (GPT-4o Mini) - Cheapest option - `o4-mini` (GPT-4o Mini) - Same as o3-mini - `o3` (GPT-3.5 Turbo) - Medium cost - `o4` (GPT-4o) - Most expensive but most capable ## Error Handling The function will raise exceptions for: - File not found - Invalid Azure OpenAI credentials - API rate limits - Network errors Make sure to handle these appropriately in your pipeline.