Spaces:

seanpedrickcase
/

document_redaction

Running

App Files Files Community

document_redaction / index.qmd

seanpedrickcase

Sync: Updated and simplified API routes further

178c86c about 1 hour ago

raw

history blame contribute delete

5.5 kB

	---
	title: "Home"
	---

	version: 2.2.0

	Welcome to the Document Redaction App documentation. This site provides comprehensive documentation for the Document Redaction App.

	Navigate through the sections to learn how to install, use, and manage the application. Below is a brief introduction to the app, followed by complete installation instructions (PyPI, source, and Docker overview).

	## Document redaction

	Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](https://seanpedrick-case.github.io/doc_redaction/src/user_guide.html) for a full walkthrough of all the features in the app.

	![Handwriting and signatures redacted example](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/refs/heads/main/review_redactions/Signatures%20and%20handwriting%20found.PNG)

	To identify text in documents, the 'Local' text extraction uses PikePDF, and OCR image analysis uses Tesseract, and works well only for documents with typed text or scanned PDFs with clear text. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings). AWS Comprehend gives better results at a small cost.

	Additional options include, choosing the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the 'Review redactions' tab to quickly create a final redacted document.

	NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed by a human before using the final outputs.

	---

	## Installation

	### Prerequisites: Tesseract and Poppler

	The application needs Tesseract (OCR) and Poppler (PDF utilities) on your system before Python packages will work fully.

	#### Windows

	1. Tesseract: Install from the [UB Mannheim Tesseract wiki](https://github.com/UB-Mannheim/tesseract/wiki). During setup, add Tesseract to the system PATH.
	2. Poppler: Download from [Poppler for Windows](https://github.com/oschwartz10612/poppler-windows), extract to a permanent folder, and add the `bin` directory to your system PATH.

	Verify in a new terminal: `tesseract --version` and `pdftoppm -v`.

	#### Linux (Debian/Ubuntu)

	```bash
	sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils
	```

	#### Linux (Fedora/CentOS/RHEL)

	```bash
	sudo dnf install -y tesseract poppler-utils
	```

	### Python environment

	Use a virtual environment (recommended):

	```bash
	python -m venv venv
	# Windows:
	.\venv\Scripts\activate
	# macOS/Linux:
	source venv/bin/activate
	```

	### Install from PyPI

	The package is published on [PyPI](https://pypi.org/project/doc-redaction/) as doc-redaction; the import name is `doc_redaction`.

	```bash
	pip install doc_redaction
	```

	Optional extras (PaddleOCR, VLM stack — see `pyproject.toml` on the repository for exact extras):

	```bash
	pip install "doc_redaction[paddle,vlm]"
	```

	After a PyPI install you get:

	- The `cli_redact` console script (same behaviour as `cli_redact.py` in the repo).
	- The programmatic API documented in [Package API usage (Python)](src/package_api_usage.qmd).

	Gradio web UI: The full `app.py` UI is easiest to run from a repository clone (see below) or via Docker; a PyPI install is optimised for library and CLI use.

	### Install from source (clone / development)

	```bash
	git clone https://github.com/seanpedrick-case/doc_redaction.git
	cd doc_redaction
	pip install -e .
	```

	Lightweight (no Paddle/VLM extras):

	```bash
	pip install -e .
	```

	Full extras from the repo:

	```bash
	pip install -e ".[paddle,vlm]"
	```

	Alternatively, `requirements_lightweight.txt` or `requirements.txt` from the repository root (see [README on GitHub](https://github.com/seanpedrick-case/doc_redaction) for GPU notes for Paddle and Torch).

	### Run the Gradio application (source checkout)

	From the repository root, with dependencies installed:

	```bash
	python app.py
	```

	Open the local URL shown in the terminal (typically `http://127.0.0.1:7860`).

	### Command-line interface

	- PyPI install: `cli_redact --help`
	- Source checkout: `python cli_redact.py --help`

	### Using as a Python package (no Gradio UI)

	For examples that mirror each Gradio `api_name` using the CLI-first Python API, see [Package API usage (Python)](src/package_api_usage.qmd).

	### Docker and cloud deployment

	For container-based installs, AWS CDK, and advanced deployment, see the [App installation guide (with CDK)](src/installation_guide.qmd) and the [Dockerfile](https://github.com/seanpedrick-case/doc_redaction/blob/main/Dockerfile) in the repository.

	### Configuration

	Application settings are typically loaded from `config/app_config.env`. A full variable reference is on the site under [App settings management guide](src/app_settings.qmd) and in the GitHub repository documentation.

	---

	## More documentation

	- [User guide](src/user_guide.qmd)
	- [Package API usage (Python)](src/package_api_usage.qmd)
	- [App installation guide (with CDK)](src/installation_guide.qmd)
	- [App settings management guide](src/app_settings.qmd)