Add README

e5825f1 verified 16 days ago

5.36 kB

	# 🖋️ Manuscript-Mimic

	AI Style Transfer for Scientific Writing

	An agentic system that rewrites AI-generated scientific text to statistically match the stylometric profile of pre-2022 human-authored academic manuscripts.

	## Architecture

	```
	┌─────────────────────────────────────────────────────────┐
	│ Gradio UI (app.py) │
	│ ┌──────────────┐ ┌─────────────────────────────────┐ │
	│ │ Reference PDF │ │ Target Draft (paste) │ │
	│ │ or Text │ │ │ │
	│ └──────┬───────┘ └────────────┬────────────────────┘ │
	│ │ │ │
	│ ▼ ▼ │
	│ ┌──────────────────────────────────────────────────┐ │
	│ │ rewrite_agent.py — CodeAgent │ │
	│ │ │ │
	│ │ Step 1: style_extractor(reference) → ref_metrics│ │
	│ │ Step 2: style_extractor(target) → tgt_metrics│ │
	│ │ Step 3: Rewrite target to match ref_metrics │ │
	│ │ Step 4: style_extractor(rewritten) → verify │ │
	│ │ │ │
	│ │ ┌─────────────────────────────────────────────┐ │ │
	│ │ │ style_extractor.py — Tool │ │ │
	│ │ │ │ │ │
	│ │ │ • Sentence Length Variance (σ) │ │ │
	│ │ │ • Hedging Density (per sentence) │ │ │
	│ │ │ • Passive Voice Density (per sentence) │ │ │
	│ │ └─────────────────────────────────────────────┘ │ │
	│ └──────────────────────────────────────────────────┘ │
	│ │ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────┐ │
	│ │ Rewritten Text + Metrics │ │
	│ └──────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────┘
	```

	## Three Stylometric Metrics

	\| Metric \| Description \| Academic Signature \|
	\|--------\|-------------\|-------------------\|
	\| Sentence Length Variance \| σ of word counts per sentence \| High variance = mix of short and long multi-clause sentences \|
	\| Hedging Density \| Hedge words per sentence (suggest, may, putative, indicate, could...) \| Pre-2022 manuscripts hedge heavily in Results/Discussion \|
	\| Passive Voice Density \| Academic passive constructions per sentence (was performed, were analyzed...) \| Methods sections are dominated by passive voice \|

	## Quick Start

	```bash
	pip install -r requirements.txt
	export HF_TOKEN="hf_..." # your Hugging Face token
	python app.py # launches Gradio on http://localhost:7860
	```

	## File Structure

	```
	manuscript_mimic/
	├── __init__.py # Package marker
	├── style_extractor.py # StyleExtractorTool + metric functions
	├── rewrite_agent.py # CodeAgent orchestrator + run_mimic()
	├── app.py # Gradio web UI
	├── requirements.txt # Dependencies
	└── README.md # This file
	```

	## Usage

	### Via Gradio UI
	1. Upload a reference PDF or paste reference text (pre-2022 manuscript excerpt)
	2. Paste your AI-generated draft
	3. Select a model and click "Rewrite to Match Style"
	4. Review the rewritten text and compare metrics

	### Via Python API
	```python
	from style_extractor import extract_style_metrics
	from rewrite_agent import run_mimic

	# Analyze a text
	metrics = extract_style_metrics("Your academic text here...")
	print(metrics)

	# Rewrite to match a reference
	rewritten = run_mimic(
	reference_text="Pre-2022 manuscript excerpt...",
	target_text="AI-generated draft...",
	)
	print(rewritten)
	```

	## Models

	The agent works with any model available on the HF Inference API:
	- `Qwen/Qwen2.5-Coder-32B-Instruct` (default — best for code-generation agents)
	- `meta-llama/Llama-3.3-70B-Instruct`
	- `mistralai/Mixtral-8x7B-Instruct-v0.1`

	## License

	MIT