Update ML Intern artifact metadata

9c30641 verified 1 day ago

3.88 kB

	---
	tags:
	- ml-intern
	---
	# Multimodal Fraudulent Paper Detection Framework

	A deep learning system that detects fraudulent scientific papers by analyzing text, images, tabular data, and metadata simultaneously. It also identifies paper mills through clustering analysis.

	## Key Features

	- True Multimodal Analysis: Combines 4 modalities (text, image, tables, metadata) vs. existing single-modality approaches
	- Cross-Modal Attention: Detects inconsistencies between text claims and visual/tabular evidence
	- Explainable Predictions: Shows which modality contributed most to fraud detection
	- Paper Mill Detection: Clusters fraudulent papers to identify organized operations
	- Anomaly Scoring: Identifies outlier papers with unusual patterns

	## Architecture

	```
	Text (SciBERT) ──┐
	Image (ViT+Forensics) ──┼──▶ Cross-Modal Fusion ──▶ Fraud Classifier
	Tables (Transformer) ───┤ (Multi-head Attention)
	Metadata (MLP) ─────────┘
	```

	## Installation

	```bash
	pip install -r requirements.txt
	```

	## Quick Start

	### 1. Train the Model

	```bash
	python scripts/train.py \
	--dataset Lihuchen/pubmed_retraction \
	--text_model allenai/scibert_scivocab_uncased \
	--batch_size 16 \
	--epochs 5 \
	--lr 2e-5 \
	--output_dir ./outputs
	```

	### 2. Detect Paper Mills

	```bash
	python scripts/cluster_paper_mills.py \
	--embeddings ./outputs/test_embeddings.npy \
	--dataset Lihuchen/pubmed_retraction \
	--output_dir ./clustering_outputs
	```

	### 3. Run Inference

	```bash
	python scripts/inference.py \
	--model_path ./outputs/best_model.pt \
	--title "Your Paper Title" \
	--text "Your abstract text..."
	```

	## Datasets

	\| Dataset \| Modality \| Purpose \| HF Hub \|
	\|---------\|----------\|---------\|--------\|
	\| PubMed Retraction \| Text + Metadata \| Binary fraud labels \| `Lihuchen/pubmed_retraction` \|
	\| PubMed Retraction (Instruction) \| Text + Explanation \| Instruction tuning \| `Lihuchen/pubmed_retraction_instruction` \|
	\| Paper Mill Benchmark \| Text + Tabular \| Paper mill labels \| `syedzayyan/paper_mill_benchmark` \|
	\| PubTables-1M \| Image + Table \| Table extraction \| `bsmock/pubtables-1m` \|

	## Paper Mill Clustering

	The clustering module analyzes:
	- Stylometric features: Text entropy, sentence length, vocabulary richness
	- Visual hashes: Detects template/reused figures across papers
	- Metadata patterns: Author concentration, timeline anomalies, citation patterns
	- Multimodal embeddings: Combined representation from all modalities

	Uses HDBSCAN for density-based clustering to find paper mill groups.

	## Novelty

	1. First unified multimodal framework for scientific fraud detection
	2. Cross-modal inconsistency detection between text and figures/tables
	3. Automated paper mill discovery via clustering on combined embeddings
	4. Explainable by design — attention weights reveal which modality flagged fraud

	## Citation

	If you use this framework, please cite:

	```bibtex
	@software{multimodal_fraud_detection,
	title={Multimodal Fraudulent Paper Detection Framework},
	author={Your Name},
	year={2025}
	}
	```

	## License

	MIT License

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "pangweijlu/multimodal-fraud-detection"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.