pangweijlu's picture
Update ML Intern artifact metadata
9c30641 verified
---
tags:
- ml-intern
---
# Multimodal Fraudulent Paper Detection Framework
A deep learning system that detects fraudulent scientific papers by analyzing **text**, **images**, **tabular data**, and **metadata** simultaneously. It also identifies **paper mills** through clustering analysis.
## Key Features
- **True Multimodal Analysis**: Combines 4 modalities (text, image, tables, metadata) vs. existing single-modality approaches
- **Cross-Modal Attention**: Detects inconsistencies between text claims and visual/tabular evidence
- **Explainable Predictions**: Shows which modality contributed most to fraud detection
- **Paper Mill Detection**: Clusters fraudulent papers to identify organized operations
- **Anomaly Scoring**: Identifies outlier papers with unusual patterns
## Architecture
```
Text (SciBERT) ──┐
Image (ViT+Forensics) ──┼──▢ Cross-Modal Fusion ──▢ Fraud Classifier
Tables (Transformer) ──── (Multi-head Attention)
Metadata (MLP) β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Installation
```bash
pip install -r requirements.txt
```
## Quick Start
### 1. Train the Model
```bash
python scripts/train.py \
--dataset Lihuchen/pubmed_retraction \
--text_model allenai/scibert_scivocab_uncased \
--batch_size 16 \
--epochs 5 \
--lr 2e-5 \
--output_dir ./outputs
```
### 2. Detect Paper Mills
```bash
python scripts/cluster_paper_mills.py \
--embeddings ./outputs/test_embeddings.npy \
--dataset Lihuchen/pubmed_retraction \
--output_dir ./clustering_outputs
```
### 3. Run Inference
```bash
python scripts/inference.py \
--model_path ./outputs/best_model.pt \
--title "Your Paper Title" \
--text "Your abstract text..."
```
## Datasets
| Dataset | Modality | Purpose | HF Hub |
|---------|----------|---------|--------|
| PubMed Retraction | Text + Metadata | Binary fraud labels | `Lihuchen/pubmed_retraction` |
| PubMed Retraction (Instruction) | Text + Explanation | Instruction tuning | `Lihuchen/pubmed_retraction_instruction` |
| Paper Mill Benchmark | Text + Tabular | Paper mill labels | `syedzayyan/paper_mill_benchmark` |
| PubTables-1M | Image + Table | Table extraction | `bsmock/pubtables-1m` |
## Paper Mill Clustering
The clustering module analyzes:
- **Stylometric features**: Text entropy, sentence length, vocabulary richness
- **Visual hashes**: Detects template/reused figures across papers
- **Metadata patterns**: Author concentration, timeline anomalies, citation patterns
- **Multimodal embeddings**: Combined representation from all modalities
Uses HDBSCAN for density-based clustering to find paper mill groups.
## Novelty
1. **First unified multimodal framework** for scientific fraud detection
2. **Cross-modal inconsistency detection** between text and figures/tables
3. **Automated paper mill discovery** via clustering on combined embeddings
4. **Explainable by design** β€” attention weights reveal which modality flagged fraud
## Citation
If you use this framework, please cite:
```bibtex
@software{multimodal_fraud_detection,
title={Multimodal Fraudulent Paper Detection Framework},
author={Your Name},
year={2025}
}
```
## License
MIT License
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "pangweijlu/multimodal-fraud-detection"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.