Multimodal Fraudulent Paper Detection Framework
A deep learning system that detects fraudulent scientific papers by analyzing text, images, tabular data, and metadata simultaneously. It also identifies paper mills through clustering analysis.
Key Features
- True Multimodal Analysis: Combines 4 modalities (text, image, tables, metadata) vs. existing single-modality approaches
- Cross-Modal Attention: Detects inconsistencies between text claims and visual/tabular evidence
- Explainable Predictions: Shows which modality contributed most to fraud detection
- Paper Mill Detection: Clusters fraudulent papers to identify organized operations
- Anomaly Scoring: Identifies outlier papers with unusual patterns
Architecture
Text (SciBERT) βββ
Image (ViT+Forensics) βββΌβββΆ Cross-Modal Fusion βββΆ Fraud Classifier
Tables (Transformer) ββββ€ (Multi-head Attention)
Metadata (MLP) ββββββββββ
Installation
pip install -r requirements.txt
Quick Start
1. Train the Model
python scripts/train.py \
--dataset Lihuchen/pubmed_retraction \
--text_model allenai/scibert_scivocab_uncased \
--batch_size 16 \
--epochs 5 \
--lr 2e-5 \
--output_dir ./outputs
2. Detect Paper Mills
python scripts/cluster_paper_mills.py \
--embeddings ./outputs/test_embeddings.npy \
--dataset Lihuchen/pubmed_retraction \
--output_dir ./clustering_outputs
3. Run Inference
python scripts/inference.py \
--model_path ./outputs/best_model.pt \
--title "Your Paper Title" \
--text "Your abstract text..."
Datasets
| Dataset | Modality | Purpose | HF Hub |
|---|---|---|---|
| PubMed Retraction | Text + Metadata | Binary fraud labels | Lihuchen/pubmed_retraction |
| PubMed Retraction (Instruction) | Text + Explanation | Instruction tuning | Lihuchen/pubmed_retraction_instruction |
| Paper Mill Benchmark | Text + Tabular | Paper mill labels | syedzayyan/paper_mill_benchmark |
| PubTables-1M | Image + Table | Table extraction | bsmock/pubtables-1m |
Paper Mill Clustering
The clustering module analyzes:
- Stylometric features: Text entropy, sentence length, vocabulary richness
- Visual hashes: Detects template/reused figures across papers
- Metadata patterns: Author concentration, timeline anomalies, citation patterns
- Multimodal embeddings: Combined representation from all modalities
Uses HDBSCAN for density-based clustering to find paper mill groups.
Novelty
- First unified multimodal framework for scientific fraud detection
- Cross-modal inconsistency detection between text and figures/tables
- Automated paper mill discovery via clustering on combined embeddings
- Explainable by design β attention weights reveal which modality flagged fraud
Citation
If you use this framework, please cite:
@software{multimodal_fraud_detection,
title={Multimodal Fraudulent Paper Detection Framework},
author={Your Name},
year={2025}
}
License
MIT License
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "pangweijlu/multimodal-fraud-detection"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support