Multimodal Fraudulent Paper Detection Framework

A deep learning system that detects fraudulent scientific papers by analyzing text, images, tabular data, and metadata simultaneously. It also identifies paper mills through clustering analysis.

Key Features

True Multimodal Analysis: Combines 4 modalities (text, image, tables, metadata) vs. existing single-modality approaches
Cross-Modal Attention: Detects inconsistencies between text claims and visual/tabular evidence
Explainable Predictions: Shows which modality contributed most to fraud detection
Paper Mill Detection: Clusters fraudulent papers to identify organized operations
Anomaly Scoring: Identifies outlier papers with unusual patterns

Architecture

Text (SciBERT) ──┐
Image (ViT+Forensics) ──┼──▶ Cross-Modal Fusion ──▶ Fraud Classifier
Tables (Transformer) ───┤         (Multi-head Attention)
Metadata (MLP) ─────────┘

Installation

pip install -r requirements.txt

Quick Start

1. Train the Model

python scripts/train.py \
  --dataset Lihuchen/pubmed_retraction \
  --text_model allenai/scibert_scivocab_uncased \
  --batch_size 16 \
  --epochs 5 \
  --lr 2e-5 \
  --output_dir ./outputs

2. Detect Paper Mills

python scripts/cluster_paper_mills.py \
  --embeddings ./outputs/test_embeddings.npy \
  --dataset Lihuchen/pubmed_retraction \
  --output_dir ./clustering_outputs

3. Run Inference

python scripts/inference.py \
  --model_path ./outputs/best_model.pt \
  --title "Your Paper Title" \
  --text "Your abstract text..."

Datasets

Dataset	Modality	Purpose	HF Hub
PubMed Retraction	Text + Metadata	Binary fraud labels	`Lihuchen/pubmed_retraction`
PubMed Retraction (Instruction)	Text + Explanation	Instruction tuning	`Lihuchen/pubmed_retraction_instruction`
Paper Mill Benchmark	Text + Tabular	Paper mill labels	`syedzayyan/paper_mill_benchmark`
PubTables-1M	Image + Table	Table extraction	`bsmock/pubtables-1m`

Paper Mill Clustering

The clustering module analyzes:

Stylometric features: Text entropy, sentence length, vocabulary richness
Visual hashes: Detects template/reused figures across papers
Metadata patterns: Author concentration, timeline anomalies, citation patterns
Multimodal embeddings: Combined representation from all modalities

Uses HDBSCAN for density-based clustering to find paper mill groups.

Novelty

First unified multimodal framework for scientific fraud detection
Cross-modal inconsistency detection between text and figures/tables
Automated paper mill discovery via clustering on combined embeddings
Explainable by design — attention weights reveal which modality flagged fraud

Citation

If you use this framework, please cite:

@software{multimodal_fraud_detection,
  title={Multimodal Fraudulent Paper Detection Framework},
  author={Your Name},
  year={2025}
}

License

MIT License

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pangweijlu/multimodal-fraud-detection"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support