Multimodal Fraudulent Paper Detection Framework

A deep learning system that detects fraudulent scientific papers by analyzing text, images, tabular data, and metadata simultaneously. It also identifies paper mills through clustering analysis.

Key Features

  • True Multimodal Analysis: Combines 4 modalities (text, image, tables, metadata) vs. existing single-modality approaches
  • Cross-Modal Attention: Detects inconsistencies between text claims and visual/tabular evidence
  • Explainable Predictions: Shows which modality contributed most to fraud detection
  • Paper Mill Detection: Clusters fraudulent papers to identify organized operations
  • Anomaly Scoring: Identifies outlier papers with unusual patterns

Architecture

Text (SciBERT) ──┐
Image (ViT+Forensics) ──┼──▢ Cross-Modal Fusion ──▢ Fraud Classifier
Tables (Transformer) ────         (Multi-head Attention)
Metadata (MLP) β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Installation

pip install -r requirements.txt

Quick Start

1. Train the Model

python scripts/train.py \
  --dataset Lihuchen/pubmed_retraction \
  --text_model allenai/scibert_scivocab_uncased \
  --batch_size 16 \
  --epochs 5 \
  --lr 2e-5 \
  --output_dir ./outputs

2. Detect Paper Mills

python scripts/cluster_paper_mills.py \
  --embeddings ./outputs/test_embeddings.npy \
  --dataset Lihuchen/pubmed_retraction \
  --output_dir ./clustering_outputs

3. Run Inference

python scripts/inference.py \
  --model_path ./outputs/best_model.pt \
  --title "Your Paper Title" \
  --text "Your abstract text..."

Datasets

Dataset Modality Purpose HF Hub
PubMed Retraction Text + Metadata Binary fraud labels Lihuchen/pubmed_retraction
PubMed Retraction (Instruction) Text + Explanation Instruction tuning Lihuchen/pubmed_retraction_instruction
Paper Mill Benchmark Text + Tabular Paper mill labels syedzayyan/paper_mill_benchmark
PubTables-1M Image + Table Table extraction bsmock/pubtables-1m

Paper Mill Clustering

The clustering module analyzes:

  • Stylometric features: Text entropy, sentence length, vocabulary richness
  • Visual hashes: Detects template/reused figures across papers
  • Metadata patterns: Author concentration, timeline anomalies, citation patterns
  • Multimodal embeddings: Combined representation from all modalities

Uses HDBSCAN for density-based clustering to find paper mill groups.

Novelty

  1. First unified multimodal framework for scientific fraud detection
  2. Cross-modal inconsistency detection between text and figures/tables
  3. Automated paper mill discovery via clustering on combined embeddings
  4. Explainable by design β€” attention weights reveal which modality flagged fraud

Citation

If you use this framework, please cite:

@software{multimodal_fraud_detection,
  title={Multimodal Fraudulent Paper Detection Framework},
  author={Your Name},
  year={2025}
}

License

MIT License

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pangweijlu/multimodal-fraud-detection"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support