| --- |
| tags: |
| - ml-intern |
| --- |
| # Multimodal Fraudulent Paper Detection Framework |
|
|
| A deep learning system that detects fraudulent scientific papers by analyzing **text**, **images**, **tabular data**, and **metadata** simultaneously. It also identifies **paper mills** through clustering analysis. |
|
|
| ## Key Features |
|
|
| - **True Multimodal Analysis**: Combines 4 modalities (text, image, tables, metadata) vs. existing single-modality approaches |
| - **Cross-Modal Attention**: Detects inconsistencies between text claims and visual/tabular evidence |
| - **Explainable Predictions**: Shows which modality contributed most to fraud detection |
| - **Paper Mill Detection**: Clusters fraudulent papers to identify organized operations |
| - **Anomaly Scoring**: Identifies outlier papers with unusual patterns |
|
|
| ## Architecture |
|
|
| ``` |
| Text (SciBERT) βββ |
| Image (ViT+Forensics) βββΌβββΆ Cross-Modal Fusion βββΆ Fraud Classifier |
| Tables (Transformer) ββββ€ (Multi-head Attention) |
| Metadata (MLP) ββββββββββ |
| ``` |
|
|
| ## Installation |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ## Quick Start |
|
|
| ### 1. Train the Model |
|
|
| ```bash |
| python scripts/train.py \ |
| --dataset Lihuchen/pubmed_retraction \ |
| --text_model allenai/scibert_scivocab_uncased \ |
| --batch_size 16 \ |
| --epochs 5 \ |
| --lr 2e-5 \ |
| --output_dir ./outputs |
| ``` |
|
|
| ### 2. Detect Paper Mills |
|
|
| ```bash |
| python scripts/cluster_paper_mills.py \ |
| --embeddings ./outputs/test_embeddings.npy \ |
| --dataset Lihuchen/pubmed_retraction \ |
| --output_dir ./clustering_outputs |
| ``` |
|
|
| ### 3. Run Inference |
|
|
| ```bash |
| python scripts/inference.py \ |
| --model_path ./outputs/best_model.pt \ |
| --title "Your Paper Title" \ |
| --text "Your abstract text..." |
| ``` |
|
|
| ## Datasets |
|
|
| | Dataset | Modality | Purpose | HF Hub | |
| |---------|----------|---------|--------| |
| | PubMed Retraction | Text + Metadata | Binary fraud labels | `Lihuchen/pubmed_retraction` | |
| | PubMed Retraction (Instruction) | Text + Explanation | Instruction tuning | `Lihuchen/pubmed_retraction_instruction` | |
| | Paper Mill Benchmark | Text + Tabular | Paper mill labels | `syedzayyan/paper_mill_benchmark` | |
| | PubTables-1M | Image + Table | Table extraction | `bsmock/pubtables-1m` | |
|
|
| ## Paper Mill Clustering |
|
|
| The clustering module analyzes: |
| - **Stylometric features**: Text entropy, sentence length, vocabulary richness |
| - **Visual hashes**: Detects template/reused figures across papers |
| - **Metadata patterns**: Author concentration, timeline anomalies, citation patterns |
| - **Multimodal embeddings**: Combined representation from all modalities |
|
|
| Uses HDBSCAN for density-based clustering to find paper mill groups. |
|
|
| ## Novelty |
|
|
| 1. **First unified multimodal framework** for scientific fraud detection |
| 2. **Cross-modal inconsistency detection** between text and figures/tables |
| 3. **Automated paper mill discovery** via clustering on combined embeddings |
| 4. **Explainable by design** β attention weights reveal which modality flagged fraud |
|
|
| ## Citation |
|
|
| If you use this framework, please cite: |
|
|
| ```bibtex |
| @software{multimodal_fraud_detection, |
| title={Multimodal Fraudulent Paper Detection Framework}, |
| author={Your Name}, |
| year={2025} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT License |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = "pangweijlu/multimodal-fraud-detection" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|