--- tags: - ml-intern --- # Multimodal Fraudulent Paper Detection Framework A deep learning system that detects fraudulent scientific papers by analyzing **text**, **images**, **tabular data**, and **metadata** simultaneously. It also identifies **paper mills** through clustering analysis. ## Key Features - **True Multimodal Analysis**: Combines 4 modalities (text, image, tables, metadata) vs. existing single-modality approaches - **Cross-Modal Attention**: Detects inconsistencies between text claims and visual/tabular evidence - **Explainable Predictions**: Shows which modality contributed most to fraud detection - **Paper Mill Detection**: Clusters fraudulent papers to identify organized operations - **Anomaly Scoring**: Identifies outlier papers with unusual patterns ## Architecture ``` Text (SciBERT) ──┐ Image (ViT+Forensics) ──┼──▶ Cross-Modal Fusion ──▶ Fraud Classifier Tables (Transformer) ───┤ (Multi-head Attention) Metadata (MLP) ─────────┘ ``` ## Installation ```bash pip install -r requirements.txt ``` ## Quick Start ### 1. Train the Model ```bash python scripts/train.py \ --dataset Lihuchen/pubmed_retraction \ --text_model allenai/scibert_scivocab_uncased \ --batch_size 16 \ --epochs 5 \ --lr 2e-5 \ --output_dir ./outputs ``` ### 2. Detect Paper Mills ```bash python scripts/cluster_paper_mills.py \ --embeddings ./outputs/test_embeddings.npy \ --dataset Lihuchen/pubmed_retraction \ --output_dir ./clustering_outputs ``` ### 3. Run Inference ```bash python scripts/inference.py \ --model_path ./outputs/best_model.pt \ --title "Your Paper Title" \ --text "Your abstract text..." ``` ## Datasets | Dataset | Modality | Purpose | HF Hub | |---------|----------|---------|--------| | PubMed Retraction | Text + Metadata | Binary fraud labels | `Lihuchen/pubmed_retraction` | | PubMed Retraction (Instruction) | Text + Explanation | Instruction tuning | `Lihuchen/pubmed_retraction_instruction` | | Paper Mill Benchmark | Text + Tabular | Paper mill labels | `syedzayyan/paper_mill_benchmark` | | PubTables-1M | Image + Table | Table extraction | `bsmock/pubtables-1m` | ## Paper Mill Clustering The clustering module analyzes: - **Stylometric features**: Text entropy, sentence length, vocabulary richness - **Visual hashes**: Detects template/reused figures across papers - **Metadata patterns**: Author concentration, timeline anomalies, citation patterns - **Multimodal embeddings**: Combined representation from all modalities Uses HDBSCAN for density-based clustering to find paper mill groups. ## Novelty 1. **First unified multimodal framework** for scientific fraud detection 2. **Cross-modal inconsistency detection** between text and figures/tables 3. **Automated paper mill discovery** via clustering on combined embeddings 4. **Explainable by design** — attention weights reveal which modality flagged fraud ## Citation If you use this framework, please cite: ```bibtex @software{multimodal_fraud_detection, title={Multimodal Fraudulent Paper Detection Framework}, author={Your Name}, year={2025} } ``` ## License MIT License ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "pangweijlu/multimodal-fraud-detection" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) ``` For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.