| --- |
| language: en |
| license: mit |
| tags: |
| - research |
| - classification |
| - scientific-papers |
| - bert |
| - academic |
| - nlp |
| datasets: |
| - mendeley-research |
| pipeline_tag: text-classification |
| --- |
| |
| # BERT Research Paper Classifier |
|
|
| ## Model Description |
|
|
| `bert_text_classifier` is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves **95.39% accuracy** on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories. |
|
|
| - **Model type:** BERT for sequence classification |
| - **Language(s):** English |
| - **License:** MIT |
| - **Finetuned from:** [bert-base-uncased](https://huggingface.co/bert-base-uncased) |
|
|
| ## Intended Uses & Limitations |
|
|
| ### Primary Use |
| This model is intended for: |
| - Automatic categorization of research papers and academic publications |
| - Building academic recommendation systems |
| - Organizing digital libraries and research databases |
| - Educational applications in scientific literature analysis |
|
|
| ### Limitations |
| - Trained primarily on Mendeley research catalog data |
| - Performance may vary on papers outside the 9 trained categories |
| - Best performance on formal academic writing style |
|
|
| ## Categories |
|
|
| The model classifies research papers into 9 scientific disciplines: |
|
|
| | Category | Key Subfields | |
| |----------|---------------| |
| | **Biology** | Genetics, Ecology, Biochemistry, Physiology | |
| | **Business** | Marketing, Finance, Management, Entrepreneurship | |
| | **Chemistry** | Organic Chemistry, Analytical Chemistry, Biochemistry | |
| | **Computer Science** | AI, Cloud Computing, Cybersecurity, Software Engineering | |
| | **Environmental Science** | Climate Change, Conservation, Sustainability | |
| | **Mathematics** | Algebra, Calculus, Statistics, Optimization | |
| | **Medicine** | Cardiology, Surgery, Neurology, Pediatrics | |
| | **Physics** | Quantum Mechanics, Astrophysics, Particle Physics | |
| | **Psychology** | Clinical, Cognitive, Social, Neuropsychology | |
|
|
| ## Training Data |
|
|
| ### Dataset Statistics |
| - **Source:** Mendeley Research Catalog |
| - **Total Papers:** 140,004 (after cleaning) |
| - **Training Samples:** 27,953 evaluation set |
| - **Cleaning Ratio:** 89.81% (from original 155,882 records) |
|
|
| ### Data Distribution |
| - Psychology: 16,821 papers (12.0%) |
| - Chemistry: 16,675 papers (11.9%) |
| - Physics: 15,941 papers (11.4%) |
| - Business: 15,929 papers (11.4%) |
| - Mathematics: 15,464 papers (11.0%) |
| - Medicine: 15,361 papers (11.0%) |
| - Computer Science: 14,776 papers (10.6%) |
| - Biology: 14,729 papers (10.5%) |
| - Environmental Science: 14,308 papers (10.2%) |
|
|
| ## Performance |
|
|
| ### Evaluation Results |
| ``` |
| |
| { |
| 'eval_loss': 0.184, |
| 'eval_accuracy': 0.9539, |
| 'eval_runtime': 428.03, |
| 'eval_samples_per_second': 65.306 |
| } |
| |
| ``` |
|
|
| ### Detailed Metrics |
|
|
| | Category | Precision | Recall | F1-Score | Support | |
| |----------|-----------|--------|----------|---------| |
| | Biology | 0.94 | 0.93 | 0.94 | 3,177 | |
| | Business | 0.96 | 0.97 | 0.97 | 3,179 | |
| | Chemistry | 0.94 | 0.96 | 0.95 | 3,073 | |
| | Computer Science | 0.96 | 0.93 | 0.95 | 2,987 | |
| | Environmental Science | 0.95 | 0.94 | 0.95 | 2,850 | |
| | Mathematics | 0.93 | 0.96 | 0.95 | 3,091 | |
| | Medicine | 0.97 | 0.96 | 0.96 | 3,067 | |
| | Physics | 0.97 | 0.95 | 0.96 | 3,181 | |
| | Psychology | 0.97 | 0.97 | 0.97 | 3,348 | |
|
|
| ## Usage |
|
|
| ### Direct Inference |
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| # Load model and tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier") |
| model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier") |
| |
| # Example research paper abstract |
| text = """ |
| This study explores novel deep learning architectures for protein structure |
| prediction using transformer-based models and attention mechanisms. |
| """ |
| |
| # Preprocess and predict |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
| predicted_class = torch.argmax(predictions, dim=1).item() |
| |
| # Map to category |
| categories = ['biology', 'business', 'chemistry', 'computerscience', |
| 'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology'] |
| print(f"Predicted category: {categories[predicted_class]}") |
| ``` |
|
|
| Using Pipeline |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline("text-classification", |
| model="Emran025/bert_text_classifier", |
| tokenizer="Emran025/bert_text_classifier") |
| |
| result = classifier("Advanced quantum computing algorithms for molecular simulation") |
| print(result) |
| ``` |
|
|
| Training Details |
|
|
| Hyperparameters |
|
|
| 路 Learning Rate: 2e-5 |
| 路 Batch Size: 16 |
| 路 Epochs: 3 |
| 路 Max Sequence Length: 512 tokens |
| 路 Optimizer: AdamW |
|
|
| Training Environment |
|
|
| 路 Framework: PyTorch with Transformers |
| 路 Hardware: Google Colab GPU |
| 路 Training Time: ~6 hours |
|
|
| Citation |
|
|
| If you use this model in your research, please cite: |
|
|
| ```bibtex |
| @misc{bert_research_classifier_2024, |
| title = {BERT Research Paper Classification Model}, |
| author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi}, |
| year = {2024}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}} |
| } |
| ``` |
|
|
| Contributors |
|
|
| 路 Emran Nasser (Emran025) |
| 路 Mohammed Alyafrosy |
| 路 Ryadh Alizi |
|
|
| License |
|
|
| MIT License - see LICENSE file for details. |
|
|
| Repository |
|
|
| https://github.com/Emran025/Research_Paper_Classification_model |
| |