docs: add comprehensive Hugging Face model card

4b73309 verified 5 months ago

5.47 kB

	---
	language: en
	license: mit
	tags:
	- research
	- classification
	- scientific-papers
	- bert
	- academic
	- nlp
	datasets:
	- mendeley-research
	pipeline_tag: text-classification
	---

	# BERT Research Paper Classifier

	## Model Description

	`bert_text_classifier` is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves 95.39% accuracy on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories.

	- Model type: BERT for sequence classification
	- Language(s): English
	- License: MIT
	- Finetuned from: [bert-base-uncased](https://huggingface.co/bert-base-uncased)

	## Intended Uses & Limitations

	### Primary Use
	This model is intended for:
	- Automatic categorization of research papers and academic publications
	- Building academic recommendation systems
	- Organizing digital libraries and research databases
	- Educational applications in scientific literature analysis

	### Limitations
	- Trained primarily on Mendeley research catalog data
	- Performance may vary on papers outside the 9 trained categories
	- Best performance on formal academic writing style

	## Categories

	The model classifies research papers into 9 scientific disciplines:

	\| Category \| Key Subfields \|
	\|----------\|---------------\|
	\| Biology \| Genetics, Ecology, Biochemistry, Physiology \|
	\| Business \| Marketing, Finance, Management, Entrepreneurship \|
	\| Chemistry \| Organic Chemistry, Analytical Chemistry, Biochemistry \|
	\| Computer Science \| AI, Cloud Computing, Cybersecurity, Software Engineering \|
	\| Environmental Science \| Climate Change, Conservation, Sustainability \|
	\| Mathematics \| Algebra, Calculus, Statistics, Optimization \|
	\| Medicine \| Cardiology, Surgery, Neurology, Pediatrics \|
	\| Physics \| Quantum Mechanics, Astrophysics, Particle Physics \|
	\| Psychology \| Clinical, Cognitive, Social, Neuropsychology \|

	## Training Data

	### Dataset Statistics
	- Source: Mendeley Research Catalog
	- Total Papers: 140,004 (after cleaning)
	- Training Samples: 27,953 evaluation set
	- Cleaning Ratio: 89.81% (from original 155,882 records)

	### Data Distribution
	- Psychology: 16,821 papers (12.0%)
	- Chemistry: 16,675 papers (11.9%)
	- Physics: 15,941 papers (11.4%)
	- Business: 15,929 papers (11.4%)
	- Mathematics: 15,464 papers (11.0%)
	- Medicine: 15,361 papers (11.0%)
	- Computer Science: 14,776 papers (10.6%)
	- Biology: 14,729 papers (10.5%)
	- Environmental Science: 14,308 papers (10.2%)

	## Performance

	### Evaluation Results
	```

	{
	'eval_loss': 0.184,
	'eval_accuracy': 0.9539,
	'eval_runtime': 428.03,
	'eval_samples_per_second': 65.306
	}

	```

	### Detailed Metrics

	\| Category \| Precision \| Recall \| F1-Score \| Support \|
	\|----------\|-----------\|--------\|----------\|---------\|
	\| Biology \| 0.94 \| 0.93 \| 0.94 \| 3,177 \|
	\| Business \| 0.96 \| 0.97 \| 0.97 \| 3,179 \|
	\| Chemistry \| 0.94 \| 0.96 \| 0.95 \| 3,073 \|
	\| Computer Science \| 0.96 \| 0.93 \| 0.95 \| 2,987 \|
	\| Environmental Science \| 0.95 \| 0.94 \| 0.95 \| 2,850 \|
	\| Mathematics \| 0.93 \| 0.96 \| 0.95 \| 3,091 \|
	\| Medicine \| 0.97 \| 0.96 \| 0.96 \| 3,067 \|
	\| Physics \| 0.97 \| 0.95 \| 0.96 \| 3,181 \|
	\| Psychology \| 0.97 \| 0.97 \| 0.97 \| 3,348 \|

	## Usage

	### Direct Inference
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier")
	model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier")

	# Example research paper abstract
	text = """
	This study explores novel deep learning architectures for protein structure
	prediction using transformer-based models and attention mechanisms.
	"""

	# Preprocess and predict
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(predictions, dim=1).item()

	# Map to category
	categories = ['biology', 'business', 'chemistry', 'computerscience',
	'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology']
	print(f"Predicted category: {categories[predicted_class]}")
	```

	Using Pipeline

	```python
	from transformers import pipeline

	classifier = pipeline("text-classification",
	model="Emran025/bert_text_classifier",
	tokenizer="Emran025/bert_text_classifier")

	result = classifier("Advanced quantum computing algorithms for molecular simulation")
	print(result)
	```

	Training Details

	Hyperparameters

	· Learning Rate: 2e-5
	· Batch Size: 16
	· Epochs: 3
	· Max Sequence Length: 512 tokens
	· Optimizer: AdamW

	Training Environment

	· Framework: PyTorch with Transformers
	· Hardware: Google Colab GPU
	· Training Time: ~6 hours

	Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{bert_research_classifier_2024,
	title = {BERT Research Paper Classification Model},
	author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}}
	}
	```

	Contributors

	· Emran Nasser (Emran025)
	· Mohammed Alyafrosy
	· Ryadh Alizi

	License

	MIT License - see LICENSE file for details.

	Repository

	https://github.com/Emran025/Research_Paper_Classification_model