| --- |
| tags: |
| - text-classification |
| - roberta |
| - scientific-abstracts |
| - multi-class |
| - research-field-classification |
| datasets: |
| - ScientificArticleAbstract_Classification |
| license: apache-2.0 |
| model-index: |
| - name: ScientificTextClassification_ResearchField |
| results: |
| - task: |
| name: Text Classification |
| type: text-classification |
| metrics: |
| - type: accuracy |
| value: 0.941 |
| name: Accuracy (Top-1) |
| - type: macro_f1 |
| value: 0.935 |
| name: Macro F1 Score |
| --- |
| |
| # ScientificTextClassification_ResearchField |
| |
| ## 📚 Overview |
| |
| This is a **RoBERTa-base** model fine-tuned for the complex task of multi-class classification of scientific article abstracts. The model predicts the **primary research field** (e.g., Physics, Biology, Computer Science) based solely on the abstract text, serving as a powerful tool for automated journal indexing and literature review organization. |
| |
| ## 🧠 Model Architecture |
| |
| The choice of RoBERTa ensures enhanced robustness and better handling of long-range dependencies common in technical and scientific prose. |
| |
| * **Base Model:** `roberta-base` (an optimized BERT approach without the next-sentence prediction objective). |
| * **Classification Head:** Outputs 8 distinct categories (`num_labels: 8`). |
| * **Input Data:** Detailed scientific abstracts from diverse journals. |
| * **Output:** A probability distribution over the 8 classes: Physics, Chemistry, Medicine, Computer Science, Biology, Geoscience, Materials Science, and Engineering. |
| * **Training Dataset:** **ScientificArticleAbstract_Classification**, providing abstracts linked to their high-level research disciplines. |
| |
| ## 🎯 Intended Use |
| |
| The model offers utility in several scientific and information retrieval contexts: |
| |
| 1. **Automated Library and Repository Indexing:** Rapidly and accurately tagging new publications with their correct discipline. |
| 2. **Literature Review Automation:** Filtering large databases of articles to focus on specific fields. |
| 3. **Grant Proposal Routing:** Assisting research institutions in routing incoming proposals to the appropriate review panel or expert based on the summary. |
| 4. **Trend Analysis:** Tracking the volume and convergence of research across different fields. |
| |
| ## ⚠️ Limitations |
| |
| 1. **Interdisciplinary Papers:** The model performs single-label classification. It may struggle with highly interdisciplinary abstracts that bridge two or more distinct fields (e.g., computational chemistry or bio-engineering). |
| 2. **Vocabulary Drift:** Scientific terminology evolves quickly. New sub-disciplines or extremely novel concepts may not be classified correctly until the model is retrained. |
| 3. **Class Imbalance:** If the underlying distribution of the eight fields in the real world shifts significantly from the training set, performance may vary. |
| |
| ### MODEL 3: **EcommerceAspectSentiment_BART** |
|
|
| This model is a BART-large sequence-to-sequence model fine-tuned for abstractive multi-aspect sentiment summarization based on Dataset 3 (EcommerceCustomerReview\_MultiAspectRating). |
| |
| #### config.json |
| |
| ```json |
| { |
| "_name_or_path": "facebook/bart-large", |
| "architectures": [ |
| "BartForConditionalGeneration" |
| ], |
| "model_type": "bart", |
| "vocab_size": 50265, |
| "d_model": 1024, |
| "encoder_layers": 12, |
| "decoder_layers": 12, |
| "encoder_attention_heads": 16, |
| "decoder_attention_heads": 16, |
| "encoder_ffn_dim": 4096, |
| "decoder_ffn_dim": 4096, |
| "dropout": 0.1, |
| "activation_function": "gelu", |
| "init_std": 0.02, |
| "num_labels": 3, |
| "max_position_embeddings": 1024, |
| "eos_token_id": 2, |
| "bos_token_id": 0, |
| "pad_token_id": 1, |
| "is_encoder_decoder": true, |
| "scale_embedding": false, |
| "forced_eos_token_id": 2, |
| "transformers_version": "4.35.2" |
| } |