itsjhuang
/

watsonx-docs-type-classifier

+---
+language:
+- en
+license: apache-2.0
+base_model:
+- sentence-transformers/all-MiniLM-L6-v2
+datasets:
+- itsjhuang/watsonx-docs-document-type
+tags:
+- text-classification
+- embeddings
+- technical-documentation
+metrics:
+- accuracy
+- f1
+---
+# Watsonx Docs Document Type Classifier
+Binary classifier for IBM Watsonx technical documentation pages.
+Given a documentation page, the model predicts whether it is:
+- `conceptual` (0): primarily used to understand or look up information
+- `how-to` (1): primarily used to complete a procedure or fix a problem
+## Model Details
+| | |
+|---|---|
+| Base embeddings | sentence-transformers/all-MiniLM-L6-v2 |
+| Classifier | LinearSVC (C=1.0, max_iter=2000) |
+| Training dataset | [itsjhuang/watsonx-docs-document-type](https://huggingface.co/datasets/itsjhuang/watsonx-docs-document-type) |
+| Input | title + first 800 words of document |
+| Output | `conceptual` or `how-to` |
+## Evaluation Results
+Three conditions were trained and evaluated. The best model (B) was selected by test macro F1.
+| Condition | Embedding Model | Classifier | Train Acc | Train F1 | Test Acc | Test F1 |
+|---|---|---|---:|---:|---:|---:|
+| A | all-MiniLM-L6-v2 | Logistic Regression | 0.879 | 0.879 | 0.817 | 0.817 |
+| B ✅ | all-MiniLM-L6-v2 | LinearSVC | 0.971 | 0.971 | 0.867 | 0.867 |
+| C | bge-small-en-v1.5 | Logistic Regression | 0.864 | 0.864 | 0.833 | 0.833 |
+Confusion matrices for each condition are available in the repository files.
+## Usage
+```python
+import joblib
+import numpy as np
+from sentence_transformers import SentenceTransformer
+embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
+clf = joblib.load("best_model.joblib")
+def softmax(x):
+    e = np.exp(x - np.max(x))
+    return e / e.sum()
+def predict(text):
+    embedding = embedder.encode([text], convert_to_numpy=True)
+    scores = clf.decision_function(embedding)[0]
+    if np.ndim(scores) == 0:
+        scores = np.array([-scores, scores])
+    probs = softmax(scores)
+    labels = ["conceptual", "how-to"]
+    return dict(zip(labels, probs))
+```
+## Limitations
+- Trained on IBM Watsonx documentation only; may not generalize to other
+  technical documentation domains.
+- Label boundary between weak procedural pages and conceptual capability
+  descriptions remains a residual source of error.
+## Source Dataset
+Derived from
+[`ibm-research/watsonxDocsQA`](https://huggingface.co/datasets/ibm-research/watsonxDocsQA),
+licensed under Apache 2.0.