itsjhuang's picture
Upload README.md with huggingface_hub
1bfed6e verified
metadata
language:
  - en
license: apache-2.0
base_model:
  - sentence-transformers/all-MiniLM-L6-v2
datasets:
  - itsjhuang/watsonx-docs-document-type
tags:
  - text-classification
  - embeddings
  - technical-documentation
metrics:
  - accuracy
  - f1

Watsonx Docs Document Type Classifier

Binary classifier for IBM Watsonx technical documentation pages.
Given a documentation page, the model predicts whether it is:

  • conceptual (0): primarily used to understand or look up information
  • how-to (1): primarily used to complete a procedure or fix a problem

Model Details

Base embeddings sentence-transformers/all-MiniLM-L6-v2
Classifier LinearSVC (C=1.0, max_iter=2000)
Training dataset itsjhuang/watsonx-docs-document-type
Input title + first 800 words of document
Output conceptual or how-to

Evaluation Results

Three conditions were trained and evaluated. The best model (B) was selected by test macro F1.

Condition Embedding Model Classifier Train Acc Train F1 Test Acc Test F1
A all-MiniLM-L6-v2 Logistic Regression 0.879 0.879 0.817 0.817
B ✅ all-MiniLM-L6-v2 LinearSVC 0.971 0.971 0.867 0.867
C bge-small-en-v1.5 Logistic Regression 0.864 0.864 0.833 0.833

Confusion matrices for each condition are available in the repository files.

Usage

import joblib
import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
clf = joblib.load("best_model.joblib")

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

def predict(text):
    embedding = embedder.encode([text], convert_to_numpy=True)
    scores = clf.decision_function(embedding)[0]
    if np.ndim(scores) == 0:
        scores = np.array([-scores, scores])
    probs = softmax(scores)
    labels = ["conceptual", "how-to"]
    return dict(zip(labels, probs))

Limitations

  • Trained on IBM Watsonx documentation only; may not generalize to other technical documentation domains.
  • Label boundary between weak procedural pages and conceptual capability descriptions remains a residual source of error.

Source Dataset

Derived from ibm-research/watsonxDocsQA, licensed under Apache 2.0.