Upload README.md with huggingface_hub

1bfed6e verified 8 days ago

2.48 kB

language:
  - en
license: apache-2.0
base_model:
  - sentence-transformers/all-MiniLM-L6-v2
datasets:
  - itsjhuang/watsonx-docs-document-type
tags:
  - text-classification
  - embeddings
  - technical-documentation
metrics:
  - accuracy
  - f1

Watsonx Docs Document Type Classifier

Binary classifier for IBM Watsonx technical documentation pages.
Given a documentation page, the model predicts whether it is:

conceptual (0): primarily used to understand or look up information
how-to (1): primarily used to complete a procedure or fix a problem

Model Details


Base embeddings	sentence-transformers/all-MiniLM-L6-v2
Classifier	LinearSVC (C=1.0, max_iter=2000)
Training dataset	itsjhuang/watsonx-docs-document-type
Input	title + first 800 words of document
Output	`conceptual` or `how-to`

Evaluation Results

Three conditions were trained and evaluated. The best model (B) was selected by test macro F1.

Condition	Embedding Model	Classifier	Train Acc	Train F1	Test Acc	Test F1
A	all-MiniLM-L6-v2	Logistic Regression	0.879	0.879	0.817	0.817
B ✅	all-MiniLM-L6-v2	LinearSVC	0.971	0.971	0.867	0.867
C	bge-small-en-v1.5	Logistic Regression	0.864	0.864	0.833	0.833

Confusion matrices for each condition are available in the repository files.

Usage

import joblib
import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
clf = joblib.load("best_model.joblib")

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

def predict(text):
    embedding = embedder.encode([text], convert_to_numpy=True)
    scores = clf.decision_function(embedding)[0]
    if np.ndim(scores) == 0:
        scores = np.array([-scores, scores])
    probs = softmax(scores)
    labels = ["conceptual", "how-to"]
    return dict(zip(labels, probs))

Limitations

Trained on IBM Watsonx documentation only; may not generalize to other technical documentation domains.
Label boundary between weak procedural pages and conceptual capability descriptions remains a residual source of error.

Source Dataset

Derived from ibm-research/watsonxDocsQA, licensed under Apache 2.0.