EN-VI Parallel Sense Tagger

This project is a token classification model for predicting English-Vietnamese word senses.
It is developed as a student project for NLP coursework, aiming to explore cross-lingual sense tagging.

Project Overview

Objective:
The goal of this project is to build a model that can accurately identify the sense of words in English and Vietnamese sentences, using token-level classification. This is especially useful in tasks like machine translation, semantic understanding, and multilingual NLP applications.

Dataset:

  • A custom dataset combining English and Vietnamese texts.
  • Labels include:
    • PAD: Padding tokens
    • O: Tokens not associated with a sense
    • Sense labels: Specific word senses in English and Vietnamese
  • Number of English labels: 6946
  • Number of Vietnamese labels: 7029

Model Architecture:

  • Base model: XLM-Roberta (cross-lingual transformer)
  • Task: Token Classification
  • Hidden size: 768
  • Number of layers: 12
  • Attention heads: 12
  • Tokenizer: XLM-Roberta tokenizer
  • Special tokens: PAD, BOS, EOS

Training

  • The model is trained using PyTorch and the Hugging Face Transformers library.
  • Optimized for cross-lingual word sense tagging.
  • Supports batching, GPU acceleration, and token-level predictions.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load the tokenizer and model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("kytrungchauwork/en-vi-parallel-sense-tagger")
model = AutoModelForTokenClassification.from_pretrained("kytrungchauwork/en-vi-parallel-sense-tagger")

# Example sentence
text = "Your input sentence here"
inputs = tokenizer(text, return_tensors="pt")

# Model inference
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kytrungchauwork/eng_viet_parrallel_sense_tagger

Finetuned
(3893)
this model

Dataset used to train kytrungchauwork/eng_viet_parrallel_sense_tagger