RLAC Audio-transcription Segmenter - Chroniques (Hybrid)

Description

This model uses a sophisticated Hybrid Deep Learning approach to detect radio chronicle segments from textual transcriptions (SRT files). It combines the semantic power of CamemBERT with the sequential modeling of a Bi-LSTM + CRF architecture.

Hugging Face link: eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-hybrid

Model Details

This approach relies on two inseparable components that must be used together:

1. Feature Extractor & BERT Embeddings (`radio_chronique_hybrid_base.pkl`)

This file is a Python object (serialized with joblib) that manages the transformation of raw text into rich numerical vectors.

Base Features: Air time, segment duration, word count, punctuation patterns, and jingle detection.
Semantic Embeddings: It uses a CamemBERT-base model to extract high-dimensional semantic representations of the text.
Normalization: It contains the scaler and tfidf_vectorizer used during training.

2. Sequential Classifier (`radio_chronique_hybrid_hybrid.pt`)

This is a PyTorch model that takes the features prepared by the base extractor to make the final sequence-aware decision.

Architecture: A Bi-LSTM (Bidirectional Long Short-Term Memory) layer followed by a CRF (Conditional Random Field) layer.
Sequential Logic: Unlike simple classifiers that look at segments in isolation, this model analyzes the entire flow of the show. It understands that a chronicle has a beginning, a middle, and an end.
Consistency: The CRF layer ensures that the predicted sequence is logically consistent (e.g., an "inside chronicle" segment cannot exist without a preceding "start chronicle" segment).
Training: It was trained using Focal Loss to specifically improve the detection of chronicle starts, which are rare events in the data stream.

Advantages

Deep Understanding: Uses BERT to understand the actual meaning of the words.
Context Aware: Analyzes the rhythm and structure of the radio show.
High Precision: The CRF layer significantly reduces false positives by enforcing temporal coherence.

Usage

Both files are loaded and used by the predict.py script:

from train import RadioChroniqueClassifier, HybridSequenceClassifier
base_extractor = RadioChroniqueClassifier.load_model("models/radio_chronique_hybrid_base.pkl")
hybrid_model = HybridSequenceClassifier.load("models/radio_chronique_hybrid_hybrid.pt")

Author

Maintained by eglantinefonrose.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-hybrid

RLAC - Radio Live A La Carte

Collection

A collection of models for automated segmentation of live radio broadcasts, leveraging both raw audio signals and text transcriptions to identify show • 3 items • Updated 6 days ago