You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Gemma-3n-E4B-It - Hindi & Kannada LoRA Adapter (Psychiatric Domain)

This is a LoRA (Low-Rank Adaptation) adapter for Google's Gemma-3n-E4B-It model, fine-tuned on psychiatric interviews and therapy session transcriptions in Hindi and Kannada for automatic speech recognition outputs.

⚠️ Content Warning & Gated Repository

This model is behind a gated repository for important safety reasons:

Since this model was trained on real-world clinical psychiatric conversations, there is an anticipated risk that model hallucinations may contain sensitive content related to mental health topics. This model is intended solely for research purposes and clinical applications by qualified teams.

To request access: Please contact the repository owner with:

  • Your research affiliation or clinical organization
  • Intended use case
  • Confirmation of ethical approval for your project (if applicable)

Model Description

  • Base Model: google/gemma-3n-e4b-it
  • Adapter Type: LoRA (Low-Rank Adaptation)
  • Languages: Hindi, Kannada
  • Task: Automatic Speech Recognition (Direct Audio-to-Text Transcription)
  • Domain: Psychiatric interviews and therapy sessions
  • PEFT Version: 0.18.0

LoRA Configuration

  • Rank (r): 32
  • Alpha: 64
  • Dropout: 0.05
  • RSLoRA: Enabled
  • Target Modules: q_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Performance

The fine-tuned model shows significant improvements over the base Gemma-3n-E4B-It model across both languages on psychiatric conversation data. All improvements are statistically significant (p < 0.001).

Test Set Results (18 files, ~9.2 hours)

Combined Results (Both Languages)

Metric Base Model (Median) Fine-tuned Model (Median) Improvement
WER (normalized) 52.89% 38.56% 27.09% relative
CER 31.72% 20.06% 36.76% relative
WIP 0.2326 0.3857 +65.82%
MER 55.93% 41.67% 25.49% relative

Language-Specific Results

Hindi (10 files, ~5.3 hours):

  • WER (normalized): 37.31% → 29.41% (21.18% relative improvement)
  • CER: 26.93% → 18.80% (30.19% relative improvement)
  • WIP: 0.2326 → 0.3857 (+65.82%)

Kannada (8 files, ~3.9 hours):

  • WER (normalized): 68.52% → 50.98% (25.61% relative improvement)
  • CER: 37.65% → 21.40% (43.16% relative improvement)
  • WIP: 0.2326 → 0.3857 (+65.82%)

Dev Set Results (15 files, ~7.4 hours)

Metric Base Model (Median) Fine-tuned Model (Median) Improvement
WER (normalized) 55.38% 36.80% 33.56% relative
CER 33.85% 19.85% 41.36% relative

Training Set Results (390 sampled instances, ~34.6 hours)

Metric Base Model (Median) Fine-tuned Model (Median) Improvement
WER (normalized) 54.63% 33.78% 38.16% relative
CER 31.19% 16.62% 46.71% relative

Transcription Prompts

This model requires carefully crafted language-specific prompts that provide detailed transcription guidelines. The prompts are available in the prompts/ directory and specify:

Key Prompt Features:

  1. Verbatim Transcription Requirements

    • Preserve all dysfluencies (filler words, repetitions, stammers, partial words)
    • Include incomplete phrases as-is
    • No grammar correction or polishing
  2. Punctuation Rules

    • Allowed: full-stop, question mark, comma, ellipsis, em-dash, exclamation mark
    • Specific usage guidelines for each punctuation type
  3. Special Tokens for Non-Speech Events

    • Emotional expressions: [हंसना], [रोना], [चिल्लाना] (Hindi)
    • Emotional expressions: [ನಗುವುದು], [ಅಳುವುದು], [ಕೂಗುವುದು] (Kannada)
    • Unclear speech marker: [अस्पष्ट] (Hindi), [ಅಸ್ಪಷ್ಟ] (Kannada)
  4. Code-Mixing Handling

    • Hindi prompt: Handles English and other Indic language code-mixing
    • Kannada prompt: Handles English and other Indic language code-mixing
    • All output in respective native scripts (Hindi: Devanagari, Kannada: Kannada script)
  5. Numerical Quantities

    • Spell out all numbers as spoken (e.g., "दो बजे" not "2 बजे", "ಎರಡು ಗಂಟೆಗೆ" not "2 ಗಂಟೆಗೆ")
    • Apply to both cardinal and ordinal numbers

Prompt Files:

  • prompts/transcription_prompt_chunks_gemma_hindi.txt - Hindi psychiatric interview transcription
  • prompts/transcription_prompt_chunks_gemma_kannada.txt - Kannada psychiatric interview transcription

Training Data

The model was fine-tuned on psychiatric interview transcriptions across two languages:

Training Set (~390 sampled instances from 75 audio files, 34.62 hours total):

  • Hindi: 40 files / 17.07 hours
  • Kannada: 35 files / 17.56 hours

Dev Set (15 files, 7.40 hours):

  • Hindi: 8 files / 3.89 hours
  • Kannada: 7 files / 3.50 hours

Test Set (18 files, 9.17 hours):

  • Hindi: 10 files / 5.28 hours
  • Kannada: 8 files / 3.89 hours

All splits were created with a random seed of 42.

Audio Preprocessing Requirements

IMPORTANT: Before using this model for transcription, audio files must be preprocessed:

  1. Chunk audio files to under 30 seconds duration using Silero VAD (Voice Activity Detection)
  2. This ensures optimal performance as the model was trained on ~30 second chunks
  3. Silero VAD helps create natural speech boundaries for chunking

Usage

Required Prompts

The model requires language-specific prompts for optimal performance. These prompts are included in the prompts/ directory:

  • Hindi: prompts/transcription_prompt_chunks_gemma_hindi.txt
  • Kannada: prompts/transcription_prompt_chunks_gemma_kannada.txt

These prompts contain detailed instructions for verbatim transcription, punctuation rules, handling of dysfluencies, code-mixing, and special tokens.

Transcription Example

from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
import torch

# Load base model and processor
model_id = "google/gemma-3n-e4b-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id).to("cuda")

# Load LoRA adapter
model = PeftModel.from_pretrained(
    model,
    "YOUR_HF_USERNAME/gemma-3n-E4B-it-trl-sft-hi-kn-final"
)

# Load the appropriate prompt for your language
# For Hindi:
with open("prompts/transcription_prompt_chunks_gemma_hindi.txt", "r") as f:
    prompt_text = f.read()

# For Kannada:
# with open("prompts/transcription_prompt_chunks_gemma_kannada.txt", "r") as f:
#     prompt_text = f.read()

# Path to your audio file (must be chunked to <30s using Silero VAD)
audio_file_path = "path/to/your/audio_chunk.wav"

# Prepare the messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt_text},
            {"type": "audio", "audio": audio_file_path},
        ]
    }
]

# Apply chat template and generate
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
input_len = inputs["input_ids"].shape[-1]
inputs = inputs.to(model.device, dtype=model.dtype)

with torch.inference_mode():
    generation = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=1.0,
        top_k=64,
        top_p=0.95,
        disable_compile=False
    )
    generation = generation[:, input_len:]

transcription = processor.batch_decode(generation, skip_special_tokens=True)[0]
print(transcription)

Merging Adapter with Base Model (Optional)

For faster inference, you can merge the adapter weights with the base model:

from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
import torch

# Load base model and adapter
base_model = AutoModelForImageTextToText.from_pretrained(
    "google/gemma-3n-e4b-it",
    torch_dtype=torch.float16
)
processor = AutoProcessor.from_pretrained("google/gemma-3n-e4b-it")

model = PeftModel.from_pretrained(
    base_model,
    "YOUR_HF_USERNAME/gemma-3n-E4B-it-trl-sft-hi-kn-final"
)

# Merge and unload
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
processor.save_pretrained("./merged_model")

Error Type Analysis

The fine-tuned model shows significant improvements across all error types:

Test Set Error Reduction (Combined)

Error Type Base Model (Median) Fine-tuned Model (Median) Improvement
Substitutions 24.0 20.0 16.7% reduction
Insertions 3.0 2.0 33.3% reduction
Deletions 5.0 4.0 20.0% reduction

Notable improvements:

  • Substitutions reduced from 25.14 ± 9.97 to 20.66 ± 8.61 (mean ± SD)
  • Insertions significantly controlled: 85.29 ± 248.89 to 14.69 ± 96.29
  • Deletions reduced from 6.46 ± 5.54 to 4.99 ± 4.91

Intended Use

Primary Use Cases

  • Clinical Research: Direct audio-to-text transcription of psychiatric interviews in Hindi and Kannada
  • Mental Health Documentation: Automated verbatim transcription of therapy sessions and psychiatric interviews
  • Multilingual ASR: Native multilingual automatic speech recognition for psychiatric domain

Out-of-Scope Use

  • Diagnostic Tool: This model should not be used as a standalone diagnostic tool
  • Replacement for Human Review: Corrected transcriptions should be reviewed by qualified mental health professionals
  • Non-Psychiatric General Purpose: While it may work on general text, it's optimized for psychiatric domain

Limitations and Biases

  • Domain-Specific: Optimized for psychiatric/clinical conversations; may not generalize well to other domains
  • Language Coverage: Specifically tuned for Hindi and Kannada; performance on other languages may vary
  • Hallucination Risk: As with all LLMs, the model may hallucinate content, which in this psychiatric context could include incorrect mental health information
  • Data Privacy: Trained on real clinical data; users must ensure compliance with data protection regulations (HIPAA, GDPR, etc.)
  • Variable Performance by Language: Kannada shows higher baseline error rates than Hindi

Ethical Considerations

  • This model was trained on real psychiatric conversations. Users must ensure appropriate ethical approvals and patient consent for any clinical use
  • Corrected transcriptions should always be reviewed by qualified mental health professionals
  • The model should not be used for surveillance or unauthorized recording of psychiatric sessions
  • Proper data security and patient confidentiality must be maintained
  • Outputs should not be used for clinical decision-making without human verification
  • Special care must be taken given the sensitive nature of psychiatric content

Statistical Significance

All reported improvements have been validated using the Wilcoxon signed-rank test with p < 0.001, indicating highly significant improvements over the base model across all metrics.

License

This adapter is released under the Apache 2.0 license. However, users must also comply with Google's Gemma license and any applicable regulations regarding clinical and psychiatric data.

Citation

If you use this model in your research, please cite:

@misc{gemma-hindi-kannada-psychiatric-lora,
  title={Gemma-3n-E4B-It Hindi & Kannada LoRA Adapter for Psychiatric ASR},
  author={Lekhansh Shukla, Prakrithi Shivaprakash},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/lekhansh/gemma-3n-E4B-it-trl-sft-hi-kn-final}}
}

Contact

For access requests or questions about this model, please contact Dr Lekhansh Shukla @ drlekhansh@gmail.com.


Disclaimer: This model is provided for research purposes only. Users are responsible for ensuring compliance with all applicable laws, regulations, and ethical guidelines when using this model, particularly regarding patient privacy, mental health data handling, and psychiatric care standards.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Lekhansh/gemma-3n-e4b-it-asr-psychiatric-domain-hindi-kannada