GENATATOR-ModernGENA (Human Region Model)

Overview

GENATATOR-ModernGENA (Region Model) is a DNA language model fine-tuned for intragenic region detection directly from genomic DNA sequences.

The model performs token-level multilabel classification to identify strand-aware genomic regions, complementing the edge model used for transcript boundary detection.

This model focuses on region-level signal detection, specifically:

  • identifying intragenic regions (gene bodies)
  • distinguishing them from intergenic background
  • modeling strand-specific transcriptional activity

Model

Model name on Hugging Face:

genatator-moderngena-base-human-region-model

Architecture properties:

  • backbone: ModernBERT (ModernGENA)
  • layers: 22
  • hidden size: 768
  • parameters: ~135M
  • tokenization: BPE
  • output head: linear projection to 6 classes
  • output resolution: token-level

The model predicts six classes.

The correct order of output classes is:

["TSS+", "TSS-", "PolyA+", "PolyA-", "Intragenic+", "Intragenic-"]

Where:

  • TSS+ — transcription start site on the forward strand
  • TSS- — transcription start site on the reverse strand
  • PolyA+ — transcript termination site on the forward strand
  • PolyA- — transcript termination site on the reverse strand
  • Intragenic+ — intragenic region (forward strand)
  • Intragenic- — intragenic region (reverse strand)

Training Data

This model was fine-tuned on full genomic sequences, including intergenic regions.

Training data includes annotations for:

  • mRNA transcripts
  • lncRNA transcripts

Dataset characteristics:

  • human genome only
  • strand-aware annotations
  • genome-wide supervision
  • human chromosomes 8, 20, and 21 held out
  • for training, chromosomes longer than 100 kbp were included

The model is trained to distinguish both transcribed regions and background genomic sequence across full chromosomes.


Method Overview

This model is the region model in the ModernGENA transcript discovery pipeline.

It predicts region-level probability tracks for:

  • intragenic regions
  • strand-specific transcriptional coverage

The full pipeline consists of:

  1. Edge model predicts transcript boundaries (TSS / PolyA)

  2. Region model (this model) predicts intragenic regions

  3. Post-processing

    • signal denoising
    • peak detection (from edge model)
    • interval construction
    • filtering using region predictions

The region model is used to:

  • validate predicted transcript intervals
  • remove false positives
  • improve reconstruction of full gene structures

Key Properties

  • strand-aware predictions
  • detects intragenic regions
  • supports both mRNA and lncRNA
  • trained on human genome
  • ab initio inference from DNA only
  • designed for genome-wide transcript discovery

Important Notes

  • This model predicts regions, not precise transcript boundaries
  • It is intended to be used together with the edge model
  • It does not predict exon–intron structure
  • It does not produce final transcript annotations by itself

The model outputs region probability tracks, which are used during post-processing to filter and validate transcript predictions.


Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-moderngena-base-human-region-model"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model = AutoModelForTokenClassification.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model.eval()

Example Inference

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-moderngena-base-human-region-model"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(repo_id, trust_remote_code=True)

sequence = "ACGTACGTACGTACGTACGTACGTACGTACGT"

enc = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**enc)

logits = outputs.logits

print("Input shape:", enc["input_ids"].shape)
print("Logits shape:", logits.shape)

Example output:

Input shape: torch.Size([1, sequence_length])
Logits shape: torch.Size([1, sequence_length, 6])

Each token receives 6 logits corresponding to the six region and boundary classes.


Recommended Inference Workflow

To obtain transcript annotations:

  1. run edge model to detect boundaries
  2. run region model to detect intragenic regions
  3. convert logits to probabilities
  4. apply smoothing
  5. detect peaks (TSS / PolyA)
  6. construct candidate transcript intervals
  7. filter intervals using intragenic predictions

Limitations

  • predicts regions, not full transcript structures
  • requires combination with edge model
  • does not model exon–intron structure
  • output depends on downstream post-processing

Summary

GENATATOR-ModernGENA region model is a ModernBERT-based DNA language model for strand-aware intragenic region detection in the human genome.

It complements the edge model and enables:

  • filtering of transcript predictions
  • improved gene structure reconstruction
  • genome-wide transcript discovery
  • multi-isoform detection through downstream processing
Downloads last month
3
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support