GENATATOR-ModernGENA (Human Region Model)

Overview

GENATATOR-ModernGENA (Region Model) is a DNA language model fine-tuned for intragenic region detection directly from genomic DNA sequences.

The model performs token-level multilabel classification to identify strand-aware genomic regions, complementing the edge model used for transcript boundary detection.

This model focuses on region-level signal detection, specifically:

identifying intragenic regions (gene bodies)
distinguishing them from intergenic background
modeling strand-specific transcriptional activity

Model

Model name on Hugging Face:

genatator-moderngena-base-human-region-model

Architecture properties:

backbone: ModernBERT (ModernGENA)
layers: 22
hidden size: 768
parameters: ~135M
tokenization: BPE
output head: linear projection to 6 classes
output resolution: token-level

The model predicts six classes.

The correct order of output classes is:

["TSS+", "TSS-", "PolyA+", "PolyA-", "Intragenic+", "Intragenic-"]

Where:

TSS+ — transcription start site on the forward strand
TSS- — transcription start site on the reverse strand
PolyA+ — transcript termination site on the forward strand
PolyA- — transcript termination site on the reverse strand
Intragenic+ — intragenic region (forward strand)
Intragenic- — intragenic region (reverse strand)

Training Data

This model was fine-tuned on full genomic sequences, including intergenic regions.

Training data includes annotations for:

mRNA transcripts
lncRNA transcripts

Dataset characteristics:

human genome only
strand-aware annotations
genome-wide supervision
human chromosomes 8, 20, and 21 held out
for training, chromosomes longer than 100 kbp were included

The model is trained to distinguish both transcribed regions and background genomic sequence across full chromosomes.

Method Overview

This model is the region model in the ModernGENA transcript discovery pipeline.

It predicts region-level probability tracks for:

intragenic regions
strand-specific transcriptional coverage

The full pipeline consists of:

Edge model predicts transcript boundaries (TSS / PolyA)
Region model (this model) predicts intragenic regions
Post-processing
- signal denoising
- peak detection (from edge model)
- interval construction
- filtering using region predictions

The region model is used to:

validate predicted transcript intervals
remove false positives
improve reconstruction of full gene structures

Key Properties

strand-aware predictions
detects intragenic regions
supports both mRNA and lncRNA
trained on human genome
ab initio inference from DNA only
designed for genome-wide transcript discovery

Important Notes

This model predicts regions, not precise transcript boundaries
It is intended to be used together with the edge model
It does not predict exon–intron structure
It does not produce final transcript annotations by itself

The model outputs region probability tracks, which are used during post-processing to filter and validate transcript predictions.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-moderngena-base-human-region-model"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model = AutoModelForTokenClassification.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model.eval()

Example Inference

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-moderngena-base-human-region-model"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(repo_id, trust_remote_code=True)

sequence = "ACGTACGTACGTACGTACGTACGTACGTACGT"

enc = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**enc)

logits = outputs.logits

print("Input shape:", enc["input_ids"].shape)
print("Logits shape:", logits.shape)

Example output:

Input shape: torch.Size([1, sequence_length])
Logits shape: torch.Size([1, sequence_length, 6])

Each token receives 6 logits corresponding to the six region and boundary classes.

Recommended Inference Workflow

To obtain transcript annotations:

run edge model to detect boundaries
run region model to detect intragenic regions
convert logits to probabilities
apply smoothing
detect peaks (TSS / PolyA)
construct candidate transcript intervals
filter intervals using intragenic predictions

Limitations

predicts regions, not full transcript structures
requires combination with edge model
does not model exon–intron structure
output depends on downstream post-processing

Summary

GENATATOR-ModernGENA region model is a ModernBERT-based DNA language model for strand-aware intragenic region detection in the human genome.

It complements the edge model and enables:

filtering of transcript predictions
improved gene structure reconstruction
genome-wide transcript discovery
multi-isoform detection through downstream processing

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32