GENATATOR-ModernGENA (Human Region Model)
Overview
GENATATOR-ModernGENA (Region Model) is a DNA language model fine-tuned for intragenic region detection directly from genomic DNA sequences.
The model performs token-level multilabel classification to identify strand-aware genomic regions, complementing the edge model used for transcript boundary detection.
This model focuses on region-level signal detection, specifically:
- identifying intragenic regions (gene bodies)
- distinguishing them from intergenic background
- modeling strand-specific transcriptional activity
Model
Model name on Hugging Face:
genatator-moderngena-base-human-region-model
Architecture properties:
- backbone: ModernBERT (ModernGENA)
- layers: 22
- hidden size: 768
- parameters: ~135M
- tokenization: BPE
- output head: linear projection to 6 classes
- output resolution: token-level
The model predicts six classes.
The correct order of output classes is:
["TSS+", "TSS-", "PolyA+", "PolyA-", "Intragenic+", "Intragenic-"]
Where:
TSS+— transcription start site on the forward strandTSS-— transcription start site on the reverse strandPolyA+— transcript termination site on the forward strandPolyA-— transcript termination site on the reverse strandIntragenic+— intragenic region (forward strand)Intragenic-— intragenic region (reverse strand)
Training Data
This model was fine-tuned on full genomic sequences, including intergenic regions.
Training data includes annotations for:
- mRNA transcripts
- lncRNA transcripts
Dataset characteristics:
- human genome only
- strand-aware annotations
- genome-wide supervision
- human chromosomes 8, 20, and 21 held out
- for training, chromosomes longer than 100 kbp were included
The model is trained to distinguish both transcribed regions and background genomic sequence across full chromosomes.
Method Overview
This model is the region model in the ModernGENA transcript discovery pipeline.
It predicts region-level probability tracks for:
- intragenic regions
- strand-specific transcriptional coverage
The full pipeline consists of:
Edge model predicts transcript boundaries (TSS / PolyA)
Region model (this model) predicts intragenic regions
Post-processing
- signal denoising
- peak detection (from edge model)
- interval construction
- filtering using region predictions
The region model is used to:
- validate predicted transcript intervals
- remove false positives
- improve reconstruction of full gene structures
Key Properties
- strand-aware predictions
- detects intragenic regions
- supports both mRNA and lncRNA
- trained on human genome
- ab initio inference from DNA only
- designed for genome-wide transcript discovery
Important Notes
- This model predicts regions, not precise transcript boundaries
- It is intended to be used together with the edge model
- It does not predict exon–intron structure
- It does not produce final transcript annotations by itself
The model outputs region probability tracks, which are used during post-processing to filter and validate transcript predictions.
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
repo_id = "shmelev/genatator-moderngena-base-human-region-model"
tokenizer = AutoTokenizer.from_pretrained(
repo_id,
trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
repo_id,
trust_remote_code=True,
)
model.eval()
Example Inference
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
repo_id = "shmelev/genatator-moderngena-base-human-region-model"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(repo_id, trust_remote_code=True)
sequence = "ACGTACGTACGTACGTACGTACGTACGTACGT"
enc = tokenizer(sequence, return_tensors="pt")
with torch.no_grad():
outputs = model(**enc)
logits = outputs.logits
print("Input shape:", enc["input_ids"].shape)
print("Logits shape:", logits.shape)
Example output:
Input shape: torch.Size([1, sequence_length])
Logits shape: torch.Size([1, sequence_length, 6])
Each token receives 6 logits corresponding to the six region and boundary classes.
Recommended Inference Workflow
To obtain transcript annotations:
- run edge model to detect boundaries
- run region model to detect intragenic regions
- convert logits to probabilities
- apply smoothing
- detect peaks (TSS / PolyA)
- construct candidate transcript intervals
- filter intervals using intragenic predictions
Limitations
- predicts regions, not full transcript structures
- requires combination with edge model
- does not model exon–intron structure
- output depends on downstream post-processing
Summary
GENATATOR-ModernGENA region model is a ModernBERT-based DNA language model for strand-aware intragenic region detection in the human genome.
It complements the edge model and enables:
- filtering of transcript predictions
- improved gene structure reconstruction
- genome-wide transcript discovery
- multi-isoform detection through downstream processing
- Downloads last month
- 3