---
license: other
tags:
  - rna
  - gquad
  - g-quadruplex
  - transformer
  - genomics
  - rna-biology
library_name: transformers
extra_gated_fields:
  I agree to use this model for non-commercial use ONLY: checkbox
---

# G4mer

**G4mer** is a transformer-based RNA foundation model trained to identify RNA G-quadruplexes (rG4s) from sequence input, fine-tuned with mRNAbert (Biociphers/mRNAbert).

## Disclaimer

This is the official implementation of the **G4mer** model as described in the manuscript:

> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024).

See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials.

## Model Details

G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict:

- **Binary classification**: Whether a 70-nt seqeunce region forms an rG4 structure

All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome.

### Variants

| Model                                | Task              | Size   |
|--------------------------------------|-------------------|--------|
| `Biociphers/g4mer`           | rG4 binary class  | ~46M    |
| `Biociphers/g4mer-subtype`          | rG4 subtype class | ~46M    |
| `Biociphers/g4mer-regression`       | rG4 strength      | ~46M    |

## Usage

### Binary rG4 Prediction

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer")
model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer")

sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA"  # max length: 70nt window

def to_kmers(seq, k=6):
    return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

sequence = to_kmers(sequence, k=6)  # Convert to 6-mers
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

rG4_probability = torch.softmax(logits, dim=1)[:, 1].item()
print(rG4_probability)
```

G4mer was trained on a maximum of 70nt per sequence. For sequences longer than 70nt, we recommend scanning the input sequence with a sliding window of 70nt and taking the maximum rG4 score across all windows.

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Biociphers/g4mer")
model = AutoModelForSequenceClassification.from_pretrained("Biociphers/g4mer")
model.eval()

# Define k-mer function
def to_kmers(seq, k=6):
    return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

# Define a long sequence (must contain only A/C/G/T)
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" * 2  # ~100nt

# Slide 70nt window with stride 1
window_size = 70
stride = 1
windows = [sequence[i:i+window_size] for i in range(0, len(sequence) - window_size + 1, stride)]

# Score each window using G4mer
scores = []
for w in windows:
    kmer_seq = to_kmers(w, k=6)
    tokens = tokenizer(kmer_seq, return_tensors="pt")
    with torch.no_grad():
        output = model(**tokens)
        prob = torch.nn.functional.softmax(output.logits, dim=-1)
        scores.append(prob[0][1].item())  # class 1 = rG4-forming

# Final rG4 score for the long sequence
max_score = max(scores)
print(f"Max rG4 score across windows: {max_score:.3f}")
```

## Web Tool

You can explore G4mer predictions interactively through our web tool:

**[G4mer Web Tool](https://tools.biociphers.org/g4mer)**

Features include:
- **RNA sequence prediction** runs `G4mer` on GPU to compute probability of rG4-forming
- **Transcriptome-wide prediction** of rG4s and subtypes
- **Variant effect annotation** using gnomAD SNVs
- **Search and filter** by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

No installation needed — just visit and start exploring.

## Citation - MLA

```
Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221.
```

## Contact

For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila).