---
license: other
tags:
  - rna
  - gquad
  - g-quadruplex
  - transformer
  - genomics
  - rna-biology
library_name: transformers
extra_gated_fields:
  I agree to use this model for non-commercial use ONLY: checkbox
---

# G4mer Subtype

**G4mer-Subtype** is a transformer-based RNA language model that predicts RNA G-quadruplex (rG4) **subtypes** from sequence input. It is fine-tuned from [`Biociphers/mRNAbert`](https://huggingface.co/Biociphers/mRNAbert) and trained on 70-nt sequences labeled with experimentally derived rG4 subtype categories.

## Disclaimer

This is the official subtype classification model from the **G4mer** framework as described in the manuscript:

> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024).

See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials.

## Model Details

G4mer-Subtype is trained to classify each 70-nt RNA sequence into one of **eight rG4 subtypes**, each representing a distinct sequence/structure motif observed in experimental rG4 data.

### Subtype Mapping

| Class Index | Subtype Description                     |
|-------------|------------------------------------------|
| 0           | G≥40%                                    |
| 1           | Unknown                                  |
| 2           | Bulges                                   |
| 3           | Canonical                                |
| 4           | Long loop                                |
| 5           | Potential G-quadruplex & G≥40%           |
| 6           | Potential G-triplex & G≥40%              |
| 7           | Two-quartet                              |

All models use overlapping 6-mer tokenization and were fine-tuned on human transcriptome-derived sequences with subtype labels.

### Variants

| Model                                | Task                  | Size   |
|--------------------------------------|-----------------------|--------|
| `Biociphers/g4mer`                   | rG4 binary class      | ~46M   |
| `Biociphers/g4mer-subtype`           | rG4 subtype class     | ~46M   |
| `Biociphers/g4mer-regression`        | rG4 strength (score)  | ~46M   |

## Usage

### Predict rG4 Subtypes

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load binary rG4 model and tokenizer
binary_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer")
binary_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer")
binary_model.eval()

# Load subtype model and tokenizer
subtype_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer-subtype")
subtype_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer-subtype")
subtype_model.eval()

# Input sequence (max 70 nt)
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA"

# Convert to space-separated 6-mers
def to_kmers(seq, k=6):
    return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

kmer_sequence = to_kmers(sequence)

# Predict rG4 binary score
binary_inputs = binary_tokenizer(kmer_sequence, return_tensors="pt")
with torch.no_grad():
    binary_output = binary_model(**binary_inputs)
    rG4_prob = torch.nn.functional.softmax(binary_output.logits, dim=-1)[0][1].item()

# If confidently predicted to be rG4. Here, we set rG4 threshold to moderately confident with 0.7.
if rG4_prob > 0.7:
    # Only classify subtype if confident rG4
    subtype_inputs = subtype_tokenizer(kmer_sequence, return_tensors="pt")
    with torch.no_grad():
        subtype_output = subtype_model(**subtype_inputs)
        subtype_probs = torch.nn.functional.softmax(subtype_output.logits, dim=-1)
        predicted_class = torch.argmax(subtype_probs, dim=-1).item()

    subtype_mapping = {
        0: "G≥40%",
        1: "Unknown",
        2: "Bulges",
        3: "Canonical",
        4: "Long loop",
        5: "Potential G-quadruplex & G≥40%",
        6: "Potential G-triplex & G≥40%",
        7: "Two-quartet"
    }
    print(f"Predicted subtype: {subtype_mapping[predicted_class]}")
else:
    print(f"Not a confident rG4 (score = {rG4_prob:.2f}); skipping subtype classification.")
```

## Training data

The model was trained on experimentally validated rG4 regions annotated with subtype labels based on loop lengths, bulges, guanine content, and overall folding potential.
Each 70-nt training window was associated with one of the eight subtype labels shown above.

## Intended use

G4mer-Subtype is intended for researchers studying:

- RNA G-quadruplex structural diversity  
- Subtype-specific regulatory roles in the transcriptome  
- Effects of sequence variation on rG4 formation patterns

## Web Tool

You can explore G4mer predictions interactively through our web tool:

**[G4mer Web Tool](https://tools.biociphers.org/g4mer)**

Features include:
- **RNA sequence prediction** runs `G4mer` on GPU to compute probability of rG4-forming
- **Transcriptome-wide prediction** of rG4s and subtypes
- **Variant effect annotation** using gnomAD SNVs
- **Search and filter** by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

No installation needed — just visit and start exploring.

## Citation - MLA

```
Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221.
```

## Contact

For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila).