# SuperSheikh Multimodal Model

A state-of-the-art multimodal language model that combines text, image, and audio understanding capabilities with an extended context window of 200,000 tokens.

## Model Description

SuperSheikh is a transformer-based multimodal model designed for:
- **Long-context understanding**: Supports up to 200,000 tokens
- **Text processing**: Advanced natural language understanding and generation
- **Image understanding**: Visual question answering and image captioning
- **Audio processing**: Speech recognition and audio understanding
- **Multimodal reasoning**: Combining information from multiple modalities

## Architecture

- **Base Model**: Transformer decoder with 32 layers
- **Hidden Size**: 4096 dimensions
- **Attention Heads**: 32 heads
- **Context Length**: 200,000 tokens
- **Vision Module**: 24-layer vision transformer with 1024 hidden size
- **Audio Module**: 12-layer audio transformer with 768 hidden size

## Installation

```bash
pip install transformers torch tokenizers safetensors accelerate
```

Or install from requirements.txt:

```bash
pip install -r requirements.txt
```

## Usage


### Download Model Weights
The model weights (`sheikh.safetensors`) are too large for direct GitHub hosting. Download them from the Hugging Face Hub:

```bash
wget --content-disposition "https://huggingface.co/codedwithlikhon/super-sheikh/resolve/main/sheikh.safetensors"
```
Or use the Hugging Face `transformers` library:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("codedwithlikhon/super-sheikh")
model = AutoModelForCausalLM.from_pretrained("codedwithlikhon/super-sheikh", trust_remote_code=True)

inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0])
```

### Multimodal Processing
```python
from transformers import SuperSheikhProcessor
from PIL import Image

processor = SuperSheikhProcessor.from_pretrained("path/to/super-sheikh")

# Process text and image together
text = "Describe this image"
image = Image.open("image.jpg")

inputs = processor(text=text, images=image, return_tensors="pt")
```

## Features

- **Long Context**: Extended context window for processing large documents
- **Multimodal**: Supports text, image, and audio inputs
- **Efficient**: Optimized for both training and inference
- **Flexible**: Customizable for various downstream tasks

## Training

The model was trained on a diverse dataset including:
- Text corpora from books, articles, and web content
- Image-text pairs from various vision-language datasets
- Audio-text pairs from speech recognition datasets

### Tokenizer Training

You can train a custom BPE tokenizer for SuperSheikh:

```python
from tokenizer_super_sheikh import SuperSheikhTokenizer

# Train tokenizer from dataset
tokenizer = SuperSheikhTokenizer.train_from_iterator(
    text_iterator,
    vocab_size=50000,
    min_frequency=2,
    special_tokens=["<|startoftext|>", "<|endoftext|>", "<pad>", "<unk>"]
)

# Save tokenizer files
tokenizer.save_pretrained("path/to/save/directory")
```

### Model Saving

The model supports safetensors format for efficient storage:

```python
# Save model with safetensors format
model.save_pretrained(
    "path/to/save/directory",
    safe_serialization=True,
    max_shard_size="10GB"
)
```

This automatically generates:
- `model.safetensors` (or sharded files)
- `model.safetensors.index.json` (for sharded models)
- `config.json`
- `generation_config.json` (if present)
- `chat_template.jinja` (if present for instruction-tuned models)
```

### Supported File Formats

The updated implementation generates standard tokenizer files:
- `tokenizer.json` - Main tokenizer file
- `vocab.json` - Vocabulary mapping
- `merges.txt` - BPE merges
- `tokenizer_config.json` - Tokenizer configuration
- `special_tokens_map.json` - Special tokens mapping
- `added_tokens.json` - Additional tokens (if any)

## Automated Deployment

This repository includes automated deployment to Hugging Face Hub via GitHub Actions:

### Setup

1. **Fork or clone** this repository to your GitHub account
2. **Set up Hugging Face token**:
   - Go to [Hugging Face Settings > Access Tokens](https://huggingface.co/settings/tokens)
   - Create a new token with "Write" permissions
   - Add it to your GitHub repository secrets as `HF_TOKEN`
3. **Push to main branch** or use manual workflow dispatch

### Workflow Features

- **Automatic deployment**: Triggers on pushes to `main` branch
- **Manual deployment**: Can be triggered manually from GitHub Actions UI
- **Complete model upload**: Automatically uploads all model files including:
  - Model weights (`*.safetensors`)
  - Tokenizer files (`tokenizer.json`, `vocab.json`, `merges.txt`)
  - Configuration files (`config.json`, `tokenizer_config.json`)
  - Chat template (`chat_template.jinja`)
  - Special tokens and additional metadata

### Repository Links

- **GitHub**: [https://github.com/codedwithlikhon/super-sheikh](https://github.com/codedwithlikhon/super-sheikh)
- **Hugging Face**: [https://huggingface.co/codedwithlikhon/super-sheikh](https://huggingface.co/codedwithlikhon/super-sheikh)

The model will be automatically available on Hugging Face Hub after successful deployment!

## Limitations

- Requires significant computational resources
- Large model size may not be suitable for all deployment scenarios
- Performance may vary depending on input quality and domain

## License

This model is released under the MIT License.

## Citation

If you use SuperSheikh in your research, please cite:

```
@misc{super-sheikh-2024,
  title={SuperSheikh: A Multimodal Long-Context Language Model},
  author={SuperSheikh Team},
  year={2024},
  url={https://github.com/codedwithlikhon/super-sheikh}
}
```

## Contact

For questions or support, please open an issue on our GitHub repository.