# SuperSheikh Multimodal Model A state-of-the-art multimodal language model that combines text, image, and audio understanding capabilities with an extended context window of 200,000 tokens. ## Model Description SuperSheikh is a transformer-based multimodal model designed for: - **Long-context understanding**: Supports up to 200,000 tokens - **Text processing**: Advanced natural language understanding and generation - **Image understanding**: Visual question answering and image captioning - **Audio processing**: Speech recognition and audio understanding - **Multimodal reasoning**: Combining information from multiple modalities ## Architecture - **Base Model**: Transformer decoder with 32 layers - **Hidden Size**: 4096 dimensions - **Attention Heads**: 32 heads - **Context Length**: 200,000 tokens - **Vision Module**: 24-layer vision transformer with 1024 hidden size - **Audio Module**: 12-layer audio transformer with 768 hidden size ## Installation ```bash pip install transformers torch tokenizers safetensors accelerate ``` Or install from requirements.txt: ```bash pip install -r requirements.txt ``` ## Usage ### Download Model Weights The model weights (`sheikh.safetensors`) are too large for direct GitHub hosting. Download them from the Hugging Face Hub: ```bash wget --content-disposition "https://huggingface.co/codedwithlikhon/super-sheikh/resolve/main/sheikh.safetensors" ``` Or use the Hugging Face `transformers` library: ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("codedwithlikhon/super-sheikh") model = AutoModelForCausalLM.from_pretrained("codedwithlikhon/super-sheikh", trust_remote_code=True) inputs = tokenizer("Hello, how are you?", return_tensors="pt") outputs = model.generate(**inputs, max_length=100) response = tokenizer.decode(outputs[0]) ``` ### Multimodal Processing ```python from transformers import SuperSheikhProcessor from PIL import Image processor = SuperSheikhProcessor.from_pretrained("path/to/super-sheikh") # Process text and image together text = "Describe this image" image = Image.open("image.jpg") inputs = processor(text=text, images=image, return_tensors="pt") ``` ## Features - **Long Context**: Extended context window for processing large documents - **Multimodal**: Supports text, image, and audio inputs - **Efficient**: Optimized for both training and inference - **Flexible**: Customizable for various downstream tasks ## Training The model was trained on a diverse dataset including: - Text corpora from books, articles, and web content - Image-text pairs from various vision-language datasets - Audio-text pairs from speech recognition datasets ### Tokenizer Training You can train a custom BPE tokenizer for SuperSheikh: ```python from tokenizer_super_sheikh import SuperSheikhTokenizer # Train tokenizer from dataset tokenizer = SuperSheikhTokenizer.train_from_iterator( text_iterator, vocab_size=50000, min_frequency=2, special_tokens=["<|startoftext|>", "<|endoftext|>", "", ""] ) # Save tokenizer files tokenizer.save_pretrained("path/to/save/directory") ``` ### Model Saving The model supports safetensors format for efficient storage: ```python # Save model with safetensors format model.save_pretrained( "path/to/save/directory", safe_serialization=True, max_shard_size="10GB" ) ``` This automatically generates: - `model.safetensors` (or sharded files) - `model.safetensors.index.json` (for sharded models) - `config.json` - `generation_config.json` (if present) - `chat_template.jinja` (if present for instruction-tuned models) ``` ### Supported File Formats The updated implementation generates standard tokenizer files: - `tokenizer.json` - Main tokenizer file - `vocab.json` - Vocabulary mapping - `merges.txt` - BPE merges - `tokenizer_config.json` - Tokenizer configuration - `special_tokens_map.json` - Special tokens mapping - `added_tokens.json` - Additional tokens (if any) ## Automated Deployment This repository includes automated deployment to Hugging Face Hub via GitHub Actions: ### Setup 1. **Fork or clone** this repository to your GitHub account 2. **Set up Hugging Face token**: - Go to [Hugging Face Settings > Access Tokens](https://huggingface.co/settings/tokens) - Create a new token with "Write" permissions - Add it to your GitHub repository secrets as `HF_TOKEN` 3. **Push to main branch** or use manual workflow dispatch ### Workflow Features - **Automatic deployment**: Triggers on pushes to `main` branch - **Manual deployment**: Can be triggered manually from GitHub Actions UI - **Complete model upload**: Automatically uploads all model files including: - Model weights (`*.safetensors`) - Tokenizer files (`tokenizer.json`, `vocab.json`, `merges.txt`) - Configuration files (`config.json`, `tokenizer_config.json`) - Chat template (`chat_template.jinja`) - Special tokens and additional metadata ### Repository Links - **GitHub**: [https://github.com/codedwithlikhon/super-sheikh](https://github.com/codedwithlikhon/super-sheikh) - **Hugging Face**: [https://huggingface.co/codedwithlikhon/super-sheikh](https://huggingface.co/codedwithlikhon/super-sheikh) The model will be automatically available on Hugging Face Hub after successful deployment! ## Limitations - Requires significant computational resources - Large model size may not be suitable for all deployment scenarios - Performance may vary depending on input quality and domain ## License This model is released under the MIT License. ## Citation If you use SuperSheikh in your research, please cite: ``` @misc{super-sheikh-2024, title={SuperSheikh: A Multimodal Long-Context Language Model}, author={SuperSheikh Team}, year={2024}, url={https://github.com/codedwithlikhon/super-sheikh} } ``` ## Contact For questions or support, please open an issue on our GitHub repository.