--- language: - en base_model: internlm/Intern-S2-Preview tags: - mlx - fp8 - 4bit - intern-s2-preview - apple-silicon - mlx-lm pipeline_tag: text-generation library_name: mlx --- # Intern-S2-Preview FP8 MLX 4-bit This repository contains an MLX-compatible 4-bit version of [`internlm/Intern-S2-Preview`](https://huggingface.co/internlm/Intern-S2-Preview). ## Local Usage ```bash python -m mlx_lm generate \ --model /Intern-S2-Preview-FP8-MLX-4bit \ --trust-remote-code \ --prompt "Write a concise response to your prompt here." \ --max-tokens 4096 ``` For a local checkout: ```bash python -m mlx_lm generate \ --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \ --trust-remote-code \ --prompt "Write a concise response to your prompt here." \ --max-tokens 4096 ``` ## Local Benchmark Benchmarks were run locally with `mlx_lm generate` on Apple Silicon. ### Basic Generation Command: ```bash python -m mlx_lm generate \ --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \ --trust-remote-code \ --prompt "Write a concise response to your prompt here." \ --max-tokens 4096 ``` Observed output stats: | Metric | Value | | --- | ---: | | Prompt tokens | 19 | | Prompt throughput | 306.835 tokens/sec | | Generation tokens | 702 | | Generation throughput | 123.388 tokens/sec | | Peak memory | 19.651 GB | ### Prompted Final-Only Output Test Command: ```bash python -m mlx_lm generate \ --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \ --trust-remote-code \ --prompt "Do not show reasoning, analysis, thinking process, scratchpad, or text. Output only the final answer. Write a concise response to your prompt here." \ --max-tokens 4096 ``` Observed output stats: | Metric | Value | | --- | ---: | | Prompt tokens | 44 | | Prompt throughput | 487.095 tokens/sec | | Generation tokens | 817 | | Generation throughput | 122.650 tokens/sec | | Peak memory | 19.695 GB | The model still emitted visible reasoning text in this raw generation mode, so prompt-only suppression was not sufficient. ## Notes - Format: MLX sharded `safetensors` - Quantization: FP8/4-bit MLX local build - Base model: `internlm/Intern-S2-Preview` - The model may emit visible reasoning text in raw generation. For chat applications, use a serving layer or post-processor that strips reasoning if needed. - Raw generation throughput was about 123 tokens/sec in the local smoke tests above. - Peak memory in these tests was about 19.7 GB. ## License This is a derived MLX build of `internlm/Intern-S2-Preview`. Refer to the base model repository for upstream license and usage terms.