---
language:
- en
base_model: internlm/Intern-S2-Preview
tags:
- mlx
- fp8
- 4bit
- intern-s2-preview
- apple-silicon
- mlx-lm
pipeline_tag: text-generation
library_name: mlx
---

# Intern-S2-Preview FP8 MLX 4-bit

This repository contains an MLX-compatible 4-bit version of [`internlm/Intern-S2-Preview`](https://huggingface.co/internlm/Intern-S2-Preview).

## Local Usage

```bash
python -m mlx_lm generate \
  --model <namespace>/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096
```

For a local checkout:

```bash
python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096
```

## Local Benchmark

Benchmarks were run locally with `mlx_lm generate` on Apple Silicon.

### Basic Generation

Command:

```bash
python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096
```

Observed output stats:

| Metric | Value |
| --- | ---: |
| Prompt tokens | 19 |
| Prompt throughput | 306.835 tokens/sec |
| Generation tokens | 702 |
| Generation throughput | 123.388 tokens/sec |
| Peak memory | 19.651 GB |

### Prompted Final-Only Output Test

Command:

```bash
python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Do not show reasoning, analysis, thinking process, scratchpad, or <think> text. Output only the final answer. Write a concise response to your prompt here." \
  --max-tokens 4096
```

Observed output stats:

| Metric | Value |
| --- | ---: |
| Prompt tokens | 44 |
| Prompt throughput | 487.095 tokens/sec |
| Generation tokens | 817 |
| Generation throughput | 122.650 tokens/sec |
| Peak memory | 19.695 GB |

The model still emitted visible reasoning text in this raw generation mode, so prompt-only suppression was not sufficient.

## Notes

- Format: MLX sharded `safetensors`
- Quantization: FP8/4-bit MLX local build
- Base model: `internlm/Intern-S2-Preview`
- The model may emit visible reasoning text in raw generation. For chat applications, use a serving layer or post-processor that strips reasoning if needed.
- Raw generation throughput was about 123 tokens/sec in the local smoke tests above.
- Peak memory in these tests was about 19.7 GB.

## License

This is a derived MLX build of `internlm/Intern-S2-Preview`. Refer to the base model repository for upstream license and usage terms.