File size: 2,621 Bytes
998dd2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f800a92
998dd2b
5c0026a
998dd2b
 
 
f800a92
998dd2b
 
f800a92
 
998dd2b
5c0026a
998dd2b
 
 
 
 
f800a92
998dd2b
 
 
 
 
 
f800a92
 
998dd2b
5c0026a
998dd2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f800a92
 
998dd2b
5c0026a
998dd2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
language:
- en
base_model: internlm/Intern-S2-Preview
tags:
- mlx
- fp8
- 4bit
- intern-s2-preview
- apple-silicon
- mlx-lm
pipeline_tag: text-generation
library_name: mlx
---

# Intern-S2-Preview FP8 MLX 4-bit

This repository contains an MLX-compatible 4-bit version of [`internlm/Intern-S2-Preview`](https://huggingface.co/internlm/Intern-S2-Preview).

## Local Usage

```bash
python -m mlx_lm generate \
  --model <namespace>/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096
```

For a local checkout:

```bash
python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096
```

## Local Benchmark

Benchmarks were run locally with `mlx_lm generate` on Apple Silicon.

### Basic Generation

Command:

```bash
python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096
```

Observed output stats:

| Metric | Value |
| --- | ---: |
| Prompt tokens | 19 |
| Prompt throughput | 306.835 tokens/sec |
| Generation tokens | 702 |
| Generation throughput | 123.388 tokens/sec |
| Peak memory | 19.651 GB |

### Prompted Final-Only Output Test

Command:

```bash
python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Do not show reasoning, analysis, thinking process, scratchpad, or <think> text. Output only the final answer. Write a concise response to your prompt here." \
  --max-tokens 4096
```

Observed output stats:

| Metric | Value |
| --- | ---: |
| Prompt tokens | 44 |
| Prompt throughput | 487.095 tokens/sec |
| Generation tokens | 817 |
| Generation throughput | 122.650 tokens/sec |
| Peak memory | 19.695 GB |

The model still emitted visible reasoning text in this raw generation mode, so prompt-only suppression was not sufficient.

## Notes

- Format: MLX sharded `safetensors`
- Quantization: FP8/4-bit MLX local build
- Base model: `internlm/Intern-S2-Preview`
- The model may emit visible reasoning text in raw generation. For chat applications, use a serving layer or post-processor that strips reasoning if needed.
- Raw generation throughput was about 123 tokens/sec in the local smoke tests above.
- Peak memory in these tests was about 19.7 GB.

## License

This is a derived MLX build of `internlm/Intern-S2-Preview`. Refer to the base model repository for upstream license and usage terms.