chanderbalaji's picture
Upload README.md with huggingface_hub
5c0026a verified
---
language:
- en
base_model: internlm/Intern-S2-Preview
tags:
- mlx
- fp8
- 4bit
- intern-s2-preview
- apple-silicon
- mlx-lm
pipeline_tag: text-generation
library_name: mlx
---
# Intern-S2-Preview FP8 MLX 4-bit
This repository contains an MLX-compatible 4-bit version of [`internlm/Intern-S2-Preview`](https://huggingface.co/internlm/Intern-S2-Preview).
## Local Usage
```bash
python -m mlx_lm generate \
--model <namespace>/Intern-S2-Preview-FP8-MLX-4bit \
--trust-remote-code \
--prompt "Write a concise response to your prompt here." \
--max-tokens 4096
```
For a local checkout:
```bash
python -m mlx_lm generate \
--model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
--trust-remote-code \
--prompt "Write a concise response to your prompt here." \
--max-tokens 4096
```
## Local Benchmark
Benchmarks were run locally with `mlx_lm generate` on Apple Silicon.
### Basic Generation
Command:
```bash
python -m mlx_lm generate \
--model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
--trust-remote-code \
--prompt "Write a concise response to your prompt here." \
--max-tokens 4096
```
Observed output stats:
| Metric | Value |
| --- | ---: |
| Prompt tokens | 19 |
| Prompt throughput | 306.835 tokens/sec |
| Generation tokens | 702 |
| Generation throughput | 123.388 tokens/sec |
| Peak memory | 19.651 GB |
### Prompted Final-Only Output Test
Command:
```bash
python -m mlx_lm generate \
--model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
--trust-remote-code \
--prompt "Do not show reasoning, analysis, thinking process, scratchpad, or <think> text. Output only the final answer. Write a concise response to your prompt here." \
--max-tokens 4096
```
Observed output stats:
| Metric | Value |
| --- | ---: |
| Prompt tokens | 44 |
| Prompt throughput | 487.095 tokens/sec |
| Generation tokens | 817 |
| Generation throughput | 122.650 tokens/sec |
| Peak memory | 19.695 GB |
The model still emitted visible reasoning text in this raw generation mode, so prompt-only suppression was not sufficient.
## Notes
- Format: MLX sharded `safetensors`
- Quantization: FP8/4-bit MLX local build
- Base model: `internlm/Intern-S2-Preview`
- The model may emit visible reasoning text in raw generation. For chat applications, use a serving layer or post-processor that strips reasoning if needed.
- Raw generation throughput was about 123 tokens/sec in the local smoke tests above.
- Peak memory in these tests was about 19.7 GB.
## License
This is a derived MLX build of `internlm/Intern-S2-Preview`. Refer to the base model repository for upstream license and usage terms.