chanderbalaji's picture
Upload README.md with huggingface_hub
5c0026a verified
metadata
language:
  - en
base_model: internlm/Intern-S2-Preview
tags:
  - mlx
  - fp8
  - 4bit
  - intern-s2-preview
  - apple-silicon
  - mlx-lm
pipeline_tag: text-generation
library_name: mlx

Intern-S2-Preview FP8 MLX 4-bit

This repository contains an MLX-compatible 4-bit version of internlm/Intern-S2-Preview.

Local Usage

python -m mlx_lm generate \
  --model <namespace>/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096

For a local checkout:

python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096

Local Benchmark

Benchmarks were run locally with mlx_lm generate on Apple Silicon.

Basic Generation

Command:

python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096

Observed output stats:

Metric Value
Prompt tokens 19
Prompt throughput 306.835 tokens/sec
Generation tokens 702
Generation throughput 123.388 tokens/sec
Peak memory 19.651 GB

Prompted Final-Only Output Test

Command:

python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Do not show reasoning, analysis, thinking process, scratchpad, or <think> text. Output only the final answer. Write a concise response to your prompt here." \
  --max-tokens 4096

Observed output stats:

Metric Value
Prompt tokens 44
Prompt throughput 487.095 tokens/sec
Generation tokens 817
Generation throughput 122.650 tokens/sec
Peak memory 19.695 GB

The model still emitted visible reasoning text in this raw generation mode, so prompt-only suppression was not sufficient.

Notes

  • Format: MLX sharded safetensors
  • Quantization: FP8/4-bit MLX local build
  • Base model: internlm/Intern-S2-Preview
  • The model may emit visible reasoning text in raw generation. For chat applications, use a serving layer or post-processor that strips reasoning if needed.
  • Raw generation throughput was about 123 tokens/sec in the local smoke tests above.
  • Peak memory in these tests was about 19.7 GB.

License

This is a derived MLX build of internlm/Intern-S2-Preview. Refer to the base model repository for upstream license and usage terms.