davide221 commited on
Commit
3bfaf33
·
verified ·
1 Parent(s): 34bf60a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -12
README.md CHANGED
@@ -51,13 +51,10 @@ Measured with `bench_laguna_generate` from lucebox-hub (dflash autoregressive fo
51
  |----------|-----------|-------|
52
  | Decode @ ctx=128 (greedy) | **113 tok/s** | n_gen=128 |
53
  | Decode @ ctx=1K | 104 tok/s | |
54
- | Decode @ ctx=4K | 60 tok/s | |
55
- | llama.cpp tg128 (Q8_0 KV, FA on) | 165 tok/s | for comparison |
56
  | 128K TTFT via dflash + PFlash | **15.91 s** | 5.4× faster than llama.cpp pp131072 (86.60 s) |
57
  | Loader VRAM | 18.77 GiB | + 110 MiB tok_embd kept on CPU |
58
 
59
- A100 SXM Q4_K_M: ~155 tok/s decode (single user, short ctx).
60
-
61
  ## Usage
62
 
63
  ### lucebox-hub (dflash + PFlash, recommended for 128K)
@@ -81,14 +78,6 @@ curl http://localhost:8000/v1/chat/completions \
81
  -d '{"model":"luce-dflash","messages":[{"role":"user","content":"hello"}],"stream":true}'
82
  ```
83
 
84
- ### llama.cpp
85
-
86
- Requires llama.cpp with `laguna` arch support. The lucebox-hub fork at `dflash/deps/llama.cpp` adds it; upstream PR pending.
87
-
88
- ```bash
89
- ./llama-bench -m laguna-xs2-Q4_K_M.gguf -p 0 -n 128 -ctk q8_0 -ctv q8_0 -fa 1 -ngl 99
90
- ```
91
-
92
  ## License
93
 
94
  Apache 2.0, inherited from upstream `poolside/Laguna-XS.2`.
 
51
  |----------|-----------|-------|
52
  | Decode @ ctx=128 (greedy) | **113 tok/s** | n_gen=128 |
53
  | Decode @ ctx=1K | 104 tok/s | |
54
+ | Decode @ ctx=4K | 65 tok/s | |
 
55
  | 128K TTFT via dflash + PFlash | **15.91 s** | 5.4× faster than llama.cpp pp131072 (86.60 s) |
56
  | Loader VRAM | 18.77 GiB | + 110 MiB tok_embd kept on CPU |
57
 
 
 
58
  ## Usage
59
 
60
  ### lucebox-hub (dflash + PFlash, recommended for 128K)
 
78
  -d '{"model":"luce-dflash","messages":[{"role":"user","content":"hello"}],"stream":true}'
79
  ```
80
 
 
 
 
 
 
 
 
 
81
  ## License
82
 
83
  Apache 2.0, inherited from upstream `poolside/Laguna-XS.2`.