About token generation speed

#1
by budivoy - opened

Summary

I am testing "fla-version" RWKV-7-2.9B-g1 for solving tasks from GAIA benchmark.

Weird thing noticed. Generation speed is quite slow for fla-version on NVidia T4 comparing to my smartphone...

Model Device Decode, token / s
fla (RWKV-7-2.9B-g1) Kaggle, NVidia T4 x2 8.33
official demo (rwkv7-g1a-2.9b-20250924-ctx4096) NVidia T4 26.90
android (RWKV-7-G1a 2.9 250924, W4A16) Snapdragon 8 Elite, NPU 27.69

The strange thing is that official demo runs same faster on single T4...

Question

  1. Is it normal speed?
  2. Is there a way to speed-up?

Code for reference

https://github.com/budivoy/Hugging-Face-Courses/blob/devel/agents-course/rwkv-fla-hub.ipynb

PS. I appreciate your work, using RWKV via transformers API is quite convenient.

Sign up or log in to comment