My private, local home assistant tells my son a story
Ten years ago I had this vision of making Jarvis real, but I really didn't want to give Jeff Bezos the keys to my life. And so it was put on hold... until last year. I'd like to share what I've been doing.
Hello, my name is Daniel. I used to be a user experience engineer for Walt Disney. I then led a small engineering team who created the web video player for NBC News. Today I am part of the group that is involved in Oracle's data center expansion.
But in my spare time, I have been designing and building my own private AI service to give my home sovereign intelligence.
It started about two years ago when I discovered Home Assistant, then later as I began playing with commercial AI agents, poking at what they could and couldn't do. Somewhere along the way I came to realize that Home Assistant could be attached to ChatGPT, and that got me to think... what if! And there began the experiments.
Not knowing about deployment of AI models, I started with a small, open sourced version on an 8GB Jetson Nano. It took a little time before I realized that the hardware and the model just were not going to help me reach my goals. Simultaneously, I was feeling unsettled about both the way that commercial models were created and how the data they consumed from everyday users were sucked up.
I set off to build my own AI server mid-summer last year, but I wasn't happy with my model choices. As my parts were en route to me, though, Swiss AI announced Apertus, and I knew it was the model I wanted to use: it aligned with my own values, and I felt that even the small one would be powerful enough to do what I'd long dreamed about.
I'd love to tell you that getting everything up and running was cupcakes and puppy dogs, but if that were true, where would the adventure stories be!
Oh no, I have had my share of challenges. First was getting it loaded into my GPU: a NVIDIA RTX 3090. After getting that successfully accomplished, my first interaction with the model will go down in my personal lore:
Me: "Who are you?" (I expected to get some form of Apertus returned)
Apertus: "I am the goddess of Earth. I am also the god of the sky."
At least it's humble.
Fast forward to December & January and there were several documents I found from Swiss AI that were critical in getting to the milestone demonstrated in the video.
- Discovering SwissAI's GitHub repo, particularly the Apertus formatter
- Finding (at least to me) vLLM and deploying Apertus on it
- Some other performance tweaks/tunes that some of the researchers from SwissAi commented on in various bug reports
So here are the achievements I am really excited to share:
- Completely private neuro-symbolic AI intelligence service designed for my home
- Runs in 16GB of VRAM on RTX 3090
- Max tokens set to 16,384 (because above 17,000 and I don't have enough free RAM for KV Cache!)
- I should note that setting the token limit higher gave me no model performance degradation: it was as fast with a high max as with a low one!
- Web UI chat client that tailors itself to my family members - text is stream and I am getting an average of 32 tokens/sec response with an initial latency between prompt request and response of around 400ms.
- STT-TTS voice interface with streaming responses + streaming TTS
About the neuro-symbolic service: I built a symbolism server that is designed to generate and cache common actions rather than going back to the AI server regularly. It's created so that the first time a potential rule comes through--please turn on the lights--it will ask the model if that's really what the user wanted. After a certain level of affirmation, the symbolic server learns the action and simply runs it without needing agent confirmation.
What's next: Only yesterday I think I finally figured out how to use the Apertus ToolCalls/ToolOutputs format. I'll start experimenting with that so I can have Apertus begin controlling devices in my home.
But there is ONE thing that I could use some real help with: my vLLM engine SEGFAULTS every time I attempt to start it with LoRA adapters enabled. I used Qwen, just as a sanity check, and Qwen showed the same behavior. It appears the issue is related to the way vLLM is unpacking the q/k/v LoRA weights to the fused qkv_proj module, something in the tensor reshaping/slicing causes a memory access violation. Has anyone encountered this or have a solution? I'd love to find a solution because vLLM is a significant improvement over Transformers and my goal is to eventually use fine-tuning to help Apertus know my family better.
Anyway, I've rattled on long enough. Let's see some demos! (Feel free to also ask questions; happy to discuss my learnings from the past five months working with this model)
An early interaction. Decent output but before I figured out how to unpack responses.
Apertus getting a sense of humor.
At the bottom of the response it shows
And my most recent milestone, fully private, fully local voice interactions and speech synthesis. The speech synthesis executes via a custom streaming service, allowing the TTS to deliver as chunks arrive. Think of it like how a web-based AI chat client responds, but this is for audio response. This is a video link that shows it in action, with a small surprise. ;)
Wonderful, thanks for sharing :) Concerning your central issue:
But there is ONE thing that I could use some real help with: my vLLM engine SEGFAULTS every time I attempt to start it with LoRA adapters enabled. I used Qwen, just as a sanity check, and Qwen showed the same behavior. It appears the issue is related to the way vLLM is unpacking the q/k/v LoRA weights to the fused qkv_proj module, something in the tensor reshaping/slicing causes a memory access violation. Has anyone encountered this or have a solution? ...
Could you, please, try the latest stable build of vLLM, follow the relevant documentation, and look through issues like this one if you still see problems.
Oh, thank you! I will give that a try to see if it's fixed things. I appreciate the follow up.