LLaMa: GPT-quality language model on commodity hardware (eg RPi)

LLaMa allows Large Language Models (of comparable quality to the Generative Pre-trained Transformer) on commodity hardware (IE on a single GPU, although even more impressively, I believe pared-down versions like LLaMa 7B have even been run on an RPi and on a smartphone).

The Hackaday article is here, and more informative than anything I’d write.

I presume we’re all proponents of having some modest degree of autonomy ourselves when it comes to domestic software & hardware, and although developed by Meta/Facebook, this is a model trained on publicly available text, and for which the code and model weights can be obtained (see the hackaday article for details).

As with all processing of inputs, there are huge security advantages to locally processing as much of them as practicable.

As with anything pertaining to that which a lay reader might think of as AI, I’d like to append a note to explain why I’d describe this as Machine-Learning rather than AI (and, at that, the word ‘Learning’ is somewhat of a lazy anthropomorphisation/pretending-something-is-a-human. I know, I know, I probably sound like my grandmother telling me ‘car’ is a vulgar contraction of ‘motor-car’…). The following text is from this web-page:

How LLMs Work:
LLMs like GPT-3 are deep neural networks—that is, neural networks with many layers of “neurons” connected by billions of weighted links. Given an input text “prompt”, at essence what these systems do is compute a probability distribution over a “vocabulary”—the list of all words (or actually parts of words, or tokens) that the system knows about. The vocabulary is given to the system by the human designers. GPT-3, for example, has a vocabulary of about 50,000 tokens.

For simplicity, let’s forget about “tokens” and assume that the vocabulary consists of exactly 50,000 English words. Then, given a prompt, such as “To be or not to be, that is the”, the system encodes the words of the prompt as real-valued vectors, and then does a layer-by-layer series of computations, whose penultimate result is 50,000 real numbers, one for each vocabulary word. These numbers are (for obscure reasons) called “logits”. The system then turns these numbers into a probability distribution with 50,000 probabilities—each represents the probability that the corresponding word is the next one to come in the text. For the prompt “To be or not to be, that is the”, presumably the word “question” would have a high probability. That is because LLMs have learned to compute these probabilities by being shown massive amounts of human-generated text. Once the LLM has generated the next word—say, “question”, it then adds that word to its initial prompt, and recomputes all the probabilities over the vocabulary. At this point, the word “Whether” would have very high probability, assuming that Hamlet, along with all quotes and references to that speech, was part of the LLMs training data.

1 Like

I have been integrating this in OVOS as part of the persona sprint

you can follow progress here


Holy smoke. Thanks @JarbasAl ! I was just mentioning this in the abstract; I didn’t think anyone would remotely be onto turning this into reality in the context of voice assistants! You just blew my mind :smiley:


Hi all, new poster here. This looks awesome, I was hoping to try out and potentially help develop some local LLM voice assistant action. :slight_smile:

I am still trying to figure out the basics of the Mycroft Mark II, which skill store to install skills from and so on, but once I get that sorted out I was hoping to be able to talk with the Mark II which would get responses generated on the GPU on a local stationary computer. Right now I am using a 7B model called wizardLM and think the conversations are pretty good.

Were you aiming to run the LLM actually on the Raspberry Pi as a proof of concept? Or are you also aiming to communicate with some local computer?

I also saw this YouTube video which inspired me. Maybe you have already seen it because it is pretty old.

How far have you gotten? All the best.


some work has been done already, mostly just exploring ideas

main issue tracking progress


Hi @JarbasAl , thanks for the quick reply! It looks very promising and ambitious. Should I have OVOS to try things out or does it also work with Neon OS?


since Neon is built on top of OVOS, any components made for OVOS will also work in Neon

in this case there are many loose proof of concepts, those can be used but do not yet come together as a final product you can just install and be done with, prioritizing this work it is a stretch goal of our ongoing fundraiser, right now updates only come whenever i work on this for fun or as a side effect of working on related code


As Jarbas said, you can work with either operating system, and often with just a little care skills & other projects can be compatible with both OVOS and Neon AI. :slight_smile:

Neon has a skill for talking to ChatGPT working in our beta version right now, which you might like to check out. With the Neon OS running on the Mark II, the commands are:

  1. Enable pre-release updates
  2. Check for updates
  3. Update my configuration
  4. and then, “Chat with ChatGPT”
    You’ll still need to either press the button on top of the Mark II or use the wakeword for each sentence you want to say to ChatGPT. We’re considering how to make that smoother - perhaps by leaving the microphone open while the ChatGPT skill is active. Suggestions are welcome. :slight_smile:

Yes, Coqui is an excellent project! We’ve put some contributions in there, and feel we’re very close to enabling our own STT & TTS. :slight_smile: If it’s of interest, here’s our Coqui demo - Neon AI - Coqui AI TTS Plugin | Neon AI

1 Like

Georgi Gerganov has another excellent repo as he did with Whisper.cpp

Its a pretty easy install but on Rpi even with the amazing optimisation work its still going to be excruciatingly slow its sort of Ok on a RK3588 which is x5 Pi4 perf and that is maxing out the CPU.
ASR+LLaMA+TTS could make a really cutting edge home assistant, but needs some Oooomf!


There is Orca 3b on a Pi4


It is realtime but slow and also its a 8gb Pi4.

1 Like

Alas, for the speed and loquaciousness of LLM I’m sure we all desire, I think at the moment, for the self-hosted/home-labbers, the most expeditious route to exploiting an LLM via Mycroft or Neon will be to construct a wee server with a few Nvidia Tesla 24GB graphics cards (circa 2014, 16x PCIE 4.0, now being sold cheaply second-hand, often ultra-cheaply when untested if you’re up for a gamble, as they have no graphics-out port, one esoteric requirement: your motherboard must support the BIOS option “Above 4G decoding” in order for them to operate - otherwise they give a “Code 12” fault after installing the driver (can’t find enough resources)), on which to run the LLM du jour. They’re getting ‘better’ by the day (r/localllama, RSS feed) , and the progress is ongoing.

1 Like

We’ve got our own LLM instance up now! :slight_smile: It’s in our alpha version, live just a couple days ago, and I need to make the “official” announcement. We’ve set up our version of FastChat using Ctranslate, which we’re calling NeonGPT, and the command at the moment is “Chat with FastChat” but we’ll get “Chat with NeonGPT” running soon.

It runs off the cloud with hosting managed by us, but it’s a very lightweight language model so it is a good target for potential offline use. We aren’t collecting any data, so it is a lot more “private” than ChatGPT, though I cannot make any guarantees about keeping your chat data super-encrypted, we’ve made a good-faith effort for it to be private.

We’re still working on more LLM-related development, and though we haven’t had any time to make or share documentation, we would welcome and support community development in this area just as we do in others. :slight_smile:


You need nothing like a few Nvidia Tesla 24GB graphics cards as models trained by bigger models post benches very close to the big models. Prob is big data who owns the big models has put in licence clauses against this.
Its just currently a raspberry Pi is just not very capable for ML as newer achictectures such as the RK3588 of Arm v8.2 have Mat/Mul instructions added where the x5 speedup of ML is far more than the 2.2Ghz clock speed up.

I have never liked Apple bling but likely by far the best private home platform is the 6.8 watt idle M1/2 mac mini’s as with the unified memory cpu/gpu/npu and the combination that the ARM NEON, Accelerate and Metal frameworks have great support where ML perf for wattage is godlike.
Even RK3588 is capable of running LLM’s where with diversification of use concurrent clashes are minimal.

Its just the Raspberry Pi is a forget about it mainly due to the age of the A53/A73 that ML really wasn’t on the radar at that time.
I was quite excited to what might be the next great thing after the RK3588 but with all the trade wars and licensing malarky going on at the moment we might see a reluctance from China fabless design to implement anything new Arm based, which prob means this is great news for Apple.

Is basically a Llama LLM finetuned with GPT3.5 & 4 and comes very close to GPT4 with huge reductions in model and parameter size.
Finetuning models is all the rage especially when you can automate it with bigger models doing the finetuning for you.

There is some really nice stuff recently also from AMD & Intel but still lags considerabilly behind what Apples take is with current flagship Arm IP is, but for me a privacy respecting home AI that has a cloud LLM is sort of paradoxical…

There are really cheap $100 24gb Tesla K80 Gpu’s on ebay, but with todays and tomorrows energy prices they are not as cheap as you might think for running 24/7/365 and if I remember rightly they are 12gb x2 and about equivalent to a RTX 3050.
I had the same idea with Tesla K80 but after some spec checking it no longer seemed a good idea.

Likely the infrastructure is wrong and distributed Pi3 and above are purely wireless mic arrays connected to a central home cloud.
Currently you can scrape by with a RK3588 and likely the best solution is a Mac Mini but also there is new low wattage silicon from AMD/Intel that can also be a central private home AI.

The Pi situation for ML is dire as current stock and even with some laterial thought to infrastructure the great Pi Zero-2 is a near impossible buy as can make a great mic array with the respeaker 2mic, but getting one is another thing and likely not going to be the same great price when avail as Raspberry seems to of commited maker suicide.


Ps I have been trying to find alternatives to the Raspberry Pi to create distributed wireless Mics and have been a fan of the esp32-s3 for some time now due to its enhanced vector instructions.
For ML its like a standard esp32 but with up to a x10 ML speedup.

There is a repo currently using the esp32-s3-box GitHub - toverainc/willow: Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative that to be honest I haven’t tried as I do have both the esp32-s3-box & esp32-s3-box-lite which I evaluated as a great technology preview that shoehorned an amazing amount of functionality into a microcontroller.
Alexa / Google smart speaker it isn’t, but likely much more resources could be released and allocated if it concentrated on merely being a wireless mic array / kws and use something tiny, cute and cheap like the lillygo T7-S3

Likely with a cheap pcm1808 ADC we can extend the far field of the s3-box by having hardware analogue AGC on a Max9814 but still use the blind source seperation of the Esspressif ADF/SR.
Replace the Esspressif Wakenet KWS blobs with trained models (Which I can do).

Also likely you could do something similar with the OrangePi Zero 2 as likely thats the most cost effect A53 pi3 replacement nowadays and likely makes model choice a little more flexible as unlike the standard esp32 (lx6) as far as I can make out LSTM layers on the S3 (lx7) is licenced and not avail.

Strangely Mycroft had a dedicated audio board but never preprocessed KWS/ASR datasets with the device of use and the total lack of any audio engineering to increase accuracy has always been a total confusion to me.
Its a tedius process but not that hard to create kws/asr tailored models for a specific device as the mics and algs all have unique signatures that if you train in by preprocessing a dataset you increase accuracy.

Is likely the best value as the Opi02 has been updated to the Opi03

Doh my memory the Opi02/3 doesn’t have I2S on the pin mux from what I could gather and thinking about it I used a Plugable usb soundcard last time.