This running in the browser via Emscripten by Georgi Gerganov of llama.cpp fame:
https://ggerganov.com/llama2.c/
Via his Twitter with ongoing thread: https://twitter.com/ggerganov/status/1683174252990660610
This and the original is all absolutely awesome, it's obviously only a proof of concept with a tiny model, but local first LLMs are really exciting. I particularly love the idea of being able to build webapps with local inference.
With optimisation, research into ways to make smaller models, partial downloads, and then the opportunity to use WebGPU we potentially have the start of an exciting new way to build privet local LLM based apps.
It's never going to be up to the same capabilities of hosted LLMs on massive clusters of top end GPUs, but there are so many use cases that this sort of thing will enable.
Here's a Rust version in case anyone's curious what it would look like. It also clocks 106 tokens/second in release mode.
https://github.com/garrisonhess/llama2.c/blob/517a1a3e487f31...
I'm not sure how many people understand how much of a badass move this is.
Andrej is helping apple and Facebook and more importantly the open source movement while also being paid really well by OpenAI(MSFT)
But they are not going to push him out because he will go directly to Tesla or xai.
I've found Llama-2 to be unusably "safety filtered" for creative work: https://i.imgur.com/GFY0wSL.png
More details from Andrej here: https://twitter.com/karpathy/status/1683143097604243456?s=46...
FYI: this builds cleanly with WASI SDK and runs with no changes in a Wasm runtime if you're into that kind of thing
To run a neural network, how much memory does one need?
Is it enought to load the first two layers from disk, calculate the activations for all nodes, discard the first layer, load the third layer from disk, calculate all the activations for all nodes, discard the second layer etc?
Then memory needs to be big enough to hold to 2 layers?
Random thought: right now an LLM returns a probabilities distribution, an RNG sampler picks one and apoends it to the output, then the sequence repeats; but can the RNG instead pick N tokens that approximate the distribution, ask LLM to generate N new distributions, combine them somehow, then pick another set of N tokens from the combined dustribution?
Is this for educational purposes only? Based on the success of llama.cpp and this one it appears that the industry is going in a direction of separate source code for every model that is released instead of general purpose frameworks like pytorch/tensorflow/onnxruntime?
"make more better tests to decrease yolo" haha
As someone who doesn’t work with languages like C, what’s the appeal of “in one file” or “header only”? Is it about dependency management?
@karpathy, I could not get to run. It exited in the reading of the tokenizer.bin. Turns out on Windows with Visual Studio, fopen needs to be issued in binary mode, otherwise the reading eventually "failed".
What is required to actually feed it text and then retrieve the results? So instead of having it produce the story of Lily, write something different?
ohh thats some really nice readable c-code
Getting 220 tokens/sec with -Ofast on an 2018 iMac Pro.
"train a baby Llama 2 model in PyTorch, then inference it"
neat!
note that gcc's default optimisation level is 0, which really isn't what people normally want.
adding -O2 to the gcc command line should improve performance quite a bit.
What are some uses for this?
Never seen the word “inference” used as a verb.
Is the trained model available on Hugging Face?
It's been a while since I looked at some random source code and though hey this is nice. This is also how code comments should be - I could follow it all because of them. Not too many or obvious ones, and not too few. I even got a chuckle from "poor man's C argparse".
Bravo!
Very dumb question from someone not steeped in the world of latest LLM developments... does the C code have to invoke python every time you pass it a prompt? What kind of permissions does it need?
I'm trying to think of some dataset to create and train this in. Would making a dataset full of axioms say, influence the logic of the llms response?
Seems like this could be suitable for masochists like me who wish to run language models on retro computers :)
Not that it is necessarily of value, but has anyone got a LLM to run on bare metal?
I wonder how much faster on AVX-512 this thing will run
For some reason i parsed this as one line of pure c.
Sounds like what Llama.cpp used to be.
This is amazing. One curious question: Why C? Why not standard C++?
Yay fun to see it make its way to HN :) It turns out that my original checkpoint runs _way_ faster than I expected (100 tok/s) on MacBook Air M1 with -O3 when compiling, so I am now training a bigger 44M model, which should still running interactively. Maybe the 7B Llama model is within reach... :thinking_emoji: