How I run LLMs locally

by Abishek_Muthianon 12/29/24, 10:49 AMwith 231 comments
by upghoston 12/29/24, 10:52 PM

> Before I begin I would like to credit the thousands or millions of unknown artists, coders and writers upon whose work the Large Language Models(LLMs) are trained, often without due credit or compensation

I like this. If we insist on pushing forward with GenAI we should probably at least make some digital or physical monument like "The Tomb of the Unknown Creator".

Cause they sure as sh*t ain't gettin paid. RIP.

by chbon 12/30/24, 2:40 AM

I’m surprised to see no mention of AnythingLLM (https://github.com/Mintplex-Labs/anything-llm). I use it with an Anthropic API key, but am giving thought to extending it with local LLM. It’s a great app: good file management for RAG, agents with web search, cross platform desktop client, but can also easily be run as a server using docker compose.

Nb: if you’re still paying $20/mo for a feature-poor chat experience that’s locked to a single provider, you should consider using any of the many wonderful chat clients that take a variety of API keys instead. You might find that your LLM utilization doesn’t quite fit a flat rate model, and that the feature set of the third-party client is comparable (or surpasses) that of the LLM provider’s.

edit: included repo link; note on API keys as alternative to subscription

by chownon 12/29/24, 10:17 PM

If anyone is looking for a one click solution without having to have a Docker running, try Msty - something that I have been working on for almost a year. Has RAG and Web Search built in among others and can connect to your Obsidian vaults as well.

https://msty.app

by rspoerrion 12/29/24, 7:25 PM

I run a pretty similar setup on an m2-max - 96gb.

Just for AI image generation i would rather recommend krita with the https://github.com/Acly/krita-ai-diffusion plugin.

by throwaway314155on 12/29/24, 8:52 PM

Open WebUI sure does pull in a lot of dependencies... Do I really need all of langchain, pytorch, and plenty others for what is advertised as a _frontend_?

Does anyone know of a lighter/minimalist version?

by halyconWayson 12/29/24, 6:37 PM

Super basic intro but perhaps useful. Doesn't mention quant sizes, which is important when you're GPU poor. Lots of other client-side things you can do too, like KoboldAI, TavernAI, Jan, LangFuse for observability, CogVLM2 for a vision model.

One of the best places to get the latest info on what people are doing with local models is /lmg/ on 4chan's /g/

by ashleynon 12/29/24, 10:07 PM

anyone got a guide on setting up and running the business-class stuff (70B models over multiple A100, etc)? i'd be willing to spend the money but only if i could get a good guide on how to set everything up, what hardware goes with what motherboard/ram/cpu, etc.

by Salgaton 12/29/24, 6:23 PM

There is a lot I want to do with LLMs locally, but it seems like we're still not quite there hardware-wise (well, within reasonable cost). For example, Llama's smaller models take upwards of 20 seconds to generate a brief response on a 4090; at that point I'd rather just use an API to a service that can generate it in a couple seconds.

by pvo50555on 12/29/24, 7:15 PM

There was a post a few weeks back (or a reply to a post) showing an app entirely made using an LLM. It was like a 3D globe made with 3js, and I believe the poster had created it locally on his M4 MacBook with 96 GB RAM? I can't recall which model it was or what else the app did, but maybe someone knows what I'm talking about?

by dividefuelon 12/29/24, 8:29 PM

What GPU offers a good balance between cost and performance for running LLMs locally? I'd like to do more experimenting, and am due for a GPU upgrade from my 1080 anyway, but would like to spend less than $1600...

by Der_Einzigeon 12/29/24, 8:27 PM

Still nothing better than oobabooga (https://github.com/oobabooga/text-generation-webui) in terms of maximalism/"Pro"/"Prosumer" LLM UI/UX ALA Blender, Photoshop, Final Cut Pro, etc.

Embarrassing and any VCs reading this can contact me to talk about how to fix that. lm-studio is today the closest competition (but not close enough) and Adobe or Microsoft could do it if they fired their current folks which prevent this from happening.

If you're not using Oobabooga, you're likely not playing with the settings on models, and if you're not playing with your models settings, you're hardly even scratching the surface on its total capabilities.

by gulan28on 12/29/24, 9:06 PM

You can try out https://wiz.chat (my project) if you want to Run llama on your web browser. Still needs a GPU and the latest version of chrome but it's fast enough for my usage.

by coding123on 12/30/24, 1:35 AM

We will at some point have a JS API to run preliminary LLM to make local decisions, then the server will be final arbiter. So for example a comment rage moderator can help an end user change their proposed post while they write it, to help them not turn the comment into rage bate. This will be done best locally on the users browser. Then when they are ready to post, one final check by the server would be done. This would be like today's React front ends doing all the state and UI computation, reducing servers from having to render HTML, for example.

by jokethrowawayon 12/29/24, 6:20 PM

I have a similar pc and I use text-generation-webui and mostly exllama quantized models.

I also deploy text-generation-webui for clients on k8s with gpu for similar reasons.

Last I checked, llamafile / ollama are not as optimised for gpu use.

For image generation I moved from automatic webui to comfyui a few months ago - they're different beasts, for some workflow automatic is easier to use but for most tasks you can create a better workflow with enough comfy extensions.

Facefusion warrants a mention for faceswapping

by novokon 12/30/24, 8:37 PM

As a piece of writing feedback, I would convert your citation links into normal links. Clicking on the citation doesn't jump to the link or the citation entry, and you are basically using hyperlinks anyway.

by mikestaubon 12/30/24, 2:52 PM

I just use MLC with WebGPU: https://codepen.io/mikestaub/pen/WNqpNGg

by prettyblockson 12/30/24, 2:31 AM

> I have a laptop running Linux with core i9 (32threads) CPU, 4090 GPU (16GB VRAM) and 96 GB of RAM.

Is there somewhere I can find a computer like this pre-built?

by erickguanon 12/30/24, 12:04 AM

How much memory can models take? I would assume the dGPU set up won't perform better at a certain point.

by koinedadon 12/29/24, 5:49 PM

Helpful summary, short but useful

by masteruvpuppetzon 12/30/24, 7:29 AM

David Bombal interviews a mysterious man where he shows how he uses AI/LLMs for his automated LinkedIn posts and other tasks. https://www.youtube.com/watch?v=vF-MQmVxnCs

by sturzaon 12/29/24, 6:30 PM

4090 has 24gb vram, not 16.

by dumbfounderon 12/29/24, 7:19 PM

Updation. That’s a new word for me. I like it.

by thangalinon 12/29/24, 9:42 PM

run.sh:

    #!/usr/bin/env bash

    set -eu
    set -o errexit
    set -o nounset
    set -o pipefail

    readonly SCRIPT_SRC="$(dirname "${BASH_SOURCE[${#BASH_SOURCE[@]} - 1]}")"
    readonly SCRIPT_DIR="$(cd "${SCRIPT_SRC}" >/dev/null 2>&1 && pwd)"
    readonly SCRIPT_NAME=$(basename "$0")

    # Avoid issues when wine is installed.
    sudo su -c 'echo 0 > /proc/sys/fs/binfmt_misc/status'

    # Graceful exit to perform any clean up, if needed.
    trap terminate INT

    # Exits the script with a given error level.
    function terminate() {
      level=10

      if [ $# -ge 1 ] && [ -n "$1" ]; then level="$1"; fi

      exit $level
    }

    # Concatenates multiple files.
    join() {
      local -r prefix="$1"
      local -r content="$2"
      local -r suffix="$3"

      printf "%s%s%s" "$(cat ${prefix})" "$(cat ${content})" "$(cat ${suffix})"
    }

    # Swapping this symbolic link allows swapping the LLM without script changes.
    readonly LINK_MODEL="${SCRIPT_DIR}/llm.gguf"

    # Dereference the model's symbolic link to its path relative to the script.
    readonly PATH_MODEL="$(realpath --relative-to="${SCRIPT_DIR}" "${LINK_MODEL}")"

    # Extract the file name for the model.
    readonly FILE_MODEL=$(basename "${PATH_MODEL}")

    # Look up the prompt format based on the model being used.
    readonly PROMPT_FORMAT=$(grep -m1 ${FILE_MODEL} map.txt | sed 's/.*: //')

    # Guard against missing prompt templates.
    if [ -z "${PROMPT_FORMAT}" ]; then
      echo "Add prompt template for '${FILE_MODEL}'."
      terminate 11
    fi

    readonly FILE_MODEL_NAME=$(basename $FILE_MODEL)

    if [ -z "${1:-}" ]; then
      # Write the output to a name corresponding to the model being used.
      PATH_OUTPUT="output/${FILE_MODEL_NAME%.*}.txt"
    else
      PATH_OUTPUT="$1"
    fi

    # The system file defines the parameters of the interaction.
    readonly PATH_PROMPT_SYSTEM="system.txt"

    # The user file prompts the model as to what we want to generate.
    readonly PATH_PROMPT_USER="user.txt"

    readonly PATH_PREFIX_SYSTEM="templates/${PROMPT_FORMAT}/prefix-system.txt"
    readonly PATH_PREFIX_USER="templates/${PROMPT_FORMAT}/prefix-user.txt"
    readonly PATH_PREFIX_ASSIST="templates/${PROMPT_FORMAT}/prefix-assistant.txt"

    readonly PATH_SUFFIX_SYSTEM="templates/${PROMPT_FORMAT}/suffix-system.txt"
    readonly PATH_SUFFIX_USER="templates/${PROMPT_FORMAT}/suffix-user.txt"
    readonly PATH_SUFFIX_ASSIST="templates/${PROMPT_FORMAT}/suffix-assistant.txt"

    echo "Running: ${PATH_MODEL}"
    echo "Reading: ${PATH_PREFIX_SYSTEM}"
    echo "Reading: ${PATH_PREFIX_USER}"
    echo "Reading: ${PATH_PREFIX_ASSIST}"
    echo "Writing: ${PATH_OUTPUT}"

    # Capture the entirety of the instructions to obtain the input length.
    readonly INSTRUCT=$(
      join ${PATH_PREFIX_SYSTEM} ${PATH_PROMPT_SYSTEM} ${PATH_PREFIX_SYSTEM}
      join ${PATH_SUFFIX_USER} ${PATH_PROMPT_USER} ${PATH_SUFFIX_USER}
      join ${PATH_SUFFIX_ASSIST} "/dev/null" ${PATH_SUFFIX_ASSIST}
    )

    (
      echo ${INSTRUCT}
    ) | ./llamafile \
      -m "${LINK_MODEL}" \
      -e \
      -f /dev/stdin \
      -n 1000 \
      -c ${#INSTRUCT} \
      --repeat-penalty 1.0 \
      --temp 0.3 \
      --silent-prompt > ${PATH_OUTPUT}

      #--log-disable \

    echo "Outputs: ${PATH_OUTPUT}"

    terminate 0
map.txt:

    c4ai-command-r-plus-q4.gguf: cmdr
    dare-34b-200k-q6.gguf: orca-vicuna
    gemma-2-27b-q4.gguf: gemma
    gemma-2-7b-q5.gguf: gemma
    gemma-2-Ifable-9B.Q5_K_M.gguf: gemma
    llama-3-64k-q4.gguf: llama3
    llama-3-64k-q4.gguf: llama3
    llama-3-1048k-q4.gguf: llama3
    llama-3-1048k-q8.gguf: llama3
    llama-3-8b-q4.gguf: llama3
    llama-3-8b-q8.gguf: llama3
    llama-3-8b-1048k-q6.gguf: llama3
    llama-3-70b-q4.gguf: llama3
    llama-3-70b-64k-q4.gguf: llama3
    llama-3-smaug-70b-q4.gguf: llama3
    llama-3-giraffe-128k-q4.gguf: llama3
    lzlv-q4.gguf: alpaca
    mistral-nemo-12b-q4.gguf: mistral
    openorca-q4.gguf: chatml
    openorca-q8.gguf: chatml
    quill-72b-q4.gguf: none
    qwen2-72b-q4.gguf: none
    tess-yi-q4.gguf: vicuna
    tess-yi-q8.gguf: vicuna
    tess-yarn-q4.gguf: vicuna
    tess-yarn-q8.gguf: vicuna
    wizard-q4.gguf: vicuna-short
    wizard-q8.gguf: vicuna-short
Templates (all the template directories contain the same set of file names, but differ in content):

    templates/
    β”œβ”€β”€ alpaca
    β”œβ”€β”€ chatml
    β”œβ”€β”€ cmdr
    β”œβ”€β”€ gemma
    β”œβ”€β”€ llama3
    β”œβ”€β”€ mistral
    β”œβ”€β”€ none
    β”œβ”€β”€ orca-vicuna
    β”œβ”€β”€ vicuna
    └── vicuna-short
        β”œβ”€β”€ prefix-assistant.txt
        β”œβ”€β”€ prefix-system.txt
        β”œβ”€β”€ prefix-user.txt
        β”œβ”€β”€ suffix-assistant.txt
        β”œβ”€β”€ suffix-system.txt
        └── suffix-user.txt
If there's interest, I'll make a repo.

by amazingamazingon 12/29/24, 9:16 PM

I never have seen the point of running locally. Not cost effective, worse model, etc.

by farceSpheruleon 12/29/24, 11:21 PM

I pay for ChatGPT Teams. Much easier and better than this.

by deadbabeon 12/29/24, 6:32 PM

My understanding is that local LLMs are mostly just toys that output basic responses, and simply can’t compete with full LLMs trained with $60 million+ worth of compute time, and that no matter how good hardware gets, larger companies will always have even better hardware and resources to output even better results, so basically this is pointless for anything competitive or serious. Is this accurate?