The AI Application Stack: The Model

This is Part 2 of a series on the AI Application Stack

Let’s start with a bottoms up view. What is the LLM? I’m not going to try to explain how they work at the lowest level, but the right abstraction seems to be that an LLM is a “completion” algorithm in a box.

You provide a document, it completes the document.

So if you were to provide a document like:

Write me a haiku about

A basic LLM response might be

Love.

Or when using a Coding specific LLM in your code editor and you type some code like:

content, err := os.ReadFile("gophers.txt")
if err !=

Your editor makes a request with the snippet of code and the completion LLM tries to finish the code block for you

nil {
    return nil, err
}

Provide a string request, get a string response. This is called “Inference”.

graph LR;
    A[User] -->|String| B[LLM];
	B -->|String| A;

Running The Model

The LLM is a combination of Data, Structure and Code.

Data: Billions of numbers that are the result of training the Model.
Structure: The layout of how all those numbers are related to each other.
Code: The glue that allows us to send our String to model and get one back.

All three components typically exist within the python ecosystem. The data itself is often distributed originally as a serialized python data-structure in a format known as “pickle” (helpful explaination here). The data-structure itself is usually from the library PyTorch

The llama model, released by Meta, comes with a reference implementation that can be used for inference. In theory, every model needs to come with code that shows how to use it. But often models are based on other models with slight tweaks, so this can get messy really fast.

HuggingFace

The center of the Local LLM world is a website, huggingface.com It’s like GitHub for Machine Learning and AI. There is a directory of models for all sorts of use-cases, including LLMs that complete text. But most critically there is a Python library ecosystem for using the models that are listed in the directory. Model builders build to the API. Model users can then use this API to make use of the Model. It looks something like this:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

This is from mistral-instruct’s model card

The model also comes with a configuration file that tells the Python library how to use the model:

{
  "architectures": [
    "MistralForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.34.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

There is a lot to understand. However, I realize now that if I had started here I would have understood more quickly how this all works. What I did instead was make use of some of the new projects that make running local LLMs easier. Does it need to be easier than this? Well when I try to run the above code on my Linux system with an NVIDIA RTX 3090, does it work?

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[2], line 10
      7 encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
      9 model_inputs = encodeds.to(device)
---> 10 model.to(device)

When I check the memory usage on GPU, what do I see:

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2183      C   /home/rhettg/Projects/vllm/bin/python      9772MiB |
+---------------------------------------------------------------------------------------+

Unfortunately even these small 7b parameter models require more memory than my 10GB GPU.

llama.cpp

Local LLMs weren’t on the scene for long before folks wanted to see if they could get them to work on their MacBooks. There are pretty significant differences between the environment an LLM is built on (big linux systems with massive GPUs) and personal compute like an M2 MacBook.

llama.cpp was developed to make running models more efficient in these environments. Not only does it sidestep the python ecosystem by implementing all the code in C++, it invented new serialization formats that support shrinking the size of the model.

As trained, the weights of an LLMs are a floating point values of 32 bits. At billions of weights that really adds up. Typically you need all those weights in memory to be useful. This isn’t a problem if you have 100GB of VRAM on your massive GPU but I do not. If you’re reading this you don’t either.

“Quantization” takes those weights and represents them at lower precision. Remarkably, at only 4 bits the models seem to still be functional. Not as accurate, but functional. Critically it means the models require a lot less memory and can run in more places.

For example, llama2 7b (that’s 7 billion weights) requires 13GB. Quantized to 4-bit that requirement drops to 3.9GB. Note that my GPU above has 10.

LLMs under llama.cpp can even run on just a CPU, no GPU required!

ollama

Running llama.cpp can be a little tricky. It’s a C++ application. The user needs to manage the model files and integrate it into an AI application by managing a system prompt various parameters.

ollama provides a elegant interface, similar to Docker, for downloading models and getting up and running quickly. It’s very friendly and easy to use for folks who are comfortable on the command line.

Combined with OllamaHub it seems like the beginning of an ecosystem for sharing not only models, but custom configured system prompts and templates like OpenAI’s GPTs

Fundamentally Ollama is a easy way to run llama.cpp and provides some nice wrappers around the llama.cpp API.

llamafile

If you still think installing software and a model is difficult, llamafile makes getting started running a local LLM even easier.

It’s just one file. Seriously. Just download, for example, llava-v1.5-7b-q4-server.llamafile and run it. It works on Linux, Mac or Windows.

Under the hood, this is making use of some clever compiler tricks to distribute a single file that contains llama.cpp, the model, and whatever configuration is necessary to run the model on multiple architectures and environments.

The lack of dependencies means you can reasonably archive an LLM to a USB stick and have a backup brain in case the apocalypse destroys the internet. We should probably put some of these in the Arctic Seed Vault

vLLM

On the other side of the spectrum, running models In Production has a different set of tradeoffs. Rather than attempting to run slower with more restricted hardware, the production user likely wants to get the most out of their compute resources.

vLLM is a project that uses a technique called Paged Attention to run models efficiently in production environments.

We saw previously that the LLM weights themselves require a lot of GPU memory. The other big consumer of memory are the tokens that make up the request itself. For example, a 13B parameter model can consume another 1.7GB just for the request itself. Shuffling this much memory around for every request is very expensive.

Benchmarks indicate using vLLM increases throughput (tokens generated) 24x over the Huggingface reference implementations.

Conclusion

Running a local LLM is easier than you might think. I would start with a llamafile just to see how cool it is.

The minute you want to start building and not just tinkering, learning llama.cpp directly is the next step.

If you are blessed with real compute resources like Nvidia A100s and cloud computing, running your model with either vLLM or the Huggingface implementations is likely the right place to spend your time.

The next post will get into some more particulars about writing applications against these models. Your use cases will obviously inform how you want to run the model as well.

@rhettg