The AI Application Stack

December 2023 ยท 4 minute read

Since April, I’ve been tinkering in the intriguing world of Large Language Models (LLMs). This exploration, facilitated by platforms like OpenAI’s API, aligns well with my expertise in computer systems and my passion for creating applications. This journey hasn’t necessarily been about understanding machine learning โ€” an area I unfortunately know little about - but more about how to effectively build the software stack that these systems require.

The latest shakeup at OpenAI has me (and, well, just about everyone in the space) more interested in the world of Open Source LLMs. Running your own LLMs is a daunting task for a production application, but for the tinkerer it’s an appealing excuse to understand more deeply how these components work.

In the following series, I’ll share insights and lessons on the essential elements of an LLM Application Stack. I’m not an expert. Those are in very short supply. But I can at least share what I’ve learned so far for anyone else going down the same twisty path.

In a general sense, you, the User, have a question, request or instruction for a Large Language Model (LLM). It’s a string:

Write me a haiku about Gophers

You expect a response like:

In fields they wander,
Earth’s playful, furry diggers,
Beneath sun, they thrive.

For user, maybe the stack feels like this:

graph LR;
    A[User] -->|Query| B[LLM];
	B -->|Response| A;

If you start working with the OpenAI API it becomes clear there is more to an AI Application than just the Model. It isn’t trivial to write your own Chat interface. The AI Application needs to keep track of the history of the conversation. The model has a limited “Context” size (how long that conversation can be) and so there must be some mechanism for forgetting earlier parts of the conversation.

For OpenAI’s API, you can see how this API is structure to suit the needs of the ChatGPT client.

curl \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
        "role": "system",
        "content": "You are a helpful assistant."
        "role": "user",
        "content": "Hello!"

The API responds with:

  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo-0613",
  "system_fingerprint": "fp_44709d6fcb",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "\n\nHello there, how may I assist you today?",
    "finish_reason": "stop"
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21

The application could then make a follow-up request that includes both the original messages, this latest assistant response message, and a new followup user message. Now you have a conversation!

Critically, the model itself must also understand what a “chat” is. The “completion” reference in the API method is telling: An LLM is fundamentally completing a document with text that it feels is likely based on the model’s training.

With this functionality in mind we can break up our AI Application into a few more components:

graph LR;
    A[User] -->|Query| B[Application];
    B -->|Request| C[API];
    C --> D[Document];
    D --> M[LLM];
    M --> C;
    C -->|Response| B;
	B -->|Answer| A;

I want to dig into each of these parts in turn.

Again, I’m not necessarily an expert in all these areas. I’m sure few are. But I’ve learned enough to start sharing. I hope others find this useful.