The AI Application Stack: The Application

February 2024 · 10 minute read

This is Part 5 of a series on the AI Application Stack


In Part 4 I highlighted the Prompt’s role as the key component for interacting with an LLM, guiding the model’s responses to deliver useful results.

In this final part of this series I will describe how we build applications to compose a prompt and deliver useful applications to end Users.

The Chat

The previous parts of this series have already described the basic components of an LLM necessary to implement a Chat application.

Then the LLM will complete the document with the answer to the user instruction.

Chain several of these together and that’s a Chat.

The application’s purpose then is to construct the relevant prompt, send it to the API, and then display the result to the User. To build on our diagram from Part I:

graph LR;
    A[User] -->|Query| B[Application];
    B -->|Query| F[Database];
    B --> P[Prompt];
    P -->|Request| C[API];
    C --> D[Document];
    D --> M[LLM];
    M --> C;
    C -->|Response| B;
	B -->|Answer| A;

The application must maintain a sense of state, mediating the user’s interaction with the LLM. This is reasonable to implement in the same way any Web application might:

  1. A user views a web page with all the previous chat messages served from a database.
  2. They fill out a form, posting a new chat message and post it to application.
  3. The application saves the message to the database.
  4. The application constructs a prompt containing previous chat messages and sends a request to an LLM API.
  5. The application retrieves the response, saves it the database.
  6. The application generates a new page with the new resulting chat message.
  7. Goto step 2 and repeat for additional user messages.

Complications and Tradeoffs

The most straightforward chat implementations do just what described above. But this simple implementation will quickly run into some limitations:

There are other important aspects of a “production” application. For example, depending on your use case, you might not want to answer certain types of questions (or worse, do certain types of “work”). There are also performance and cost considerations where allowing anyone on the internet to send arbitrary numbers of tokens to an expensive API are not a good idea. I’m not going to attempt to tackle those here. A close consideration of “Prompt Injection” is advised.

Prompt Size

Limiting the chat prompt window is necessary to avoid user’s chatting until failure. There are a few techniques I’ve seen to address this:

Doing Work

Having an LLM do work for the user is a problem of requiring the LLM to generate tokens in some predictable pattern. This looks one of two ways:

These may sound Easy. But to do this reliably requires one or more non-obvious tweaks.

Access to Information

Base models have some level of understanding of the data they were trained on. For example, asking about the life details of a U.S. president that are described in wikipedia will get you a very reliable answer, assuming you trust Wikipedia.

Fine tuning on additional data can provide a model with information it wouldn’t otherwise have. For example, you could take your chat history with all your friends and get a model that perhaps sounds like you.

But fine-tuning is expensive and slow. It also may not be as reliable as needed. Think of fine-tuning as studying for a test by trying to memorize as much of the textbook as you can.

Another approach is to allow open-book / open-notes. Rather than memorize, just bring those notes with you. The trick is then to simply find the right notes fast enough to finish the test on time.

The fancy name for this is Retrieval Augmented Generation (RAG). The basic technique is to add additional context to the prompt that will help the LLM give a good response. Commonly, the application takes the user query, generates an embedding, then uses that embedding as a query into some database. Documents in the database that are similar to that query will then be included in the prompt.

This, of course, has complications:

Another approach relies on the ability of the LLM to do work. The LLM as an Agent can write its own search queries. This allows the LLM to apply more intelligence to the information retrieval, reinterpreting the user queries into something a database can more precisely answer. But of course there are trade-offs here as well:


The last section describing how an LLM could craft a query to retrieve information crosses a nuanced threshold.

LLM Agents are AI applications that do more than provide a Request / Response interaction of Chat where each User instruction goes to the LLM and the response is displayed right back.

An Agent has access to Tools that give it the ability to do more than just Generate text. It can use these tools to examine or even act on the world around it.

Some examples:

The difficult part is getting the Agent to use the tools correctly to accomplish some goal. All LLM applications have the tendency to Hallucinate, inventing tools that seem like they should exist, or calling them in ways that are not supported.

Remarkably, an Agent has the ability to be self-correcting. If an Agent makes a mistake the application can tell them and sometimes they figure out another approach.

The hardest part is ensuring the Agent has a plan and sticks to it. I’ve often seen in my own experiments that the LLM will become “lazy” and do part of the task and immediately tell the User.

Designing a Cognitive Architecture for an Agent is an absolutely unsolved problem, but immensely interesting.

Other Applications

Chat is easy to talk about because most of us have experienced ChatGPT. It’s also what the underlying “instruct” models are generally trained to do. But it isn’t the only application architecture.

There is nothing that says your User has to talk to an AI. The application can instead.

The general approach is the same. The application needs to construct an appropriate prompt and extract the answer in a way that will be useful.


This article is hand-wavy, speculative and lacks concrete examples. This is because this is the absolutely least developed part of AI Applications. The development of LLM technology itself, while also new, has had clear goals that have driven the technology to develop. These goals are largely scientific and metric driven.

AI Applications, with real world users, have no such constraints. We are only beginning to understand what LLMs are capable of. Combine this technological uncertainty with the usual dose of randomness for what will make a User happy and you have a very volatile, unpredictable yet exciting frontier to play in.

Just get building.