The AI Application Stack: The Application

This is Part 5 of a series on the AI Application Stack

In Part 4 I highlighted the Prompt’s role as the key component for interacting with an LLM, guiding the model’s responses to deliver useful results.

In this final part of this series I will describe how we build applications to compose a prompt and deliver useful applications to end Users.

The Chat

The previous parts of this series have already described the basic components of an LLM necessary to implement a Chat application.

A model that has been “instruct” tuned will know how to complete prompts of a certain format.
A System Prompt describes to the model how it should answer.
A User instruction.

Then the LLM will complete the document with the answer to the user instruction.

Chain several of these together and that’s a Chat.

The application’s purpose then is to construct the relevant prompt, send it to the API, and then display the result to the User. To build on our diagram from Part I:

graph LR;
    A[User] -->|Query| B[Application];
    B -->|Query| F[Database];
    B --> P[Prompt];
    P -->|Request| C[API];
    C --> D[Document];
    D --> M[LLM];
    M --> C;
    C -->|Response| B;
	B -->|Answer| A;

The application must maintain a sense of state, mediating the user’s interaction with the LLM. This is reasonable to implement in the same way any Web application might:

A user views a web page with all the previous chat messages served from a database.
They fill out a form, posting a new chat message and post it to application.
The application saves the message to the database.
The application constructs a prompt containing previous chat messages and sends a request to an LLM API.
The application retrieves the response, saves it the database.
The application generates a new page with the new resulting chat message.
Goto step 2 and repeat for additional user messages.

Complications and Tradeoffs

The most straightforward chat implementations do just what described above. But this simple implementation will quickly run into some limitations:

If I continue this chat conversation for many turns, the limited prompt capacity may be exceeded. This will be experienced as an error from the API when it rejects your request.
The user might want the LLM to do some work that requires formatting the responses in a particular way.
The user might expect the LLM to have access to information that the LLM wasn’t trained on.

There are other important aspects of a “production” application. For example, depending on your use case, you might not want to answer certain types of questions (or worse, do certain types of “work”). There are also performance and cost considerations where allowing anyone on the internet to send arbitrary numbers of tokens to an expensive API are not a good idea. I’m not going to attempt to tackle those here. A close consideration of “Prompt Injection” is advised.

Prompt Size

Limiting the chat prompt window is necessary to avoid user’s chatting until failure. There are a few techniques I’ve seen to address this:

Simply cut-off messages from the front of the chat once some maximum token length is met. For some use cases it’s likely going to work perfectly well. This assumes though that there is sufficient context in the remaining conversation for the LLM to understand what’s going on. If the user later refers to an important piece of information that was last mentioned long ago, the model will be confused.
The application could use some heuristics to select certain parts of the conversation for removal. For example, if the user is trying to fine tune some essay or block of code, the application could remove some earlier versions once the user has moved past them. This technique is of course very application specific.
An LLM could itself summarize earlier pieces of the conversation and replace chunks of conversation with some shorter version of it. This requires a much more sophisticated application architecture where a summarization process happens on the side, outside of a typical user request/response interaction pattern.

Doing Work

Having an LLM do work for the user is a problem of requiring the LLM to generate tokens in some predictable pattern. This looks one of two ways:

The user may need to see the response formatted in a certain helpful way. The responses from language models are just text. But that isn’t very fun.
- For example, if the user is asking the LLM to write some prose or even code for it, the user likely is going to want to take that prose or code elsewhere. It is most helpful if the LLM formats the response in a way that can be easily identified and formatted for the resulting interface. Many LLMs are very adept and formatting results in “markdown”, which is a text-based formatting language that allows for simple enhancements to plain text such as links, headers and most importantly, preformatted text. The application can then choose to interpret the LLMs response as markdown and render it for the user.
The user may need the application to do something with the response beyond displaying it to the user.
- For example, the application may run code the LLM has written. The LLM must be manipulated into generating a response the application can identify and programmatically interpret. A common response format is to require JSON documents with specific keys.

These may sound Easy. But to do this reliably requires one or more non-obvious tweaks.

Fine-tuning models to answer in the desired format can improve the likelihood of the desired format of response. Some models are specifically fine tuned to return results in JSON. ChatGPT seems very adept at formatting things in markdown and does so with very little instruction.
As mentioned in Part 3, some local models support providing a Grammar which constrain the output tokens. This requires being able to describe the response in a formal language such as BNF.

Access to Information

Base models have some level of understanding of the data they were trained on. For example, asking about the life details of a U.S. president that are described in wikipedia will get you a very reliable answer, assuming you trust Wikipedia.

Fine tuning on additional data can provide a model with information it wouldn’t otherwise have. For example, you could take your chat history with all your friends and get a model that perhaps sounds like you.

But fine-tuning is expensive and slow. It also may not be as reliable as needed. Think of fine-tuning as studying for a test by trying to memorize as much of the textbook as you can.

Another approach is to allow open-book / open-notes. Rather than memorize, just bring those notes with you. The trick is then to simply find the right notes fast enough to finish the test on time.

The fancy name for this is Retrieval Augmented Generation (RAG). The basic technique is to add additional context to the prompt that will help the LLM give a good response. Commonly, the application takes the user query, generates an embedding, then uses that embedding as a query into some database. Documents in the database that are similar to that query will then be included in the prompt.

This, of course, has complications:

What sort of embedding do you use? If the data is very specialized, a specific embedding for the application might be more precise. The embedding also determines how many dimensions are used which has big performance and cost implications.
This naive implementation assumes the query is “similar” to the answer in embedding space. If the query is, “how fast is a north american gopher” and you have in your database “the north american gopher runs 4mph”, those might be pretty close. But this is very application specific.
What do you use as a database? Most standard SQL databases do not support “vector” search. At least not out of the box.
How much data do you include in the prompt? When performing a query to a vector database you get a list of results, sorted by “distance”. The threshold is up to the application. Include too many results and the noise could cause poor answers, slow and expensive responses or running out of context size. Too few results might miss the “needle in the haystack”.

Another approach relies on the ability of the LLM to do work. The LLM as an Agent can write its own search queries. This allows the LLM to apply more intelligence to the information retrieval, reinterpreting the user queries into something a database can more precisely answer. But of course there are trade-offs here as well:

The LLM/Agent might not decide to do a query at all. Maybe it just hallucinates an answer.
It’s slower and more expensive to call the LLM multiple times (once to do the query, once with the results).

Agent

The last section describing how an LLM could craft a query to retrieve information crosses a nuanced threshold.

LLM Agents are AI applications that do more than provide a Request / Response interaction of Chat where each User instruction goes to the LLM and the response is displayed right back.

An Agent has access to Tools that give it the ability to do more than just Generate text. It can use these tools to examine or even act on the world around it.

Some examples:

Run a search query against Wikipedia to retrieve full exact quotes.
Drive a Web Browser, click on links, summarize contents.
Operate a Robot
Communicate with other Agents
Launch the Nukes (not advised)

The difficult part is getting the Agent to use the tools correctly to accomplish some goal. All LLM applications have the tendency to Hallucinate, inventing tools that seem like they should exist, or calling them in ways that are not supported.

Remarkably, an Agent has the ability to be self-correcting. If an Agent makes a mistake the application can tell them and sometimes they figure out another approach.

The hardest part is ensuring the Agent has a plan and sticks to it. I’ve often seen in my own experiments that the LLM will become “lazy” and do part of the task and immediately tell the User.

Designing a Cognitive Architecture for an Agent is an absolutely unsolved problem, but immensely interesting.

Other Applications

Chat is easy to talk about because most of us have experienced ChatGPT. It’s also what the underlying “instruct” models are generally trained to do. But it isn’t the only application architecture.

There is nothing that says your User has to talk to an AI. The application can instead.

The general approach is the same. The application needs to construct an appropriate prompt and extract the answer in a way that will be useful.

Code completion often uses a completely different model that allows for “in-filling” code rather than completing an instruction. But some of the previously discussed techniques still apply. For example, code completion may be more effective if the prompt has more context than just the immediately surrounded code. RAG can be used to provide additional context.
LLMs are great at summarization. It can read an article and provide a quick summary which can be useful in applications outside of a Chat interface.
LLMs can be used for tasks like “Classification” that have previously been the job of purpose-built models. Rather than tune your own model for identifying “hate speech” or “spam”, ask an LLM and it might do a pretty good job. Tuning an LLM is then a matter of adjusting the instructions rather than building a new model.

Conclusion

This article is hand-wavy, speculative and lacks concrete examples. This is because this is the absolutely least developed part of AI Applications. The development of LLM technology itself, while also new, has had clear goals that have driven the technology to develop. These goals are largely scientific and metric driven.

AI Applications, with real world users, have no such constraints. We are only beginning to understand what LLMs are capable of. Combine this technological uncertainty with the usual dose of randomness for what will make a User happy and you have a very volatile, unpredictable yet exciting frontier to play in.

Just get building.