The AI Application Stack: The API

December 2023 · 8 minute read

This is Part 3 of a series on the AI Application Stack


Once you have an LLM running running under a tool like llama.cpp, you have an HTTP API ready to use.

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Write me a haiku about","n_predict": 128}'

Or, living within the Huggingface python ecosystem perhaps you have the outlines of a python script ready to run directly:

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)

As I previously mentioned, think of an LLM as a text completion algorithm. You provide a document, it will finish it by generating some likely text. It learns how to do this by seeing lots of text examples.

We need to know a little more about how an LLM sees this text to make good use of the API.

The Tokenizer

If you’ve ever done any sort of work building a basic search engine you’ll know that it can be helpful to rewrite the words so that might match easier. If you are searching for the word “gophers” your search index should match documents containing words like “gopher”, “gopher’s” and “gophering”. Prior to searching (or indexing documents) the application will simplify all the words above to just “gopher”.

LLMs have a similar concept, but there is a different set of requirements:

A popular Tokenizer is SentencePiece It is a technique for creating a model that translates sentences into sequences of integers which is unique to the dataset being used.

The implication is that Tokenization is specific to an LLM. In fact this is a core component of using an LLM off of HuggingFace: You also need the tokenization model:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer.tokenize("hello there")
# Returns:
#  ['▁hell', 'o', '▁there']

tokenizer.encode("hello there")
# Returns:
#   [1, 6312, 28709, 736]

You may notice when encoded there are more integers than tokens. Tokenizers often support special tokens that have a significant meaning for the LLM. Usually these tokens are used in the training data to indicate something like the beginning or end of some significant chunk of a document.

# Returns:
#  ['<s>', '</s>', '<unk>']

The first two indicate the beginning and end of a sentence. You’ll notice that the configuration for this model (from the previous section) told us about this:

"bos_token_id": 1,
"eos_token_id": 2,

Using these special tokens turns out to be a very important part of using a Model.

Our diagram also is now bigger:

graph LR;
    A[User] -->|String| B[Tokenizer];
    B -->|Integer Sequence| C[LLM];
    C -->|Integer Sequence| B;
	B -->|String| A;

The Chat

There is still a big gap between “completing” a document and answering questions or having a conversation.

Making an LLM useful for tasks such as Chat is a task of training the LLM to recognize certain types of special documents and complete it in a helpful way. Those documents may look something like this:

[INST] <<SYS>>
You are a helpful poet.

Write me a haiku about Snakes [/INST] Silent in the grass,
Sleek hunter slithers, unseen,
Nature's quiet dance.

Note the special indicators:

As this is designed for a chat dialog, the user can follow-up by appending another INST:

[INST] <<SYS>>
You are a helpful poet.

Write me a haiku about Snakes [/INST] Silent in the grass,
Sleek hunter slithers, unseen,
Nature's quiet dance.

[INST] Now do one for Owls [/INST]

This special formatting (and the associated training) is all that’s needed to convince a model to respond in a way that makes sense for a chat application!

These chat formats, just like the tokenizer, may be unique to each model. I pulled the above example from following along with the reference python code from facebook/llama here. llama.cpp provides similar code as part of it’s bash-based chat example:

format_prompt() {
    if [[ "${#CHAT[@]}" -eq 0 ]]; then
        echo -n "[INST] <<SYS>>\n${INSTRUCTION}\n<</SYS>>"
        LAST_INDEX=$(( ${#CHAT[@]} - 1 ))
        echo -n "${CHAT[$LAST_INDEX]}\n[INST] $1 [/INST]"

I was able to find specific documentation for minstral-instruct:

<s>[INST] Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]

However, if you want to also include a System prompt, you have to look at this guide on Guardrails

<s>[INST] System Prompt + Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]

The System Prompt is therefore not quite as special as llama2 with it’s <<SYS>> tags.

As another example, the Huggingface configuration provides this gnarly one-liner:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

This very important notes makes me quite wary about the subtle differences and inconsistent implementations:

This format must be strictly respected, otherwise the model will generate sub-optimal outputs.


It is important for using LLMs for automation to be able to constrain the output to some format. The OpenAI ChatGPT API has some fields in the completion request that, to be really honest, I never actually used.

It can be inferred though that these features are actually important components in giving LLMs access to Tools.

response_format: {"type": "json_object"}

Similarly, llama.cpp supports “grammars” which allow constraining the model output to a GBNF Grammar

Grammars work by biasing and constraining the output tokens from the LLM to a fixed set you can define. Normally, when a token is chosen as an output of the LLM, there is a distribution of probabilities across multiple tokens. One token from that set is chosen based on a random number, the temperature and the probability of the token. A lower temperature restricts the tokens to only the most probable.

The grammar restriction works within this system by further restricting the tokens to only those that fit the desired output.

Pulling it all together

So how do we pull these together to get a reliable Chat experience? It depends.

llama.cpp provides a /chat/completions endpoint, but it only supports select models. Mistral-instruct is not one of them (right now).

API clients can implement chat template on the caller side and provide the entire template to the regular /completions endpoint, but the details of how the API will handle tokenization is very important. According to comments in the code, special tokens will be parsed in the server. So including a symbol like </s> should mean the proper eos token is inserted. However, every string-based prompt receives a free begin-of-string token, so don’t include an extra one!

I’ve tried to confirm this behavior. Consider:

$ curl http://scarlett:1337/completion --header "Content-Type: application/json" \
    --data '{"prompt": "[INST] My name is Groot. [/INST] Hello, I am Mistral, a helpful assistant.</s> [INST] What is my name? [/INST]"}'
"content":" Your name is Groot.",

$ curl http://scarlett:1337/completion --header "Content-Type: application/json" \
    --data '{"prompt": "[INST] My name is Groot. [/INST] Hello, I am Mistral, a helpful assistant. [INST] What is my name? [/INST]"}'
"content":" Your name is Groot.",

Including the </s> only increases the tokens by 1 giving us a good indication the </s> is tokenized as expected. However, the result is the same so perhaps the model isn’t that sensitive to the eos.

This special token behavior is a double-edged sword: don’t allow your users to insert special tokens because the API isn’t going to know the difference.

Alternatively, tokens can be directly sent to the API as a JSON list of integers. This would allow clients to completely customize how tokenization is handled. This would require the client to have full access to the tokenization model and a correct implementation.

$ curl http://scarlett:1337/completion --header "Content-Type: application/json" \
    --data '{"prompt": [1, 733, 16289, 28793, 1984, 1141, 349, 420, 4719, 28723, 733, 28748, 16289, 28793, 16230, 28725, 315, 837, 25200, 1650, 28725, 264, 10865, 13892, 28723, 2, 28705, 733, 16289, 28793, 1824, 349, 586, 1141, 28804, 733, 28748, 16289, 28793]}'

At least that’s how it the code makes it appear it should work. It just makes my server freeze up and the request is never answered!


My recommendation is to rely on the API layer only for the basics: tokenization and completions.

However, I think the direction the API implementers are going is to provide higher-level Chat interfaces that handle the document templates. As this is today limited and inconsistently available your ability to rely on this will vary.

Additionally, other document formats that could be custom to a specific model or custom fine tuning won’t be directly supported. So knowing how to format documents on the API caller side is important.