Understanding tokens and prompts is key for optimal LLMOps

The LLM Evolutionary Tree

The Adidas GPT-5 chronicles

What are LLMOps?

You will increasingly encounter the term 'LLMOps', short for Large Language Model Operations, across your social media platforms and within your professional circles.

LLMOps encompass the optimal management of the entire lifecycle of applications powered by Large Language Models (LLMs), from their creation to deployment and continuous upkeep. It's essential for developers and businesses interested in launching LLM-enabled applications to understand the operational necessities, infrastructure requirements, and associated costs. 

At tokes compare, the world's first dedicated LLM and generative AI comparison site, we are seeing a massive spike in interest from developers and enterprises as they scramble to think about how they can make the most out of the AI revolution to automate and improve applications and services, but keep their costs to a minimum.

When choosing a model, you are likely to consider accuracy, speed, and cost. Additionally, you may evaluate the model's interpretability (the model's understanding of decision-making), bias (inherent tendencies), scalability (performance as data increases), maintenance (ease of updates or fine-tuning), generalizability (performance on unseen data), regulatory compliance, privacy and security, and integration (compatibility with existing tech stack). However, in this blog, we focus on tokens and how this can translate to costs.

Let's talk tokens

The concept of tokenization is central to the functioning of LLMs. The tokenization process is largely determined by the model's design and training. It involves breaking down the input text into smaller, manageable pieces—tokens—that the model can understand and process. In the context of LLMs, a token can be word, subword, character, or byte (see below an example of tokenization based on OpenAI's GPT-3 model).

Many LLM services accessed through an API, including language models like GPT-4, Jurassic-2 or PaLM 2, are often billed based on the number of tokens processed. The more tokens an application uses, the higher the cost (think of it as an LLM pay-as-you-go consumption model). Therefore, managing the number of tokens is essential for cost control. Interestingly, we are seeing LLM providers starting to advertise that they offer more text per token than other providers, which aims to save consumers money.

Below is an example based on OpenAI's GPT-3 model (you can access their free Tokenizer here) and it shows that while there were 4,801 characters, this was grouped into 1,000 tokens.

1000

tokens

4801

characters

In the example, we have used enough characters to get to 1,000 tokens, as this is the industry standard for most LLMs (1/k tokens). Let's look at how this applies to some of the models for companies mentioned earlier:

OpenAI's
GPT-4 8K context

Prompt: $0.03/1k tokens
Completion: $0.06/1k tokens

OpenAI's
GPT-4 32K context

Prompt: $0.06/1k tokens
Completion: $0.12/1k tokens

Google's
PaLM 2 chat-bison-001

$0.0021535/ 1k tokens

A121 Studio
Jurassic-2 Jumbo (Base & Instruct models)

$0.015 / 1k tokens

10 tips for understanding tokens and mastering prompts for efficient and cost-effective LLM usage 

1

Model selection

Choose the right LLM based on the output you want. Models have different architectures and capabilities and excel in different fields. Use the tokes compare website to select the right model to power your application, and contact the tokes compare team if you need further guidance.  

2

Utilize cost estimation tools

Employ our tokes compare calculator and model-specific cost data to understand token expenses for application development and maintenance. We have heard from consumers find it challenging to grasp the 1/k currency due to the decimalization of prices, so we would recommend developers / managers / finance professionals calculate the number of tokens first (which annually for enterprises may be in the hundreds of millions or billions). You should then compare and contrast similar models token prices to see your monthly or annual potential savings.

3

Craft efficient prompts and ensure you have concise conversations

Reduce token costs by creating clear, concise prompts (user inputs) and consider fine-tuning the model for shorter prompts (prompt design and engineering is a great skill to learn). Maintain focus in dialogue to limit token usage and manage costs by consolidating queries for increased efficiency.

4

Context management

Eliminate redundant context messages to reduce prior context size and prompt costs. 

5

Context reset

Start fresh conversations to lower overall token usage and costs.

6

Leverage vector stores

Use vector stores (a database designed to store and retrieve vectors) to "cache" LLMs, reducing the frequency of API calls and the amount of context passed to the model. Five popular vector databases include Faiss (Meta), Pinecone, Milvus, Postgres (pg_vector), and Weaviate.

7

Handle long documents

Develop strategies to deal with documents exceeding the maximum context limit (the maximum number of tokens the model can process at one time). You can segment the document into smaller parts, use a sliding window approach (where you move through the document in overlapping sections, ensuring continuity of context), and employ abstractive summarization (or a combination of these strategies). 

8

Regulate completion length

Set a maximum token limit for model responses (completion) to prevent overly lengthy replies, balancing between cost and information quality. 

9

Apply wait policies

Implement wait policies to pause response generation until token additions cease, saving unnecessary token usage. 

10

Custom model training

Consider developing a bespoke model if these strategies do not sufficiently reduce costs. 

Managing LLM usage and cost is a complex field that requires a thorough understanding of the models. We are here to help, and we would love to hear your innovative stories about how you achieved reduced usage costs while maintaining performance and quality to share with the broader AI community.

At tokes compare, we will continue to push all organisations to publish their prices in the 1/k currency so consumers can compare LLMs and generative AI platform costs for tokens. We would love the wider AI and tech community to support us! Reach out if you would like to discuss this further - it matters; let's tackle the digital divide and technology inequality so everyone can be part of the AI revolution.

Understanding tokens and prompts is key for optimal LLMOps

The LLM Evolutionary Tree

The Adidas GPT-5 chronicles

What are LLMOps?

You will increasingly encounter the term 'LLMOps', short for Large Language Model Operations, across your social media platforms and within your professional circles.

Let's talk tokens

Below is an example based on OpenAI's GPT-3 model (you can access their free Tokenizer here) and it shows that while there were 4,801 characters, this was grouped into 1,000 tokens.

In the example, we have used enough characters to get to 1,000 tokens, as this is the industry standard for most LLMs (1/k tokens). Let's look at how this applies to some of the models for companies mentioned earlier:

OpenAI's
GPT-4 8K context

Prompt: $0.03/1k tokens
Completion: $0.06/1k tokens

OpenAI's
GPT-4 32K context

Prompt: $0.06/1k tokens
Completion: $0.12/1k tokens

Google's
PaLM 2 chat-bison-001

$0.0021535/ 1k tokens

A121 Studio
Jurassic-2 Jumbo (Base & Instruct models)

$0.015 / 1k tokens

1

Model selection

2

Utilize cost estimation tools

3

Craft efficient prompts and ensure you have concise conversations

4

Context management

5

Context reset

6

Leverage vector stores

7

Handle long documents

8

Regulate completion length

9

Apply wait policies

10

Custom model training

Managing LLM usage and cost is a complex field that requires a thorough understanding of the models. We are here to help, and we would love to hear your innovative stories about how you achieved reduced usage costs while maintaining performance and quality to share with the broader AI community.

Jamie Blackstone

Further Reading

Useful links

Understanding tokens and prompts is key for optimal LLMOps

The LLM Evolutionary Tree

The Adidas GPT-5 chronicles

The LLM Evolutionary Tree

The Adidas GPT-5 chronicles

What are LLMOps?

You will increasingly encounter the term 'LLMOps', short for Large Language Model Operations, across your social media platforms and within your professional circles.

Let's talk tokens

Below is an example based on OpenAI's GPT-3 model (you can access their free Tokenizer here) and it shows that while there were 4,801 characters, this was grouped into 1,000 tokens.

In the example, we have used enough characters to get to 1,000 tokens, as this is the industry standard for most LLMs (1/k tokens). Let's look at how this applies to some of the models for companies mentioned earlier:

OpenAI's GPT-4 8K context

Prompt: $0.03/1k tokensCompletion: $0.06/1k tokens

OpenAI's GPT-4 32K context

Prompt: $0.06/1k tokensCompletion: $0.12/1k tokens

Google's PaLM 2 chat-bison-001

$0.0021535/ 1k tokens

A121 Studio Jurassic-2 Jumbo (Base & Instruct models)

$0.015 / 1k tokens

1

Model selection

2

Utilize cost estimation tools

3

Craft efficient prompts and ensure you have concise conversations

4

Context management

5

Context reset

6

Leverage vector stores

7

Handle long documents

8

Regulate completion length

9

Apply wait policies

10

Custom model training

Managing LLM usage and cost is a complex field that requires a thorough understanding of the models. We are here to help, and we would love to hear your innovative stories about how you achieved reduced usage costs while maintaining performance and quality to share with the broader AI community.

Jamie Blackstone

OpenAI's
GPT-4 8K context

Prompt: $0.03/1k tokens
Completion: $0.06/1k tokens

OpenAI's
GPT-4 32K context

Prompt: $0.06/1k tokens
Completion: $0.12/1k tokens

Google's
PaLM 2 chat-bison-001

A121 Studio
Jurassic-2 Jumbo (Base & Instruct models)

$0.015 / 1k tokens