From Tokens to Context Windows: Simplifying AI Jargon - Publications

AI models are trained on trillions of tokens, billions of parameters, and ever-longer context windows.

But what the heck does any of that mean? Is two trillion tokens twice as good as one trillion? And, for that matter, what is a token, anyway?

In this post, we’ll explain key concepts in understanding large language and other AI models and why they matter, including tokens, parameters, and the context window. The post continues by explaining open source versus proprietary models (the difference may not be what you think it is); multimodal models; and the CPUs, GPUs, and TPUs that run the models. Finally, we have a technical-ish discussion of parameters and neurons in the training process and the problem of language equity.

Tokens

The smallest piece of information we use when we write is a single character, like a letter or number. Similarly, a token is the smallest piece of information an AI model uses. A token represents a snippet of text, ranging from a single letter to entire phrases. The amount of text in a token is typically less than the length of a word, meaning the model ends up including more tokens than words, although fewer tokens than the number of characters in a written version of the document. Each token is assigned an ID, and these IDs, rather than the original text, are used to train the model. This “tokenization” reduces the computational power required to learn from the text.

One common method for tokenization is called Byte Pair Encoding (BPE), which starts with the most basic elements—such as characters—and progressively merges the most frequent pairs to form tokens. This allows for a dynamic tokenization that can efficiently handle common words as single tokens while breaking down less common words into smaller sub-words or even single characters. BPE is particularly useful because it strikes a balance between the granularity of characters and the semantic meaning captured in longer strings of text.

A good rule of thumb is that any given text will have about 30 percent more tokens than it does words, though this can vary based on the text and the specific tokenization algorithm used.

“Training Tokens” are the tokens used to train or fine-tune the model. This number can be in the billions or trillions. Most of the time, the more tokens a model is trained on, the better quality the model is, however the quality and diversity of the tokens used matter for general purpose models (a model trained on trillions of tokens from Reddit would do worse at some general tasks than one with a wider range of token sources), and adding more data makes training longer.

Parameters

The “parameters” in an LLM are the values a model uses to make its predictions. Each parameter changes during training to improve the model’s predictive output. When training is complete, the parameters are fixed so that they no longer change.

Each parameter is a single number, and it is the collection of parameters and how they are weighted that leads to the predictions. The complexity of language means that models follow billions of parameters to make their inferences. Smaller LLMs currently have single-digit billions of parameters, mid-sized models like Llama 2 and GPT-3 are around 70 billion, and GPT-4 has upwards of 175 billion.

More parameters means more complex models that can hopefully handle more complex tasks, although at the cost of requiring more computational power. While additional complexity usually improves the model, it does not always. Smaller models with higher quality training data, such as Microsoft’s Phi-2, can outperform larger models with less refined training data.

For example, if a model has too many parameters relative to the amount of training data it has it can perfectly predict every outcome in its training data. But because it predicts what it already knows so well it may do a poor job of extrapolating beyond its training. This phenomenon is called “overfitting.” Generally, it is possible to avoid overfitting by having more training tokens than parameters.

Context Window Limit

The context window is the maximum amount of previous text the model can “see” when calculating its prediction, and is measured in tokens. While the number of tokens a model is trained on can number in the trillions, and parameters in the billions, the context window is typically in the thousands, although Claude 2 has had a 200k token window and Gemini 1.5 Pro has a context window of one million tokens, and Gemini 1.5 Pro has an experiential 10 million token window.

Larger context windows allow a model to use more user-provided data, like a PDF, and output longer responses, which can lead to more accurate reasoning. However, some longer context models have a “lost in the middle” issue, where content in the middle of the context window isn’t paid enough attention, which can be an issue when trying to reason with complex documents. Recent models like Gemini 1.5 Pro and Claude 3 have made progress to solve this, but benchmarks are still evolving. In addition, bigger windows require more computational power and are slower.

Comparing LLMs

With the information above we can better understand how different models compare, as shown in the table below.

Open Source vs. Proprietary Models

Most people are familiar with the concepts of open source and proprietary software. They’re similar in AI, but with some key differences and controversies over the phrase “open source.”

In general, “open source” AI models are models whose parameter weights are available for the public to use and alter. “Open source,” “open weights,” and “downloadable weights” are used somewhat interchangeably. Some disagree with calling these models “open source,” believing the phrase should be reserved for models whose entire training pipeline – data, architecture, and training plan – is available for download.

Multimodal Models

The most developed LLMs take input in one form and output in the same form. In these single-modal models, for example, you can type a question in a text box and it delivers an answer in text (text-to-text). By contrast, so-called multimodal models are more advanced and can interpret and output multiple formats, such as image-to-image or image-to-text. GPT-4V (V for Vision) can respond to images, and Gemini 1.5 can watch video, see images, and listen to audio, without first transcribing or otherwise processing them to text.

The key differences between single- and multi-modal models have to do with the underlying architecture and their ability to seamlessly handle different formats on either end. Text-to-image and text-to-speech are still considered single-modal because they are designed to input a specific format and output a specific format.

CPU vs GPU vs TPU

LLMs and other AIs require a lot of processing power, and the rise of AI is thus affecting how computer processing is developing and, of course, the names of those processor types.

Central Processing Units (CPUs) are standard chips, useful for general purpose computing. Graphic Processing Units (GPUs) are specialized chips, originally designed for processing graphics, that prioritize parallel, instead of serial, computation. These turned out to be good for cryptocurrency mining, helping create a shortage in 2021, and are also good at processing LLMs.

Tensor Processing Units (TPU) are specially designed for training machine learning models and originally created by Google. They can better handle more complex mathematical computation, which is important for adjusting the parameter weights during training and performing inference efficiently.

Training models require large numbers of processors–generally speaking, the more parameters the model has, the more processing power is required to train it. The Falcon 180B model runs on 4096 GPUs simultaneously. Google’s A3 supercomputer is made up of 26,000 Nvidia GPUs, although its new Hypercomputer will run on its own TPUs.

Parameters and Neurons in the Training Process

Modern AI models are “deep neural networks.” A neural network is made up of layers of smaller components called neurons, modeled loosely on biological neurons and implying some similarity to the way an organic brain works. They are “deep” (as opposed to “shallow”) because they are complex and include multiple layers.

Training a neural network begins by initializing the model's parameters using random values, often from a normal distribution. This initialization helps prevent issues such as vanishing or exploding gradients, which can prevent the model from learning effectively. During training, a subset of the input data is passed through the model in a step known as “the forward pass.” Here, each neuron calculates an output by weighting its input, adding a bias, and then applying an activation function—like the ReLU (rectified linear unit) function—which introduces non-linearity by outputting zero for negative inputs and the input value for positive inputs. This ensures the model cannot be turned into a simple linear equation.

After computing the forward pass, the model's predictions are evaluated against the actual target outputs using a loss function, which measures the model's performance by quantifying the difference between predicted and true values. The loss with respect to each parameter helps determine how to adjust the weights and biases to minimize the loss.

Subsequently, an optimization algorithm (typically Adam or AdamW) updates the parameters based on the gradients computed. This optimization step is crucial for enhancing the model’s predictive accuracy. The entire process—from forwarding pass to parameter update—is iterated with different subsets of the training data for a predetermined number of training steps or until the model exhibits satisfactory performance. Through this iterative training process, the model learns to accurately map inputs to outputs, gradually improving its ability to make predictions or decisions based on new, unseen input data.

Unicode, Tokens, and the Language Equity Problem

Representing text in computing involves assigning a unique number to each character, ranging from symbols like 'a' and '&' to complex characters such as '业' or even emojis like '😊'. This assignment process is known as "encoding." In the early days of computing, different countries developed their own encodings to cater to their specific alphabets. For instance, the United States developed the American Standard Code for Information Interchange (ASCII) standard. This diversity in encodings posed challenges in managing multilingual texts, prompting the need for a universal solution.

This need led to Unicode, a comprehensive encoding standard designed to represent every character used across various languages. Unicode characters can vary in size from 1 to 4 bytes. Commonly used writing systems such as Latin, Arabic, and Cyrillic, as well as the more frequently used Han characters, are encoded in 1 to 3 bytes. In contrast, lesser-used Han characters, emojis, and characters from rare or extinct writing systems require 4 bytes.

Byte-Pair Encoding (BPE, discussed above), typically starts with the basic 256 bytes rather than the extensive range of Unicode characters. This approach means that constructing tokens can be more complex and token-intensive depending on the language. For instance, Telugu, a significant language in India, may generate up to 10 times more tokens for the same amount of text compared to English. That means processing Telugu is more costly for any given amount of text, resulting in less representation and accuracy given computing constraints.

The question of whether it's feasible to develop a fair multilingual model that performs equally well across different languages, as opposed to optimizing models for specific languages, remains an open and complex challenge.

Stay tuned for more on artificial intelligence models from the Technology Policy Institute. Visit chatbead.org to use TPI’s AI tool to search federal infrastructure grant applications. Nathaniel Lovin is Lead Programmer and Senior Research Analyst at the Technology Policy Institute. Sarah Oh Lam is a Senior Fellow at the Technology Policy Institute.

From Tokens to Context Windows: Simplifying AI Jargon - Publications - The Technology Policy Institute (2024)

FAQs

What is the token window in AI? ›

Generally, one token corresponds to about 4 characters of English text, which is about ¾ of a word. So 100 tokens is equal to about 75 words. The number of tokens that AI considers at any given time is called the context window.

Read On ›

What is a context window AI? ›

A Context Window in artificial intelligence, specifically within the realm of Natural Language Processing (NLP), is the frame of reference that a language model uses to understand or generate language based on a fixed span of words or tokens surrounding a specific point.

Discover More Details ›

What is the difference between parameters and tokens in AI? ›

Tokens represent the smallest units of data that the model processes, such as words or characters in natural language processing. Parameters, on the other hand, are internal variables that the model adjusts during training to improve its performance.

What are Windows tokens? ›

The Microsoft identity platform authenticates users and provides security tokens, such as access tokens, refresh tokens, and ID tokens. Security tokens allow a client application to access protected resources on a resource server.

See Details ›

What are the AI tokens? ›

In the field of AI, a token is a fundamental unit of data that is processed by algorithms, especially in natural language processing (NLP) and machine learning services. A token is essentially a component of a larger data set, which may represent words, characters, or phrases.

Find Out More ›

Which AI has the biggest context window? ›

How the Google DeepMind team created the longest context window of any large-scale foundation model to date. Yesterday we announced our next-generation Gemini model: Gemini 1.5.

Tell Me More ›

What is the context window in ChatGPT? ›

The model can consider a certain number of these tokens at a time, which is referred to as the model's “context window”. The size of this window is a key factor in how ChatGPT maintains context. When generating a response, the model considers all the tokens within its context window.

Show Me More ›

What is the context window for GPT-4? ›

GPT-4 Turbo has a context window of 128k and a maximum response size of 4096 tokens. Ninja'd! Context window means the maximum amount of tokens a model can be fed, while the maximum response size means the maximum tokens a model can output.

Explore More ›

What is a token in generative AI with an example? ›

Tokens are the smallest unit of text that carries meaning for a language model. To prepare text for understanding, models use tokenization, a process that breaks down sentences or larger chunks of text into individual tokens. Then, each unique token is assigned a numerical ID.

What is token AI memory? ›

AI large language models process information in tokens. Tokens represent a word, part of a word, a phrase, or a symbol. Different models tokenize text in different ways, but in AI Dungeon, one token is about 4 characters.

Show Me More ›

What are tokens in OpenAI API? ›

You can think of tokens as pieces of words used for natural language processing. For English text, 1 token is approximately 4 characters or 0.75 words.

Read The Full Story ›

What does token mean in OpenAI? ›

By definition tokens are text "chunks" that represent commonly occurring sequences of characters in the large language training dataset. A token can be a single character, fraction of a word, or an entire word.

See Details ›

What is a token in generative AI? ›

Get More Info Here ›

What is a token in ChatGPT? ›

Tokens are the basic unit that OpenAI GPT models (including ChatGPT) use to compute the length of a text. They are groups of characters, which sometimes align with words, but not always. In particular, it depends on the number of characters and includes punctuation signs or emojis.

How do I get an OpenAI access token? ›

In order to get an Open AI access token, you will need to sign up for a free account on the Open AI website. Create an account on OpenAI (you will get a free 5$ trial account ). Generate the API key that Text Generator Plugin will use.

From Tokens to Context Windows: Simplifying AI Jargon - Publications - The Technology Policy Institute (2024)

Tokens

Parameters

Context Window Limit

Comparing LLMs

Open Source vs. Proprietary Models

Multimodal Models

CPU vs GPU vs TPU

Parameters and Neurons in the Training Process

Unicode, Tokens, and the Language Equity Problem

FAQs

What is the token window in AI? ›

What are tokens in OpenAI API? ›