Transformers 101 (basics of chatGPT)

This is a simple attempt to explain how ChatGPT works. It starts from zero and builds up one concept at a time. Along the way there are small interactive simulations to make each idea easier to follow.

The interactive demos on this page are simplified simulations, not actual model inference. They illustrate the principles that real models use, just at a scale small enough to fit on a screen.

Step 1: Text is not text (to a computer)

When humans look at a sentence, they see words and meaning. A computer sees none of that, because it only works with numbers. So before a language model can do anything useful with your message, it first has to convert the text into a numerical form it can actually process.

The way it does this is by breaking text into small pieces called tokens. A token can be a whole common word like "cat," a chunk of a longer word like "un" or "ing," or sometimes just a single character. The reason is that there are far too many distinct words in any language to give each one its own slot, so the model keeps a fixed set of reusable pieces (about 100,000 of them) that can be combined to spell anything.

Try it: see how text becomes tokens

Type anything below and watch how it gets broken into pieces. Each colored block is one token. Hover over a token to see its numeric ID.

If you play with the demo above you will notice that common short words like "the" or "how" stay in one piece, while longer or rarer words get split up. In most real tokenizers, a leading space is actually merged into the word that follows it, so " the" (with the space) counts as a single token, different from "the" at the start of a text. Punctuation marks typically get their own token. A real model like GPT-4 has a vocabulary of roughly 100,000 token types; the demo uses a smaller set, but the splitting logic works the same way.

Step 2: Giving tokens meaning (embeddings)

At this point we have tokens, but they are just numbered labels. The number 4821 tells the computer nothing about whether "cat" is an animal or how it relates to "dog" or "kitten." So the model goes a step further and assigns each token a long list of numbers (hundreds or thousands of them) that together act as a kind of address in a high-dimensional space. This is called an embedding.

The easiest way to picture it is a giant map. Instead of just two orthogonal axes, imagine hundreds of axes. On this map, words that mean similar things end up close together: "king" near "queen," "Paris" near "France," "running" near "jogging." Nobody programs these positions by hand. The model discovers them during training, simply by noticing which words tend to appear in similar contexts across the billions of sentences it reads.

Try it: explore the word map

Pick a word and see which words end up nearby. This is a simplified 2D projection of a space that, in a real model, has hundreds or thousands of dimensions.

What I find remarkable is that no one ever told the model "cat and dog are related." It figured that out on its own, because those two words keep showing up around the same kinds of other words. If you see "the ___ chased the ball" and "the ___ sat on the couch," both "cat" and "dog" fit, and the model learns to place them near each other.

There is one more detail worth mentioning here. Because a transformer (a type of neural network architecture used in modern language models) processes all the tokens at the same time rather than reading them left to right, it has no built-in sense of word order. Without some extra information, "dog bites man" and "man bites dog" would look identical to it. To solve this, the model adds a positional encoding to each embedding: a set of numbers that tells the model whether a token is first, tenth, or three-hundredth in the sequence. This way the model knows not just what each word is, but where it sits.

Step 3: Context matters (attention)

Here is where it gets interesting. The word "bank" can mean the side of a river or a financial institution, and the only way to tell which meaning someone intends is to look at the surrounding words. This is the core problem that attention solves, and it is the mechanism that made modern language models possible.

The idea is that for every word in a sentence, the model looks at the words it has seen so far and decides which ones are most relevant to understanding that particular word right now. When it processes "I sat on the river bank," the word "bank" pays strong attention to "river" and shifts its internal representation toward the waterside meaning. In "I went to the bank to deposit money," the same word latches onto "deposit" and "money" instead, and ends up with an entirely different internal representation despite being spelled identically.

Try it: see where the model looks

Click on any word to see which other words it attends to most. This uses simulated attention weights that illustrate the real pattern.

When people talk about "transformers," this is what they mean. The whole architecture is built around this attention mechanism. Unlike older approaches that processed words one by one in sequence, a transformer processes all the input tokens in parallel. There is one important constraint: in models like GPT, each token can only attend to the tokens that came before it, not the ones that come after, because the model is designed to predict what comes next and it would be cheating to look ahead. A large model repeats this attention process many times in parallel (using what are called "attention heads") and stacks these layers deep, sometimes 80 or 100 of them, with each pass refining the model's understanding of what every word means in this specific context.

Step 4: Predicting the next word

This is the part that tends to surprise people. The core of training (called pre-training) boils down to a single, almost absurdly simple task: given some text, predict the next word.

The model reads trillions of words from books, websites, code, and other sources, and at each position it tries to guess what comes next. When it guesses wrong it nudges its internal numbers to do a bit better next time, and after enough examples it gets remarkably good at this game. To turn this raw predictor into a helpful assistant like ChatGPT, there is a second phase where human reviewers rank its answers and the model is fine-tuned to prefer the responses people actually find useful (a process called RLHF, reinforcement learning from human feedback). But the foundation is still next-word prediction. When you chat with ChatGPT, the model processes your message through all those layers of embeddings and attention, produces a probability for every possible next token in its vocabulary, picks one, appends it, and repeats. Word by word, the response appears on your screen.

Try it: see the probabilities

Type the start of a sentence and see simulated probabilities for the next word. Real models compute these over their full vocabulary; this demo uses a small curated set to show the principle.

Notice that the model never commits to a single answer. It assigns probabilities to many possible continuations. After "The capital of France is," it might put 88% on "Paris," 4% on "a," 2% on "located," and scatter tiny fractions across thousands of other words. Which word it actually picks depends on a setting called temperature.

Step 5: Temperature (creativity vs. safety)

Temperature is a knob that controls how the model picks from its probability distribution. At low temperature (close to 0) it almost always goes with the most likely word, which produces safe and predictable output but also repetitive and dull. At high temperature (close to 2) it gives real chances to less likely words, making the output more surprising and creative but sometimes incoherent. Think of it like asking someone to continue a story: a cautious person always says the obvious next line, while a wild improviser takes bizarre detours. Somewhere in the middle you get the best mix.

Try it: turn the temperature knob

Drag the slider and click Generate. Each word is sampled from a different distribution that depends on the previous choice, just like a real model. The chart shows the first step.

This is why asking the same question twice can give you two different answers. The model is not confused; it is sampling from a distribution, and temperature determines how much variety that sampling allows.

Step 6: Scale

Everything I have described so far is the basic recipe. What turned it from a modest research result into the thing everyone is talking about is not some secret trick. It is scale.

  • Parameters: GPT-3 had 175 billion adjustable numbers. GPT-4's size was never officially disclosed, but models in this class are believed to have hundreds of billions or more. Together these numbers encode patterns about language, facts, reasoning styles, and tone.
  • Training data: trillions of words from books, websites, code, and more. The more diverse the data, the better the model generalizes to new situations.
  • Compute: training a model this large requires thousands of specialized chips running for weeks or months, and it is extremely expensive.

What researchers found is that if you make the model bigger, feed it more data, and let it train longer, its performance improves in a surprisingly smooth and predictable way. They call this a scaling law, and it is the main reason the field accelerated so dramatically over the last few years.

Step 7: What it cannot do

I think it is worth being honest about limits, because the hype around these tools can make them seem almost magical, and that is not a helpful way to think about them.

  • The model does not know what is true. It knows what is probable text. Sometimes probable text happens to be factually wrong, and the model will state it with the same confidence either way.
  • The base model has no persistent memory between conversations. Some products (like ChatGPT) add a memory layer on top, but that is an engineering feature, not something the model itself learned to do.
  • Even within a single conversation, the model can only handle a limited number of tokens at once (this is called the context window). Early models were limited to a few thousand tokens; newer ones can handle over a hundred thousand, but there is always a ceiling. If a conversation gets long enough, the oldest parts simply fall out of what the model can see.
  • It does not reason the way you and I do. It can reproduce patterns of reasoning that it saw during training, which is genuinely impressive but not the same thing as understanding.
  • When it makes something up and presents it confidently, people call that a "hallucination." This happens because the model is optimized for plausible-sounding continuations, not for factual accuracy.

None of that makes the tool useless. It just means you will get more out of it if you understand what it is actually doing under the hood.

The whole picture

A language model breaks text into tokens, maps them to a high-dimensional space, uses attention to figure out what each word means in context, and then predicts the next token, one step at a time, over and over, until the response is complete.

Everything else you see (the chatbot interface, the helpful assistant persona, the ability to follow instructions) is built on top of that foundation. At its core, this is a next-word predictor that has been trained on an enormous amount of text.

I hope this makes the whole thing a bit less mysterious. Next time it gives you a strange answer, you will at least know why.