LLMs contain a LOT of parameters. But what’s a parameter?

When a model is trained, each word in its vocabulary is assigned a numerical value that captures the meaning of that word in relation to all the other words, based on how the word appears in countless examples across the model’s training data.

Each word gets replaced by a kind of code?

Yeah. But there’s a bit more to it. The numerical value—the embedding—that represents each word is in fact a list of numbers, with each number in the list representing a different facet of meaning that the model has extracted from its training data. The length of this list of numbers is another thing that LLM designers can specify before an LLM is trained. A common size is 4,096.

Every word inside an LLM is represented by a list of 4,096 numbers?

Yup, that’s an embedding. And each of those numbers is tweaked during training. An LLM with embeddings that are 4,096 numbers long is said to have 4,096 dimensions.

Why 4,096?

It might look like a strange number. But LLMs (like anything that runs on a computer chip) work best with powers of two—2, 4, 8, 16, 32, 64, and so on. LLM engineers have found that 4,096 is a power of two that hits a sweet spot between capability and efficiency. Models with fewer dimensions are less capable; models with more dimensions are too expensive or slow to train and run.

Using more numbers allows the LLM to capture very fine-grained information about how a word is used in many different contexts, what subtle connotations it might have, how it relates to other words, and so on.

Back in February, OpenAI released GPT-4.5, the firm’s largest LLM yet (some estimates have put its parameter count at more than 10 trillion). Nick Ryder, a research scientist at OpenAI who worked on the model, told me at the time that bigger models can work with extra information, like emotional cues, such as when a speaker’s words signal hostility: “All of these subtle patterns that come through a human conversation—those are the bits that these larger and larger models will pick up on.”

The upshot is that all the words inside an LLM get encoded into a high-dimensional space. Picture thousands of words floating in the air around you. Words that are closer together have similar meanings. For example, “table” and “chair” will be closer to each other than they are to “astronaut,” which is close to “moon” and “Musk.” Way off in the distance you can see “prestidigitation.” It’s a little like that, but instead of being related to each other across three dimensions, the words inside an LLM are related across 4,096 dimensions.

Yikes.

It’s dizzying stuff. In effect, an LLM compresses the entire internet into a single monumental mathematical structure that encodes an unfathomable amount of interconnected information. It’s both why LLMs can do astonishing things and why they’re impossible to fully understand.

Source link