Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
One of the most widely used techniques for making AI models more efficient, quantization has limitations, and the industry can quickly approach them.
In the context of artificial intelligence, quantization refers to reducing the number of bits needed to represent information—the smallest units a computer can process. Consider this analogy: when someone asks the time, you’d probably say “noon” rather than “twelve hundred, one second and four milliseconds.” This is quantization; both answers are correct, but one is slightly more accurate. How much precision you really need depends on the context.
AI models consist of several components that can be quantized—specifically, the parameters, the internal variables the models use to make predictions or decisions. This is convenient considering that the models perform millions of calculations when they are run. Quantized models with fewer bits representing their parameters are mathematically and therefore computationally less demanding. (To be clear, this is a different process than “distillation”, which is a much more involved and selective pruning of parameters.)
But quantization may have more trade-offs than previously thought.
according to to learn According to researchers at Harvard, Stanford, MIT, Databricks, and Carnegie Mellon, quantized models perform worse if the original, unquantized version of the model is trained over a long period of time on large amounts of data. In other words, at some point, it may be better to make a smaller model than to bake a large one.
This could be bad news for AI companies that train extremely large models (which are known to improve response quality) and then scale them to make them less expensive to serve.
The effects are already showing. A few months ago, developers and academics He stated that Meta is quantizing Llama 3 the model tends to be potentially “more harmful” than other models due to its training.
“I think the number one cost for everyone in AI will continue to be inference, and our work shows an important way to reduce it that won’t work forever,” Harvard math student and first author Tanishq Kumar told TechCrunch.
Contrary to popular belief, AI model derivation – running the model, for example, when ChatGPT answers the question – often more expensive than model training in aggregate. Consider, for example, what Google spends is estimated 191 million dollars for the development of one of its flagships Twins models – of course, a huge amount. But if the company used a model to generate only 50-word answers to half of all Google Search queries, it would spend. approx 6 billion dollars per year.
Major AI labs have adopted training models on massive datasets with the assumption that “scalability”—increasing the amount of data and computation used in training—will lead to increasingly proficient AI.
For example, Meta trained Llama 3 with a set of 15 trillion tokens. (Tokens represents bits of raw data; 1 million tokens equals about 750,000 words.) The previous generation, Llama 2, was trained on “only” 2 trillion tokens. In early December Meta released the new Llama 3.3 70B modelsaid the company has “improved core performance at a significantly lower cost.”
Evidence shows that increasing scale eventually produces diminishing returns; Anthropic and Google reported recently trained massive models that fall short of internal benchmark expectations. But there is little indication that the industry is ready to move away from these entrenched scaling approaches in a meaningful way.
So if labs are reluctant to train models on smaller datasets, is there a way to make the models less susceptible to degradation? Maybe. Kumar says he and his co-authors found that training models at “low fidelity” can make them more robust. Stay with us for a moment as we dive in a bit.
Here, “Precision” refers to the number of digits that a numeric data type can accurately represent. Data types are collections of data values, usually defined by a set of possible values and allowed operations; for example, the FP8 data type uses only 8 bits to represent a floating point number.
Today, most models are trained to 16-bit or “half-precision” and 8-bit precision with “post-train quantization”. Some model components (for example, its parameters) are converted to a lower precision format at the cost of some precision. Think of it as doing math to a few decimal places, but then rounding to the nearest 10, and often gives you the best of both worlds.
Hardware vendors such as Nvidia strive for lower precision for quantized model results. The company’s new Blackwell chip supports 4-bit precision, specifically a data type called FP4; Nvidia touted this as a boon for memory- and power-constrained data centers.
However, extremely low quantization accuracy may not be desirable. According to Kumar, unless the original model is incredibly large in number of parameters, resolutions below 7 or 8 bits can see a noticeable drop in quality.
If this all sounds a bit technical, don’t worry – it is. But the main point is that AI models are not fully understood, and known shortcuts that work in many types of computing do not work here. You wouldn’t say ‘noon’ if someone asked you when you started the 100m dash, would you? It’s not as obvious, of course, but the idea is the same:
“The point of our work is that there are limits that you cannot cross with naivety,” Kumar said. “We hope our work adds nuance to the debate that often seeks ever-lower precision requirements for training and inference.”
Kumar acknowledges that his and his colleagues’ research was relatively small-scale — they plan to test it with more models in the future. But he believes he’ll catch on to at least one point: There’s no free lunch when it comes to cutting bottom line costs.
“A little bit of accuracy is important, and it’s not free,” he said. “You can’t reduce it forever without the models suffering. Models have limited capacity, so rather than trying to fit a quadrillion tokens into a small model, I think more effort would be put into fitting only high-quality data into the smaller models. I’m optimistic that new architectures that aim to stabilize intentionally low-precision training will be important in the future.”
This story was originally published on November 17, 2024 and was updated on December 23 with new information.