- Published on
Is AI training really scalable?
- Authors
- Name
- Anirudh Sathiya
[ Readtime: <4m ]
Modern LLMs like ChatGPT stem from the transformer architecture based on the Google paper titled Attention is all you need ^1. While the initial breakthroughs in GPT models showed impressive capabilities, many recent evaluations suggest that the returns on investment in terms of model performance are diminishing as model size and complexity continue to increase ^2. This begs the question— are we approaching a dead end in training such models?
This is a million-dollar question, as it has consequences for the US stock market, many of our careers, and potentially a worst-case apocalyptic AGI scenario ^3. Wall Street and Silicon Valley strongly believe that better AI stems out of scale— continuing to pump billions into distributed training with more compute, more data, and larger model sizes, hoping that we’ll achieve greater intelligence and fewer hallucinations. OpenAI has raised 21.9 billion USD by preaching this philosophy ^4. As we’ll explore soon, there’s a three-fold set of problems that big AI has to tackle - a scaling one, an architectural one, and a data one.
The most significant problem preventing us from achieving “higher intelligence” is caused by the law of diminishing returns for transformer-based language models. Current multimodal models experience a log-linear scaling trend during pre-training, meaning that making linear improvements to the model takes exponentially more compute. Let’s say it costs you 10 million USD to make a 10% zero-shot improvement to your existing model. That means it would cost you around a whopping 100 million USD to make a further 10% improvement. ^5^6
Another problem is the reason why self-driving cars can’t be proven to drive safely. This can be explained by this quote I heard, which I quite liked - AI doesn't "sometimes" hallucinate. AI is always hallucinating and most of the time its hallucination matches expectations. In technical terms, current models perform poorly in long-tailed data distributions, where few data points occur very frequently, while the majority occur rarely. This is a fundamental issue with the current deep learning architecture, which cannot understand causality but simply works based on correlation. If the conditions are freezing and snowy, it won’t be able to infer the relation to slipping due to black ice. Bringing back the self-driving example, these cars seem to drive fine in most normal conditions on which it's primarily trained on, but when presented with an edge-case situation such as a deer crossing, it's more likely to hallucinate— an outcome that is not desirable when you’re a passenger on the road. ^7
Finally, a model is only as good as the data it is trained on. And data is finite, both in terms of quality and quantity. And as we currently have it, our “brute-force” approach of training intelligent systems is already nearing its limit. Furthermore, the law of diminishing returns presents itself here too, where exponentially more data is required to observe linear improvements in models ^5^6. To address this issue, synthetic data, aka model-generated data used is used to train other models. Unsurprisingly, it is seen to result in a decline in performance and also undesirable behaviors and biases that are not fully understood. This is often referred to as “Model Autophagy Disorder” or “Habsburg AI” ^8. This is especially a concern today as more AI generated data is infiltrating the internet, making it hard to differentiate between human and AI synthesized.
OP-ED:
To me, our current approach to training models is akin to teaching a monkey with learning impairments to multiply two numbers by giving it thousands of multiplication tables and hoping it has seen enough examples not to fail.
While Llama and GPT models still fascinate me from day to day in its various capabilities, its important to remember that they lack true comprehension, reasoning, and the ability to think critically. Hence, we can be fairly confident that we won’t lose our jobs to AI until we hit a few more key milestones. Until then, AI remains a tool for augmenting human capabilities and assisting us rather than replacing us entirely.
References:
- Attention Is All You Need(Paper)
- Current AI scaling laws are showing diminishing returns, forcing AI labs to change course (Article)
- Not a reference, but you should watch the show Person of Interest
- OpenAI financials page on Crunchbase, Model Size Comparison (Reddit Post)
- Scaling Laws for Neural Language Models(Paper)
- A Neural Scaling Law from the Dimension of the Data Manifold(Paper)
- No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines(Paper)
- Preventing Al Model Collapse: Addressing the Inherent Risk of Synthetic Datasets(Article)