The reality of Generative AI: achievements and limitations

Part 1 of a two-part thought piece by Bayes Director of Research and Innovation, Dr Vaishak Belle, on the reality and future of Generative AI.

Generative AI has undoubtedly brought discussions about the adoption and risks of AI to the forefront, but arguably for somewhat routine tasks such as writing reports, generating paintings, and coding. Along with obvious cases of eroding the social fabric with deepfakes and synthetic media (where AI-generated images and videos have become sophisticated enough to make distinguishing real from fake increasingly difficult on social media) and preemptive workforce reduction on the promise of "agentic AI", the overall technological discourse swings between optimism and fear. This oscillating discourse often misses what's genuinely interesting about these systems: they represent non-trivial progress while simultaneously revealing fundamental limitations we need to understand clearly.

The achievement Is genuine

From the point of view of an algorithm designer or scientist, the speed and accuracy of routine tasks such as drafting emails, summarising documents, generating code snippets, and creating images from text prompts is, make no mistake, very hard. Machine learning scientists have been trying to understand how language could be used for these tasks for decades. Even in the eighties, people came up with very clever algorithms for spam detection, relying on probabilistic models and Bayes Theorem (from which the Bayes Centre takes its name). Getting to where we are now required numerous breakthroughs: deep learning architectures, new computing paradigms, the ability to harness massive datasets, and sophisticated pre-training schemes.

Much of this may be engineering rather than fundamental science, more "hack" than insight into human cognition, but the engineering achievement is non-trivial. The productivity gains could be argued to be substantial, particularly for writing and coding, even if concerns about mediocre outputs are warranted. Moreover, many models, including those from Meta, are trained on copyrighted books without permission, spawning lawsuits alleging copyright infringement.

Yet here's what's striking: these benefits have fallen short of the transformation early adopters anticipated. With the right infrastructure, you can create sophisticated automations. But these large language models don't deliver miracles at the click of a button without extensive supporting architecture.

This isn't because the technology failed to advance. Recent models are incomprehensibly large. GPT-4 and similar systems contain roughly a trillion parameters, about 100 times fewer than synaptic connections in the human brain. They can cost close to $100 million to train and consume substantial portions of all human-generated text.

For these reasons, I don't believe simply scaling these systems larger will deliver what we expect. And I think this stems from a widespread misunderstanding about what these systems actually are.

Error-prone by design

Current generative AI systems are error-prone—not as a bug, but by design.

Unlike traditional software that follows explicit rules, large language models operate as sophisticated pattern-matching systems. As Rodney Brooks recently argued, these systems function more like sophisticated search engines that merge multiple statistical patterns into coherent-seeming responses, rather than systems that understand or reason about the world. They generate outputs based on statistical correlations rather than causal understanding or logical reasoning.

As Melanie Mitchell and Gary Marcus have argued, LLMs don't explicitly construct world models—intermediate representations of objects, entities, and relationships. A recent viral example, the "Car Wash Puzzle," illustrates this vividly: when asked "I want to wash my car and the car wash is just 50 meters away. Should I start my car and drive there or just walk?", most modern LLMs suggest walking—missing the obvious fact that you need the car at the car wash to wash it. This reveals ongoing struggles with spatial reasoning and basic physical causality, even in frontier models.

They operate as extraordinarily complex statistical models that remix and recombine linguistic patterns. Generating genuinely novel insights is arguably impossible. Everything emerges from manipulating correlations in language.

The human in the loop

This correlation-based architecture explains why generative AI requires substantial human oversight for critical applications. Despite appearances, they don't generate genuinely novel objects. They produce statistical pastiche, recombinations of patterns from their training data. Hence, the term "AI slop." LLM-generated text may be coherent, but it rarely contains truly original insights. Images from DALL-E or Midjourney are remixes of existing visual styles rather than new aesthetic paradigms. The problem is particularly acute with AI-generated videos flooding social media—computationally expensive to produce yet offering questionable value.

The current design pattern: the LLM provides initial drafts or suggestions, but human judgment remains essential for accuracy and quality. Yet there's evidence that humans using LLMs for brainstorming actually remember less and may become trapped in local optima, their thinking constrained by the suggestions they receive.

Consider GitHub Copilot. It can dramatically accelerate coding by suggesting boilerplate and common patterns. But developers must carefully review these suggestions—they frequently contain subtle bugs or security vulnerabilities.

This human-in-the-loop requirement has serious implications. It limits the scalability of productivity gains, since human attention remains necessary. Domain expertise stays essential. And while human work shifts toward verification and curation, we're losing opportunities to train the next generation in foundational problem-solving skills.

The environmental cost

These qualified benefits must be weighed against a substantial ecological footprint. While training costs for individual LLMs remain relatively small compared to other industrial activities like steel production, inference costs across millions of queries add up significantly. More concerning still are image and especially video generation models, which are far more energy-intensive while arguably producing less useful output.

The rapid expansion of AI has created disruptions in hardware markets. Recent price increases in RAM chips have been partly attributed to surging demand from AI data centres, illustrating how quickly this technology is reshaping infrastructure and markets.

Do incremental productivity gains justify this ecological cost? This becomes particularly important given the economic incentives that currently shape AI investment. The technology is primarily used in consumer applications—marketing content, email drafting, social media posts—because that's where immediate revenue exists. These applications are easier to monetise than scientific research, which requires deep domain expertise, longer development cycles, and less obvious paths to profitability. Whether this ecological impact is justified for consumer applications remains an open question.

The scientific community harbours its own concerns. Many researchers argue that while LLMs represent impressive engineering, they don't capture the full breadth of AI, which encompasses reasoning, planning, search, and combinatorial problems alongside machine learning. The conflation of AI with LLMs creates problems when assessing genuine risks and capabilities.

Generative AI represents non-trivial progress with interesting applications. But recognising its fundamental limitations (error-prone by design, requiring constant human oversight, and carrying substantial environmental costs) is essential for deploying these systems responsibly.

In Part 2, we'll explore how hybrid architectures combining neural and symbolic approaches might address some of these limitations while opening new possibilities for scientific research.

The achievement Is genuine

Error-prone by design

The human in the loop

The environmental cost

Tags