Unlocking the Mystery: What Large Language Models Can't Do (Yet)

In the not-too-distant future, you’re likely going to face a seemingly simple, yet surprisingly tough question from your leadership or colleagues: “What CAN’T large language models (LLMs) do?” The reasons behind this question are understandable. People want to grasp how their roles might evolve as this technology continues to permeate the workplace. Some of this curiosity stems from apprehension, while some is driven by the urge to prepare. Moreover, expect inquisitive employees at company town halls to pose pointed questions about this technology’s future in the workplace. Today, I’m going to attempt to help you start responding to these kinds of queries.

In the media, you’ll often hear phrases like “LLMs are still dumb.” But what exactly does this mean? And how do we contextualize such statements when we see many instances where LLMs, such as GPT-4, exhibit human-like reasoning abilities?

I recently came across The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain, a paper that might help you answer these questions. The authors elaborates on why LLMs struggle to formulate abstract concepts and apply them to problem-solving. What’s particularly appealing about this paper is that it contains numerous simple examples that illustrate GPT-4’s limitations. These are akin to the puzzles you’d find on a restaurant placemat, designed to keep children engaged. For instance, completing a shape that looks like a square but with a few pixels missing. Here are a few examples from the Appendix:

The paper presents test results comparing humans and GPT-4 on these tasks. Humans correctly answer them over 90% of the time, while GPT-4’s success rate hovers around 20%.

Reviewing these examples, you’ll intuitively grasp that LLMs seem to struggle with abstract thought. However, here’s the important caveat: this is what GPT-4 can accomplish today. The authors have provided the dataset used for this experiment, and you can bet that researchers and individuals around the globe will be working to improve upon these results, hoping to match human-level performance in the coming months.

I anticipate that we’re likely to encounter some limits to what LLMs can achieve in reasoning, particularly because GPT-4’s current performance on these tasks leaves a lot to be desired. But only time will tell.

In the interim, these simple examples may prove helpful in answering some challenging, yet crucial questions from your colleagues.

Subscribe to our YouTube channel where we post daily videos on the ever-evolving world of AI and large language models.


Prolego is an elite consulting team of AI engineers, strategists, and creative professionals guiding the world’s largest companies through the AI transformation. Founded in 2017 by technology veterans Kevin Dewalt and Russ Rands, Prolego has helped dozens of Fortune 1000 companies develop AI strategies, transform their workforce, and build state-of-the-art AI solutions.

Let’s Future Proof Your Business.