How to Test for Artificial General Intelligence AGI has become a term of hype, and the traditional Turing Test can’t reliably detect it. How can we evaluate claims of that someone has built artificial general intelligence? Here’s a better test.

Published
Reading time
3 min read
Andrew Ng is pictured writing in a notebook by a large window, with a garden and pool visible in the background.
Loading the Elevenlabs Text to Speech AudioNative Player...

Dear friends,

Happy 2026! Will this be the year we finally achieve AGI? I’d like to propose a new version of the Turing Test, which I’ll call the Turing-AGI Test, to see if we’ve achieved this. I’ll explain in a moment why having a new test is important.

The public thinks achieving AGI means computers will be as intelligent as people and be able to do most or all knowledge work. I’d like to propose a new test. The test subject — either a computer or a skilled professional human — is given access to a computer that has internet access and software such as a web browser and Zoom. The judge will design a multi-day experience for the test subject, mediated through the computer, to carry out work tasks. For example, an experience might consist of a period of training (say, as a call center operator), followed by being asked to carry out the task (taking calls), with ongoing feedback. This mirrors what a remote worker with a fully working computer (but no webcam) might be expected to do.

A computer passes the Turing-AGI Test if it can carry out the work task as well as a skilled human.

Most members of the public likely believe a real AGI system will pass this test. Surely, if computers are as intelligent as humans, they should be able to perform work tasks as well as a human one might hire. Thus, the Turing-AGI Test aligns with the popular notion of what AGI  means.

Here’s why we need a new test: “AGI” has turned into a term of hype rather than a term with a precise meaning. A reasonable definition of AGI is AI that can do any intellectual task that a human can. When businesses hype up that they might achieve AGI within a few quarters, they usually try to justify these statements by setting a much lower bar. This mismatch in definitions is harmful because it makes people think AI is becoming more powerful than it actually is. I’m seeing this mislead everyone from high-school students (who avoid certain fields of study because they think it’s pointless with AGI’s imminent arrival) to CEOs (who are deciding what projects to invest in, sometimes assuming AI will be more capable in 1-2 years than any likely reality).

The original Turing Test, which required a computer to fool a human judge, via text chat, into being unable to distinguish it from a human, has been insufficient to indicate human-level intelligence. The Loebner Prize competition actually ran the Turing Test and found that being able to simulate human typing errors — perhaps even more than actually demonstrating intelligence — was needed to fool judges. A main goal of AI development today is to build systems that can do economically useful work, not fool judges. Thus a modified test that measures ability to do work would be more useful than a test that measures the ability to fool humans.

For almost all AI benchmarks today (such as GPQA, AIME, SWE-bench, etc.), a test set is determined in advance. This means AI teams end up at least indirectly tuning their models to the published test sets. Further, any fixed test set measures only one narrow sliver of intelligence. In contrast, in the Turing Test, judges are free to ask any question to probe the model as they please. This lets a judge test how “general” the knowledge of the computer or human really is. Similarly, in the Turing-AGI Test, the judge can design any experience — which is not revealed in advance to the AI (or human subject) being tested. This is a better way to measure generality of AI than a predetermined test set.

AI is on an amazing trajectory of progress. In previous decades, overhyped expectations led to AI winters, when disappointment about AI capabilities caused reductions in interest and funding, which picked up again when the field made more progress. One of the few things that could get in the way of AI’s tremendous momentum is unrealistic hype that creates an investment bubble, risking disappointment and a collapse of interest. To avoid this, we need to recalibrate society’s expectations on AI. A test will help.

If we run a Turing-AGI Test competition and every AI system falls short, that will be a good thing! By defusing hype around AGI and reducing the chance of a bubble, we will create a more reliable path to continued investment in AI. This will let us keep on driving forward real technological progress and building valuable applications — even ones that fall well short of AGI. And if this test sets a clear target that teams can aim toward to claim the mantle of achieving AGI, that would be wonderful, too. And we can be confident that if a company passes this test, they will have created more than just a marketing release — it will be something incredibly valuable.

Happy New Year, and have a great year building!

Andrew

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox