Loading the Elevenlabs Text to Speech AudioNative Player...

Dear friends,

Happy 2026! Will this be the year we finally achieve AGI? I’d like to propose a new version of the Turing Test, which I’ll call the Turing-AGI Test, to see if we’ve achieved this. I’ll explain in a moment why having a new test is important.

The public thinks achieving AGI means computers will be as intelligent as people and be able to do most or all knowledge work. I’d like to propose a new test. The test subject — either a computer or a skilled professional human — is given access to a computer that has internet access and software such as a web browser and Zoom. The judge will design a multi-day experience for the test subject, mediated through the computer, to carry out work tasks. For example, an experience might consist of a period of training (say, as a call center operator), followed by being asked to carry out the task (taking calls), with ongoing feedback. This mirrors what a remote worker with a fully working computer (but no webcam) might be expected to do.

A computer passes the Turing-AGI Test if it can carry out the work task as well as a skilled human.

Most members of the public likely believe a real AGI system will pass this test. Surely, if computers are as intelligent as humans, they should be able to perform work tasks as well as a human one might hire. Thus, the Turing-AGI Test aligns with the popular notion of what AGI  means.

Here’s why we need a new test: “AGI” has turned into a term of hype rather than a term with a precise meaning. A reasonable definition of AGI is AI that can do any intellectual task that a human can. When businesses hype up that they might achieve AGI within a few quarters, they usually try to justify these statements by setting a much lower bar. This mismatch in definitions is harmful because it makes people think AI is becoming more powerful than it actually is. I’m seeing this mislead everyone from high-school students (who avoid certain fields of study because they think it’s pointless with AGI’s imminent arrival) to CEOs (who are deciding what projects to invest in, sometimes assuming AI will be more capable in 1-2 years than any likely reality).

Andrew Ng is pictured writing in a notebook by a large window, with a garden and pool visible in the background.

The original Turing Test, which required a computer to fool a human judge, via text chat, into being unable to distinguish it from a human, has been insufficient to indicate human-level intelligence. The Loebner Prize competition actually ran the Turing Test and found that being able to simulate human typing errors — perhaps even more than actually demonstrating intelligence — was needed to fool judges. A main goal of AI development today is to build systems that can do economically useful work, not fool judges. Thus a modified test that measures ability to do work would be more useful than a test that measures the ability to fool humans.

For almost all AI benchmarks today (such as GPQA, AIME, SWE-bench, etc.), a test set is determined in advance. This means AI teams end up at least indirectly tuning their models to the published test sets. Further, any fixed test set measures only one narrow sliver of intelligence. In contrast, in the Turing Test, judges are free to ask any question to probe the model as they please. This lets a judge test how “general” the knowledge of the computer or human really is. Similarly, in the Turing-AGI Test, the judge can design any experience — which is not revealed in advance to the AI (or human subject) being tested. This is a better way to measure generality of AI than a predetermined test set.

AI is on an amazing trajectory of progress. In previous decades, overhyped expectations led to AI winters, when disappointment about AI capabilities caused reductions in interest and funding, which picked up again when the field made more progress. One of the few things that could get in the way of AI’s tremendous momentum is unrealistic hype that creates an investment bubble, risking disappointment and a collapse of interest. To avoid this, we need to recalibrate society’s expectations on AI. A test will help.

If we run a Turing-AGI Test competition and every AI system falls short, that will be a good thing! By defusing hype around AGI and reducing the chance of a bubble, we will create a more reliable path to continued investment in AI. This will let us keep on driving forward real technological progress and building valuable applications — even ones that fall well short of AGI. And if this test sets a clear target that teams can aim toward to claim the mantle of achieving AGI, that would be wonderful, too. And we can be confident that if a company passes this test, they will have created more than just a marketing release — it will be something incredibly valuable.

Happy New Year, and have a great year building!

Andrew

Agents of 2026

The pieces are in place: AI models have gained the ability to generate coherent text, images, videos, and other data; draw upon proprietary databases; and navigate the web and take actions online. Get ready for a Cambrian Explosion of intelligent applications that help us live better lives and steward our organizations and communities. In this special issue of The Batchas in previous New Year issues, some of the brightest minds in AI share their hopes for what comes next.


David Cox is pictured during a discussion in a glass-walled office, aligned with themes of open-source innovation and teamwork.

Open Source Wins

by David Cox

My hope is that open AI continues to flourish and ultimately wins.

An open ecosystem has always been the engine of real innovation. It lets communities build on decades of progress and tap into collective talent, not just the resources of a single company. Back in the 1990s, open software like Linux, Apache, and Eclipse challenged the dominant proprietary systems. That fight shaped the internet as we know it. Now, the same principles must guide AI’s evolution.

The parallels are eerie. Some players are trying to own and control AI by doing all the same things Microsoft did back when it was dumping free copies of Windows into developing markets to help keep Linux from gaining a foothold.

Like Microsoft’s free floppy disks of yore, Open AI and Meta both dropped so-called open models that were not actually open. They didn’t disclose anything about their training sets or formulas, they had caps on how much revenue you could make. All this is designed to prevent anyone else from getting traction and making your product the one that wins.

But the potential of truly open AI is so important. It’s important that AI is not owned by anybody and doesn’t represent the values of only one company. And it’s important that everyone can help shape its future, whether it’s people at other companies, academics, or ordinary users.

Open development has important benefits. First, it reduces the odds of vendor lock-in. Nobody wants to be stuck on a proprietary model that’s behind an API that becomes critical infrastructure. Second, it enables greater customization. Not only do you legally have the ability to customize an open model (including using it to make a closed one), but it’s easier to remake it because you know something about how it was made.

There is a thriving open ecosystem in China right now. Those developers have done amazing work. But there’s also a weird geopolitical overlay. Countries don’t trust other countries. China doesn’t trust the U.S., the U.S. doesn’t trust China, and Europe doesn’t trust either. And it can be easy to poison a model by training on compromised data. Genuinely open development solves that, because everyone knows what the training sets were and how they were obtained.

At IBM, we’ve been walking this talk. We publish the details of our models, how they were trained, and especially what data they’re trained on. The Stanford Transparency Index put us at the very top, with a score of 95 percent, 23 points ahead of second place. And we’re not the only ones. The Allen Institute has done really impressive work, and developers in China are walking the walk — but then there’s this weird geopolitical overlay. The U.S. doesn’t trust models from China, and China doesn’t trust models from the U.S.

We know IBM has a reputation for being boring. But boring can actually be good. Boring is stable; it’s a foundation you can build on. IBM is also a little weird. That stable foundation actually lets you do weird things without them falling apart. Let’s make AI more open, more weird, and maybe a little more boring in 2026.

David Cox is the VP for AI models at IBM Research and the IBM Director of the MIT-IBM Watson AI Lab. Previously he taught natural sciences, applied sciences, and engineering at Harvard.


Adji Bousso Dieng is pictured typing on a laptop in a warmly lit room, focusing on AI-driven scientific work.

AI for Scientific Discovery

by Adji Bousso Dieng

In 2026, I hope AI will transition from being a tool for efficiency to a catalyst for scientific discovery.

For the last decade, the dominant paradigm in deep learning has been interpolation. We have built incredibly powerful models that excel at mimicking the distribution of their training data. This is perfect for the applications where AI shines right now, such as conversational agents and coding assistants, where a query can be answered by identifying statistical patterns in existing data. This paradigm has even led to successful applications that meet scientific challenges that can be formulated as supervised learning problems, such as AlphaFold.

However, within that paradigm, models struggle with the rarest examples, the tails of the data distribution. For instance, in our work with the Vendiscope, a tool we developed to audit data collections, we found that even AlphaFold struggles to predict the 3D structures of rare proteins. Furthermore, many grand challenges in the physical sciences, from designing de novo proteins to discovering novel metal-organic frameworks (MOFs) that capture CO2 from the atmosphere cannot be framed as supervised learning problems. Rather, they can be framed as discovery problems where what is sought is rare.

In these settings, the dominant modes of the distribution are often scientifically uninteresting because they represent more of what we already know. In 2026, I hope we finally crack the code on discovery, moving to techniques that can tame the tail of the distribution and even discover meaningful things that are out of distribution. The goal is to find things that nature allows but we haven’t yet seen.

To make this leap from interpolation to discovery, the AI community must prioritize a fundamental shift in the objective functions that drive machine learning. We need to move beyond maximizing accuracy and probabilistic likelihoods, objectives that inherently drive models toward interpolation and collapse to the dominant modes of the data distribution. Instead, we need to raise diversity as a first-class objective, rather than treating it solely as a vague sociotechnical concept for fairness.

At my lab, Vertaix, we have led this thread of research by developing the Vendi Score. In our research on materials discovery, we found that optimizing the Vendi Score allowed us to identify stable, energy-efficient MOFs that standard search methods missed because they could not effectively explore a search space that spans trillions of materials.

In 2026, we should stop treating diversity merely as a secondary evaluation metric and start treating it as the primary mathematical engine for discovery. If we make this shift, AI will cease to be just an imitator of human knowledge and become a true partner in expanding it.

Adji Bousso Dieng is founder of the Vertaix research lab at Princeton University and co-principal investigator of the National Science Foundation Institute for Data-Driven Dynamical Design. She is founder of The Africa I Know, a nonprofit that supports STEM education for young Africans.


Juan M. Lavista Ferres is pictured holding a laptop while students watch a video about AI on a screen, linking education and technology.

Education That Works With — Not Against — AI

by Juan M. Lavista Ferres

A little more than three years ago, OpenAI released ChatGPT, and education changed forever. For students, the ability to generate fluent, credible text on demand in seconds is an incredible new tool. For educators, it is a new kind of challenge. In the coming year, I hope the education community will make peace with AI as an educational tool and focus on developing reliable ways to evaluate student performance in the era of generative media.

In the months that followed ChatGPT’s arrival, a comforting story was widely shared: If generative AI could write essays, then we could build AI detectors to identify them. Some early studies reported near-perfect accuracy in controlled settings. The implicit promise was appealing: teachers would not need to rethink assessment. We could keep the same workflows, the same assignments, the same enforcement model.

That hope was an illusion. In a lab, these systems can perform very well. But their performance assumes that students will submit the raw model output. They won’t. The moment there is a detector, students have an incentive to evade it. And evasion is not difficult. Rewrite a paragraph. Add a few typos. Change sentence lengths. Reorder sections. Insert personal anecdotes. Translate and re-translate. Or use any of the growing set of tools that exist to rewrite AI output to look “human.”

This is the structural problem: If you can build a system that detects AI-generated text, then you can use that system to train a system that defeats it. The moment a detector is deployed, entrepreneurs will build products to break it, and students will learn to use them.

But the biggest problem is not designing effective detectors. It is maintaining trust. If educators rely on detector scores and students rely on programs designed to defeat detectors, educators are pushed into suspicion and adjudication. You end up confronting students, navigating appeals, and making high-stakes judgments without reliable evidence. You risk harming students, especially non-native English speakers, and students who have learned to follow certain academic conventions. Meanwhile, the students most committed to misuse will adapt fastest. So in practice, detection can penalize the wrong people while failing to deter the most sophisticated evasion.

Generative AI can improve learning. It can help students practice, give feedback, and deliver tutoring. It can translate material into a student’s own language and help personalize learning at scale.

But we need to be realistic. The traditional take-home essay, used as a universal proof of independent authorship, is broken. Verifying independent authorship through text alone no longer works at scale. Universities and schools should assume students will use generative AI, and they need assessment models that still work in that reality.

A few practical moves: 

  • Use authentic demonstrations of understanding. In-person exams, oral defenses, live writing, presentations, and project walk-throughs make comprehension and ownership visible. 
  • Teach AI literacy. Verification, citation, bias awareness, and responsible use should be part of the curriculum, not an afterthought. 
  • Design for AI, not against it. For take-home assignments, assume students will use these tools. Build work that incorporates them responsibly, and assess students’ judgment, reasoning, and the ability to apply knowledge.

The genie is out of the bottle. There is no way to put it back. Our job now is to build the rules and practices that make education more effective, and more trustworthy, in the world we actually live in.

Juan M. Lavista Ferres is chief data scientist at Microsoft and a corporate vice president. He directs the Microsoft AI for Good Lab and the Microsoft AI Economy Institute.


Tanmay Gupta is pictured smiling next to a whiteboard filled with mathematical formulas, embodying active AI engagement.

From Prediction to Action

by Tanmay Gupta

AI research in 2026 should confront a simple but transformative realization: Models that predict are not the same as systems that act. The latter is what we actually need.

Over the last decade, we have become extraordinarily good at passive prediction and generative modeling — producing bounding boxes and segmentation masks for objects in images, transcribing audio into text, or generating fluent paragraphs and images on command. These are impressive achievements, yet they remain proxy tasks: tasks that are often assumed to represent real-world economic utility. This is a fallacy. The world’s economically meaningful tasks do not end at a single prediction or generation from a single input. They require taking a sequence of actions (each of which may be a function of predictions or generations from one or more models) in complex, dynamic environments where each action shapes the state of the environment and hence subsequent actions.

In 2026, AI research must move decisively from solving these proxy tasks to the corresponding long-horizon realistic tasks that these proxy tasks loosely approximate. Consider how coding has evolved: Models once autocompleted lines, but modern coding agents increasingly take a high-level specification, search through a codebase, run tests, and return a working solution with minimal human intervention.

I hope we can bring this evolution — from generating proxies to accomplishing goals —  to other domains. For example, vision models should be studied as parts of larger systems that use visual input streams to drive digital (web/computer use) and physical (embodied) workflows, monitor processes, or extract insights. Speech systems need to be studied as part of intelligent conversational assistant architectures that understand objectives conveyed through conversation and interface with digital or physical tools to fulfill them. Image- and video-generation models should be studied as parts of systems that generate, say, long-form visual educational content from existing documents or marketing material for products or research artifacts.

Shifting focus to these long-horizon tasks and goal-oriented AI systems has two major benefits. First, it exposes limitations and pain-points of current AI models when we use them to construct these larger systems and pipelines. These goal-oriented AI systems need more than predictive or generative capability. They require persistent memory, ability to focus on a goal over a long time horizon, responsiveness to real-time human feedback, and the ability to cope with uncertainty in an evolving environment. They also require effective interfacing with a wide variety of multimodal information sources, tool calling, the ability to hypothesize and reason, continual learning, self-improvement, and more. Many of these gaps in capability are invisible on short-horizon or single-step predictive tasks but reveal themselves in more complex and realistic long-horizon scenarios. We need better ways to evaluate these aspects of intelligence and methods to improve them.

Second, this goal-centric reframing aligns AI research with end-task utility. By directly trying to solve real end-tasks, researchers are less likely to be led astray by the siren song of seemingly useful proxy tasks that ultimately prove to be incapable of solving real tasks. For instance, for years, semantic parsing was assumed to be an important component of natural language understanding systems by NLP researchers. Today’s LLMs are capable of sophisticated language understanding and manipulation without ever explicitly performing semantic parsing. In hindsight, the semantic parsing research-hours were perhaps better spent on trying to solve the end-task instead of chasing the proxy metric of semantic parsing accuracy.

Real digital or physical tasks unfold over minutes, hours, months, and sometimes years. Humans have the extraordinary capability of consolidating diverse information collected over extended periods of time into a consistent world-view that drives execution of complex goals in evolving environments. The technological advancements in deep learning over the last decade, particularly in LLMs and VLMs, have well set the stage for the AI research community to take a serious shot at replicating this ability on silicon in the next decade. In the last couple years alone, we have seen the rise of LLM-powered agentic systems that are automating well defined workflows. Tackling the underspecified, ill-defined, undiscovered, and unimagined is the next frontier.

Tanmay Gupta is a senior research scientist at the Allen Institute for Artificial Intelligence. He is a co-author of “Visual Programming: Compositional Visual Reasoning Without Training,” which won best paper at CVPR 2023. His work spans multimodal agents, coding agents, VLMs, and VLAs.


Pengtao Xie is pictured standing near a chalkboard filled with mathematical notes, addressing a classroom of attentive students.

Multimodal Models for Biomedicine

by Pengtao Xie

Over the past few years, we have seen rapid progress in models that jointly reason over text, images, sequences, graphs, and time series. Yet in biomedical settings, these capabilities often remain fragmented, brittle, or difficult to interpret. In 2026, I hope the community moves decisively toward building multimodal models that are not only powerful but also scientifically grounded, transparent, and genuinely useful to biomedical discovery and clinical decision-making.

A key priority should be deep multimodal integration, rather than superficial concatenation of modalities. Biological systems are inherently multi-scale and multi-view: molecules, cells, tissues, organs, and patients are connected through complex mechanisms that span sequences, structures, images, and longitudinal records. Foundation models should reflect this structure by learning aligned representations that preserve biological meaning across modalities, enabling coherent reasoning from molecular mechanisms to phenotypic outcomes. Achieving this will require new pretraining objectives, better inductive biases, and principled ways to encode biological context. For instance, researchers could design objectives that explicitly align representations across modalities using shared biological anchors such as pathways, cell states, or disease phenotypes, so information learned from one view remains meaningful when transferred to another.

Another critical focus is interpretability. In biomedicine, predictions alone are rarely sufficient. Researchers and clinicians need to understand why a model makes a decision, what evidence it relies on, and how its outputs relate to known biology. As multimodal models grow larger and more general, the AI community should prioritize explanation methods that operate across modalities, allowing users to trace predictions back to molecular interactions, image regions, or temporal patterns in patient data. For instance, models could be designed to produce explanations as structured cross-modal attributions, explicitly linking elements in one modality (e.g., genes or residues) to evidence in another (e.g., image regions or time points).

Data efficiency and adaptability should be central goals. Many biomedical domains suffer from limited labeled data, strong distribution shifts, and incomplete knowledge. Multi-modal foundation models must be able to adapt to new tasks, diseases, and institutions with minimal retraining, while maintaining robustness and calibration. Parameter-efficient adaptation, continual learning, and uncertainty-aware inference are especially important in this context.

Finally, progress in 2026 should be measured not only by benchmarks, but also  by integration into biomedical workflows. This includes tools that support hypothesis generation, experimental design, and interactive exploration. Such tools would allow domain experts to engage in multi-turn dialogue with models rather than passively consume predictions.

If the AI community aligns around these priorities, multimodal foundation models could become trusted partners in biomedical research, accelerating understanding while respecting the complexity and responsibility inherent in supporting human health.

Pengtao Xie is an associate professor at UC San Diego and an adjunct faculty member at Mohamed bin Zayed University of Artificial Intelligence in Abu Dhabi. Previously, he was senior director of engineering at Petuum, a generative AI startup.


Sharon Zhou is pictured smiling confidently with her hands clasped, reflecting AI’s potential for community-building.

Chatbots That Build Community

by Sharon Zhou

Next year, I’m excited to see AI break out of 1:1 relationships with each of us. In 2026, AI has the potential to bring people together and unite us with human connection, rather than polarize and isolate us. It’s about time for ChatGPT to enter your group chats.

The internet today feels like it’s getting pushed toward two extremes. On one end, it’s heavy AI slopification that paints a strictly worse, noisier version of our former internet — with bots participating in forums and scraping data (getting DDOS’ed by AI scrapers ~1 million times a day is not weird!) On the other end, it’s heavy human curation that’s trying to keep the LLMs out as much as possible.

But this tension doesn’t have to be adversarial. It can be integrating instead. AI can be designed to connect people and strengthen human connections. The bot in the chat becomes a positive uniting force, rather than a neutral assistant or a deceptive agent. To accomplish this, researchers will need to change some things, like post-training on longer contexts and different reinforcement learning environments to handle multi-human contexts and objectives. But it can be done, and I believe it will introduce new heights of intelligence, human and artificial.

As you talk to your LLM at 3:00 A.M. about solving a relationship problem and how it’s like debugging your code, your LLM asks you whether you want to talk to someone else who feels the same way. You think, “well, I thought my problem was niche at this hour, but why not.” What’s more, the LLM isn’t just there to make the intro. It joins your chat, making jokes with funny memes and asking interesting questions to make the conversation lively and full of curiosity — until you realize you’ve made a couple of friends, fixed your bug, and have a new lens for approaching your relationship. You’ve learned something helpful for your job and your personal life. And it’s only 3:15 A.M.

Curiosity accelerates when it’s shared. It’s infectious. It’s easier to learn things when you’re motivated by a group and where it’s trying to go, reach, explore. As a collective tool, AI can further our curiosity and creativity together. And there’s a chance that some of those enlightening conversations will be the new data needed to lift AI’s intelligence.

It would be quite the win-win if we design a future where the AI is incentivized to bring people together and give people a sense of belonging with each other, and in so doing, get people inventing more things and growing our collective intelligence in a way that serves as data that pushes models in ways that benchmarks on isolated chats don’t. This might even motivate new model architectures, like an extreme MoE (mixture of experts) that has lightweight, partially shared weights for each person and your multi-dimensional self, like a more evolved version of scratchpad memory today.

Today, the advances are close and this future is completely viable, which is why it excites me. I hope this year we take a step toward making AI a more positive force on humanity at large and on our individual humanity. This is one path that we can take in that direction.

Sharon Zhou is the Corporate Vice President of AI at AMD. Formerly, she was founder and CEO at Lamini and an adjunct instructor at Stanford University.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox