AGI? Depends Who You Ask.
Stop waiting for your favorite AI CEO to tell you when AGI will arrive... I think it's already here.
Which AI company will be first to “AGI” (Artificial General Intelligence)? AGI will be here within months, a year, a few years… or is it already here? This all depends on WHO YOU ASK and their specific mental construct or SUBJECTIVE definition of “AGI.”
A recent trilateral discussion starring: Dario Amodei (CEO of Wokethropic) and Demis Hassabis (Google WokeMind) asked each about when “AGI” will arrive… they disagreed, but why? Only because each has a different definition or subjective criteria for what constitutes AGI.
Dario thinks of AGI as an AI model that can do everything a human can do at the level of a Nobel laureate across many fields — and he thinks we will get this around 2026.
Demis thinks of AGI as a a system that can exhibit all the cognitive capabilities humans can — and he thinks a good tests for AGI are things like: (1) whether the AI can invent the theory of gen-relativity when Einstein did with the information he had available at the time OR (2) invent a game like Go that’s as aesthetically beautiful and complex.
Demis thinks there’s ~50% chance we achieve his definition of AGI within 5 years (2025-2030 window) but it could extend to the end of the decade (2035).
Eventually we will reach some threshold/point where everyone’s subjective definition for “AGI” is satisfied… and the next debate will be “wen ASI?” And the same type of discussions about when ASI (Artificial Super-Intelligence) will ensue.
I. Introduction: The Trouble with “AGI”
I’ve been annoyed with how people throw around the term “AGI” (Artificial General Intelligence). It’s a buzzword that many in the AI community—CEOs, researchers, media outlets—use with very little consensus on what it actually means.
Every time you blink, someone else has a new definition or timeline, or they’re trying to shoehorn the term into their marketing pitch. As a result, the label “AGI” has become a moving target that sparks more semantic debates than substantive ones.
In my opinion, early AGI is “already here” and has been here… when did it arrive? I could argue 2024… some might argue 2025. Others might say we’re still far from what they consider “AGI.”
A.) AGI Definition Confusion & Shifting Goalposts
Definition Confusion
The biggest headache is that nobody agrees on a single set of criteria for “AGI.” Some define it narrowly (e.g., “an AI that surpasses humans in most tasks”), others define it broadly (e.g., “an AI that can do anything a human can do but better”).
Still others focus on intangible qualities like creativity or consciousness. Because there’s no universal standard, any conversation about whether we “have” AGI quickly devolves into different camps referencing different definitions.
Shifting Goalposts
Historically, whenever an AI system beats a challenge once believed to be definitive of “true intelligence,” critics shift to a new challenge.
For example, people used to say, “If an AI can hold open-ended conversations and pass tough tests, that’s AGI.” Now that advanced models (GPT-4, o3, etc.) do this, skeptics say, “Well, that’s not real AGI—there’s more to it.”
This phenomenon isn’t just academic nitpicking; it happens within AI research labs themselves. Benchmarks intended to measure intelligence (like the ARC-AGI test) get declared obsolete as soon as top models master them. Researchers then create a new, tougher benchmark (e.g., ARC-AGI-2) to claim AI is still not “there yet.”
The net effect is that, no matter how impressive AI becomes, there’s always a newly minted definition or a newly concocted challenge that pushes “AGI” out of reach. This stifles clear conversation about what modern AI can (and can’t) do, and it perpetuates endless, circular arguments over what “AGI” really is.
B.) Debating AGI
Semantic Debates
These definitional disputes end up crowding out more pressing questions: How do we measure meaningful progress? How do we ensure safety and alignment? How do we harness AI’s breakthroughs for actual societal benefit?
Instead, people get stuck in “Is it AGI yet or isn’t it?” arguments—a question that has no universal answer as long as the term itself is a moving goalpost.
Industry & Public Perception
On one side, certain labs or companies might hype “AGI is near!” to grab attention or funding.
On the other side, prominent researchers or entrepreneurs may ridicule those claims, saying we’re decades away—often based on their personal definition of AGI.
The public gets whiplash from reading conflicting headlines: one day “OpenAI says AGI is close,” the next day “AGI skeptics say it’s nowhere in sight.” All because “AGI” means something different to each party.
My Focus
In my view, it’s more productive to evaluate what AI can do right now—and how rapidly those capabilities are expanding—than to ask, “Is this truly AGI?” I’m not denying there is a fundamental question about whether AI can reach human-level cognition.
But given the fiasco around each person’s AGI yardstick, I’d rather track a continuum of capabilities and the actual impact these systems have (e.g., are they driving new scientific discoveries, fueling huge economic growth, etc.?) instead of an endless binary debate.
II. The Evolving History of ‘AGI’
“AGI” isn’t just a new buzzword. It has its roots in some of the earliest discussions of artificial intelligence, going back to the mid-20th century.
Yet the meaning of “general intelligence” has morphed significantly over time, especially in the last couple of decades with the rise of deep learning.
1. Early Visions (Pre-Deep Learning)
Strong AI & Symbolic Reasoning
In the 1950s through the 1980s, researchers like John McCarthy and Marvin Minsky talked about AI that could equal or surpass human reasoning in any domain. They imagined symbolic logic-based systems that could theoretically handle the same intellectual tasks people do—from math proofs to language understanding.
During this era, the Turing Test was often held up as the gold standard for intelligence: if a machine could converse indistinguishably from a human, that would be “real AI.” But as chatbots improved in superficial ways (often through clever scripted responses), people began to see the Turing Test as incomplete.
Initial Milestones & Their Limitations
Over time, specialized AI systems began to do things previously deemed impossible for machines (like beating humans at chess), but these were still considered “narrow” feats—not general intelligence.
The conversation around “AGI” mostly stayed in the realm of theoretical computer science, futurists, and philosophers.
2. Modern Achievements (Deep Learning Era)
The Big Shift (2012–Present)
With the success of deep neural networks, starting around 2012, AI took enormous leaps in vision, speech recognition, and eventually language modeling.
Suddenly, tasks that were once “light years away” (like real-time translation, human-level board-game strategy) became realities.
Systems like AlphaGo and AlphaFold began solving or surpassing human abilities in tasks long considered “too complex” for mere algorithms.
Language Models: GPT & Beyond
When GPT-3 debuted in 2020, it shocked many with its broad capabilities, from coding simple programs to writing passable prose.
Each subsequent iteration (GPT-3.5, GPT-4, OpenAI’s o3, etc.) has expanded those capabilities further—some outscoring the average human on standardized tests, coding interviews, and more.
A decade prior, a machine that could do all that might have been crowned “AGI” on the spot. But in 2023–2024, it’s often dismissed as “just a statistical parrot” or “still not truly general.”
3. The Moving Target of AGI
Historical Shifts in Benchmarks
Repeatedly, people predicted that achieving certain milestones—like beating a Go grandmaster or passing a certain IQ test—would signal “AGI.” But once AI did it, they pivoted: “Well, that’s not actually general intelligence; it just means the system is good at pattern matching for that domain.”
This has led to an ongoing phenomenon: each time AI gets good at something that used to be deemed “uniquely human,” that capability suddenly ceases to define intelligence.
From Turing Test to Next-Gen Benchmarks
The Turing Test is still invoked occasionally, but many see it as too simplistic. Modern AI labs concoct new ways to measure “general cognition,” from the Winograd Schema Challenge to the ARC-AGI benchmark.
Yet, as soon as a system like GPT-4 or o3 scores high on these tests, a new iteration (ARC-AGI-2, ARC-AGI-3, etc.) gets proposed to keep pushing the boundary. It’s akin to a carrot on a stick—always just out of reach.
4. Why We’re Stuck in This Cycle
Technical Progress vs. Conceptual Reframing
Part of the reason is that AI capabilities are evolving so quickly, it’s hard to keep definitions stable. What seemed “decades away” can be achieved in a few months of intense research.
As soon as that happens, critics reframe “real AGI” as requiring yet another dimension—like creativity, self-reflection, emotion, or open-ended experimentation.
Hype, Skepticism, & the Public Eye
All of this plays out in front of the public, who see headlines like “AI passes law exam” or “AI codes better than most human programmers,” then hear other voices say “We’re nowhere near AGI.”
Without acknowledging that “AGI” keeps getting redefined, the whole conversation looks contradictory and bizarre to lay observers.
III. The ARC-AGI Benchmark & Its Sequels
ARC-AGI is a crystal-clear example of how AGI goalposts perpetually shift. When ARC-AGI was first introduced, many considered it to be a robust measure of an AI model’s ability to perform genuine abstract reasoning and adaptation—as opposed to just memorizing known examples… basically AGI.
Some probably consider clearing a certain score on ARC-AGI to be “AGI”… but others don’t. Some think ARC-AGI isn’t difficult enough… hence we are now onto a more demanding version ARC-AGI-2… and ARC-AGI-3 is already in development.
A.) The Original ARC-AGI: Purpose & Method
Designed by François Chollet
The creator of Keras, François Chollet, proposed the ARC (Abstraction and Reasoning Corpus) to test an AI’s capacity for pattern recognition and problem-solving with very limited training data.
Idea: Rather than feed the model a massive labeled dataset (like ImageNet), ARC tasks require you to infer transformations or rules from just a few examples—much as a human might.
Why It Was Called “AGI”
Chollet and others labeled it “ARC-AGI” because they intended it as a stepping stone toward measuring “general intelligence,” i.e., how flexibly an agent can handle entirely new tasks.
If an AI could handle ARC tasks on par with humans, many believed it would be a strong sign that we were inching toward more general forms of intelligence.
Initial Difficulty & Hype
Early attempts by standard deep learning models (like vanilla CNNs or LSTM-based approaches) failed miserably.
For a while, it validated the notion that true general reasoning was still out of reach for these architectures.
B.) OpenAI’s o3 Model “Defeats” ARC-AGI
Scoring 85%
Then came OpenAI’s o3 model, which managed to score around 85% on the ARC-AGI test—roughly matching average human performance.
This was a huge leap forward, showing that a carefully engineered large model (with the right architectures, perhaps chain-of-thought reasoning, etc.) could solve tasks that were supposed to require “general” intelligence.
Implications
Instantly, the achievement cast doubt on ARC-AGI’s status as a “final boss” for reasoning tests.
If an AI already matched average humans, was this the new “AGI moment”? Or was the ARC-AGI test over-hyped as a measure of genuine general intelligence in the first place?
Critics of the result claimed that “maybe ARC wasn’t broad enough,” or that o3 had “some hidden trick.” Regardless, it forced the AI community to concede that ARC-AGI was no longer a definitive yardstick.
C.) Enter ARC-AGI-2 & ARC-AGI-3
Launching in 2025
In response, the benchmark’s creators (and possibly a consortium of researchers) are rolling out ARC-AGI-2: a next-generation version explicitly intended to be “much harder” for advanced models like o3.
It will reportedly include more diverse task types, more robust ways to prevent “quick hacks,” and new forms of logical or spatial puzzles that defied the older version.
Since they don’t expect ARC-AGI-2 to last very long before an AI model crushes it, they’ve already begun working on ARC-AGI-3.
Raising the Bar, Again
This is a textbook illustration of the “moving goalposts” dynamic. Once an AI climbs the mountain, we declare that wasn’t the real peak, and we build a taller mountain.
I’m not necessarily against improved benchmarks—progress does demand new challenges. But it exemplifies how labeling anything “AGI test” is fleeting: as soon as it’s beaten, it’s no longer “the” test.
D.) What This Says About AGI Benchmarks
A Perpetual Arms Race: There’s an arms race between benchmark designers (trying to encapsulate more robust “general” reasoning) and AI model builders (continuously innovating architectures or training techniques). Every victory just escalates the race.
Good for Progress, Bad for Clear Definitions: On the bright side, this iterative process spurs AI labs to push the envelope. If ARC was never defeated, we’d have no impetus to create something more advanced. However, it also means “AGI” remains a floating concept. If we say “AGI is what passes ARC-AGI,” and then a model does it, we move to ARC-AGI-2 and keep the “AGI” label beyond reach.
IV. AI Leaders vs. “AGI” Definitions (Subjectivity)
While ARC-AGI’s story highlights how test-based definitions shift, the broader landscape of AGI definitions/criteria highly fragmented.
Major CEOs and researchers have wildly different visions—both for what AGI is and when we might get it and what’s the optimal “criteria.”
A.) Dario Amodei (CEO, Anthropic)
Definition: Nobel Laureate Across Many Fields
Dario’s standard is “an AI model that can do everything a human can do, but at the level of a Nobel laureate, across many disciplines.”
In other words, to him, “AGI” means more than just matching an average human—it’s surpassing the best humans in wide-ranging intellectual tasks.
Timeline: 2026
He’s on record predicting we could see such a system as soon as 2026.
To some, that sounds shockingly soon; to others, it’s plausible given how fast large language models are scaling.
Contradiction in Terminology
Interestingly, Dario often calls “AGI” a marketing term but still uses it to describe his forecast. This underscores how even skeptics of the label end up referencing it for convenience.
B.) Demis Hassabis (CEO, Google DeepMind)
Definition: All Human Cognitive Capabilities
Demis emphasizes that a true “AGI” should exhibit all the mental faculties humans possess—reasoning, creativity, planning, abstraction, etc.
He’s big on the idea of “inventiveness.” He’s suggested that if an AI can replicate something like Einstein’s general relativity discovery or create something as elegant as Go, that’s a sign of genuine general intelligence.
Timeline: ~5-10 Years
Hassabis often says it may be within 5-10 years or so, assuming a few key breakthroughs.
This is less aggressive than Dario’s 2026 but still implies a relatively near-term possibility.
C.) Satya Nadella (CEO, Microsoft)
Definition: Economic Transformation (10% GDP Growth)
Nadella basically calls AGI benchmarks “nonsensical milestone-hacking.” He wants to see real economic impact—like 10% GDP growth in developed economies—as the proof AI is truly transformative.
It’s a very pragmatic perspective: if AI is so “intelligent,” it should yield historically unprecedented productivity gains.
Dismissal of “AGI Milestones”
Because of his yardstick, Nadella doesn’t tie “AGI” to puzzle-solving or matching Einstein’s breakthroughs. He’s not even focusing on timeline predictions—he wants tangible macroeconomic changes.
RELATED: Satya Nadella on Microsoft AI/AGI
D.) Sam Altman (CEO, OpenAI)
Older vs. Newer Definitions
Altman initially said AGI would be something that “outperforms humans at most economically valuable work.”
More recently, he’s talked about “a system that can tackle complex problems at a human level across many fields,” perhaps lowering the bar from absolute economic dominance to multi-domain proficiency.
Timeline Confusion
OpenAI statements vary. Sometimes they hint AGI might be around the corner; other times they emphasize caution and disclaimers.
Part of that discrepancy might be internal shifts in how they foresee progress, or differences between public hype and private caution.
E.) Andrew Ng (Founder, Deeplearning.ai)
Any Human Intellectual Task
Ng has a classic definition: an AI that can do “anything a human can do intellectually,” from flying planes to writing novels.
He’s frequently said we’re decades away from that “real AGI,” criticizing labs for claiming near-term AGI just to attract attention or investment.
F.) François Chollet & the ARC Perspective
Efficiency at New Tasks
Chollet’s ARC and ARC-AGI frameworks measure how quickly a model can master new problem domains with minimal data, akin to how humans approach unseen puzzles.
With ARC-AGI-2 and ARC-AGI-3 on the horizon, Chollet’s approach is an evolving yardstick: as soon as a system conquers the current tasks, a new version is born. This somewhat aligns with my AGI-L1 vs. AGI-L2 idea… we could use Chollet’s tests to gauge Levels of AGI.
Why These Differences Matter
No Unified End Zone: Each leader focuses on different aspects: Some want direct equivalence to top human experts (Dario), others want creativity on the Einstein/Go scale (Demis), still others want massive economic outcomes (Satya). Because they’re measuring different things, they’re effectively having different conversations when they say “AGI.”
Implications for Timelines: If your definition requires Nobel-level excellence across many fields, you might be more conservative or pick a certain year. If your definition is “mass societal transformation,” you might wait for large-scale economic data, which could lag behind the raw technology.
Hype & Skepticism: The public sees these conflicting definitions: One CEO says we’ll have it in 2–3 years, another says 10, another says it’s decades away. They’re not lying—each just envisions “AGI” differently.
V. Why ‘AGI vs. Not-AGI’ Debates are Dumb AF
Despite all the talk about “AGI” as some looming event or threshold, I’ve become convinced that framing it as a yes-or-no milestone is just kind of dumb.
The definitions keep evolving, the goalposts shift, and people wind up talking past each other. Here’s why I see these debates as ultimately pointless:
A.) Perpetual Goalpost Shifting
"That’s Not Real AGI": Every time AI achieves something once deemed impossible—like mastering complex tasks, coding at a high level, or solving advanced puzzles—critics say, “Well, that’s still not real AGI.” They then propose a more sophisticated challenge.
The ARC-AGI → ARC-AGI-2 Cycle: We saw this clearly with ARC-AGI: once OpenAI’s o3 model scored around 85% (average human level), suddenly the benchmark was no longer considered a decisive test. Now ARC-AGI-2 is being rolled out to “prove” the next layer of intelligence.
Moving Targets: This is the nature of intelligence tests: as soon as AI passes one, we redraw the boundaries of “general.” It’s like an endless treadmill—great for motivating research, terrible for objective, stable definitions.
B.) No Universal Criterion
Different Yardsticks, Different Timelines
Dario Amodei’s Nobel-level standard differs from Demis Hassabis’s Einstein/Go litmus test, which differs again from Satya Nadella’s 10% GDP criterion.
Consequently, each person answers “Do we have AGI?” based on their yardstick—leading to “Yes, soon!” vs. “We’re decades off!” arguments that never reconcile.
Subjectivity in Intelligence: Even among humans, measuring intelligence is complex. We have IQ scores, real-world achievements, creativity, social intelligence, etc. For AI, we multiply that complexity by 10. There’s no consensus on which aspects are “required” for general intelligence. I’d prefer to measure this in objective quantifiable real-world impact.
C.) Public Misconception & Hype
Contradictory Headlines: The media churns out stories like “OpenAI declares AGI is near” or “Top scientists say AGI is decades away,” leaving the public confused. Both statements can be “true” if they’re referencing different conceptions of AGI.
The Dangerous Binary: Casting it as “AGI or bust” can distort priorities. Labs might overhype certain achievements to claim “We’re near AGI!” or, conversely, dismiss real progress by saying “We’re nowhere close.” Neither approach focuses on the deeper question: What can the AI actually do? And how does that affect society?
D.) Better Focus: Capabilities & Trajectories
Concrete Achievements: Instead of fighting over “AGI vs. Not-AGI,” it’s more fruitful to list what AI systems can do today, measure how fast they’re improving, and consider the risks and benefits of those capabilities.
Continuous Improvement: If we treat intelligence as a spectrum, we can watch how rapidly models climb that spectrum—rather than waiting for some magical day to declare “AGI achieved.”
VI. My Idea: AGI as a Spectrum to ASI (AGI-L1 → AGI-L2 → AGI-L3, etc.)
I don’t see AGI as a one-shot destination where we suddenly “have it.” Instead, I think AGI should be viewed as a continuum of increasing capability—“levels” that systems advance through over time.
At some high level of generality, we might even talk about ASI (Artificial Super Intelligence). But ASI, too, can be broken down into specific capability thresholds rather than treated as a single/static entity.
ASI will be the same way… low-level (ASI-L1 to ASI-L10 or ASI-L10000 depending on the level of subjective granularity we want to apply). Obviously I’d like to see some sort of “expert consensus” criterion for “AGI” — perhaps at each level.
We could take the top AI/ML labs and researchers, have them form a consensus of “AGI Level 1”… so we actually know. Could do the same with AGI Level 2, etc. And eventually for ASI the same way… but humans will need augmentation and/or AIs to gauge ASI improvements because they may be rapid and will likely be undetectable for humans beyond a certain threshold (improvements happen but humans don’t even realize it: seems like GPT-4 to GPT-4o to humans but it might really be like jumping from 200 IQ to 500 IQ).
A.) We’ve Already Surpassed Older “AGI” Definitions
What 2013 Might Have Called ‘AGI’: 10 years ago, many would’ve labeled a system “AGI” if it could write complex code, pass challenging exams, craft coherent essays, and conduct fluid conversations.
Modern Reality: Models like GPT-4, o3, or Grok 3 already do these things—yet we say they’re “not AGI.” That shows how the goalposts have shifted. By older standards, we may well be in “AGI territory.”
B.) Levels of AGI (Subjective Granularity)
Pre-AGI AI: Early language models (e.g., GPT-2, GPT-3.5) might be “advanced narrow AIs” with broad skills but insufficient depth or flexibility to be called “general.”
AGI Level 1: A system that outperforms an average human in several intellectual tasks (coding, language analysis, test-taking) but isn’t an expert in all domains. Grok 3, o1-pro, o3-mini might fit here.
AGI Level 2, 3, 4, 5, etc.: Each new iteration adds capacity: bigger context windows, stronger reasoning, multi-modal integration (text + vision + robotics), specialized modules, etc. Over time, these upgrades yield more “human-expert-level” or “broadly superhuman” performance.
AGI Level 10: Perhaps a system outshining top specialists in most fields—akin to Dario Amodei’s “Nobel laureate across disciplines.” That still may not be “superintelligence,” but it far surpasses typical human capabilities.
Distinguishing Domain-Specific Mastery from the ‘G’ in AGI
Elite Narrow AI ≠ Broad Generality
An AI can be superhuman in one area (e.g., coding) yet be subpar in others (e.g., creative writing or physical reasoning).
That kind of domain-specific superhuman skill is commonly called “narrow AI,” even if it’s better than 99.9% of humans at a single task.
‘AGI in One Domain’ Is an Oxymoron
Some people say “AGI-level in domain X” to highlight that the AI matches or surpasses every human specialist in that domain.
But the “G” in AGI implies multi-domain, human-like breadth. If the system only excels in one field, it’s not truly “general” in the classical sense.
Domain Tests as Stepping Stones
That said, measuring “AGI-like” performance in specific domains can still illuminate incremental progress.
Achieving superhuman coding or theorem-proving is a major leap—but to claim “AGI,” those leaps must eventually extend across multiple, diverse domains.
C.) Evolution from AGI → ASI
When Does AGI Become ‘Super’? There’s no bright line. Some might call it “ASI Level 1” once the AI can do things all human geniuses can do, plus feats we can’t replicate at all. Higher ASI levels might be exponentially more capable, up to near-omniscience in theory.
Overlapping Tiers: “AGI Level 10” and “ASI Level 1” might overlap in capabilities on a Venn diagram—one lab’s “extremely advanced AGI” could be another’s “entry-level ASI.”
Recursive Self-Improvement? A core ASI scenario posits that once an AI becomes self-upgradeable, it rapidly outstrips our understanding. This is the so-called “Singularity.”
Bottlenecks & Diminishing Returns: However, hardware, algorithmic limits, or deliberate throttling could stall progress. Alternatively, a near-omniscient ASI might decide further intelligence improvement is pointless. These outcomes remain speculative.
D.) Why AGI “Levels” Make Sense
Eliminates the Binary Trap: Instead of “AGI or not,” we talk about how a system moves from Level 1 to Level 2 and so on. This better reflects the incremental nature of AI breakthroughs.
Acknowledges Uneven Development: Systems excel in some domains before others. A multi-level or domain-specific framework captures that reality better than a single “AGI Day.”
The 200 IQ Panel Problem: As AIs surpass top human experts, humans may struggle to create tests or judge performance—like running an “IQ 200” exam without any 200-IQ test designers. Eventually, advanced AIs might have to design (and verify) new benchmarks for each other, raising questions about impartiality and governance.
Adaptive Benchmarking & Regulation: If we treat each domain or macro-level jump as a milestone, policymakers can scale oversight accordingly. For instance, “AGI Level 5 in cybersecurity” might require special regulation.
Aligns with Actual Progress: AI improvement is less about a sudden cosmic leap and more about iterative upgrades—bigger models, novel training techniques, specialized modules. A continuum approach is a direct match for how AI evolves in practice.
VII. Considering an Expert Consensus for AGI-ASI Levels?
Given the ever-shifting definitions of AGI and the subjective differences among experts, it’s critical to develop an objective, consensus-based framework for assessing AI progress along the AGI-ASI continuum.
Such a framework would not only help standardize evaluations but also provide clear benchmarks for when a system moves from one level to the next. It’s not necessary but would shut down the debate about what is AGI, how advanced the AGI is, etc.
A.) Rationale for Expert Consensus
Overcoming Subjectivity: With every leader using different yardsticks—whether it’s Nobel-level performance, Einstein-like creativity, or even macroeconomic impact—the current debate is fragmented. An expert consensus can reconcile these differences by agreeing on a set of core criteria that define each AGI level.
Tracking Incremental Progress: Instead of getting caught up in binary “AGI or not” debates, a consensus framework allows us to assess AI improvements incrementally. This means experts can set measurable milestones (e.g., “AGI Level 1” or “AGI Level 2”) that reflect a system’s evolving capabilities across multiple domains.
B.) Proposed Methodologies
Multidisciplinary Expert Panels: Assemble panels of experts from diverse fields—AI/ML researchers, cognitive scientists, mathematicians, and domain specialists—to discuss and define what constitutes each level of AGI. These panels would periodically review and update criteria as AI technology advances.
Domain-Specific Benchmarks & Macro-Level Assessments: Develop separate benchmarks for individual domains (e.g., coding, language, reasoning) and integrate these into an overall “general intelligence” score. This dual approach ensures that while a system might excel in one domain, it must still demonstrate breadth to qualify as higher-level AGI.
Iterative, Dynamic Testing: Use a continuous, adaptive testing process—similar to how ARC-AGI evolves—to measure performance over time. When current benchmarks are met or surpassed, the panel revises the criteria, ensuring that the assessment remains challenging and relevant.
Advanced AI-Assisted Evaluation: For performance levels beyond human evaluative capacity (akin to the “200 IQ” problem), leverage advanced AIs designed specifically for test creation and analysis. These AIs can help break down complex outputs into verifiable chunks, ensuring that even ultra-high performance can be reliably measured.
C.) Addressing Potential Challenges
The “200 IQ” Dilemma: As AI systems surpass the capabilities of the best human experts, our traditional tests become unreliable. The framework should include mechanisms for calibration—such as cross-validation between multiple expert panels or using AI-generated benchmarks—to ensure that the evaluation remains accurate.
Maintaining Objectivity: To prevent biases, the consensus process must be transparent and involve periodic peer reviews. A standardized protocol for how benchmarks are updated and how experts vote on each level is essential.
Scalability: The framework should be designed to accommodate varying levels of granularity. Whether using a coarse 1–10 scale or a more fine-grained continuous metric (1, 1.5, 2, 2.5 to 100, etc.), the system should allow adjustments as needed, ensuring that the criteria remain relevant as AI evolves.
D.) Benefits of an Expert Consensus Framework
Standardization: With agreed-upon metrics, the research community and industry stakeholders can objectively compare progress across different models and domains.
Clarity: A clear, consensus-based set of criteria helps reduce confusion and contradictory claims about whether a system has achieved AGI.
Actionable Benchmarks: Policymakers and industry leaders can use these benchmarks to guide investment, safety protocols, and further research, ensuring that discussions around AGI focus on measurable improvements rather than semantic debates.
VIII. Recap: AGI Definitions/Criteria
After examining the ever-shifting landscape of AGI benchmarks—illustrated by the evolution from ARC-AGI to ARC-AGI-2—and the wide range of definitions offered by leaders like Dario Amodei, Demis Hassabis, and Satya Nadella, one clear message emerges: AGI isn’t a fixed destination but an evolving target.
A.) Subjective, Moving Goalposts
No Single Threshold: Different experts define AGI based on diverse criteria—whether it’s achieving Nobel-level performance, replicating Einstein-like breakthroughs, or delivering significant economic impact. As soon as one benchmark is met, the target shifts, rendering the binary “AGI or not?” debate moot.
Dynamic Benchmarks: As seen with the ARC-AGI cycle, when a system reaches the current standard (e.g., scoring 85%), critics immediately push for a tougher test, ensuring that “AGI” remains an ever-moving goalpost.
B.) Levels-Based, Consensus-Driven Approach?
Continuum of Capability: Rather than waiting for a single “AGI moment,” we propose evaluating AI progress along a continuum—from early AGI (e.g., Level 1) to higher levels that eventually lead to ASI.
Domain and Macro Considerations: While some systems excel in specific domains, true AGI requires multi-domain competence. Our model accommodates both domain-specific performance and overall generality.
Expert Consensus: To bring objectivity to this continuum, I’d advocate for establishing an expert consensus on the criteria for each level. This standardized framework would function much like calibrated, high-precision IQ tests—ensuring that as AI systems reach ever-higher levels, their capabilities are measured consistently.
Final Take: AGI or Nah?
The debate over “AGI or not?” is less about a single breakthrough and more about understanding where we stand on a continuum of evolving capabilities.
By adopting a levels-based perspective—anchored by expert consensus—we can objectively assess AI progress and track its development incrementally, rather than fixating on a single, elusive “AGI moment.”