Will AI Inference (Test-Time Compute) Scaling Yield True Novelty & Major Scientific Breakthroughs?

It is unlikely that scaling AI inference (test-time compute) alone will yield high-value novel breakthroughs (without humans) to advance humanity

Feb 14, 2025

∙ Paid

Artificial intelligence, once confined to specialized research laboratories, now underpins chatbots, coding assistants, and creative content generators used by millions.

At the forefront are GPT-style Large Language Models (LLMs), such as GPT-4, whose ability to produce human-like text has fueled excitement about AI’s rapid progress.

Despite these impressive feats, a lingering question remains:

Can these AI systems move beyond recombining known data to produce legitimately novel insights? (Or are they basically just more advanced “calculators” for things like coding, writing, analysis, etc.?)

Evolution of Base GPT Models

The Early GPT Models (GPT-1 and GPT-2)

GPT-1 (2018)

Showcased that a transformer-based architecture, trained on large amounts of text, could achieve state-of-the-art results in multiple NLP tasks.
Despite being notable for its time (~117 million parameters), it remained mostly in research labs.

GPT-2 (2019)

Scaled parameter counts to 1.5 billion, leading to significantly more coherent text generation.
Sparked public debate over AI safety, as it could generate passages that were convincing enough for misinformation or spam.

Takeaway: Early GPT models proved “bigger is better” for language tasks, hinting at the scaling trend that would soon dominate.

GPT-3: A Large Leap in Scale

175 Billion Parameters: GPT-3 was the first LLM to really grab mass attention, showing “few-shot” or even “zero-shot” learning capabilities. It could write code snippets, creative stories, or translations with minimal fine-tuning.
Limitations: GPT-3 still hallucinated facts, struggled with certain logic puzzles, and sometimes produced bizarre or inconsistent outputs. This underscored that raw scaling alone doesn’t guarantee robust reasoning.

Takeaway: GPT-3’s success popularized the notion that “more parameters + more data” = better overall performance, yet it also revealed the persistent issue of shallow or incorrect reasoning.

GPT-4 (2023): Refinements and Partial Multimodality

Refined Language Understanding: GPT-4 claimed lower error rates, better multilingual skills, and an optional multimodal interface (in certain versions). It tackled complex writing tasks, coding, and advanced Q&A more effectively than GPT-3.
Still Not “Truly Novel”: GPT-4 shined at synthesizing existing ideas, yet it wasn’t producing groundbreaking scientific theories or brand-new solutions out of thin air. Hallucinations diminished but remained an issue.

Takeaway: GPT-4’s jump was less about raw size and more about quality-of-data improvements, fine-tuning, and partial problem-solving enhancements. However, it still didn’t break free from the boundary of known human knowledge.

Scaling Beyond Model Size: Inference-Time Compute & Chain-of-Thought

As GPT-4 arrived, it became clear that merely making models bigger might yield diminishing returns, especially for deeper reasoning tasks. Enter the era of inference-time scaling—emphasizing how the model is used at runtime. (Read: OpenAI’s New O3 Model).

Chain-of-Thought (CoT) Reasoning

What It Is: Instead of having the AI output a final answer in one shot, CoT prompts the model to generate intermediate “thinking steps” before concluding.
Why It Helps: This approach can improve multi-step tasks (e.g., math, coding, logical puzzles) because the model checks partial reasoning en route, reducing blatant errors.
Impact on Novelty: While CoT can yield more transparent and methodical answers, the knowledge base is still drawn from training data. It’s great for reorganizing known facts, but breaking new ground remains unlikely unless the model is specifically pushed to hypothesize beyond known patterns.

Large-Scale Inference & Repeated Sampling

Multiple Candidate Answers: Another tactic is to generate many parallel solutions for the same query, then vote on the best or merge them. This “majority vote” or “self-consistency” method often raises accuracy.

Trade-Offs

Improved Reliability: More tries can reduce random errors or oversights.
Rapidly Escalating Costs: Each extra sample means extra computation. Running 100 or 1000 parallel branches can become very expensive at scale.

Does It Spur Novelty? These repeated passes typically explore various angles of the same underlying knowledge base. Occasionally, you get a creative or unexpected angle. However, it’s still mainly an iterative approach—not an explicit push toward uncharted ideas.

Read: AI Emergence: Emergent Behaviors Do Not Automatically Equal Novelty

Why true novelty remains elusive for AI models…

Despite advanced chain-of-thought and multi-sample inference:

Recombination vs. Genuine Invention: LLMs excel at pattern matching and can propose “new-sounding” ideas, but in practice, those ideas generally derive from rearrangements of training data.
No Direct Real-World Feedback: Without an automated process to test or experiment on novel suggestions, the model can’t refine genuine unknowns. It must stick to textual patterns, which are inherently secondhand human knowledge.
High-Level Constraints: Even at thousands of parallel samples, the cost can be prohibitive. Consequently, most usage focuses on reliability, not on orchestrating huge volumes of “risky” or purely exploratory attempts.
Incremental vs. Radical Gains: CoT, repeated sampling, and better fine-tuning do deliver more robust “analyst/coder” type outputs. But they rarely break from known frameworks—especially without data beyond human-authored text.

Odds of Major Novelty Breakthroughs in the Next 5 Years (2025-2030)

Inference/Test-Time Scaling (Only)

Estimated Probability: Many analysts pin it at 10–20% that we’ll see truly unprecedented solutions (like discovering brand-new laws of physics or curing a major disease from scratch) via GPT-like models used in conventional ways.

Why Not Higher?

Data Boundaries: The model’s knowledge is finite, pulled from existing text.
Lack of Real-World Testing: Novelty also requires experimentation or domain-specific data beyond text.
Cost Constraints: The most advanced multi-branch inference setups become expensive, pushing mainstream usage toward more straightforward tasks.

Bottom Line: We can expect continuing improvements in correctness, coding assistance, multi-step logic—but not a frequent stream of “eureka!” breakthroughs purely from scaling up inference.

Note: Human augmentation with advanced AIs should theoretically enhance cost/time-efficiency for humans to achieve rapid breakthroughs in certain fields.) This isn’t the same as the AI’s just spitting out perfect ideas etc.

Hypothetical Paths to AI Models Generating Novel Output (Innovative Solutions)

Although conventional GPT-style models (even with chain-of-thought and multi-sample inference) rarely spark radical originality, several targeted innovations could raise the odds that AI systems discover truly unprecedented ideas or solutions.

Massive Parallel Exploration at Inference

Idea: Run thousands—or even millions—of alternative solution paths for each query or challenge.

Keep reading with a 7-day free trial

Subscribe to ASAP Drew to keep reading this post and get 7 days of free access to the full post archives.