Sundeep Teki
  • Home
    • About
  • AI
    • Training >
      • Testimonials
    • Consulting
    • Papers
    • Content
    • Hiring
    • Speaking
    • Course
    • Neuroscience >
      • Speech
      • Time
      • Memory
    • Testimonials
  • Coaching
    • Forward Deployed Engineer
    • Testimonials
  • Advice
  • Blog
  • Contact
    • News
    • Media

Gemini 3 and The Dawn of System 2 AI

19/11/2025

Comments

 
Picture
Figure 1
Picture
Figure 2
The data from the latest Gemini 3 release marks a definitive paradigm shift in frontier model performance vs. competing LLMs (figure 1).

Analysing the performance delta between Gemini 3 and Gemini 2.5 (figure 2), attributed to improved pre-training and post-training (cf. Oriol Vinyals' post on X), it is clear that Google has cracked the code on "System 2" thinking for multimodal AI.

Here are some key insights that I gleaned from the latest benchmark results:

1. Visual Logic is the New Moat:
The divergence in ARC-AGI-2 is shocking. While GPT-5.1 and Claude Sonnet 4.5 hover in the 13-17% range, Gemini 3 Deep Think has achieved 45.1%. This isn't just better image recognition; it represents a fundamental breakthrough in abstract visual reasoning and generalization.

2. The "Reasoning" Explosion:
On Humanity's Last Exam (HLE), we see a non-linear leap. Gemini 3 Pro improved by 73.6% over its predecessor 2.5 Pro, hitting 37.5%, while the Deep Think variant pushes the boundary to 41.0%. We are moving rapidly beyond pattern matching toward verifiable logic.

3. Agentic Planning has Matured:
The improvements in "Coding & Agents" are massive. The 855% improvement on Vending-Bench 2 (Planning) and 537% on ScreenSpot-Pro (UI Vision) signals that the coming year might herald fully autonomous, reliable agents that can navigate software interfaces as well as humans, if not better.

4. LLMs Can Do Math:
Perhaps the most staggering data point is the 4,580% jump in Gemini 3 Pro's score on MathArena Apex (from 0.50% to 23.40%; with Sonnet 4.5 and GPT 5.1 scoring ~1-1.6%). This suggests that hallucinations in mathematical workflows are being solved, likely by integrating formal verification steps into the model's chain of thought.

5. Conclusions & Future trends:
The data confirms that scaling laws still hold, but the gains are shifting toward quality of thought (inference compute) rather than just fluency. The disparity in the ARC-AGI-2 scores suggests that Google has found a unique architectural advantage in multimodal processing. Future architectures will likely commoditize "Deep Thinking" modes, making high-fidelity complex reasoning accessible for coding and scientific discovery.
Comments
comments powered by Disqus
    ★ Checkout my new AI Forward Deployed Engineer Career Guide and 3-month Coaching Accelerator Program ★ ​

    Archives

    December 2025
    November 2025
    October 2025
    September 2025
    August 2025
    July 2025
    June 2025
    May 2025
    April 2025
    March 2025
    February 2025
    January 2025
    October 2024
    September 2024
    March 2024
    February 2024
    April 2023
    December 2022
    November 2022
    October 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    February 2022
    December 2021
    October 2021
    August 2021
    May 2021
    April 2021
    March 2021

    Categories

    All
    Ai
    Data
    Education
    Genai
    India
    Jobs
    Leadership
    Nlp
    Remotework
    Science
    Speech
    Strategy
    Web3

    RSS Feed


    Copyright © 2025, Sundeep Teki
    All rights reserved. No part of these articles may be reproduced, distributed, or transmitted in any form or by any means, including  electronic or mechanical methods, without the prior written permission of the author. 
    Disclaimer
    This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the blog owner and do not represent those of people, institutions or organizations that the owner may or may not be associated with in professional or personal capacity, unless explicitly stated.
[email protected] 
​​  ​© 2025 | Sundeep Teki
  • Home
    • About
  • AI
    • Training >
      • Testimonials
    • Consulting
    • Papers
    • Content
    • Hiring
    • Speaking
    • Course
    • Neuroscience >
      • Speech
      • Time
      • Memory
    • Testimonials
  • Coaching
    • Forward Deployed Engineer
    • Testimonials
  • Advice
  • Blog
  • Contact
    • News
    • Media