Sundeep Teki
  • Home
    • About Me
  • AI
    • Training >
      • Testimonials
    • Consulting
    • Papers
    • Content
    • Hiring
    • Speaking
    • Course
    • Neuroscience >
      • Speech
      • Time
      • Memory
    • Testimonials
  • Coaching
    • Advice
    • Testimonials
  • Blog
  • Contact
    • News
    • Media

Agentic Context Engineering

14/10/2025

Comments

 
"We argue that contexts should function not as concise summaries, but as comprehensive, evolving playbooks - detailed, inclusive, and rich with domain insights."
​

- Zhang et al., 2025
Agentic Context Engineering - Evolving Context for Self-Improving Language Models

Table of Contents

1. Conceptual Foundations​
  • 1a. Problem Context: The $30 Billion Question
  • 1b. Historical Evolution: From Prompts to Playbooks
  • 1c. Core Innovation: Agentic Context Engineering Framework

2. Technical Architecture
  • 2a. Fundamental Mechanisms: The Three-Role ACE System
  • 2b. Implementation Considerations: Production Patterns

3. Advanced Topics
  • 3a. Variations and Extensions: Multi-Agent Architectures
  • 3b. Current Research Frontiers: Agentic RAG
  • 3c. Limitations and Challenges: The 40% Failure Rate

4. Practical Applications
  • 4a. Industry Use Cases: Production Deployments
  • 4b. Performance Characteristics: Benchmarks and Comparisons
  • 4c. Best Practices: Lessons from Practice

5. Engineering Agentic Systems into Production
  • 5a. Practical Implementation with Modern Frameworks
  • 5b. Evaluation and Benchmarking: 
  • 5c. System Design Considerations: Scalability, Latency, and Cost
  • 5d. The Strategic Moat: Building a Proprietary "Context Supply Chain"

6. Conclusions - Cracking Agentic AI and Context Engineering Roles
​
7. CTA: Subscribe to my upcoming Substack Newsletter on AI Deep Dives & Careers


8. Resources - my other articles on Context Engineering
  • Context Engineering
  • From Vibe Coding to Context Engineering
  • Context-Bench - Evaluating Agentic Context Engineering

Picture
Agentic Context Engineering framework (Zhang et al., 2025)
1. Conceptual Foundations

1a. Problem Context: The $30 Billion Question
Despite $30-40 billion in corporate GenAI spending, 95% of organizations report no measurable P&L impact. The culprit isn't model capability - GPT-5 and Claude Sonnet 4.5 demonstrate remarkable reasoning prowess. The bottleneck is context engineering: these powerful models consistently underperform because they receive an incomplete, half-baked view of the world.

Consider this: when you ask an LLM to analyze a company's Q2 financial performance, it has zero access to your actual financial data, recent market trends, internal metrics, or strategic context. It operates with parametric knowledge frozen at training cutoff, attempting to solve real-time problems with static, general information. This is the fundamental gap that context engineering addresses.

The Core Insight:
Quality of underlying model is often secondary to quality of context it receives. Teams investing heavily in swapping between GPT-5, Claude, and Gemini see marginal improvements because all these models fail when fed incomplete or inaccurate worldviews. The frontier of AI application development has shifted from model-centric optimization to
context-centric architecture design.


1b. Historical Evolution: From Prompts to Playbooks

Era 1: Prompt Engineering (2020-2023)
  • Tactical, single-turn interactions
  • Focus on "clever wording" and few-shot examples
  • Stateless operations with no memory
  • Success measured by individual response quality

Era 2: RAG & Context Engineering (2023-present)
  • Strategic, multi-turn conversations
  • Shift to "context pipelines" and information ecosystems
  • Stateful systems with persistent memory
  • Success measured by task completion and business outcomes

Era 3: Agentic Context Engineering (2024-present)
  • Autonomous, self-improving systems
  • Contexts as evolving playbooks that accumulate strategies
  • Continuous learning through incremental adaptation
  • Success measured by long-term reliability and cost efficiency

The progression reflects a maturation from creative prompt crafting to industrial-grade context orchestration. As Andrej Karpathy's "context-as-a-compiler" analogy captures: the LLM is the compiler translating high-level human intent into executable output, and context comprises everything the compiler needs for correct compilation - libraries, type definitions, environment variables.

Unlike traditional compilers (deterministic, throws clear errors), LLMs are stochastic. They make best guesses, which can be creative or disastrous. Agentic Context Engineering systematically addresses this unpredictability.

1c. Core Innovation: The Agentic Context Engineering Framework
The ArXiv paper by Zhang and colleagues (2025) introducing Agentic Context Engineering identified two critical failure modes in existing context adaptation approaches:

Brevity Bias:
Optimization systems collapse toward short, generic prompts, sacrificing diversity and omitting domain-specific detail. Research documented near-identical instructions like "Create unit tests..." propagating across iterations, perpetuating recurring errors. The assumption that "shorter is better" breaks down for LLMs - unlike humans who benefit from concise generalization, LLMs demonstrate superior performance with long, detailed contexts and can autonomously distill relevance.


Context Collapse:
When LLMs rewrite accumulated context, they compress into much shorter summaries, causing dramatic information loss. One documented case saw context drop from 18,282 tokens (66.7% accuracy) to 122 tokens (57.1% accuracy) in a single rewrite step.


The ACE Solution: Treat contexts as comprehensive, evolving playbooks rather than concise summaries. This playbook paradigm introduces three key innovations:
  1. Incremental delta updates instead of monolithic rewrites
  2. Bulletized context architecture with metadata-enriched entries
  3. Three-role modular system separating generation, reflection, and curation

This framework achieved:
  • +10.6% on agent benchmarks,
  • +8.6% on finance domains,
  • 86.9% latency reduction, and
  • 75.1% cost reduction - matching top-ranked production agents while using smaller open-source models.
2. Technical Architecture

2a. Fundamental Mechanisms: The ACE Three-Role System

Architecture Overview:

Role 1: Generator
  • Produces reasoning trajectories for new queries
  • Surfaces effective strategies and recurring pitfalls
  • Operates with current context state
  • Outputs: chain-of-thought reasoning, tool calls, intermediate results

Role 2: Reflector (Key Innovation)
  • Critiques traces to extract actionable lessons
  • Separates evaluation from insight extraction
  • Refines across multiple iterations (typically 5 rounds)
  • Crucial for context quality - weak reflection produces noisy, harmful context
  • Outputs: strategic insights, failure patterns, domain concepts

Role 3: Curator
  • Synthesizes lessons into compact delta entries
  • Maintains consistency with existing context structure
  • Handles de-duplication via semantic embeddings
  • Outputs: structured bullets ready for deterministic merging

Critical Design Choice:
​Separating reflection from curation dramatically improves context quality. Previous approaches combined these roles, leading to superficial analysis and redundant entries.

    
2b. Implementation Considerations: Production Patterns

There are 4 pillars of context management - 


1. Write: Persist state and build memory beyond a single LLM call.
Scratchpad for reasoning, logging tool calls, Structured Note-Taking 

2. Select: Dynamically retrieve the right information at the right time.
Retrieval-Augmented Generation (RAG), tool definition retrieval, "Just-in-Time" Context 

3. Compress: Manage context window scarcity by reducing token footprint.
LLM-based summarization (Compaction), heuristic trimming, linguistic compression 

4. Isolate: Prevent different contexts from interfering with each other.
Sub-agent Architectures with separate contexts, sandboxing disruptive processes 
Pattern 1: WRITE - Contextual Memory Architectures
LLMs are stateless by default. Multi-turn applications require external memory:

    
Pattern 2: SELECT - Advanced Retrieval
Beyond naive vector similarity:

    
Pattern 3: COMPRESS - Managing Million-Token Windows
The Sentinel Framework (2025) demonstrates query-aware compression:

    
Pattern 4: ISOLATE - Compartmentalizing Context
Prevent "context soup" that mixes unrelated information streams:

    

🎯 PAUSE: Are You Getting Maximum Value?

You've just absorbed 1,000+ words of dense technical content on Agentic Context Engineering. Here's the reality: reading once isn't enough for mastery.

What top performers do differently:
- They revisit advanced concepts with fresh examples
- They stay current on weekly research developments  
- They learn production patterns from real implementations
- They connect theory to evolving industry practices

I publish exclusive content weekly on Substack that extends guides like this with:
✅ New research paper breakdowns (GPT-5, Claude updates, agent frameworks)
✅ Production war stories and debugging lessons
✅ Interview questions actually asked at OpenAI, Anthropic, Google
✅ Career navigation strategies for AI roles
No spam. Unsubscribe anytime. One email per week with genuinely useful insights.

3. Advanced Topics

​3a. Variations and Extensions: Multi-Agent Architectures


1. Orchestrator-Workers Pattern
(Hub-and-Spoke):

Central orchestrator dynamically decomposes tasks and delegates to specialist agents:

    
HyperAgent achieved 31.4% on SWE-bench Verified using this pattern with 4 specialists. MASAI reached 28.33% on SWE-bench Lite with modular sub-agents.
3b. Current Research Frontiers: Agentic RAG
​

Traditional RAG follows fixed Retrieve → Augment → Generate sequence. Agentic RAG introduces dynamic reasoning loops where agents:
  1. Iterative Refinement: Retrieve, analyze, determine sufficiency, retrieve more if needed
  2. Self-Evaluation: Assess own outputs against quality criteria
  3. Query Decomposition: Break complex questions into targeted sub-queries
  4. Tool Integration: Select from multiple tools (vector search, SQL, web search, calculators)
  5. Adaptive Strategy: Adjust retrieval based on query type and intermediate results

​Graph RAG: Integrates structured knowledge (databases, knowledge graphs) for multi-hop reasoning.
Value: Enables complex multi-hop reasoning impossible with text-only retrieval.

    
3c. Limitations and Challenges: The 40% Failure Rate

Gartner Prediction: 40% of agentic AI projects will be canceled by end of 2027 due to:
  1. Escalating Costs: Agents use 3-5× more tokens than single LLM calls
  2. Unclear Business Value: ROI difficult to demonstrate
  3. Inadequate Risk Controls: Security, hallucination, unpredictability

Hallucination Problem (Cannot Be Eliminated):
Research proves hallucinations are inevitable by design in LLMs. Agent-specific types:
  • Reasoning hallucinations: Semantic vagueness in goal interpretation
  • Action hallucinations: Invalid tool usage or API calls
  • Retrieval hallucinations: Incorporating irrelevant context as fact

Mitigation Strategies:
Multi-agent orchestration reduces haullucinations by 10-15 percentage points.


Security Risks:
  • Prompt injection: User inputs manipulate agent behavior
  • Data poisoning: Malicious information in multi-agent communication
  • Jailbreaking: Collaborative reasoning amplifies attack success

​Progress (2025)
:
​Anthropic reduced prompt injection success from 23.6% → 11.2% in Claude Sonnet 4.5 through architectural improvements and safety classifiers.

    
4. Practical Applications

4a. Industry Use Cases: Production Deployments

1. Customer Support (Most Mature):
  • Salesforce Agentforce: 70% of tier-1 inquiries automated
  • Usage-based pricing: Charge only for successful resolutions
  • ROI: Virgin Voyages saw 28% sales boost with "Email Ellie" agent

2. Software Development:
  • Claude Sonnet 4.5: 77.2% on SWE-bench Verified, 30+ hour autonomous sessions
  • GPT-5 Codex: 74.9% success, 7+ hour independent work on complex refactors
  • Capabilities: Full-stack implementation, test-driven development, code review

3. Enterprise Operations:
  • Manufacturing: 40% reduction in unplanned downtime, 30% overtime reduction, 15% throughput gain
  • Finance: Capital A (AirAsia) AI-first transformation, Macquarie Bank universal Gemini access
  • Healthcare: Stanford Health Care using agents for tumor board preparation

4b. Performance Characteristics: Benchmarks and Comparisons

SWE-bench Verified (500 real-world software engineering tasks):
  • Claude Sonnet 4.5: 77.2% (baseline), 82.0% (parallel sampling)
  • GPT-5 Codex: 74.9%
  • Evolution: <20% (2023) → >75% (2025)

Computer Use (OSWorld):
  • Claude Sonnet 4.5: 61.4% (19-point improvement over previous SOTA)
  • Gemini 2.5: SOTA on web/mobile with lower latency

Hallucination Rates (29 LLMs tested):
  • Claude 3.7: 17% (lowest/most accurate)
  • Multi-agent orchestration: 10-15 percentage point reduction

4c. Best Practices: Lessons from Practice

Anthropic's Core Principles:
  1. Maintain simplicity: Start with simplest solution, add complexity only when necessary
  2. Prioritize transparency: Show planning steps, make decisions explainable
  3. Craft Agent-Computer Interface carefully: Tool documentation requires rigorous prompt engineering

Claude Code Best Practices:

# 1. Research before coding
agent.instruct("Tell me about this codebase")
agent.explore_structure()

# 2. Plan explicitly
agent.instruct("Think about approach, make a plan")
plan = agent.generate_plan()

# 3. Test-Driven Development
agent.write_tests(feature)
agent.verify_failures()
agent.implement(feature)
agent.verify_passes()

# 4. Use extended thinking for complex tasks
agent.instruct("ultrathink about the optimal architecture")

# 5. Commit frequently
agent.commit("feat: implement user authentication")

12-Factor Agent Framework:
  1. Own your context window and control loop
  2. Provide clear, specific instructions with relevant context
  3. Use structured outputs (JSON) for tool calls
  4. Separate reasoning (LLM) from execution (code)
  5. Small, focused agents (3-10 step workflows)
  6. Implement robust error handling
  7. Human-in-loop for high-risk operations
  8. Monitor and log all agent actions
  9. Version control agent configurations
  10. Test agents extensively before production
  11. Implement cost caps and rate limiting
  12. Design for graceful degradation

Essential Production Metrics:

    
5. Engineering Agentic Systems into Production

Translating the theoretical power of agentic architectures into robust, scalable, and valuable production systems requires a disciplined engineering approach. This involves leveraging modern frameworks, establishing rigorous evaluation practices, and making pragmatic design choices that balance capability with real-world constraints.


5.1. Practical Implementation with Modern Frameworks (LangChain, LlamaIndex)
Frameworks like LangChain and LlamaIndex have become indispensable for building agentic systems. They provide the abstractions and tools needed to implement the architectural patterns discussed. 

​LangChain, for example, offers a create_agent() function that builds a graph-based agent runtime using its LangGraph library
. This runtime implements the ReAct loop by default and simplifies the process of defining tools, configuring models, and managing the agent's state.


A conceptual, production-ready implementation of a simple agent using LangChain might look like this:

    
5.2. Evaluation and Benchmarking: Measuring Agent Performance and Reliability
Evaluating an agent is significantly more complex than evaluating a simple classification model or even a static RAG system. The focus shifts from measuring the quality of a single, final output to assessing the quality of a dynamic, multi-step process.

In a production environment, evaluation must be multi-faceted :
  • Outcome-Based Metrics: Does the agent successfully complete the task? For RAG-based tasks, this includes metrics like response accuracy, factual consistency (faithfulness), and user satisfaction.
  • Process-Based Metrics: Was the agent's reasoning process logical and efficient? This involves evaluating the quality of the generated "thoughts" and the appropriateness of its tool usage.
  • Operational Metrics: How much did it cost? Key metrics include latency, throughput, and the total number of LLM and tool calls 

    The
    Cost-Aware Pass Rate (CAPR) is an emerging metric that explicitly balances task success with computational cost, which is crucial for enterprise applications.

Designing and implementing meaningful evaluation is a critical and often overlooked skill for senior AI engineers. It is the foundation for iterative improvement and for demonstrating the business value of an agentic system.

5.3. System Design Considerations: Scalability, Latency, and Cost
Deploying agents in a business context introduces a host of pragmatic constraints. There is often a fundamental trade-off between the depth of an agent's reasoning and the production requirements for low latency and cost. A highly iterative, multi-step agent that performs "deep research" might provide a superior answer but be too slow for a real-time customer support chatbot.
​

Key design considerations include:
  • Model Selection: While frontier models offer the best reasoning, smaller, faster, and cheaper models are rapidly improving and may be sufficient for many specialized tasks, offering significant advantages in latency and cost.
  • Data Security: In enterprise settings, data privacy is non-negotiable. This often means models must be deployed within the company's private network, bringing the "model to the data."
  • Pipeline Complexity: An agentic system is an end-to-end pipeline with a "combinatorial explosion" of choices at each step (chunking strategy, embedding model, retriever, generator, etc.). Building this from scratch can be a multi-year effort, making the use of specialized vendors an attractive option for achieving a faster return on investment.

5.4. The Strategic Moat: Building a Proprietary "Context Supply Chain"
Ultimately, the true, defensible value of agentic AI will not reside in the foundation model itself. As powerful models become increasingly commoditized, the competitive battleground is shifting. The strategic moat for AI-native companies will be the quality, breadth, and efficiency of their proprietary "context supply chain":

This supply chain includes:
  • Proprietary Data Sources: Unique, high-quality internal knowledge bases.
  • Exclusive Tools: Access to private APIs and internal systems.
  • Specialized Agentic Workflows: Finely-tuned, domain-specific agentic processes that solve core business problems.

​A company with a slightly inferior foundation model but a superior context supply chain can outperform a competitor with a better model but only generic context. Investing in the engineering systems to build, curate, and manage these proprietary context assets is the most critical strategic imperative for any organization looking to build a lasting advantage with AI.
6. Conclusion: Cracking Agentic AI & Context Engineering Roles
Agentic Context Engineering represents the frontier of applied AI in 2025. As this guide demonstrates, success in this field requires mastery across multiple dimensions: theoretical foundations (RAG, agent architectures, ACE framework), practical implementation (code, tools, frameworks), production considerations (scalability, security, cost), and continuous learning (research, experimentation, community engagement).

The 80/20 of Interview Success:
  1. Deep Understanding (40%): Don't memorize - understand why. Explain brevity bias, derive RAG formulations, reason about trade-offs.
  2. Implementation Skills (30%): Build real systems. Employers value candidates who've debugged production agents over those who've only read papers.
  3. Communication (20%): Articulate complex ideas clearly. Practice verbal explanations, whiteboard sessions, teaching others.
  4. Business Acumen (10%): Connect technical decisions to business outcomes. Understand when agents are appropriate vs. overkill.

Why This Matters for Your Career:
  • Market Demand: 33% of enterprise software will include agentic AI by 2028 (Gartner), creating massive demand for expertise
  • Salary Premium: Agentic AI specialists command 20-30% premium over traditional ML engineers
  • Future-Proofing: Autonomous systems are the next frontier after LLM chat; early expertise positions you as a leader
  • Impact: Build systems that genuinely transform how work gets done, not just incremental improvements

Taking Action:
If you're serious about mastering Agentic Context Engineering and securing roles at top AI companies like OpenAI, Anthropic, Google, Meta, structured preparation is essential. To get a custom roadmap and personalized coaching to accelerate your journey significantly, consider reaching out to me:


With 17+ years of AI & Neuroscience experience across Amazon Alexa AI, Oxford, UCL, and leading startups, I have successfully places 100+ candidates at Apple, Meta, Amazon, LinkedIn, Databricks, and MILA PhD programs.

What You Get:
  • Customized preparation strategy based on your background and target companies
  • Deep technical interview preparation (AI fundamentals, coding, system design)
  • Mock interviews with detailed feedback 
  • Negotiation support
  • Career coaching to perform well in first 90 days of new role

Next Steps:
  1. Review this guide thoroughly - take notes, implement examples
  2. If serious about top-tier placement, schedule a 15-minute intro call
  3. Visit my Coaching website for details, advice and testimonials
  4. Contact me with your specific goals and requirements as below

Contact:
Please email me directly at [email protected] with the following information:
  • Career goals 
  • Career background
  • Coaching requirements
  • Target roles and companies
  • Timelines
  • CV
  • LinkedIn

The field of Agentic AI and Context Engineering is exploding with opportunity. Companies are desperate for engineers who understand these systems deeply. With systematic preparation using this guide and targeted coaching, you can position yourself at the forefront of this transformation.

Subscribe to my upcoming Substack Newsletter focused on AI Deep Dives & Careers

📚 CONTINUE YOUR LEARNING JOURNEY
You've just completed one of the most comprehensive technical guides on Agentic Context Engineering.

But here's the challenge:
The field evolves weekly. New benchmarks, frameworks, and production patterns emerge constantly. Claude Sonnet 4.5 was released just weeks ago. GPT-5 capabilities are expanding. Multi-agent protocols are standardizing.

Reading this once gives you a snapshot. Staying current gives you an edge.
What You Get with my Substack Newsletter:

🔬 Weekly Research Breakdowns
- Latest papers from ArXiv (contextualized for practitioners)
- Model updates and capability analyses
- Benchmark interpretations that matter

🏗️ Production Patterns & War Stories
- Real implementation lessons from Fortune 500 deployments
- What works, what fails, and why
- Cost optimization techniques saving thousands monthly

💼 Career Intelligence
- Interview questions from recent FAANG+ loops
- Salary negotiation advice and strategies
- Team and project selection frameworks

🎓 Extended Learning Resources
- Code repositories and notebooks
- Advanced tutorials building on guides like this
- Office hours announcements and AMAs

Subscribe to DeepSun AI (while free) → https://substack.com/@deepsun
  • One email weekly.
  • No spam.
  • Unsubscribe anytime.
  • Premium tier coming soon.

Comments
comments powered by Disqus
    Newsletter

    Archives

    November 2025
    October 2025
    September 2025
    August 2025
    July 2025
    June 2025
    May 2025
    April 2025
    March 2025
    February 2025
    January 2025
    October 2024
    September 2024
    March 2024
    February 2024
    April 2023
    December 2022
    November 2022
    October 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    February 2022
    December 2021
    October 2021
    August 2021
    May 2021
    April 2021
    March 2021

    Categories

    All
    Ai
    Data
    Education
    Genai
    India
    Jobs
    Leadership
    Nlp
    Remotework
    Science
    Speech
    Strategy
    Web3

    RSS Feed


    Copyright © 2025, Sundeep Teki
    All rights reserved. No part of these articles may be reproduced, distributed, or transmitted in any form or by any means, including  electronic or mechanical methods, without the prior written permission of the author. 
    Disclaimer
    This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the blog owner and do not represent those of people, institutions or organizations that the owner may or may not be associated with in professional or personal capacity, unless explicitly stated.
[email protected] 
​​  ​© 2025 | Sundeep Teki

** Subscribe to my upcoming Substack Newsletter on AI Deep Dives & Careers **
  • Home
    • About Me
  • AI
    • Training >
      • Testimonials
    • Consulting
    • Papers
    • Content
    • Hiring
    • Speaking
    • Course
    • Neuroscience >
      • Speech
      • Time
      • Memory
    • Testimonials
  • Coaching
    • Advice
    • Testimonials
  • Blog
  • Contact
    • News
    • Media