What Jinja2 Templates Taught Me About AI Agents

By Jason Belk · May 1, 2026

If you’ve built network automation, you’ve probably picked a side. Jinja2 templates with Ansible and Python for rendering flat CLI configs. Or Cisco NSO with XML templates validated against YANG models for structured, schema-aware orchestration. Different tools, different philosophies, same goal: stop configuring devices from scratch every time.

I recently pulled concepts from both approaches to solve a completely different problem. I needed AI to assemble valid configurations from a database of 33,000 components. I borrowed Jinja2’s template inheritance and macro system for defining reusable blocks. I borrowed NSO’s commit dry-run validation philosophy and service hierarchy for guaranteeing correctness before anything reaches the user.

The domain was a card game (Magic: The Gathering), but the engineering patterns apply to any system where AI agents generate structured output from a large inventory. Network configs, cloud infrastructure, security policies.

The Problem: AI Generating Configurations from Scratch

I’m building a side project (mtgcommander.ai) where an AI assistant helps users build 100-item configurations from a pool of 33,000 valid components. Each configuration needs the right structural foundation (resource generation, threat response, card advantage, defensive options) plus 25-30 strategy-specific picks that make the build unique.

It’s the same shape as a network device config. You need NTP, DNS, logging, and interface configs for the structural foundation. Then you need custom ACLs, route maps, and policies for the context-specific part.

The first version worked like most AI agent implementations I see today. Hand the LLM a prompt describing what the user wants, let it generate the entire configuration from scratch, hope it’s correct. Sometimes the output was great. Sometimes it included invalid components, items that violated compatibility constraints, or obviously poor choices. The equivalent of configuring a router with a VLAN that doesn’t exist on the switch.

That’s what happens when you let an AI agent generate infrastructure output without guardrails. It works 80% of the time. The other 20% is where trust breaks down.

The Fix: Templates First, AI Second

This is where my network automation background kicked in. I pulled two patterns from two different tools:

From Jinja2/Ansible: Template inheritance, macros, and parameterized rendering. Define the structure once, override specifics per variant. The template guarantees the right structural components exist.

From NSO: Validation before commit. YANG models enforce constraints, so you can’t push a config that violates the schema. commit dry-run shows you exactly what will change before anything touches the device. Service hierarchies compose smaller validated units into complete configurations.

I combined both:

Before (AI generates everything):

User: "Build me a configuration optimized for strategy X"
AI: *generates all 100 items from scratch, hopes they're valid*
Result: Some good picks, some random filler, occasionally invalid items

After (template engine + AI for creative parts only):

User: "Build me a configuration optimized for strategy X"

Step 1: Jinja2 template renders structural foundation (deterministic, no AI)
  → 10 resource accelerators (mix varies by strategy)
  → 5 targeted answers + 3 sweepers
  → 10 card advantage pieces
  → 37 resource base

Step 2: AI picks 25-30 strategy-specific items (the creative part)
  → Selected from pre-classified database, not hallucinated

Step 3: Compile-time validation (like commit dry-run)
  → Compatibility check ✓
  → No banned items ✓
  → Budget compliance ✓
  → Curve analysis balanced ✓

The AI only touches the 25-30 creative decisions. The other 70 items are deterministic, queried from an enriched database using the template’s specifications. No hallucination possible. No invalid items. No random filler.

How It Works: Jinja2 Macros as Reusable Services

In NSO, a service is a reusable unit that takes input parameters (device name, VLAN ID, IP range) and produces a validated chunk of configuration. You compose services together (NTP service, DNS service, routing service) to build a complete device config. Each service handles its own domain and validates its own output.

I used Jinja2 macros to create the same pattern. Each macro is a “service” that takes parameters and renders a specification for what components to pull from the database. Below is the actual ramp_package macro from the project (the ramp here is in-game terminology for resource acceleration). Click through the scenarios under it to see the same template producing different output as the parameters change — same shape as an NSO service template parameterized for branch vs. data center.

packages/ramp.j2 — resource acceleration service

Template

{# packages/ramp.j2 — unlimited budget tier shown #}

{% macro ramp_package(dorks=2, sorcery=3, rocks=3, enchant=2) %}
    creature_dorks:
      count: {{ dorks }}
      tag: mana-dork
    sorcery_ramp:
      count: {{ sorcery }}
      type_filter: Sorcery
      constraints:
        db_flag: is_ramp
    mana_rocks:
      count: {{ rocks }}
      tag: mana-rock
    enchant_ramp:
      count: {{ enchant }}
      type_filter: Enchantment
      constraints:
        db_flag: is_ramp
{% endmacro %}

Click a scenario to render

      creature_dorks:
      count: 2
      tag: mana-dork
    sorcery_ramp:
      count: 3
      type_filter: Sorcery
      constraints:
        db_flag: is_ramp
    mana_rocks:
      count: 3
      tag: mana-rock
    enchant_ramp:
      count: 2
      type_filter: Enchantment
      constraints:
        db_flag: is_ramp

Base case — ramp_package() called with all defaults. Used by green-inclusive commanders without a strategy override.

      creature_dorks:
      count: 4
      tag: mana-dork
    sorcery_ramp:
      count: 2
      type_filter: Sorcery
      constraints:
        db_flag: is_ramp
    mana_rocks:
      count: 2
      tag: mana-rock
    enchant_ramp:
      count: 2
      type_filter: Enchantment
      constraints:
        db_flag: is_ramp

ramp_package(dorks=4, sorcery=2, rocks=2, enchant=2). Sacrifice strategy in green: more creature ramp because those creatures double as sacrifice fodder.

      creature_dorks:
      count: 0
      tag: mana-dork
    sorcery_ramp:
      count: 2
      type_filter: Sorcery
      constraints:
        db_flag: is_ramp
    mana_rocks:
      count: 6
      tag: mana-rock
    enchant_ramp:
      count: 2
      type_filter: Enchantment
      constraints:
        db_flag: is_ramp

ramp_package(dorks=0, sorcery=2, rocks=6, enchant=2). Voltron commander with no green: artifact ramp dominates because creature dorks die to the same board wipes voltron decks already worry about.

      creature_dorks:
      count: 4
      tag: mana-dork
    sorcery_ramp:
      count: 4
      type_filter: Sorcery
      constraints:
        db_flag: is_ramp
    mana_rocks:
      count: 2
      tag: mana-rock
    enchant_ramp:
      count: 0
      type_filter: Enchantment
      constraints:
        db_flag: is_ramp

Same sacrifice parameters but with budget_tier='budget' active (the macro applies floors and caps). Dorks floor up to 3, sorceries floor up to 4, enchantment ramp forced to zero. Same call, different tier — different output. Like an NSO service responding to a feature flag.

The macro takes four parameters (how many of each sub-type to include) and renders a YAML specification. The tag and type_filter fields tell the compiler engine exactly how to query the database: “give me 3 items tagged mana-rock, ordered by popularity rank.” The constraints field adds hard filters, like requiring the is_ramp flag in the database. No ambiguity, no AI interpretation. A deterministic query spec.

The defaults (dorks=2, sorcery=3, rocks=3, enchant=2) define the balanced base case. Strategy-specific templates override just the parameters they need to change. Same idea as extending a base device template for a branch office vs. a data center:

{# sacrifice.j2 #}
{% extends "base_deck.j2" %}

{% block ramp_sub_slots %}
{{ ramp_package(dorks=4, sorcery=2, rocks=2, enchant=2) }}
{% endblock %}

One line changes the ramp mix: dorks=4 instead of the default 2. The sacrifice strategy gets more creature-based resource acceleration because those creatures serve double duty. They accelerate your resources AND they’re expendable assets you can sacrifice for value later. A different strategy (voltron) calls ramp_package(dorks=1, rocks=5) because artifact-based acceleration is more resilient to the board-clearing effects that strategy relies on.

Same macro, different parameters, different output. Always valid, always the right structure. If you’ve built NSO service packages where a branch office service extends a base device service but adjusts the routing protocol and interface count, this is that pattern in Jinja2.

The base template imports all the service macros and composes them into a complete configuration:

{# base_deck.j2 #}
{% from "packages/ramp.j2" import ramp_package %}
{% from "packages/removal.j2" import removal_package %}
{% from "packages/draw.j2" import draw_package %}

slots:
  ramp:
    count: {{ ramp_count }}
    sub_slots:
  {% block ramp_sub_slots %}
  {{ ramp_package() }}          {# defaults: balanced mix #}
  {% endblock %}

  removal:
    count: 5
    sub_slots:
  {% block removal_sub_slots %}
  {{ removal_package() }}       {# defaults: balanced mix #}
  {% endblock %}

  card_draw:
    count: 10
    sub_slots:
  {% block draw_sub_slots %}
  {{ draw_package() }}          {# defaults: balanced mix #}
  {% endblock %}

  strategy_cards:
    count: {{ remaining }}      {# AI fills these #}

Each {% block %} is an override point. The base template provides sane defaults. Strategy templates only override the blocks they need to change. Adding a new structural sub-type (a fifth category of resource acceleration) means editing one macro, not every strategy template.

The Data Pipeline: Enrich First, Then Let AI Decide

In network automation, your templates are only as good as your data source. Netbox, Catalyst Center, or your CMDB provides the device inventory: what interfaces exist, what VLANs are assigned, what IP addresses are allocated. Without accurate source data, your templates render garbage configs.

Same principle here. I built a four-phase enrichment pipeline to populate the component database with structured metadata before the AI ever touches it:

Canonical source data. Bulk import from the authoritative API. 33,000 components with popularity rankings.
Community classifications. Functional tags like “accelerator”, “answer”, “tutor”. 18,000 items tagged by the community.
Usage context. Which popular configurations use which components, in what categories.
AI classification. Claude Opus classified the top 10,000 components with strategy tags and role assignments. The important part: Opus received all the data from steps 1-3 as context first. The AI confirmed and synthesized existing data rather than figuring it out from scratch. Cost was $112 for 10,000 items, a one-time investment.

This is the pattern I’d recommend for any AI agent working with infrastructure data. Don’t have the AI figure out your network topology from scratch. Feed it Netbox data, IPAM allocations, and VLAN assignments, then ask it to make decisions within those constraints.

Compile-Time Validation: commit dry-run for AI Output

This is the NSO concept that made the biggest difference. After the template engine assembles a configuration, a validation pipeline checks every component before the user sees it:

Compatibility. Every item must be compatible with the base configuration (like checking a VLAN exists on the switch).
Banned list. No prohibited items (like checking for deprecated CLI commands).
Slot constraints. Items tagged as “accelerators” actually have that flag in the database (like verifying an interface is actually a routed port before assigning an IP).
Budget. No item exceeds the per-item cost cap (like checking license limits).
Quality floor. Items below a popularity threshold are rejected unless they come from expert sources (like filtering out EOL hardware from inventory).

Any item that fails validation is rejected and the slot is re-filled from the candidate pool. The user never sees an invalid configuration.

I also added deterministic CI checks that run on every code change. The system builds three representative configurations and validates all constraints automatically. No manual testing. If a code change breaks configuration quality, CI catches it before merge.

The Progression

Here’s what the journey looked like, measured by an automated evaluation suite (LLM-as-judge scoring 0-3):

Step	What Changed	Score
Baseline	AI picks all 100 items procedurally	1.3 / 3.0
+ Template engine + enrichment	Declarative slots, 30K ranked, 18K tagged	1.2
+ Sub-packages	Strategy-aware structural macros	1.1
+ Evaluation fix	Infrastructure items exempt from strategy scoring	2.0
+ Strategy tags source	AI classifications drive selection	1.8

The scores actually went down before they went up. That’s the most important part of this story.

The template engine and data enrichment were real structural improvements. Deterministic checks started passing (color identity, banned cards, mana curve). But the evaluation judge (an LLM scoring the output 0-3) was penalizing universal infrastructure items for not being “on-theme” with the strategy. That’s like docking points on a network audit because NTP isn’t synergistic with the BGP routing policy. NTP is infrastructure. It’s correct by definition.

Once I separated structural quality from strategy quality in the evaluation (two scoring dimensions instead of one) the improvements showed up immediately. The jump from 1.1 to 2.0 wasn’t a code change. It was fixing the metric.

If your metrics don’t reflect your architecture, you’ll optimize for the wrong things. When I built deterministic structural checks (the equivalent of config compliance tests), I could see the template engine was working perfectly. The LLM judge just wasn’t measuring what mattered.

What This Means for AI Agents and NetDevOps

The pattern I found building a card game platform applies directly to how we should build AI agents for network automation:

Don’t let the AI generate everything from scratch. Use templates for the structural foundation. The AI adds value on creative, context-specific decisions, not on NTP configuration.
Validate before deploying. Every AI-generated output should go through a commit dry-run equivalent before it reaches production. Compile-time validation catches errors the AI won’t.
Enrich your data first, then use AI. The AI classification step cost $112 for 10,000 items, but only because I fed it structured data from three other sources first. The AI confirmed and synthesized. It didn’t research from scratch. Feed your AI agents Netbox data, not blank prompts.
Separate infrastructure from intent. Structural components are solved problems with deterministic solutions. Custom policies and strategy-specific choices are where AI reasoning adds real value. Don’t waste inference cycles on solved problems.
Build service hierarchies. Each structural category is a self-contained service with its own validation, composing into a complete configuration.
Test your evaluation, not just your output. If your quality metrics penalize correct infrastructure for not being “creative,” you’ll chase the wrong improvements. Separate your scoring dimensions.

If you’re building AI agents for network automation, or any domain where AI generates structured output from an inventory, consider the template-first approach. Let Jinja2 handle the structure. Let the AI handle the creativity. And always validate before you commit.

Related: Model Context Protocol Tutorial: Build Your First Server — the next post in the AI-engineering-for-network-people series.

If you’re scoping AI training for an engineering team — or trying to figure out where templates and validation fit in your own agent architecture — book a free 20-minute call. First conversation is free, and I’ll tell you honestly if AI is the right fit for what you’re trying to do.

Ready to put this to work?

Free 20-minute call. Tell me what you're trying to do and I'll tell you honestly whether AI is the right fit.

Book a Free 20-Minute Call