← All case studies

Knit · May 2025 – April 2026

Agentic Market Research Platform

Raw survey data to verified insights, charts, and consulting-grade PPTX decks.

Role: Senior AI Engineer / principal architect for India AI team

LLM systemsDAG orchestrationEvalsSandbox executionDeck automation

Executive summary

Architected a production agentic research workflow that transformed analyst-heavy reporting into a verified AI execution pipeline.

48-72h → <1h report turnaround30-50 sandboxed analytics tasks per report15-25 Highcharts charts per reportMulti-provider LLM routing and shared Python agent platform

Problem and constraints

Market research reporting required analysts to process survey data, write insights, generate charts, validate findings, and assemble polished decks. The bottleneck was not text generation alone; the system needed numerical correctness, artifact quality, observability, and recovery boundaries.

  • Insights needed numerical correctness, not fluent guesses.
  • Charts needed to be visually usable and connected to evidence.
  • Reports needed consulting-grade native PowerPoint output.
  • Workflow execution needed parallelism, retries, and traceability.
  • Private prompts, customer data, internal traces, and proprietary implementation details must remain omitted from public discussion.

Architecture

01Raw survey data
02Data ingestion and schema normalization
03Task planning
04DAG execution
05LLM Python code generation
06Persistent sandbox execution
07Independent judge verification
08Insight synthesis
09Highcharts chart generation
10Visual quality scoring
11Deck intermediate representation
12HTML preview
13Native PPTX export

Improved system diagrams

Research workflow boundaries

A sanitized view of how raw survey data became verified insights, chart specs, and deck artifacts through explicit boundaries.

01Survey data

Ingest and normalize schema.

02Task plan

Break report into typed analysis tasks.

03Sandbox execution

Generate and run auditable Python.

04Independent judge

Recompute and inspect high-risk outputs.

05Deck IR

Render charts and slides from inspectable structure.

Sanitized architecture diagram. Customer data, private prompts, internal datasets, and proprietary implementation details omitted.

Observability and cost loop

Trace spans, model routes, retry budgets, and normalized cost counters make production AI failures debuggable.

01Span

Latency, model, status, and task class.

02Route

Select model by risk and value.

03Budget

Detect node-level cost anomalies.

04Retry

Recover only where useful.

05Review

Improve routing and eval policy.

Representative systems diagram. Exact company costs and internal traces omitted.

Decision Theater

Decision fork

Free-form agents vs explicit DAG

The workflow needed parallel execution and reliable recovery, not just autonomous behavior.

Free-form autonomous loop

Pros
  • Fast to prototype
  • Flexible exploration
Cons
  • Hard to debug
  • Hard to parallelize
  • Unclear retry boundaries

Explicit DAG execution

Pros
  • Deterministic dependencies
  • Node-level observability
  • Parallel execution
  • Clear retries
Cons
  • More upfront structure
  • Requires domain modeling

Chosen: Explicit DAG orchestration. Production workflows need predictable execution and debugging more than theatrical autonomy.

Decision fork

LLM-only insights vs code-backed analysis

Survey analytics cannot rely on plausible natural language when denominators and filters matter.

Ask LLM from summaries

Pros
  • Lower engineering complexity
  • Fast response
Cons
  • Hallucinated metrics
  • Unsupported conclusions
  • Weak audit trail

Generate and execute Python

Pros
  • Evidence-backed outputs
  • Inspectable calculations
  • Better validation hooks
Cons
  • Sandboxing required
  • More latency and orchestration

Chosen: LLM-generated Python with sandbox execution. For business reporting, numerical correctness matters more than generation convenience.

Decision fork

Self-check vs independent judge

A system that verifies itself can still agree with its own mistakes.

Same-model self-check

Pros
  • Cheaper
  • Simple
Cons
  • Self-confirming errors
  • Weak semantic validation

Independent judge

Pros
  • Recomputes evidence
  • Catches silent failures
  • Improves trust
Cons
  • Higher cost
  • More latency

Chosen: Independent judge with separate sandbox execution. Verification is the difference between a demo and a production AI system.

Evaluation and reliability

  • Independent judge verification recomputed results in a separate sandbox path.
  • Chart outputs passed multi-threshold quality scoring before deck assembly.
  • Retry semantics were tied to task boundaries rather than vague agent state.

Observability and debugging

  • OpenTelemetry and Langfuse made model calls, spans, failures, and cost inspectable.
  • Task-level traces exposed latency, retries, and model routing behavior.
  • Generated APIs and SSE streaming made execution state visible to product surfaces.

Reflection

The durable lesson is that production AI systems are less about an agent loop and more about explicit boundaries: typed inputs, executable artifacts, independent verification, observability, and unit economics.

This case study uses sanitized architecture and representative examples. It excludes confidential prompts, customer data, proprietary datasets, private implementation details, and internal traces.