Knit · May 2025 – April 2026

Agentic Market Research Platform

Raw survey data to verified insights, charts, and consulting-grade PPTX decks.

Role: Senior AI Engineer / principal architect for India AI team

LLM systemsDAG orchestrationEvalsSandbox executionDeck automation

Executive summary

Architected a production agentic research workflow that transformed analyst-heavy reporting into a verified AI execution pipeline.

48-72h → <1h report turnaround30-50 sandboxed analytics tasks per report15-25 Highcharts charts per reportMulti-provider LLM routing and shared Python agent platform

Problem and constraints

Market research reporting required analysts to process survey data, write insights, generate charts, validate findings, and assemble polished decks. The bottleneck was not text generation alone; the system needed numerical correctness, artifact quality, observability, and recovery boundaries.

Insights needed numerical correctness, not fluent guesses.
Charts needed to be visually usable and connected to evidence.
Reports needed consulting-grade native PowerPoint output.
Workflow execution needed parallelism, retries, and traceability.
Private prompts, customer data, internal traces, and proprietary implementation details must remain omitted from public discussion.

Architecture

01Raw survey data

02Data ingestion and schema normalization

03Task planning

04DAG execution

05LLM Python code generation

06Persistent sandbox execution

07Independent judge verification

08Insight synthesis

09Highcharts chart generation

10Visual quality scoring

11Deck intermediate representation

12HTML preview

13Native PPTX export

Improved system diagrams

Research workflow boundaries

A sanitized view of how raw survey data became verified insights, chart specs, and deck artifacts through explicit boundaries.

01Survey data

Ingest and normalize schema.

02Task plan

Break report into typed analysis tasks.

03Sandbox execution

Generate and run auditable Python.

04Independent judge

Recompute and inspect high-risk outputs.

05Deck IR

Render charts and slides from inspectable structure.

Sanitized architecture diagram. Customer data, private prompts, internal datasets, and proprietary implementation details omitted.

Observability and cost loop

Trace spans, model routes, retry budgets, and normalized cost counters make production AI failures debuggable.

01Span

Latency, model, status, and task class.

02Route

Select model by risk and value.

03Budget

Detect node-level cost anomalies.

04Retry

Recover only where useful.

05Review

Improve routing and eval policy.

Representative systems diagram. Exact company costs and internal traces omitted.

Decision Theater

Decision fork

Free-form agents vs explicit DAG

The workflow needed parallel execution and reliable recovery, not just autonomous behavior.

Free-form autonomous loop

Pros

Fast to prototype
Flexible exploration

Cons

Hard to debug
Hard to parallelize
Unclear retry boundaries

Explicit DAG execution

Pros

Deterministic dependencies
Node-level observability
Parallel execution
Clear retries

Cons

More upfront structure
Requires domain modeling

Chosen: Explicit DAG orchestration. Production workflows need predictable execution and debugging more than theatrical autonomy.

Decision fork

LLM-only insights vs code-backed analysis

Survey analytics cannot rely on plausible natural language when denominators and filters matter.

Ask LLM from summaries

Pros

Lower engineering complexity
Fast response

Cons

Hallucinated metrics
Unsupported conclusions
Weak audit trail

Generate and execute Python

Pros

Evidence-backed outputs
Inspectable calculations
Better validation hooks

Cons

Sandboxing required
More latency and orchestration

Chosen: LLM-generated Python with sandbox execution. For business reporting, numerical correctness matters more than generation convenience.

Decision fork

Self-check vs independent judge

A system that verifies itself can still agree with its own mistakes.

Same-model self-check

Pros

Cheaper
Simple

Cons

Self-confirming errors
Weak semantic validation

Independent judge

Pros

Recomputes evidence
Catches silent failures
Improves trust

Cons

Higher cost
More latency

Chosen: Independent judge with separate sandbox execution. Verification is the difference between a demo and a production AI system.

Evaluation and reliability

Independent judge verification recomputed results in a separate sandbox path.
Chart outputs passed multi-threshold quality scoring before deck assembly.
Retry semantics were tied to task boundaries rather than vague agent state.

Observability and debugging

OpenTelemetry and Langfuse made model calls, spans, failures, and cost inspectable.
Task-level traces exposed latency, retries, and model routing behavior.
Generated APIs and SSE streaming made execution state visible to product surfaces.

Reflection

The durable lesson is that production AI systems are less about an agent loop and more about explicit boundaries: typed inputs, executable artifacts, independent verification, observability, and unit economics.

This case study uses sanitized architecture and representative examples. It excludes confidential prompts, customer data, proprietary datasets, private implementation details, and internal traces.