Building an Eval-Driven AI Feature at StackOne

StackOne connects to hundreds of providers through one API. When errors occur — expired keys, missing scopes, unsupported actions — the system surfaces raw provider errors. These errors reference provider-specific concepts that don’t help users in the StackOne dashboard.

As an AI Engineer with a PhD studying evaluation best practices, I spent my first month building an AI agent to translate provider errors into clear resolution steps. The key approach: starting with evaluations from day one.

Start with the Problem, Not the Solution

Before coding, I interviewed Bryce, a solutions engineer, to understand the manual error resolution process. Users encounter errors in three locations: account linking, account status, and connection logs. Solutions are documented across various guides.

The goal was straightforward: automatically search documentation and generate resolution steps. I resisted building a complex RAG system with time-series analysis. Instead, I sketched the simplest possible flow and committed to shipping first and improving later.

Build Tools, Test Early

The system’s core uses Claude with custom tools for documentation searching. The tool registration includes:

{
  "name": "grab_error_code_guide",
  "description": "Fetch unified API error codes from StackOne",
  "input_schema": {
    "type": "object",
    "properties": {},
    "required": []
  }
}

A significant efficiency gain came from exposing all documentation as plain .txt and .md files at docs.stackone.com/llms.txt. This eliminated scraping and complex parsing work.

Five tools were developed:

grab_error_code_guide — Fetches error code reference
grab_troubleshooting_guide — Fetches troubleshooting documentation
search_stackone_docs — Searches all documentation
search_provider_guides — Searches provider-specific documentation
search_docs — Vector search using Turbopuffer database

Extracting context from messy logs proved challenging. The solution used Pydantic for structured outputs with gpt-4o-mini to extract necessary fields for resolution.

Evals Drive Everything

This is where most AI features fail: shipping without verification of functionality.

The approach taken:

Connected the Lambda function to LangSmith for full observability
Created real errors in the development environment
Ran them through the resolution agent
Manually corrected outputs in LangSmith to create a golden dataset
Used this as an evaluation benchmark

Every prompt change could be measured against this benchmark, eliminating guesswork about whether modifications helped or hurt performance.

Ship with Feedback Loops

We recognized customers could provide feedback on resolution quality post-launch. This real-world data could enhance the evaluation database.

The feedback system works through:

Each resolution receives a unique LangSmith trace ID
Users rate resolution quality
Feedback flows back to LangSmith, expanding the eval dataset
Real-world usage continuously improves the model

This creates a virtuous cycle: ship, collect feedback, improve evals, enhance prompt, ship better version.

The Technical Stack

Lambda — Hosts the resolution agent
LangSmith — Tracks generations and builds eval datasets
Logfire — Monitors LLM performance
DataDog — Tracks execution metrics
PostHog — Measures feature usage
CDK — Manages infrastructure as code

The frontend integration was straightforward, exposing a single endpoint /ai/resolutions that accepts error logs and returns resolution steps.

Lessons Learned

Start with evals, not features — Without evaluation, development proceeds without proper guidance
Ship the simplest version first — The initial RAG design would have required months; the basic version shipped in weeks
Real usage beats synthetic data — Customer feedback creates superior evals compared to test data
Tools are just functions — LLM tool design shouldn’t be overcomplicated; they’re Python functions with JSON schemas

The feature progressed from concept to production in one month, not through rushing but by focusing on what mattered: knowing whether it actually worked. Evaluations aren’t afterthoughts in AI development — they’re foundational.