← Back to blog
SaaS Architectureai saas architecturesaas folder structureai coding architectureclean architecture saasai evaluation pipelinemodel routing

How to Structure a SaaS Project So AI Doesn’t Break It

4/10/2026
11 min read

2026 guide to AI SaaS architecture: robust SaaS folder structure, contracts, routing, evals, CI/CD, and cost controls so LLMs won't break your production app.

Why AI Breaks SaaS Architecture (If You Let It)

AI turns neat, deterministic code paths into probabilistic workflows. Models change, prompts drift, providers throttle, and outputs vary between calls. When those forces leak into your core domain, production reliability, costs, and iteration speed all suffer. The antidote is deliberate ai saas architecture: a clean, enforceable boundary where AI lives at the edge and your business logic remains deterministic.

This article lays out a 2026-ready blueprint: a practical SaaS folder structure, clear interfaces for LLMs and tools, evaluation gates in CI/CD, and rollout controls. If you’re using AI coding tools daily, this is the ai coding architecture that keeps velocity high without turning your repo into spaghetti.

Core Principles: Deterministic Core, Probabilistic Edge

  • Keep business rules deterministic. Billing, permissions, quotas, and SLAs must never depend on an LLM’s mood.
  • Push AI to the edges behind interfaces. Controllers call application services; services use AI adapters—never raw provider SDKs or inline prompts.
  • Strict contracts at the AI boundary. Validate and version inputs/outputs. Treat AI as untrusted I/O.
  • Measure everything. Emit structured telemetry per call: provider, model, latency, tokens, cost, cache hit, evaluation score.
  • Idempotency over cleverness. Retries, timeouts, and fallbacks are normal—make steps safe to repeat.
  • Always have a plan B. Implement fallbacks and kill-switches at feature and tenant levels.

These principles underpin a repeatable ai saas architecture and make it safe to adopt faster models or prompts without rewriting your core.

A Reference SaaS Folder Structure (Monorepo-Friendly)

A good saas folder structure makes the right path the easiest path—especially when AI coding tools are assisting. The example below is language-agnostic; TypeScript names are for clarity.

/                         # repo root
├─ apps/
│  ├─ web/                # Next.js/Remix client + server routes
│  ├─ worker/             # queues, schedulers, long-running jobs
│  └─ admin/              # ops dashboards, eval & cost views
├─ packages/
│  ├─ domain/             # pure business rules (deterministic)
│  ├─ app/                # use-cases, orchestrators, DTOs
│  ├─ infra/              # db, cache, queues, search, email
│  ├─ ai/                 # AI adapters, prompts, tools, evaluators
│  │  ├─ providers/       # openai.ts, anthropic.ts, bedrock.ts …
│  │  ├─ adapters/        # LLMProvider interface, retry, tracing
│  │  ├─ prompts/         # prompt packs with tests + fixtures
│  │  ├─ tools/           # retrieval, function-calls, tool schemas
│  │  ├─ evaluators/      # offline + CI evaluators (golden sets)
│  │  └─ experiments/     # A/B configs, flags, notebooks
│  ├─ interfaces/         # http controllers, graphql/resolvers
│  ├─ shared/             # util, types, telemetry, feature-flags
│  └─ configs/            # env config, model routing rules
├─ data/
│  ├─ seeds/              # fixture data
│  └─ golden/             # truth sets for AI evaluation
├─ scripts/               # migrations, re-embedding, utilities
├─ .github/               # CI: tests, eval gates, lint, cost checks
└─ docs/                  # runbooks, ADRs, architectural maps

Why this saas folder structure works:

  • domain never depends on ai, infra, or web. It’s the unbreakable core.
  • app orchestrates domain + ai through explicit interfaces, so you can swap models/providers—or disable AI—without rewiring the product.
  • ai is first-class with prompts, providers, and evaluators versioned together and tested in CI.
  • data/golden provides reproducible checks to catch drift early.

Polyrepo mapping: if you split repos, keep the same boundaries. For instance, “ai” becomes its own package with versioned releases, and “domain” remains a dependency-free library used by web and workers. Enforce imports with lint rules or module boundaries (e.g., TS path aliases, Bazel, or Nx constraints).

Conventions that support AI coding architecture:

  • One index.ts per package boundary; forbid deep imports across packages.
  • CODEOWNERS: ai/* owned by the applied-ML team; domain/* by the core backend team.
  • ESLint rules that reject provider SDK imports outside packages/ai.
  • ADRs documenting any cross-boundary exception with a rollback path.

The AI Boundary: Contracts, Adapters, and Model Routing

Define a minimal, stable interface that any LLM must satisfy. Keep provider quirks out of business code.

// packages/ai/adapters/LLMProvider.ts
export interface LLMProvider {
  name: string; // e.g., "openai:gpt-4.1-mini"
  generate(input: {
    prompt: string;
    tools?: ToolSchema[];
    system?: string;
    temperature?: number;
    maxTokens?: number;
    metadata?: Record<string, string>;
  }): Promise<{
    text: string;
    toolCalls?: ToolCall[];
    tokens: { input: number; output: number };
    costUSD: number;
    latencyMs: number;
    model: string;
  }>;
}

Routing policy example:

// packages/configs/model-routing.ts
export const policy = ({ tier, region, purpose }: { tier: 'free'|'pro'|'enterprise'; region: 'us'|'eu'; purpose: 'summarize'|'extract'|'agent' }) => {
  if (purpose === 'agent') return region === 'eu' ? 'bedrock:cohere-command-r' : 'openai:gpt-4.1-mini';
  if (tier === 'free') return 'openai:gpt-4o-mini';
  return 'openai:gpt-4.1-mini';
};

Adapter responsibilities in a robust ai saas architecture:

  • Retries with jitter, backoff, and circuit breaking per provider.
  • Strict schema validation for tool calls and JSON outputs.
  • Token budgeting and truncation with visibility into losses.
  • Observability hooks: span ids, request ids, cache keys, evaluation ids.
  • PII redaction and regional routing before any network call.

Prompt Packs With Tests (Stop Silent Drift)

Prompts are code. Store them alongside fixtures and schemas, and test them like business logic.

/packages/ai/prompts/
  ├─ classify-intent/
  │  ├─ prompt.md
  │  ├─ schema.ts       # zod/yup schema for structured output
  │  ├─ cases/
  │  │  ├─ billing-refund.in.json
  │  │  └─ vague-question.in.json
  │  └─ expected/
  │     ├─ billing-refund.out.json
  │     └─ vague-question.out.json
  └─ summarize-ticket/

Example minimal schema assertion:

// schema.ts
import { z } from 'zod';
export const Intent = z.object({
  intent: z.enum(['billing','support','sales','unknown']),
  confidence: z.number().min(0).max(1)
});

CI practices:

  • Run golden-set evals on PRs and nightly. Fail on accuracy or calibration regressions.
  • Post diffs of changed outputs to the PR with cost/latency deltas.
  • Version prompts (prompt@semver) and record the version in logs and A/B flags.

Data, State, and Retrieval: Avoid the Blob Anti-Pattern

LLM features attract semi-structured blobs. Keep your source of truth clean.

  • Structured store first. Entities (users, tickets, invoices) live in Postgres. Emit domain events for downstream AI tasks.
  • Vector store for retrieval only. Namespace by tenant + purpose (support-kb, product-docs, user-threads). Version and store the embedding model with each vector (e.g., text-embedding-3-large@v2).
  • Deterministic ingestion. Place loaders and cleaners in packages/ai/tools/ingestion; never index raw HTML or JSON.
  • PII guardrails. Centralize redaction/tagging by region and tenant policy before any embedding or provider call.
  • Tiered caching:
    1. Deterministic cache for exact-input prompts.
    2. Semantic cache keyed by embedding similarity with safe thresholds.
    3. Business-result cache (e.g., normalized JSON) with TTL and invalidation hooks.

Event example decoupling AI from writes:

{
  "type": "ticket.created",
  "id": "evt_01H...",
  "occurred_at": "2026-04-10T14:32:00Z",
  "tenant_id": "t_123",
  "ticket_id": "tk_999",
  "summary": "Customer cannot update credit card",
  "pii_tags": ["email"],
  "region": "us"
}

Workers consume events to create summaries, embeddings, and suggestions off the hot path.

Testing, Evaluation, and SLOs: Red, Green, Score

Traditional tests ensure determinism. Evaluations ensure probabilistic quality. You need both.

  • Unit tests: 100% on critical domain paths. Mock LLMProvider.
  • Contract tests: validate provider responses and tool-call structures.
  • Golden set evals: accuracy, extraction F1, refusal precision/recall, toxicity, latency, and cost.
  • Approval tests: human-reviewed outputs stored and diffed; changes require sign-off.
  • Shadow mode: run new prompts/models alongside production on a sample of traffic.
  • SLOs: define p95 latency, min eval score, and max cost per task—CI fails if breached.

Minimal evaluator:

// packages/ai/evaluators/intent-accuracy.ts
export function score(expected: { intent: string }, got: { intent: string }) {
  return expected.intent === got.intent ? 1 : 0;
}

CI pipeline:

  • pnpm test (units + contracts)
  • pnpm eval:run --suite intent --minScore 0.92
  • Fail PR if score < threshold, p95 latency > SLO, or cost regresses > X%

Observability and Cost Controls From Day 1

  • Tracing: spans for prompt build, provider call, tool call, and post-processing. Attach userId, tenantId, model, tokens, and cost.
  • Logs: redact PII; use hashed references for correlation. Store prompt hash + parameters + output hash + evaluator id.
  • Metrics: per-tenant p50/p95 latency, cost per successful task, cache hit rates, refusal rate, and eval score trend.
  • Budgets: per-tenant and global token/cost ceilings; switch to a cheaper model or disable features when approaching limits.
  • Live kill-switches: flags to disable AI, freeze prompts, or force a specific router path.

Example cost guard:

if (monthlyCostUSD('tenant_123') > 200 && !isEnterprise(tenant)) {
  router.force('openai:gpt-4o-mini');
}

CI/CD and Environments: Make AI Deployable

  • Pin everything: provider, model, and API versions—record them in emitted events.
  • Blue/green for prompts: treat prompt packs like packages with semantic versions.
  • Data migrations include embeddings: ship re-embedding scripts; mark old vectors with previous_version until backfilled.
  • Pre-prod mirroring: replay anonymized prod events to staging; record costs and scores.
  • Rollout levers: percentage-based flags by tenant/tier; ramp gradually while watching eval deltas.

Working With AI Coding Tools Without Letting Them Wreck It

AI pair-programmers are fast but boundary-blind. Enforce ai coding architecture at the repo level so generated code lands in the right layer.

  • PR checklist: "No provider SDK imports outside packages/ai", "Domain untouched by AI-specific types", "Contracts + tests updated".
  • Lint rules: forbid cross-package deep imports; allow only service interfaces from app.
  • Scaffolding over generation: generate files in the correct folders (adapters/prompts/evaluators) and let the AI fill in functions.
  • ADRs: document AI-influenced refactors and the rollback procedure.

Security, Privacy, and Compliance in the AI Path

  • Data classification: public, internal, restricted, PII. Redact before any external call.
  • Regional routing: EU data to EU-bound models; ensure provider isolation is real, not marketing.
  • Tool safety: strong schemas and authorization checks for tool calls; never run privileged actions without policy gates.
  • Prompt injection defenses: sanitize inputs, scope tools narrowly, and re-verify outputs with deterministic rules.
  • Audit trail: store prompt/output hashes and evaluator scores for forensics and SOC2 evidence.

Rollout Playbook: From Prototype to Production

  1. Prototype behind a feature branch; keep all AI calls in packages/ai.
  2. Add evaluators and pass thresholds locally; wire cost and latency metrics.
  3. Ship behind a flag to 5% of Pro tenants; shadow-run the old implementation.
  4. Monitor dashboards: latency, cost, accuracy. Adjust prompt, routing, or cache.
  5. Ramp to 25%, then 100% once stable; keep fallback for 2–4 weeks.
  6. Schedule monthly evaluation reviews; rotate golden sets quarterly.

Example: Wiring an AI Use-Case End to End

Goal: auto-triage support tickets with confidence and explanation.

  • Domain: Ticket entity + TriageRequested event.
  • App: TriageTicket use-case fetches text and calls IntentClassifier service.
  • AI: IntentClassifier builds the prompt, calls LLMProvider via router, validates JSON against schema, logs metrics, returns a typed DTO.
  • Interfaces: HTTP POST /tickets/:id/triage triggers the use-case; returns intent + confidence.
  • Worker: listens to ticket.created -> queues triage -> persists result -> notifies analytics.
  • Evaluation: nightly batch scores golden sets and last 1k live samples using agent corrections as ground truth.

Buying vs Building: Pragmatic Choices in 2026

  • Buy: hosted eval platforms, tracing/cost dashboards, and managed vector DBs when you’re <5 engineers and need speed.
  • Build: adapters, router, and prompt packs—your strategic levers.
  • Hybrid: keep a thin in-house interface so you can swap vendors without leakage of provider types into domain/app code.

A 12-Point Checklist to Keep AI From Breaking Your SaaS

  • AI calls exist only behind LLMProvider and service facades.
  • Domain has zero provider imports; deterministic functions are test-covered.
  • Prompts are versioned with tests and golden sets.
  • Model/router policies are codified, observable, and auditable.
  • Telemetry logs tokens, cost, latency, cache hits, and eval scores.
  • Budgets and kill-switches are live in production.
  • Vector store is not the source of truth; embedding versions are stored per record.
  • Tool calls are schema-validated and policy-gated.
  • CI fails on evaluation regressions and excessive cost growth.
  • Staging mirrors prod traffic with privacy-safe replay.
  • ADRs document AI-related architecture decisions.
  • Runbooks exist for prompt rollback, re-embedding, and provider outage.

FAQ

Q: What’s the simplest safe place to put AI code in a full-stack framework? A: Create a packages/ai module (adapters, prompts, evaluators). Controllers call application services that call this module—never from route handlers.

Q: How do I stop prompt drift when models update? A: Pin models in production, run nightly golden-set evals, and promote new models behind a feature flag only after meeting score/cost/latency thresholds. Version prompts with semver for instant rollback.

Q: Do I need a vector database on day one? A: Often no. Start with Postgres + pgvector for simple retrieval. Move to a dedicated vector service when volume, filtering, or multi-tenant isolation demands it.

Q: How do I control runaway costs? A: Track cost per request and per tenant, add budget ceilings, cache aggressively, use small models by default, and route only hard cases to larger models.

Q: How does this help when using AI coding tools daily? A: The enforced boundaries and saas folder structure guide generation: AI fills functions inside adapters/prompts/evaluators without leaking provider code into domain or controllers.

Related Reading

Related articles