Definition

Customer-success-as-engineering

The discipline that stops every new customer from spawning its own engineering tickets — the work underneath the agent that keeps it reliable as you scale, owned by one person who hands it back to your team. You learn it’s broken the bad way: a customer complaint, not a dashboard. Or the week a frontier lab ships a model and your prod quietly shifts for whoever’s live. Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027 (Gartner, Jun 2025); the wall is reliability, not capability.

Customer-success-as-engineering turns the per-customer AI grind into a system that compounds, so engineering capacity scales sublinearly with customer count instead of one-for-one. It’s the layer underneath the agent, owned end-to-end by one named operator who stays the reliability gate, and built to be handed back rather than held hostage. The person managing the agents at 2am is probably you right now. This moves that off your plate into a system your team runs. Done right, customer #30 costs what #10 did, not 3× it — the flat curve your board probes at the next raise. Economics that improve as you scale, not a founder hiring an engineer every few customers while burn climbs.

Four sub-disciplines

Agent Operating Procedures (AOP)

The instruction layer. For each agent in production, a written spec: its domain, inputs and outputs, where a human touches it, how it fails, who it escalates to. Without it, every new edge case lands on an engineer’s desk and re-opens shipped work. With it, your success and ops people resolve most edge cases by editing the procedure instead of filing a ticket, and “this customer is different” becomes a config change rather than a code change. In a regulated workflow it’s where the line lives that a fabricated citation, a mis-extracted clause, or a hallucinated clinical fact never reaches a customer without a credentialed human in the path — the one error that ends the company, caught by design. It’s the layer I keep legible and hand over: yours to read, edit, and run after I’m gone, not a black box you rent.

Eval cadence

The measurement layer. Production AI accuracy slips silently unless evaluation runs on a schedule — weekly for high-traffic agents, monthly for the rest. The cadence sets what to measure, how to sample, which credentialed reviewers sign off, and what regression threshold blocks a release, so you catch the drop on your schedule before a customer does. Tooling tells you something moved. It doesn’t decide what “reliable enough” means for you, or own the fix when the number drops. That judgment is the product, not the dashboard — the layer I install on top and hand to your team to hold.

Model-migration runbook

The substrate-change layer. When a frontier lab ships a new model, or you switch providers, the migration is never a single PR. It needs regression evaluation across the customers already live, prompt re-tuning per vertical, AOP updates per agent, and a staged rollout with a rollback path. The runbook codifies all of it, so the next one is a two-week structured process your team can run instead of the six-week scramble that quietly drops reliability for whoever’s live that week. Because I own the architecture, model choice stays swappable inside it: vendor-neutral by design, not a lock-in you can’t reverse.

Workflow-definition discipline

The agent-boundary layer. Onboarding a customer forces the question: what does the agent do, and where does the human pick up? Without discipline, every customer negotiates a bespoke workflow and engineering owns the difference forever — one more hire every few customers, gross margin sliding the wrong way as the count climbs. With it, customers select from defined templates, each with its own AOP, eval rubric, reviewer profile, and SLA, and customizations live in config, not code. This is where reliability is won or lost. Shortening the agent’s chains and putting human checkpoints where compounding error bites is how reliability holds as customers pile on, instead of decaying silently until a churn email explains it. It’s the sub-discipline that bends the per-customer cost curve most directly toward flat.

Receipts

Reliability-at-scale — the thing I’ve done longest
The “still works under load” proof sits in systems that ran for years: a distributed micro-frontend architecture I invented at Rakuten before the industry named the pattern, and an enterprise reporting platform I built end-to-end — API, queues, auto-scaling under load — serving global advertisers under strict uptime. (No SLA percentile I can’t show you.)
Artelize — a 4.5-year vertical-AI SaaS technical function, owned end-to-end
Read case study →
A four-actor performing-arts marketplace, the AI underneath it (real-time indexing of 2M+ event pages, generation, recommendations), a real-time data pipeline, and an analytics product on top. Named singularly: the practice, lived once at depth. Honest gap — this is performing-arts vertical AI, not a regulated (legal or clinical) one; that domain I bring by analogy, from ATC, and I’ll say so before you ask.
Deep Thought Solutions — internal, dogfooded
The same AOP, eval-cadence, and migration discipline runs daily across DTS’s own outreach automation, content pipeline, and client work. I run on what I install.
Bulbee — more than four years, two-sided paid platform (specialists B2B + families B2C), three-stakeholder
Read case study →
The adjacent multi-stakeholder analog. Both sides kept paying — that’s the signal, not a count; not a second SaaS-CTO tenure.

What this is NOT

This is not eval tooling, and it does not compete with it. LangSmith, Langfuse, Braintrust, and Arize ship the measurement substrate — a team can plug any of them in an afternoon. That was never the gap. The gap is self-authoring the discipline they sit on: the AOP, cadence, migration runbook, and workflow discipline that turn a regression signal into a fix without an engineering rebuild every time. I wrap their tooling as substrate, never against it. It’s also not a roster you manage, and not one person you can’t afford to lose: the surgical core — architecture, reliability, comprehension of your system — I build by hand and own; the scale underneath runs on a named DTS team, written down so your people own it when I leave. An agent-team shop can promise more tickets closed, faster. Only undivided ownership of the core can promise it still works at customer #30 — reliability rots when ownership is diffuse.

Not for you if

you’re pre-first-customer still hunting product-market fit, you want a body to write tickets under your direction, or you’re buying security/compliance sign-off (a vCISO). This is for a vertical-AI product in production where engineering cost climbs with every customer.

This is the deep read. If your AI works in the demo and breaks past customer #10 — or a model migration just cost you a fortnight you didn’t have — that’s the conversation. 15 minutes, no pitch, and if there’s no fit I’ll say so.

Discuss installing this →← Case studies