Write Design Docs like Amazon / Eric Clemmons

I wrote lots of design docs while working at AwS Amplify. The most recent of which was the Amplify Authenticator Relaunch.

This is a thorough template that’s meant to be adjusted to meet your needs.

Document Guidelines

Aim for 6-8 pages of primary content. Further explanation & examples can be included in the Appendix.
Don’t use weasel words. Use numbers, percentages, dates, etc. If you don’t have the data, get some.
When referencing data, only include the data-point itself in the body. Move graphs, explanation, & methodologies to the Appendix.
Rarely use bullet-points. Instead, write prose.

(Yes, there are lots of bullets in this document, but that’s because it’s acting as an instructional checklist for what prose you should be writing!)

The Template

Problem Statement

…

Present an abstract of your document that clearly and succinctly defines your problem, summarizes your solution, and states your goal.

For example:

The existing @aws-amplify/ui-* packages successfully reduced the level of effort to create cross-framework components (e.g. <Chatbot>) for web, but led to certain technical limitations (e.g. fine-grained customizability of the UI styling, password manager support) that prevented customers from using the Authenticator in production apps.

The next major release of the Authenticator resolves long-standing customers issues with the existing @aws-amplify/ui-* packages and creates a foundation to ensure cross-framework (e.g. Angular, React, React Native, Vue) and cross-platform (e.g. Android, iOS, Flutter) feature parity & stability.

Glossary

…

Define terms, acronyms, and abbreviations here. This doc will last beyond your meeting, so help future readers (including yourself!).

For example:

SSG – Static Site Generation. The page HTML is generated at build time, rather than per-request.

Next.js – React framework developed by Vercel for full-stack web applications.

Use cases

…

Define use cases in terms of impact or business value, not technical outcomes.
Have a representative list of examples, but not exhaustive (i.e. covering edge-cases).
Work backwards from & show the customer’s perspective.

For example:

Customers can wrap their React pages in authentication with zero-configuration of their underlying backend:

import { Authenticator, useAuthenticator } from "@aws-amplify/ui-react"

export default function App() {
  const auth = useAuthenticator();
 
  if (auth.state !== 'AUTHENTICATED') {
    return <Authenticator />
  }

  return (
    <>
      <h1>Welcome {auth.user.username}!</h1>
      <button onClick={auth.signOut}>Sign out</button>
    <>
  )
}

Breaking changes

…

Does your design break customers, services, or anything else? Such as:

404 URLs.
HTTP API request/response changes.
Component props change.
Making a null-able property non-null.
Updating to this dependency breaks code compilations.
Performance regressions.
Function signature changes.
UI structure may break E2E tests.
Buggy behavior customers had to work around, and those workarounds will break.
etc.

Success criteria

…

What will you measure and quantify to demonstrate you’ve successfully solved the problem? Use relative and absolute values. Doubling customers sounds huge, until you find out it’s from 1 to 2.

For example:

Performance – Time-to-first-byte (TTFB) will improve 4x, from 200ms to 50ms.

Cost – We’ll reduce our DataDog bill by $12,000 each month, from $30,000 to $18,000.

Adoption By Q2, 25% of our users (4,250) will be using this new version, based on NPM downloads.

Stability Our session error percentage will drop from 3% to 1%, eliminating crashes in 2,000 sessions a month.

Notice how there’s a directional stat, and an before/after comparison.

Proposed design

…

This is where you outline your solution & approach and what it does.

Focus on ideas, concepts, and tenets that are directionally true, no matter the implementation. There should be key features that are accomplished through the technical design below.

Technical design

…

This is where you discuss how your achieve your solution. Provide architecture diagrams, algorithms, data structures, interfaces, etc. that developers or systems will rely on.

Primarily describe the happy-path, well-supported use cases. If edge-cases or errors impact the design (e.g. working around AWS SQS Standard not behaving like a FIFO Queue), include them.

When including a design, link to the original. (You’ll want that again.)

Here are the tools I like:

Components

…

Like the C4 model, this should enumerate the various humans, systems, processes, & components that are interacting with each other.

C4 model

For example:

Studio - The primary interface for UX Designers. It consumes UI Codegen as a dependency.

CLI - The primary interface for App Developers. It consumes UI Codegen as a dependency.

GitHub - The source control system-of-record for the open-source UI Codegen package.

Actions - The CI/CD system for building, testing, and deploying the UI Codegen package.

NPM - The primary distribution channel for the open-source UI Codegen package.

Dependencies

…

What external or sibling systems or components does your design rely on? List these out, as they may be relevant as a risk to mitigate.

For example:

_LaunchDarkly – Feature flag service that is responsible for rolling out this feature to users.

Monitoring

…

Similar to the customer-facing or business-oriented success metrics above, how will you track technical metrics for the health & stability of your solution?

New APIs or behaviors

…

This means public-facing (i.e. an API another service or person may depend on) changes. These require broader buy-in, particularly by stakeholders and senior technical staff.

For example:

Queues go from delivering messages exactly once to at least once.

Introdicing an experimental: {...} configuration object

Pros & cons

…

Show you’ve pragmattically considered the impact of adopting your solution as well as not adopting it.

Is there development cost that will take away from or delay other initiatives?
Is this solution easy or difficult to maintain? Will this affect headcount?
Does this solution introduce complexity that impacts on-call or operational load?
Is this a 1-way or 2-way door decision?

Major risks & mitigations

…

This is an opportunity to go deeper into Dependencies, Cons, & Components listed above.

Are there dependencies that are outside your control? If there’s a bug or an outage, how will your solution be affected?
Does your solution introduce a new technology or pattern? How will others gain experience with it?
Are there any assumptions your solution has that haven’t been validated yet? What happens if those assumptions are incorrect?
Does your solution introduce technical debt? How will you pay it off? (Feature flags are great fodder for this!)

Security

…

Good designs are also accompanied by a separate security review.

Does the exposed surface area of our systems change?
Is new data being gathered or stored? What type of data? Stored where? What happens if it’s exposed?
Is the design secure-by-default? Or are there extra steps required to improve the security posture?

Scope

…

This should touch on the exact features & functionality that’s required to deploy the solution, including scale & a SLA.

For example:

Customers will be able to upgrade their dependency to this version without changing any code.

This API will use existing, local LLM models, without affecting our OpenAI quota.

Consider what the minimum possible work is to solve the above use cases, without having to “boil the ocean”.

Out of scope

…

With a minimal set of work defined, incremental delivery can be listed here. This is where you can “look around corners” for potential needs, but not block your solution for them.

For example:

The API will not support batch editing, but is planned for Q2.

There are no changes to existing user Roles & Permissions.

Alternatives considered

…

By this point, you should have a well-written proposal for your solution. But “do nothing” is an option that must be addressed.

Likewise, include alternative systems or solutions that are close, but fall short enough to justify the effort of both writing this long-freakin’ proposal and building the damned thing.

Alternative #1

Pros & cons

Similar to Pros & Cons for your solution, list those, tradeoffs, & benefits concisely here.

Reasons discarded

Why was this not persued? Is there any aspect of this that can change (e.g. release cadences, team structure, support, expertise) that would make this viable?

FAQ

…

If there’s a trend in questions as you solicit pre-review feedback (your team is excellent for this!), you can get ahead of those & answer them here._createMdxContent

However, I consider this section a smell. Don’t bury the lede by making people go to the bottom to get their question answered. Put it up top where it makes sense in context!

Open Questions & Feedback

…

This is where you record items from the design review.

Every item must be addressed and resolved by updating the document. Keep the original questions here, but strike them out:

~~Missing Pros & Cons~~

Have you considered this other solution?

Appendix

…

This is where you include the methodology & sources behind your data, prior art, resources, & attachments.