Skip to main content
This document defines the foundational standards for all k0rdent APIs: request/response structure, naming conventions, field design, and operational patterns across Atlas, Arc, and shared services.
Draft: This documentation is currently a work in progress and subject to change.
Quick Navigation: - For complete endpoint implementations, see API Specifications - For code examples and patterns, see Data Ownership - For authentication and security, see Auth

Table of Contents


API Response Contract

All API responses use a consistent discriminated union envelope structure with a success boolean discriminator.

Type Definition

// Base response wrapper with discriminated union
type ApiResponse<T, M = {}> =
  | { success: true; data: T; meta: BaseMeta & M } // Success with extendable meta
  | { success: false; error: ApiErrorBody; meta: BaseMeta } // Error with fixed meta

Base Meta Object

The meta object is always present and contains request tracking information:
// Base meta - always present in all responses
interface BaseMeta {
  requestId: string // Required: for log correlation
  timestamp: string // Required: ISO 8601, for distributed debugging
}
Design Decision: requestId and timestamp are always included because they provide negligible overhead while being critical for distributed debugging and log correlation across services.

Extension Meta Types

Success responses can extend meta with additional context:
// Pagination meta (for list responses)
interface PaginationMeta {
  pagination: {
    total: number
    page: number
    pageSize: number
    hasMore: boolean
    nextCursor?: string // For cursor-based pagination
  }
}

// Workflow meta (for long-running tasks via Trigger.dev)
interface WorkflowMeta {
  workflowId?: string // Trigger.dev run ID for polling status
  estimatedDuration?: number // Hint for UI progress (ms)
}

// Debug meta (development/troubleshooting)
interface DebugMeta {
  duration: number // Request duration in milliseconds
}

// Deprecation warnings
interface DeprecationMeta {
  warnings: DeprecationWarning[]
}

interface DeprecationWarning {
  code: 'DEPRECATED_FIELD' | 'DEPRECATED_ENDPOINT'
  message: string
  field?: string
  sunset?: string // ISO 8601 date when feature will be removed
  migration?: string // URL to migration guide
}

Error Response Structure

interface ApiErrorBody {
  code: string // Machine-readable: "VALIDATION_ERROR", "NOT_FOUND"
  message: string // Human-readable description
  details?: Record<string, unknown> // Additional error context
}

Response Examples

Meta Object Design Decisions

FieldDecisionRationale
requestIdAlways includeEssential for log correlation across services
timestampAlways includeNegligible overhead; critical for distributed debugging
apiVersionOmitURL path (/v1/) is the version; redundant in response
rateLimitAdd laterWhen rate limiting is implemented

Implementation Reference

Technology Stack

LayerTechnologyPurpose
FrameworkNext.js 16+ (App Router)Server components, API routes
API LayerHonoLightweight, type-safe API routes
ValidationZodRuntime validation, schema definitions
DatabaseDrizzle ORMType-safe queries, migrations
AuthBetterAuthSession management, OAuth
WorkflowsTrigger.devDurable task execution

Zod Schema Definitions

Define all request/response schemas using Zod for validation and type inference:
Pattern: Workflows by Default. All operations that interact with infrastructure (BMC, K8s) are executed as Trigger.dev workflows and return immediately with a workflow run ID. Clients poll workflow status via the Workflows API or receive webhooks.

Field Design Rules

Design fields for extension from day one. The cost of refactoring primitive fields into objects later is high and often requires breaking changes.
Principle: Objects Over Primitives. Always wrap values that might grow into structured objects. It’s better to have nested objects early than to break APIs later when you need to add context.

Use Objects Over Primitives

Wrap values that might grow into objects immediately.
// ❌ Will require breaking change
interface Server {
  status: 'available' | 'provisioning' | 'error'
  bmcAddress: string
}

// ✅ Extensible
interface Server {
  status: {
    state: 'available' | 'provisioning' | 'error'
    reason?: string
    since?: string
    conditions?: Condition[]
  }
  bmc: {
    address: string
    protocol?: 'ipmi' | 'redfish'
    vendor?: string
  }
}

Never Use Booleans for State

States often grow beyond two values.
// ❌ Trouble waiting to happen
interface Server {
  isOnline: boolean
  isHealthy: boolean
}

// ✅ Extensible
interface Server {
  power: {
    state: 'on' | 'off' | 'unknown'
  }
  health: {
    state: 'healthy' | 'degraded' | 'unhealthy' | 'unknown'
  }
}

Use IDs with Optional Expansion

Don’t embed full objects. Use IDs and let clients request expansion.
// ❌ Embedded - can't change cardinality later
{
  "id": "srv_123",
  "cluster": {
    "id": "cls_456",
    "name": "prod-cluster"
  }
}
GET /v1/compute/servers/srv_123
// ✅ ID reference
{
  "id": "srv_123",
  "clusterId": "cls_456"
}
GET /v1/compute/servers/srv_123?expand=cluster
// ✅ Optional expansion
{
  "id": "srv_123",
  "clusterId": "cls_456",
  "cluster": {
    "id": "cls_456",
    "name": "prod-cluster"
  }
}

API Versioning

All API endpoints include version in the URL path: /v1/... Domain-based routing separates Atlas and Arc APIs:

Version Format

# Atlas APIs (api.internal.example.com)
https://api.internal.example.com/v1/region/global/compute/servers
https://api.internal.example.com/v1/region/global/compute/clusters
https://api.internal.example.com/v1/region/global/organizations

# Arc APIs (api.example.com)
https://api.example.com/v1/region/{region}/projects
https://api.example.com/v1/region/{region}/compute/clusters
https://api.example.com/v1/region/{region}/stacks

# Shared Services (both domains)
/v1/region/global/auth
/v1/region/global/notifications
/v1/region/global/webhooks

Version Policy

  • Major versions (v1, v2) for breaking changes
  • No minor versions in URL - use feature flags and deprecation warnings instead
  • Deprecation timeline: 6 months notice before removing deprecated endpoints
  • Version in URL, not response: The apiVersion field is omitted from responses because the URL path is the source of truth
When deprecating fields or endpoints, include warnings in the meta.warnings array with sunset dates and migration guides.

Deprecation Example

{
  "success": true,
  "data": { ... },
  "meta": {
    "requestId": "...",
    "timestamp": "...",
    "warnings": [
      {
        "code": "DEPRECATED_ENDPOINT",
        "message": "This endpoint is deprecated. Use /v2/servers instead.",
        "sunset": "2026-06-01",
        "migration": "https://docs.example.com/migration/v2-servers"
      }
    ]
  }
}

Naming Conventions

Consistent naming across URLs, fields, and resources improves developer experience and reduces confusion.

URL Paths

Hono uses colon-prefixed route parameters.
RuleExample
Route params with colon/v1/clusters/:clusterId
Lowercase, hyphenated/v1/ai-services
Plural nouns for collections/servers, /clusters
Singular for singletons/me, /health
NOTE: Lowercase, hyphenated is only for url paths and not the same for the database, response, request body, and other contexts.

Resource IDs

Resources (clusters, servers, organizations, etc.) get globally unique, opaque IDs that do NOT contain region information. This decouples resource identity from physical location. Format: {prefix}_{base62}
ResourcePrefixExample
Organizationorg_org_8TcVx2WkZddNmK3Pt9JwX7BzWrLM
Serversrv_srv_3KpQm9WnXccFjH2Ls8DkT6VzRqYU
Clustercls_cls_6NZtkvWLBbbmHfPi7L6oz7KZpqET
Stackstk_stk_5MfRp4WjYbbHmG8Nt2LvS9CxPqZK
Workflow Runrun_run_7NhTq6WlAbbKmF5Rt3MxU8DzSqWJ
Poolpoo_poo_2LgPn8WmXccGjE7Mt4KwV9BySrTL
Allocationall_all_9QjSr3WnZddMmH6Pt5LxW2CzUrYK
API Keykey_key_4KfQm7WkYccJmG3Nt8MvX9BzSqWL
Eventevt_evt_6MgRp2WlXbbKmF9Rt5NxU3DzTqZJ
Special ID: org_system is reserved for platform-level admin operations. TBD if this is needed still. Originally it was for something else.

Field Names

RuleExample
camelCasecreatedAt, nodeCount
Suffix IDs with IdclusterId, organizationId
Use past tense for timestampscreatedAt, updatedAt, deletedAt

Pagination

Use cursor-based pagination for real-time data, offset/limit for stable datasets.

Query Parameters

GET /v1/compute/servers?limit=50&offset=0
GET /v1/compute/servers?limit=50&cursor=srv_abc123

Response

interface PaginatedResponse<T> {
  success: true
  data: T[]
  meta: {
    requestId: string
    timestamp: string
    pagination: {
      total: number
      limit: number
      offset?: number
      cursor?: string
      nextCursor?: string
      hasMore: boolean
    }
  }
}

Filtering and Sorting

Query String Format

Use consistent query parameter patterns for filtering and sorting:
GET /v1/compute/servers?status=available&sort=-createdAt&limit=50
ParameterFormatExample
Filterfield=valuestatus=available
Multiple valuesfield=val1,val2status=available,provisioning
Sort ascendingsort=fieldsort=name
Sort descendingsort=-fieldsort=-createdAt
Multiple sortssort=field1,-field2sort=status,-createdAt

Implementation with Zod

export const listQuerySchema = z.object({
  // Filtering
  status: z.string().optional(),
  type: z.string().optional(),

  // Sorting
  sort: z.string().optional(),

  // Pagination
  limit: z.coerce.number().min(1).max(100).default(25),
  offset: z.coerce.number().min(0).default(0),
  cursor: z.string().optional(),
})

Action Endpoints

For operations beyond CRUD, use a unified action endpoint with POST method. Actions represent commands that change resource state asynchronously.

Design Principles

  1. Unified Endpoint: Single /actions endpoint handles all action types (power, provision, deprovision, inspect, maintenance)
  2. Type-Safe Parameters: Each action type has its own request schema with action-specific options
  3. Async by Default: Actions return 202 Accepted with workflow/operation IDs for tracking
  4. Audit Trail: Logs show “POST /actions with type=power action=off” for clear tracking
  5. Granular Permissions: Easy to scope permissions like servers:lifecycle vs servers:update

Endpoint Pattern

POST /v1/compute/servers/:serverId/actions     # Actions: power, provision, deprovision, inspect, maintenance

Action Request Schema

// Base action schema with discriminated union for type safety
export const serverActionBodySchema = z.discriminatedUnion('type', [
  // Power actions
  z.object({
    type: z.literal('power'),
    action: z.enum(['on', 'off', 'reboot', 'cycle']),
    force: z.boolean().default(false),
  }),

  // Provision actions
  z.object({
    type: z.literal('provision'),
    imageUrl: z.string().url(),
    imageChecksum: z.string().optional(),
    checksumType: z.enum(['md5', 'sha256', 'sha512']).optional(),
    rootDeviceHints: z
      .object({
        deviceName: z.string().optional(),
        minSizeGiB: z.number().optional(),
      })
      .optional(),
  }),

  // Deprovision actions
  z.object({
    type: z.literal('deprovision'),
    wipeDisks: z.boolean().default(true),
  }),

  // Inspect actions
  z.object({
    type: z.literal('inspect'),
    full: z.boolean().default(false),
  }),

  // Maintenance mode actions
  z.object({
    type: z.literal('maintenance'),
    enabled: z.boolean(),
    reason: z.string().optional(),
  }),
])

Implementation Example


Bulk Operations

Bulk operations allow applying actions to multiple resources simultaneously. All bulk actions use partial success semantics - individual resource failures do not fail the entire bulk operation.

Design Principles

  1. Partial Success: Individual failures don’t abort the entire bulk operation
  2. Explicit IDs: Use explicit ID lists for predictability and safety
  3. Per-Resource Results: Response includes success/failure status for each resource
  4. 207 Multi-Status: Always return 207 to indicate mixed results possible
  5. Dedicated Endpoints: Use /bulk pattern for consistency

Endpoint Pattern

POST /v1/compute/servers/bulk                     # Atlas API
POST /v1/notifications/inbox/bulk          # Shared service
POST /v1/webhooks/subscriptions/bulk       # Shared service
The action type is specified in the request body, making the API flexible and maintainable.

Request Schema

export const bulkRequestSchema = z.object({
  // Action type
  action: z.enum([
    'register',
    'power',
    'provision',
    'deprovision',
    'delete',
  ]),

  // Explicit resource IDs
  ids: z.array(z.string()).min(1).max(1000),

  // Optional dry-run mode
  dryRun: z.boolean().optional(),

  // Action-specific configuration
  params: z.record(z.unknown()).optional(),
})

Response Structure

interface BulkOperationResponse {
  success: true // Always true for bulk ops
  data: {
    action: string
    requested: number
    succeeded: number
    failed: number
    dryRun?: boolean
    wouldAffect?: number // Dry-run only
    results: Array<{
      id: string
      status: 'success' | 'failed'
      error?: {
        code: string
        message: string
      }
    }>
  }
  meta: {
    requestId: string
    timestamp: string
  }
}

Implementation Example

Safety Features

Dry-Run Mode

Preview which resources would be affected without executing:
POST /v1/servers/bulk
Content-Type: application/json
{
  "action": "power",
  "ids": ["srv_123", "srv_456"],
  "dryRun": true,
  "params": { "action": "off" }
}
Response:
{
  "success": true,
  "data": {
    "action": "power",
    "dryRun": true,
    "wouldAffect": 2,
    "preview": [
      { "id": "srv_123", "name": "node-01", "state": "on" },
      { "id": "srv_456", "name": "node-02", "state": "on" }
    ]
  },
  "meta": {
    "requestId": "kl4c6-1766196422377-0f705e3ef475",
    "timestamp": "2025-01-13T12:00:00.000Z"
  }
}

Rate Limiting

Bulk operations are throttled to prevent resource overload. Default: 10 requests/min.

Implementation Reference


Error Handling

Typed Error Classes

Define semantic error types for consistent error responses:
lib/errors.ts
export class AppError extends Error {
  constructor(
    public code: string,
    message: string,
    public statusCode: number = 500,
    public details?: unknown
  ) {
    super(message)
    this.name = 'AppError'
  }
}

export class ValidationError extends AppError {
  constructor(message: string, details?: unknown) {
    super('VALIDATION_ERROR', message, 400, details)
  }
}

export class NotFoundError extends AppError {
  constructor(resource: string, id: string) {
    super('NOT_FOUND', `${resource} not found: ${id}`, 404)
  }
}

export class ForbiddenError extends AppError {
  constructor(message: string = 'Access denied') {
    super('FORBIDDEN', message, 403)
  }
}

export class ConflictError extends AppError {
  constructor(message: string) {
    super('CONFLICT', message, 409)
  }
}

export class InvalidStateTransitionError extends ConflictError {
  constructor(resource: string, currentState: string, targetState: string) {
    super(
      `Cannot transition ${resource} from ${currentState} to ${targetState}`
    )
    this.code = 'INVALID_STATE_TRANSITION'
  }
}

Global Error Handler

middleware/error-handler.ts
export function errorHandler(err: Error, c: Context) {
  const requestId = c.get('requestId')

  // Log error with context
  console.error({
    requestId,
    error: err.message,
    stack: err.stack,
    code: err instanceof AppError ? err.code : 'INTERNAL_ERROR',
    userId: c.get('userId'),
    path: c.req.path,
  })

  // Return typed error response
  if (err instanceof AppError) {
    return c.json(
      {
        success: false,
        error: {
          code: err.code,
          message: err.message,
          details: err.details,
        },
        meta: {
          requestId,
          timestamp: new Date().toISOString(),
        },
      },
      err.statusCode
    )
  }

  // Don't leak internal errors to client
  return c.json(
    {
      success: false,
      error: {
        code: 'INTERNAL_ERROR',
        message: 'An unexpected error occurred',
      },
      meta: {
        requestId,
        timestamp: new Date().toISOString(),
      },
    },
    500
  )
}

// Usage in Hono
app.onError(errorHandler)

Usage in Routes

app.get('/servers/:id', async (c) => {
  const { id } = c.req.param()

  const server = await db.query.servers.findFirst({
    where: eq(schema.servers.id, id),
  })

  if (!server) {
    throw new NotFoundError('Server', id)
  }

  return success(c, server)
})

app.post('/servers/:id/provision', async (c) => {
  const { id } = c.req.param()
  const body = c.req.valid('json')

  const server = await db.query.servers.findFirst({
    where: eq(schema.servers.id, id),
    with: { inventory: true },
  })

  if (server.inventory.state !== 'available') {
    throw new InvalidStateTransitionError(
      'server',
      server.inventory.state,
      'provisioning'
    )
  }

  // ... continue with provisioning
})

Audit Logging

SOC 2 compliant audit logging for all API requests. Every significant action must be traceable to a user and timestamp.

What to Log

Event TypeLog?Rationale
All mutations (POST/PUT/PATCH/DELETE)✅ AlwaysCore audit trail
Failed authentication (401)✅ AlwaysSecurity monitoring
Failed authorization (403)✅ AlwaysAccess control audit
Server errors (5xx)✅ AlwaysIncident response
Reads on sensitive resources✅ AlwaysCompliance (see below)
General reads (GET)⚠️ OptionalHigh volume; enable for debugging
Health/metrics endpoints❌ NeverNoise
For multi-tenant security architecture and authorization patterns, see Auth Architecture.

Sensitive Entities Requiring Audit Logs

These entities require audit logging on all operations, including reads:
EntityWhy SensitiveExample Events
API KeysCredential accessapi_key.created, api_key.viewed, api_key.revoked
BMC CredentialsInfrastructure accessbmc_credential.created, bmc_credential.accessed
Cluster CredentialsKubeconfig accesscluster_credential.downloaded
SSH KeysServer accessssh_key.created, ssh_key.deleted
SecretsUser-managed secretssecret.created, secret.accessed, secret.deleted
Organization MembersAccess controlmember.invited, member.role_changed, member.removed
Billing/PaymentFinancial datapayment_method.added, invoice.viewed

Audit Event Schema

interface AuditEvent {
  id: string // evt_<nanoid>
  timestamp: string // ISO 8601
  requestId: string // Correlation ID

  // Actor
  actor: {
    type: 'user' | 'service' | 'system'
    id: string
    email?: string // For user actors
    service?: string // For service actors
  }

  // Action
  action: string // e.g., "server.provision", "api_key.created"
  method: string // HTTP method
  path: string // Request path

  // Resource
  resource: {
    type: string // e.g., "server", "cluster", "api_key"
    id: string
    name?: string
  }

  // Context
  organizationId: string | null

  // Outcome
  outcome: 'success' | 'failure'
  statusCode: number
  errorCode?: string

  // Changes (for mutations)
  changes?: {
    before?: Record<string, unknown>
    after?: Record<string, unknown>
  }

  // Request metadata
  ip: string
  userAgent: string
  duration: number // ms
}

Audit Event Naming Convention

Use past-tense, dot-namespaced actions:
# Resource lifecycle
server.created
server.updated
server.deleted

# State transitions
server.provisioned
server.deprovisioned
cluster.scaled

# Access events
api_key.created
api_key.accessed
api_key.revoked
cluster_credential.downloaded
secret.accessed

# Security events
member.invited
member.role_changed
member.removed
auth.login_failed
auth.login_success

Multi-Tenancy Patterns

Row-Level Security (RLS)

Use PostgreSQL RLS for defense-in-depth isolation:
-- Enable RLS on multi-tenant tables
ALTER TABLE atlas.clusters ENABLE ROW LEVEL SECURITY;

-- Policy: Users see only their org's data
CREATE POLICY clusters_org_isolation ON atlas.clusters
  FOR ALL
  USING (organization_id = current_setting('app.current_org_id')::TEXT);

-- Policy: Service accounts bypass RLS (for background jobs)
CREATE POLICY clusters_service_bypass ON atlas.clusters
  FOR ALL
  USING (current_setting('app.is_service', true)::BOOLEAN = true);

Setting Context Per Request

middleware/rls-context.ts
export async function setRLSContext(db: Database, orgId: string) {
  await db.execute(sql`SELECT set_config('app.current_org_id', ${orgId}, true)`)
}

// In route middleware
app.use('*', async (c, next) => {
  const orgId = c.req.header('X-Org-ID')
  if (orgId) {
    await setRLSContext(db, orgId)
  }
  await next()
})
Critical: RLS context is set per-transaction. For connection pooling, always set context at the start of each request. Drizzle’s transaction() helper ensures this.

Decision Log

Response Envelope Pattern

DecisionRationaleTrade-off
Discriminated union with success: boolean over separate success/error types• TypeScript discriminated unions provide excellent type narrowing
• Client code: if (response.success) gets correct types
• Consistent structure across all endpoints
• Easier to generate TypeScript clients
Slightly more verbose than HTTP-only error signaling.
Type safety worth it.

Resource ID Format

DecisionRationaleTrade-off
Prefixed nanoid (srv_abc123, cls_xyz789) over UUIDs or numeric IDs• Human-readable in logs
• Immediately identify resource type
• URL-safe
• Short enough for display
• Low collision probability
Slightly longer than pure nanoid.
Worth it for debugging and log correlation.

Action Endpoints

DecisionRationaleTrade-off
Dedicated POST endpoints (/power, /provision) over overloading PATCH• Semantic clarity: POST /power action=reboot clearer than PATCH { online: true }
• Action-specific parameters (e.g., force, imageUrl)
• Better audit trail: “POST /power action=off” vs “PATCH with field changes”
• Granular permissions: servers:lifecycle vs servers:update
Slightly more endpoints.
Worth it for clarity and permissions.

Bulk Operation Responses

DecisionRationaleTrade-off
Always 207 Multi-Status with per-resource results (not 200 OK with mixed results or fail-entire-operation)• Partial success is common in bulk operations
• Client needs to know which specific resources succeeded/failed
• Failing entire operation for one resource is poor UX
• 207 status code semantically correct for mixed outcomes
None significant.
Standard practice for bulk operations.

Async Operation Default

DecisionRationaleTrade-off
Return 202 immediately (not synchronous with long timeouts)• Infrastructure operations take 30s to 30min
• Prevents HTTP timeouts and connection issues
• Allows UI to show progress
• Supports horizontal scaling (request and execution on different instances)
• Better observability via workflow tracking
Requires more client code.
Mitigated by SDKs and clear polling patterns.

Testing

Test Structure

All API endpoints should have integration tests covering:
  1. Happy path: Successful requests with expected responses
  2. Validation: Invalid inputs return appropriate errors
  3. Authorization: Unauthorized users receive 403
  4. State transitions: Invalid state transitions are rejected
  5. Edge cases: Empty lists, missing resources, etc.

Test Helpers

tests/helpers/api.ts
export function createTestClient(options?: {
  userId?: string
  orgId?: string
  roles?: string[]
}) {
  return {
    async get(path: string) {
      const req = new Request(`http://localhost${path}`, {
        method: 'GET',
        headers: {
          'X-User-ID': options?.userId || 'test-user',
          'X-Org-ID': options?.orgId || 'test-org',
          'X-Roles': JSON.stringify(options?.roles || ['admin']),
        },
      })
      return app.fetch(req)
    },

    async post(path: string, body: unknown) {
      const req = new Request(`http://localhost${path}`, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'X-User-ID': options?.userId || 'test-user',
          'X-Org-ID': options?.orgId || 'test-org',
          'X-Roles': JSON.stringify(options?.roles || ['admin']),
        },
        body: JSON.stringify(body),
      })
      return app.fetch(req)
    },
  }
}

Example Test

// tests/api/servers.test.ts
import { describe, it, expect, beforeAll, afterAll } from 'vitest'
import { createTestClient } from '../helpers/api'

describe('Servers API', () => {
  let client: ReturnType<typeof createTestClient>

  beforeAll(async () => {
    client = createTestClient({ roles: ['provider_operator'] })
  })

  it('lists available servers', async () => {
    const res = await client.get('https://api.internal.example.com/v1/compute/servers?state=available')
    expect(res.status).toBe(200)

    const body = await res.json()
    expect(body.success).toBe(true)
    expect(body.data).toBeInstanceOf(Array)
    expect(body.meta.requestId).toBeDefined()
  })

  it('rejects invalid state filter', async () => {
    const res = await client.get('https://api.internal.example.com/v1/compute/servers?state=invalid')
    expect(res.status).toBe(400)

    const body = await res.json()
    expect(body.success).toBe(false)
    expect(body.error.code).toBe('VALIDATION_ERROR')
  })
})

  • Specification - Complete API specification with detailed endpoint examples
  • Auth Architecture - Authentication, authorization, and multi-tenant security patterns
  • Data Ownership - Implementation patterns, workflow orchestration, and development guidance
  • Data Model - Database schema and relationships