Workflows - k0rdent AI

Workflows API

Workflow orchestration for long-running operations. Built on Trigger.dev.

Design Principle: All infrastructure mutations flow through workflows. This provides durable execution, automatic retries, audit trails, and observability. Workflows double as the event log for the platform.

Workflow Runs

Query and manage workflow executions.

GET    /v1/workflows/runs
GET    /v1/workflows/runs/:id
POST   /v1/workflows/runs/:id/cancel
POST   /v1/workflows/runs/:id/retry

List Runs
Get Run Details
Retry Failed Run

GET /v1/workflows/runs?type=cluster.create&status=failed&limit=20

Query Parameters:

Parameter	Type	Description
`type`	string	Filter by workflow type
`status`	string	`pending`, `running`, `completed`, `failed`, `cancelled`
`resourceType`	string	`server`, `cluster`, `vm`
`resourceId`	string	Specific resource ID
`triggeredBy`	string	User ID who triggered
`since`	ISO date	Runs started after this time
`until`	ISO date	Runs started before this time

Response:

{
  "success": true,
  "data": [
    {
      "id": "run_abc123",
      "type": "cluster.create",
      "status": "failed",
      "resourceType": "cluster",
      "resourceId": "cls_xyz789",
      "triggeredBy": "usr_operator",
      "startedAt": "2025-01-12T10:00:00Z",
      "completedAt": "2025-01-12T10:05:23Z",
      "durationMs": 323000,
      "error": {
        "code": "POOL_CAPACITY_EXCEEDED",
        "message": "Not enough available servers in pool gpu-h100-pool",
        "step": "allocate-resources"
      },
      "steps": [
        { "name": "validate", "status": "completed", "durationMs": 45 },
        { "name": "create-record", "status": "completed", "durationMs": 120 },
        {
          "name": "allocate-resources",
          "status": "failed",
          "durationMs": 322835
        }
      ]
    }
  ],
  "meta": {
    "requestId": "kl4c6-1766196422377-0f705e3ef475",
    "timestamp": "2025-01-13T12:00:00.000Z",
    "pagination": {
      "total": 47,
      "page": 1,
      "pageSize": 20,
      "hasMore": true,
      "nextCursor": "eyJpZCI6InJ1bl94eXoifQ"
    }
  }
}

GET /v1/workflows/runs/run_abc123

Response:

{
  "success": true,
  "data": {
    "id": "run_abc123",
    "type": "cluster.create",
    "status": "completed",
    "resourceType": "cluster",
    "resourceId": "cls_xyz789",
    "triggeredBy": "usr_operator",
    "triggeredFrom": "arc",
    "organizationId": "org_acme",
    "startedAt": "2025-01-12T10:00:00Z",
    "completedAt": "2025-01-12T10:02:34Z",
    "durationMs": 154000,
    "input": {
      "name": "ai-training-cluster",
      "projectId": "proj_abc",
      "stackId": "stk_k8s",
      "config": {
        "controlPlane": { "count": 3 },
        "workers": { "count": 10 }
      }
    },
    "output": {
      "clusterId": "cls_xyz789",
      "allocatedServers": ["srv_1", "srv_2", "srv_3", "..."],
      "kubeApiEndpoint": "https://cls-xyz789.k8s.example.com:6443"
    },
    "steps": [
      {
        "name": "validate",
        "status": "completed",
        "startedAt": "2025-01-12T10:00:00Z",
        "completedAt": "2025-01-12T10:00:00.045Z",
        "durationMs": 45
      },
      {
        "name": "create-record",
        "status": "completed",
        "startedAt": "2025-01-12T10:00:00.045Z",
        "completedAt": "2025-01-12T10:00:00.165Z",
        "durationMs": 120
      },
      {
        "name": "allocate-resources",
        "status": "completed",
        "startedAt": "2025-01-12T10:00:00.165Z",
        "completedAt": "2025-01-12T10:00:02.341Z",
        "durationMs": 2176,
        "output": {
          "allocatedCount": 13,
          "allocationIds": ["alloc_1", "alloc_2", "..."]
        }
      },
      {
        "name": "apply-k8s-resources",
        "status": "completed",
        "startedAt": "2025-01-12T10:00:02.341Z",
        "completedAt": "2025-01-12T10:00:05.678Z",
        "durationMs": 3337
      },
      {
        "name": "wait-for-ready",
        "status": "completed",
        "startedAt": "2025-01-12T10:00:05.678Z",
        "completedAt": "2025-01-12T10:02:34.000Z",
        "durationMs": 148322
      }
    ],
    "retryOf": null,
    "retriedBy": null
  },
  "meta": {
    "requestId": "kl4c6-1766196422377-0f705e3ef475",
    "timestamp": "2025-01-13T12:00:00.000Z"
  }
}

POST /v1/workflows/runs/run_abc123/retry
Content-Type: application/json

{
  "fromStep": "allocate-resources",
  "inputOverrides": {
    "config": {
      "workers": { "count": 5 }
    }
  }
}

Response:

{
  "success": true,
  "data": {
    "id": "run_def456",
    "type": "cluster.create",
    "status": "pending",
    "retryOf": "run_abc123",
    "startedAt": "2025-01-12T11:00:00Z"
  },
  "meta": {
    "requestId": "kl4c6-1766196422377-0f705e3ef475",
    "timestamp": "2025-01-13T12:00:00.000Z"
  }
}

Design Decision: Retries create new run records linked to the original. This preserves audit history. fromStep allows resuming from a specific step when earlier steps succeeded.

Workflow Types

Available workflow types and their purposes:

Type	Trigger Source	Description
`server.register`	Atlas API	Register new bare metal server
`server.inspect`	Atlas API	Hardware inspection
`server.provision`	Atlas API	OS provisioning
`server.lifecycle`	Atlas/Arc API	Lifecycle actions (power, provision, etc.)
`server.decommission`	Atlas API	Remove from inventory
`cluster.create`	Arc API	Create Kubernetes cluster
`cluster.scale`	Arc API	Scale cluster nodes
`cluster.upgrade`	Arc API	Upgrade Kubernetes version
`cluster.delete`	Arc API	Delete cluster
`vm.create`	Arc API	Create virtual machine
`vm.power`	Arc API	VM power actions
`vm.delete`	Arc API	Delete VM
`sync.server-state`	Projector	Sync K8s state to cache
`sync.cluster-state`	Projector	Sync cluster state to cache

Workflow Metrics

Aggregate statistics for workflow performance.

GET    /v1/workflows/metrics
GET    /v1/workflows/metrics/:type

Example:

GET /v1/workflows/metrics?since=2025-01-01&type=cluster.create

Response:

{
  "success": true,
  "data": {
    "type": "cluster.create",
    "period": {
      "from": "2025-01-01T00:00:00Z",
      "to": "2025-01-12T23:59:59Z"
    },
    "summary": {
      "total": 156,
      "completed": 142,
      "failed": 12,
      "cancelled": 2,
      "successRate": 0.91
    },
    "timing": {
      "p50Ms": 145000,
      "p90Ms": 312000,
      "p99Ms": 545000,
      "avgMs": 178000
    },
    "failureReasons": [
      { "code": "POOL_CAPACITY_EXCEEDED", "count": 8 },
      { "code": "BMC_CONNECTION_FAILED", "count": 3 },
      { "code": "QUOTA_EXCEEDED", "count": 1 }
    ]
  },
  "meta": {
    "requestId": "kl4c6-1766196422377-0f705e3ef475",
    "timestamp": "2025-01-13T12:00:00.000Z"
  }
}

Workflow Operations

Long-running operations that interact with infrastructure (BMC, Kubernetes) return 202 Accepted immediately with a Trigger.dev workflow ID for tracking. All infrastructure mutations flow through durable Trigger.dev workflows.

Design Principles

Immediate Response: Return 202 within < 1 second, don’t wait for completion
Workflow ID: Provide Trigger.dev run ID for polling or webhook correlation
Estimated Duration: Give clients a hint for progress UI
Status Endpoint: Query workflow status via /v1/workflows/runs/:id
Webhook Integration: Support webhooks for completion notifications

Workflow Orchestration

Use Trigger.dev for durable, retryable task execution:

Pattern: Compensating Actions. Use onFailure to clean up partial state. Release allocated resources, update status to error, notify via webhook.

Workflow Status Endpoint

GET /v1/workflows/runs/:workflowId

// Response
{
  "success": true,
  "data": {
    "id": "run_deploy_789",
    "status": "running",           // pending, running, completed, failed
    "progress": 45,                // 0-100
    "startedAt": "2025-01-09T12:00:00.000Z",
    "estimatedCompletionAt": "2025-01-09T12:05:00.000Z",
    "steps": [
      { "name": "validate", "status": "completed", "completedAt": "2025-01-09T12:00:15.000Z" },
      { "name": "provision", "status": "running", "startedAt": "2025-01-09T12:00:15.000Z" },
      { "name": "configure", "status": "pending" }
    ],
    "error": null
  },
  "meta": {
    "requestId": "req_abc",
    "timestamp": "2025-01-09T12:01:00.000Z"
  }
}

Event Sourcing Pattern

We will need to get updated to use event sourcing pattern for the workflows API.
Need to decide on K8s Informers, Watchers, or Controller-based approach.

// Event sourcing pattern
const workflow = await getWorkflow(workflowId)
const events = await getEvents(workflowId)
const state = await getState(workflowId)

Engineering

Content

​Workflows API

​Workflow Runs

​Workflow Types

​Workflow Metrics

​Workflow Operations

​Design Principles

​Workflow Orchestration

​Workflow Status Endpoint

​Event Sourcing Pattern

Workflows API

Workflow Runs

Workflow Types

Workflow Metrics

Workflow Operations

Design Principles

Workflow Orchestration

Workflow Status Endpoint

Event Sourcing Pattern