Skip to main content

Workflows API

Workflow orchestration for long-running operations. Built on Trigger.dev.
Design Principle: All infrastructure mutations flow through workflows. This provides durable execution, automatic retries, audit trails, and observability. Workflows double as the event log for the platform.

Workflow Runs

Query and manage workflow executions.
GET    /v1/workflows/runs
GET    /v1/workflows/runs/:id
POST   /v1/workflows/runs/:id/cancel
POST   /v1/workflows/runs/:id/retry
GET /v1/workflows/runs?type=cluster.create&status=failed&limit=20
Query Parameters:
ParameterTypeDescription
typestringFilter by workflow type
statusstringpending, running, completed, failed, cancelled
resourceTypestringserver, cluster, vm
resourceIdstringSpecific resource ID
triggeredBystringUser ID who triggered
sinceISO dateRuns started after this time
untilISO dateRuns started before this time
Response:
{
  "success": true,
  "data": [
    {
      "id": "run_abc123",
      "type": "cluster.create",
      "status": "failed",
      "resourceType": "cluster",
      "resourceId": "cls_xyz789",
      "triggeredBy": "usr_operator",
      "startedAt": "2025-01-12T10:00:00Z",
      "completedAt": "2025-01-12T10:05:23Z",
      "durationMs": 323000,
      "error": {
        "code": "POOL_CAPACITY_EXCEEDED",
        "message": "Not enough available servers in pool gpu-h100-pool",
        "step": "allocate-resources"
      },
      "steps": [
        { "name": "validate", "status": "completed", "durationMs": 45 },
        { "name": "create-record", "status": "completed", "durationMs": 120 },
        {
          "name": "allocate-resources",
          "status": "failed",
          "durationMs": 322835
        }
      ]
    }
  ],
  "meta": {
    "requestId": "kl4c6-1766196422377-0f705e3ef475",
    "timestamp": "2025-01-13T12:00:00.000Z",
    "pagination": {
      "total": 47,
      "page": 1,
      "pageSize": 20,
      "hasMore": true,
      "nextCursor": "eyJpZCI6InJ1bl94eXoifQ"
    }
  }
}

Workflow Types

Available workflow types and their purposes:
TypeTrigger SourceDescription
server.registerAtlas APIRegister new bare metal server
server.inspectAtlas APIHardware inspection
server.provisionAtlas APIOS provisioning
server.lifecycleAtlas/Arc APILifecycle actions (power, provision, etc.)
server.decommissionAtlas APIRemove from inventory
cluster.createArc APICreate Kubernetes cluster
cluster.scaleArc APIScale cluster nodes
cluster.upgradeArc APIUpgrade Kubernetes version
cluster.deleteArc APIDelete cluster
vm.createArc APICreate virtual machine
vm.powerArc APIVM power actions
vm.deleteArc APIDelete VM
sync.server-stateProjectorSync K8s state to cache
sync.cluster-stateProjectorSync cluster state to cache

Workflow Metrics

Aggregate statistics for workflow performance.
GET    /v1/workflows/metrics
GET    /v1/workflows/metrics/:type
Example:
GET /v1/workflows/metrics?since=2025-01-01&type=cluster.create
Response:
{
  "success": true,
  "data": {
    "type": "cluster.create",
    "period": {
      "from": "2025-01-01T00:00:00Z",
      "to": "2025-01-12T23:59:59Z"
    },
    "summary": {
      "total": 156,
      "completed": 142,
      "failed": 12,
      "cancelled": 2,
      "successRate": 0.91
    },
    "timing": {
      "p50Ms": 145000,
      "p90Ms": 312000,
      "p99Ms": 545000,
      "avgMs": 178000
    },
    "failureReasons": [
      { "code": "POOL_CAPACITY_EXCEEDED", "count": 8 },
      { "code": "BMC_CONNECTION_FAILED", "count": 3 },
      { "code": "QUOTA_EXCEEDED", "count": 1 }
    ]
  },
  "meta": {
    "requestId": "kl4c6-1766196422377-0f705e3ef475",
    "timestamp": "2025-01-13T12:00:00.000Z"
  }
}

Workflow Operations

Long-running operations that interact with infrastructure (BMC, Kubernetes) return 202 Accepted immediately with a Trigger.dev workflow ID for tracking. All infrastructure mutations flow through durable Trigger.dev workflows.

Design Principles

  1. Immediate Response: Return 202 within < 1 second, don’t wait for completion
  2. Workflow ID: Provide Trigger.dev run ID for polling or webhook correlation
  3. Estimated Duration: Give clients a hint for progress UI
  4. Status Endpoint: Query workflow status via /v1/workflows/runs/:id
  5. Webhook Integration: Support webhooks for completion notifications

Workflow Orchestration

Use Trigger.dev for durable, retryable task execution:
Pattern: Compensating Actions. Use onFailure to clean up partial state. Release allocated resources, update status to error, notify via webhook.

Workflow Status Endpoint

GET /v1/workflows/runs/:workflowId
// Response
{
  "success": true,
  "data": {
    "id": "run_deploy_789",
    "status": "running",           // pending, running, completed, failed
    "progress": 45,                // 0-100
    "startedAt": "2025-01-09T12:00:00.000Z",
    "estimatedCompletionAt": "2025-01-09T12:05:00.000Z",
    "steps": [
      { "name": "validate", "status": "completed", "completedAt": "2025-01-09T12:00:15.000Z" },
      { "name": "provision", "status": "running", "startedAt": "2025-01-09T12:00:15.000Z" },
      { "name": "configure", "status": "pending" }
    ],
    "error": null
  },
  "meta": {
    "requestId": "req_abc",
    "timestamp": "2025-01-09T12:01:00.000Z"
  }
}

Event Sourcing Pattern

  • We will need to get updated to use event sourcing pattern for the workflows API.
  • Need to decide on K8s Informers, Watchers, or Controller-based approach.
// Event sourcing pattern
const workflow = await getWorkflow(workflowId)
const events = await getEvents(workflowId)
const state = await getState(workflowId)