Skip to main content

Troubleshooting

Use this page for the highest-frequency operational issues seen during deployment and run execution.

ImagePullBackOff

What it means

The cluster cannot pull the container image for one or more pods.

Common causes

  • Wrong image repository, tag, or digest
  • Missing registry credentials or permissions
  • Image not yet published or not visible to the cluster

What to check

  • Pod events for pull and auth failures
  • The exact image reference in Helm values
  • Cluster access to Artifact Registry or the target registry

Typical fix

Correct the image reference, fix registry access, or publish the missing image and redeploy.

CrashLoopBackOff

What it means

A container starts and repeatedly crashes.

Common causes

  • Invalid Helm values
  • Missing files, mounts, or secret references
  • Startup failures caused by insufficient memory or CPU

Typical fix

Review the crashing container logs, then correct values, mounts, or resource settings before rerunning.

Job Pending

What it means

The job pod cannot be scheduled.

Common causes

  • Not enough CPU, memory, or GPU capacity
  • Resource requests larger than the available node pool
  • Selectors, taints, tolerations, or affinity rules that prevent scheduling

Typical fix

Scale the node pool, reduce requests if safe, or align node selection settings with the available nodes.

PVC Pending

What it means

The PersistentVolumeClaim cannot be bound.

Common causes

  • Missing or wrong StorageClass
  • Provisioner failure
  • Unsupported access mode or size request

Typical fix

Confirm the StorageClass, adjust the request, or fix the storage provisioner path.

403 or AccessDenied on GCS

What it means

The workload does not have permission to read or write the configured bucket path.

Common causes

  • Missing IAM bindings
  • Broken Workload Identity binding
  • Incorrect bucket or prefix configuration

Typical fix

Fix IAM, correct the Workload Identity binding, or repair the bucket path configuration.

Job completed but outputs are missing

What it means

The job reached a terminal state, but the expected artifacts are not in the expected location.

Common causes

  • Wrong bucket prefix or PVC path
  • Incorrect config.projectId
  • Misconfigured or skipped output upload step

Typical fix

Compare the runtime config to the expected storage path, then inspect the logs for copy or upload steps.

ubbagent not ready

What it means

The Marketplace reporting sidecar cannot start or report usage.

Common causes

  • Reporting secret is missing or malformed
  • Sidecar image cannot be pulled
  • Network restrictions block reporting endpoints

Typical fix

Recreate the secret, fix the image reference, or allow the required egress path.

Verification failure

What it means

The downloaded or restored bundle does not pass verification.

Common causes

  • Incomplete download
  • Bundle files modified after retrieval
  • Wrong bundle version or mismatched manifest

Typical fix

Re-download the bundle, avoid mutating the files, and verify against the correct run artifact set.