Backup Mechanics¶
The backup pipeline is where most of Velero's complexity lives.
Understanding this path is essential for debugging, performance tuning, and writing plugins.
Backup lifecycle state machine¶
The full state machine involves 4 controllers in sequence:
BackupQueueController
New ──► Queued ──► ReadyToStart
│
BackupController
ReadyToStart ──► InProgress
│
BackupOperationsController
WaitingForPluginOperations
WaitingForPluginOperationsPartiallyFailed
│
BackupFinalizerController
Finalizing / FinalizingPartiallyFailed
│
Completed / PartiallyFailed / Failed
│
──► Deleting (via DeleteBackupRequest)
The BackupQueueController enforces concurrency limits and detects
namespace conflicts (two backups covering the same namespaces cannot run
simultaneously). Once a Backup reaches ReadyToStart, the
BackupController takes over and drives execution.
Step-by-step Backup Execution¶
1. BackupController Picks Up a New Backup¶
A controller-runtime informer watches for Backup objects reaching
ReadyToStart phase. The controller sets status.phase = InProgress and
status.startTimestamp.
In HA deployments, controller-runtime's leader election (via Kubernetes leases) ensures only one replica runs reconcilers at a time, preventing concurrent execution of the same backup.
2. Resource Discovery and Collection¶
Uses the API server's discovery API to enumerate all resource types. For each resource type matching the include/exclude filters, lists objects via the dynamic client (an untyped Kubernetes client that works with any resource type without compiled Go structs — essential for backing up CRDs whose types Velero doesn't know at compile time).
Discovery is done concurrently with goroutines per resource group.
Key file: pkg/backup/item_collector.go
3. BackupItemAction Plugins¶
For each collected item, runs all registered BackupItemAction plugins
whose AppliesTo() matches the resource type. These can:
- Mutate the item (e.g. strip sensitive annotations)
- Add additional items to the backup graph (e.g. the built-in
pod-actionadds the PVC when a Pod is backed up, ensuring PVC/PV pairs are consistent) - Set skip flags to exclude an item
This is where most custom business logic lives. See Plugin System.
4. PVC → Volume Backup Decision¶
For each PVC, Velero decides the volume backup method. Volume policies
(ConfigMap-based, referenced via spec.resourcePolicy) take highest
precedence, followed by pod annotations, then defaults:
- Volume policy match (if configured): a
ResourcePoliciesConfigMap can match by storage class, CSI driver, capacity, NFS config, PVC labels, or PVC phase. The first matching rule's action wins:skip,fs-backup, orsnapshot. - Skip: if
snapshotVolumes: falseor if the PVC has the opt-out annotationbackup.velero.io/backup-volumes-excludes - CSI VolumeSnapshot: if the CSI plugin is enabled and a matching VolumeSnapshotClass exists
- Cloud provider snapshot: if a VolumeSnapshotter plugin is registered for the storage class
- Kopia file-level copy: if
defaultVolumesToFsBackup: trueor if the PVC has the opt-in annotationbackup.velero.io/backup-volumes
Volume policies are the modern approach
Volume policies replace the annotation-based per-pod opt-in/out model. They're centrally managed, support rich matching conditions, and apply cluster-wide without modifying application manifests.
5. Pre-backup hooks¶
Before serializing a pod's volume data, executes pre-backup hooks (exec into containers). Used to quiesce databases, flush caches, sync filesystems. See Hooks.
6. Volume Snapshot / Data Upload¶
Creates DataUpload CRDs. The DataUploadController in node-agent picks
these up and runs Kopia to upload data directly from the PVC mount on the
node. Velero-server polls DataUpload.status for completion.
Calls the VolumeSnapshotter plugin synchronously. The plugin calls the
cloud provider API and returns a snapshot ID that Velero stores in the
backup metadata.
7. Post-backup hooks¶
After volume data is captured, runs post-backup hooks to un-quiesce
(e.g. UNLOCK TABLES). Velero guarantees post hooks run even if pre hooks
fail (unless onError: Fail caused the backup to abort).
8. Serialization and Upload¶
All collected, plugin-processed items are serialized to JSON and written into
a tarball (backup.tar.gz). A backup-results.gz file captures warnings and
errors per item. Both are streamed to the object store via the ObjectStore
plugin.
Key file: pkg/backup/backup.go
9. Metadata Upload¶
A velero-backup.json metadata file is written to the BSL. This is what the
BackupSyncController reads to reconstruct Backup objects in a new cluster
(enabling cross-cluster restores without re-creating Backup CRDs manually).
Object Store Layout¶
{bucket}/{prefix}/
backups/
{backup-name}/
velero-backup.json # Backup CRD spec + status
{backup-name}.tar.gz # All K8s resources (JSON per item)
{backup-name}-logs.gz # Velero server logs during backup
{backup-name}-results.gz # Warnings and errors per item
{backup-name}-csi-volumesnapshots.json.gz # CSI snapshot metadata
{backup-name}-volumesnapshots.json.gz # Legacy VSL snapshot metadata
restores/
{restore-name}/
restore-{restore-name}-logs.gz
restore-{restore-name}-results.gz
Tar Archive Structure¶
resources/
deployments/
namespaces/
default/
my-deployment.json
persistentvolumeclaims/
namespaces/
default/
my-pvc.json
persistentvolumes/
cluster/ # cluster-scoped resources live here
pvc-abc123.json
Useful Debug Techniques¶
# Watch backup progress
kubectl get backup my-backup -n velero -o yaml -w
# Stream velero server logs during a backup
kubectl logs -n velero deployment/velero -f --since=5m
# Inspect what's in the tar archive
velero backup download my-backup --output /tmp/my-backup.tar.gz
tar -tzf /tmp/my-backup.tar.gz | head -50
# See per-item warnings/errors
velero backup describe my-backup --details
Performance Note
Backup speed is bound by API server list throughput and object store upload bandwidth. For large clusters (10k+ objects), the list phase dominates.
The spec.resourceVersion is set at list time: items added after listing
may be missing from the backup.