OpenShift Virtualization — Reference Architecture

The Environment — Where Most Customers Are Starting

This is an anonymized composite of the environments we consistently encounter. The specific hardware names differ. The problems are nearly identical across every organization we talk to.

20–150 hosts

Across primary datacenter, multiple hardware generations and vendor contracts

200–5,000 VMs

Mix of RHEL, Windows Server, and legacy operating systems — each with different owners

3–9 mo.

Typical renewal or contract pressure window forcing a platform decision now

What the infrastructure looks like today

Compute Primary DC has servers across multiple hardware generations — some on active support contracts, some not. Capacity planning is manual and reactive.

Storage VMs live on datastores. Different teams manage storage differently. No consistent tiering strategy. Live migration depends on shared storage being right — and it often isn't.

DR Secondary site exists but is passive and undertested. Backups are agent-based on individual VMs. RTO is measured in hours, not minutes. Last full DR test was over a year ago.

Automation Operations are runbook-driven. Changes go through manual change management. VM provisioning takes days. Secrets live in shared spreadsheets or in the heads of two administrators.

Containers A separate Kubernetes cluster exists somewhere, managed by a different team. VMs and containers do not share tooling, monitoring, or management plane.

What this costs the mission

The issue isn't that the platform is old. The issue is that the platform increasingly dictates what the mission can and cannot do. Velocity slows. Security posture is harder to prove. Every audit requires manual evidence collection. Every new workload requires a negotiation between teams.

The administrators are not the problem. They are experts in a platform that is approaching end of support, end of commercial viability, or both. Their expertise is real and transferable — the goal is not to replace people, it is to give them a platform that keeps pace with the mission.

Common customer questions

"My administrators only know VMware — is this realistic for them?"

"Do I need all new hardware to make this work?"

"I have a renewal in 3–6 months. Does that timeline work?"

"Why do this over a lift-and-shift to cloud or another legacy platform?"

"What is the FedRAMP story here?"

"Can we get hands-on with this before we commit?"

Primary Datacenter — Multi-Availability Zone OpenShift Cluster

Three physically isolated availability zones — separate rooms in the same datacenter. A single OpenShift cluster spans all three, with control plane nodes distributed for HA. Each zone carries independent storage. Each partner's layer is visible throughout.

Availability Zone A Physical Room 1

Tier 1 · All-Flash Array (NVMe)

All-Flash ArrayNVMe

StorageClass: ontap-nas-t1 · Protocol: NFS / iSCSI · RWX enabled → live VM migration

All-Flash Array — HA PairNVMe

OpenShift Nodes

control-plane-01Control

worker-01Worker

worker-04Worker

+ additional workers

VM Workloads

RHEL Third Party Linux Windows Server Windows Desktop Mixed OS

Tier 2 · Capacity (SAS/SSD)

Capacity ArraySAS / SSD

StorageClass: ontap-nas-t2 · Backup target volumes · SnapMirror source

Capacity Array — HA PairSAS / SSD

Availability Zone B Physical Room 2

Tier 1 · All-Flash Array (NVMe)

All-Flash ArrayNVMe

StorageClass: ontap-nas-t1 · Protocol: NFS / iSCSI · RWX enabled → live VM migration

All-Flash Array — HA PairNVMe

OpenShift Nodes

control-plane-02Control

worker-02Worker

worker-05Worker

+ additional workers

VM Workloads

RHEL Third Party Linux Windows Server Windows Desktop Legacy Applications

Tier 2 · Capacity (SAS/SSD)

Capacity ArraySAS / SSD

StorageClass: ontap-nas-t2 · Backup target volumes · SnapMirror source

Availability Zone C Physical Room 3

Tier 1 · All-Flash Array (NVMe)

All-Flash ArrayNVMe

StorageClass: ontap-nas-t1 · Protocol: NFS / iSCSI · RWX enabled → live VM migration

All-Flash Array — HA PairNVMe

OpenShift Nodes

control-plane-03Control

worker-03Worker

worker-06Worker

+ additional workers

VM Workloads

RHEL Third Party Linux Windows Server Windows Desktop Containerized Apps

Tier 2 · Capacity (SAS/SSD)

Capacity ArraySAS / SSD

StorageClass: ontap-nas-t2 · Backup target volumes · SnapMirror source

Capacity Array — HA PairSAS / SSD

How it fits in this environment

Conversation anchor

Platform Services

OCP Virtualization

Advanced Cluster Management

NetApp Trident CSI

Veeam Kasten K10

HashiCorp · ArgoCD / GitOps

Ansible Automation Platform

HashiCorp Vault

Elastic Stack

Prometheus / Monitoring

Compute / Platform

Red Hat OpenShift

OCP Virtualization, node sizing, CPU pinning, multi-AZ scheduling, ACM fleet

Storage / CSI

NetApp

Trident CSI, DataVolumes, StorageClass tiering, ReadWriteMany for live migration

DR / Data Protection

Veeam · Kasten K10

K8s-native backup, SnapMirror replication, app-consistent, RPO/RTO planning

Automation / Secrets

HashiCorp

ArgoCD, Ansible AAP, Vault dynamic secrets, IaC-driven VM lifecycle

Observability

Elastic / Carahsoft

Elastic Agent DaemonSet, unified logs + metrics + SIEM, FedRAMP-ready

Key

Control plane node

Worker node

Tier 1 — NetApp All-Flash

Tier 2 — NetApp Capacity

Spotlight buttons highlight each partner's layer across the full architecture

DR & Cloud Sites — Secondary Datacenter and Cloud Egress

A reduced-footprint secondary site serves as the primary DR and backup target. An optional cloud site provides burst, tertiary DR, or archival egress inside a FedRAMP authorization boundary. These sites are managed as part of the same fleet via ACM.

The Backup Conversation is Different on Kubernetes — and More Flexible

Veeam Kasten K10 protects at three scopes: an individual VM and its attached storage, a full namespace containing multiple VMs and applications, or the entire cluster. You define the policy — Kasten executes it consistently. This is a meaningful upgrade from agent-based backup, which requires per-VM configuration and breaks when VMs move. Alongside Kasten, SnapMirror handles block-level storage replication to the DR site independently — so your data is protected at both the application layer and the storage layer. These two mechanisms together give you something more reliable and more testable than most organizations have today.

Per-VM protection Namespace-level protection Full cluster backup App-consistent restore SnapMirror storage replication Testable failover

Secondary Datacenter

Reduced footprint — DR and backup target

Passive / Active DR

OpenShift Nodes

control-planeControl

worker-01Worker

worker-02Worker

+ scale out on failover

NetApp— Less performant by design, SnapMirror destination

Capacity Array — HA PairSAS

Capacity ArraySAS

Object / Archival TargetS3-compatible

Veeam Kasten K10— Backup target & DR restore point

DR Modes

PassiveData replicated, cluster powered down. Manual failover. Lowest cost.

ActiveCluster running, workloads live. Automated failover. Higher cost.

Cloud Egress Site

GovCloud or commercial — optional architecture component

Optional

Managed OpenShift

Managed OpenShift ClusterCloud-hosted

HashiCorp Vault Enterprise— Secrets sync across all sites

Cloud Storage

Object StorageS3-compatible

Managed Block (CSI)Cloud-native

Use Cases

BurstOverflow compute for mission demand peaks.

Cloud DRThird site for additional resiliency.

ArchiveLong-term retention and compliance egress.

ACM provides single-pane fleet management across all three sites. FedRAMP authorization boundary applies when configured correctly across primary, DR, and cloud.

How Recovery Actually Works — Two Workflows Side by Side

NetApp SnapMirror — Storage Failover

Block-level replication · Storage layer recovery · Used for site-wide DR

SnapMirror continuously replicates storage volumes from the primary datacenter to the secondary site at the block level. It is storage-layer protection — independent of what is running on top. In a failover scenario, the secondary volumes are promoted and become read/write. OpenShift at the DR site mounts those volumes and starts workloads.

Step 1

SnapMirror replication running continuously. RPO is minutes, not hours — based on replication schedule configured per volume or policy.

Step 2

Primary site incident detected. Decision to failover made. SnapMirror relationship is broken — secondary volumes promoted to read/write.

Step 3

OpenShift cluster at DR site brought active. Trident CSI re-attaches the promoted volumes as PersistentVolumes.

Step 4

VMs start on DR cluster using replicated storage. Applications resume from last replication checkpoint. DNS/routing updated to DR site.

Step 5

Primary restored. Reverse replication syncs changes back. Planned failback executed. SnapMirror relationship re-established in original direction.

Best used for: Site-wide disaster, storage hardware failure, datacenter-level outage. Protects the data layer regardless of what caused the incident.

Veeam Kasten K10 — Application Restore

Application-layer recovery · Per-VM, namespace, or cluster scope · Any cluster in the fleet

Kasten K10 operates at the application layer. It captures a consistent point-in-time snapshot of a VM, a namespace, or the entire cluster — including storage volumes, configuration, and metadata. Restore can target any cluster in the fleet. This makes Kasten the right tool for accidental deletion, corruption, ransomware recovery, and cross-cluster workload mobility as well as DR.

Step 1

Kasten K10 policies run on a defined schedule. Backup scope is per-VM, per-namespace, or cluster-wide. Snapshots stored on-cluster or exported to S3-compatible object storage at the DR site.

Step 2

Recovery event occurs — deleted VM, corrupted namespace, ransomware, or full site failure. Administrator selects restore point and target cluster from the Kasten dashboard.

Step 3

Kasten restores the VM disk (PVC), VM definition, networking config, and associated secrets. Application-consistent to the backup point-in-time — not just the disk.

Step 4

VM comes online on the target cluster. If restoring to a different namespace or cluster, Kasten handles remapping of storage and network references automatically.

Step 5

Restore verified. Policy compliance and audit log available in Kasten dashboard. Immutable backup copies on object storage remain intact for compliance retention.

Best used for: Accidental deletion, ransomware, application corruption, cross-cluster workload mobility, and DR where application-layer consistency is required.

These two mechanisms are complementary, not duplicative. SnapMirror protects storage at the infrastructure layer continuously and is the primary tool for site-level failover. Kasten protects applications at the platform layer on a schedule and is the primary tool for precision recovery, mobility, and compliance. Most production environments run both.

Migration Flow — From Your Current Platform to OpenShift Virtualization

This is not a rip-and-replace. It is a phased transition that keeps existing workloads running at every step. The goal is to land VMs on the new platform, stabilize operations, and give teams the time and space to adopt cloud-native tooling at their own pace.

No New Hardware to Start

OpenShift Virtualization runs on your existing servers. No forklift. No waiting on procurement. Hardware refresh happens on your existing lifecycle, not as a prerequisite.

Two valid destinations. You choose.

Migrate & Stabilize

VMs land on OpenShift. Your team manages them the same way they do today — console, CLI, familiar workflows. The platform is new. The day-to-day is not.

Migrate & Modernize

VMs land on OpenShift. Over time, teams adopt GitOps, automated secrets, and container-based workloads alongside their VMs. One platform, evolving at your pace.

Day 1

your VMware admins can manage migrated VMs using familiar workflows. Cloud-native tooling is adopted progressively, not on day one.

Free and paid training is available through Red Hat. Admins do not need to learn everything before the first VM migrates.

Phase 01

Assess

Who does this: Your VMware admins + Red Hat architects

Inventory every VM: operating system, CPU/memory footprint, storage dependencies, network topology, and application owner. Identify which VMs have hard dependencies on vSphere features (vMotion, vSAN, snapshots) and map those to OpenShift equivalents. Sequence the migration order — start with low-risk, non-production workloads to build team confidence before touching anything mission-critical.

Migration Toolkit for Virtualization (MTV)

VM inventory and dependency mapping

Network topology analysis

vSphere feature gap analysis

Output: prioritized migration wave plan

Phase 02

Prepare

Who does this: Platform + storage + security teams together

Build the landing zone before migrating anything. This means configuring OpenShift Virtualization, defining StorageClasses in Trident CSI that match your Tier 1 and Tier 2 storage, mapping existing VLANs to NetworkAttachmentDefinitions, and setting up namespace structure and RBAC so each application team owns their space. Bootstrap Vault for secrets, stand up ArgoCD for GitOps, and configure Kasten K10 backup policies before the first VM lands — so protection is in place from day one.

NetApp Trident CSI StorageClasses

NetworkAttachmentDefinitions (VLAN map)

Vault secrets bootstrap

Kasten K10 backup policies pre-configured

Output: hardened landing zone, ready to receive VMs

Phase 03

Migrate

Who does this: Primarily your existing VMware admins via MTV

Run MTV migration plans against each wave. Disk images are converted (V2V), transferred to DataVolumes on NetApp storage, and VMs are started as KubeVirt VMs. Each availability zone can run migration pipelines in parallel to reduce the total migration window. VMs are validated against the original before cutover. Source VMs remain running until cutover is confirmed.

MTV migration plans per wave

Cold — offline conversion, lowest risk

Warm — background copy, brief cutover

Hot — live replication, near-zero downtime

Output: VMs running on OpenShift, source decommissioned per wave

Phase 04

Operate & Evolve

Who does this: Your existing teams, on their timeline

VMs are now on OpenShift. Day-2 operations begin. Your admins use the OpenShift console to manage VMs — the experience is recognizable. Kasten K10 is backing everything up. Elastic is collecting logs and metrics. Over time — months, not days — teams adopt GitOps for VM lifecycle, Vault for secrets, and begin containerizing workloads where it makes sense. The platform does not force this transition. It enables it.

Kasten K10 — backup from day one

Elastic — unified observability

VM console management (familiar UX)

GitOps adoption at team's own pace

Output: stable operations, path to modernization open

Before

Your Current Environment

Where most organizations are today.

Traditional hypervisor (VMware vSphere)vCenter, vMotion, vSAN or external NFS

Datastores (NFS / VMFS)Storage managed separately from compute

Agent-based VM backupPer-VM policies, long RTO, hard to test

Manual operations & runbooksChanges take days, secrets in spreadsheets

Separate container platformDifferent team, different tooling, no shared plane

During

Transition Period — Both Worlds at Once

This phase is real and should be planned for. It is not a failure state.

VMs running on OpenShift VirtualizationManaged via OpenShift console — familiar interface

Backup via Kasten K10 from day oneProtection in place before source decommission

Some VMs still on legacy platformWave-based migration, not all at once

Teams learning GitOps and VaultTraining and adoption run in parallel with operations

Elastic observability across both platformsSingle pane during the transition period

After

Destination Environment

What the platform looks like fully realized.

OpenShift Virtualization (KubeVirt)VMs and containers on one platform, one team

NetApp Trident CSI DataVolumesStorage as code, consistent tiering, live migration

Kasten K10 namespace-aware backupApp-consistent, testable, auditable

GitOps-driven VM & app lifecycleVault secrets, ArgoCD, Ansible AAP

Elastic — unified SIEM & observabilitySingle pane for VMs, containers, and platform

What the infrastructure looks like today

What this costs the mission

Common customer questions

Architecture Overview

The Backup Conversation is Different on Kubernetes — and More Flexible