Introduction
Dalet has selected Vmware Tanzu Kubernetes Grid (Standalone) as one of the reference platforms for On Prem deployments.
A POC was run with the help of an external consulting company, with the following goals:
- fast-track Vmware Tanzu adoption within Dalet R&D
- establish a reference architecture that we could communicate to customers
- set up a validation platform for periodic compatibility checks (from scratch & continuous upgrade).
This document summarises the scope of what was tested during the first validation campaign, lists out some prerequisites and general recommendations drawn from the POC we ran.
Future validation will run on this platform, with some adjustments when necessary. This document will be updated to reflect this.
Disclaimer
There is no continuous validation of Vmware Tanzu in place. Compatibility checks is best effort.
The main trigger for a new validation will be the release of a a new version of Kubernetes that Dalet wishes to support, this usually happens several times a year.
We do not plan to ever validate TKGi (integrated into vSphere).
Validation scope
Dalet will periodically run some compatibility checks with the Vmware Tanzu platform in Dalet R&D.
Those checks include:
- creation of a Workload cluster, from scratch (using an existing Management cluster)
- deployment of Dalet Pyramid, using our own automation tools (mostly Ansible playbooks)
- basic sanity check on the platform, using a mix of automated (API) tests, as well as human tests.
Additionally, when a new Kubernetes version is released that Dalet wishes to support, Dalet will run these additional tests, prior to deploying Pyramid:
- upgrade of the Management cluster
- upgrade of the Workload cluster
- run the Pyramid sanity tests.
The outcome of this validation will be an update of the supported versions (including, if required, updates to our application code and/or deployment tools), prerequisites documents, general recommendations, as well as a list of known issues with possible remediations if they were identified.
Reference architecture
High-level layout
This is the general, high-level layout of a typical On prem deployment.
It shows the main components of the system, as well as main network/domain areas.
Tanzu layout
The platform used in validation has the following layout:
It consists of:
- 1x management cluster (3 master nodes, 2 worker nodes)
- 1x workload cluster (3 master nodes, 7 worker nodes [5x general, 2x dedicated to Postgres])
- 1x NSX-ALB cluster (3x controller nodes)
- 1x NSX-ALB Service engine cluster (2x Service engine nodes).
It is highly recommended to dedicate a Tanzu cluster to only Pyramid, and to never mix it with any other product.
This is because applications and services rely on specific versions of Kubernetes API, and upgrades may be required from one product but not the other, and solving those conflicts would be very hard. Proper isolation is key.
Network
Static IP allocation through node IPAM feature is used.
NSX-ALB integration is used to expose k8s API services and in-cluster services of type Load Balancer.
By default, two Services of type LoadBalancer are created:
- internal (for services that should not be exposed publicly)
- external (for all other services).
Both are using Traefik as an Ingress controller.
An additional LoadBalancer may be created for the in-cluster Postgres engine.
There is technically no difference between the internal and the external Load balancer services; in an On Prem context, it is up to the customer to configure the proper NAT/routing/firewall rules to expose (or not) each Load balancer.
For simplicity reasons, a flat network is used in the validation platform, i.e.: all clusters and nodes in the same VM network and the same VLAN. They use the same gateway.
It is possible to use a fine-grained network segregation when implementing it in production context.
Example of such network layout (each network segment has a different colour):
Note: the above layout is not tested by Dalet.
Compute
The clusters are provisioned with the following compute resources:
Cluster | k8s version | # Masters | CPU | Mem | Disk | # Workers | CPU | Mem | Disk |
Management | 1.27.5 | 3 | 2 | 4GB | 20GB | 2 | 2 | 4GB | 20GB |
Workload | 1.27.5 | 3 | 2 | 4GB | 20GB | 7 | 4 | 16GB | 40GB |
The standard Tanzu Ubuntu 20.04 OS image is used (Photon was not tested and there are no plans to test it).
Storage
Most Pyramid workloads are stateless, but some of them do require persistent storage, with varying levels of performance requirements.
High-performance, single-access storage
This is applicable to the following workloads:
- Postgresql engine (whether in-cluster or running in external VMs)
- Elasticsearch (asset & metadata index/search solution).
These workloads require PVCs with exclusive read-write access (RWO - ReadWriteOnce).
We do not have precise requirements for these workloads, however, on AWS we provision gp3 volumes, with the standard 3000 IOPS (125MB/s) performance level. On Prem environments in Tanzu should match these requirements at a minimum.
The exact storage size required for each will be determined by the Dalet Solution Architect.
General purpose, shared storage
Some workloads need to store persistent files that may be shared with other workloads.
The customer must provide an NFS export of size 100GB, with read-write rights to Dalet.
PVCs will be created in mode RWX - ReadWriteMany.
Database
Pyramid uses a single Postgres database instance to store content (assets, metadata) as well as some of the configurations (applications, user settings, etc).
Layout
There are two options to run this Postgres database:
- as an external database cluster
- as an in-cluster database cluster, running in Kubernetes.
Dalet recommends a highly-available deployment, with a minimum of 2 nodes with synchronous or asynchronous replication, depending on RPO objectives. Backups should also be taken periodically and copied to a storage outside the VM/Kubernetes cluster.
Configuration
Dalet is working on reducing privileges required to deploy Pyramid.
This section will be updated soon, with detailed requirements.
Prerequisites
Tanzu installation
A Tanzu cluster should be dedicated to Pyramid (see note above, section "Tanzu layout").
It is also recommended to run Tanzu on enough hosts to guarantee N+1, or even N+2, level of redundancy.
A minimum of 3 ESXi hosts is recommended, since it will allow proper separation with anti-affinity rules.
Versions
The following versions were used:
- vSphere 7.0U3
- Tanzu CLI 1.0.0, with Kubernetes 1.27.5
- NSX-ALB 22.1.3 (Essential tier)
- CRI: containerd v1.6.18
- CNI: antrea v1.11.2
- CSI: vsphere-csi v3.0.2
Pyramid installation
Dalet will need the following items ready before deploying the Pyramid software stack:
- Access to a kubeconfig file with full administrative rights on the kubernetes Workload cluster(s) running Pyramid
- Name of the vSphere CSI storage class
- List of node pools with labels and taints (to be agreed beforehand with Dalet Ops team)
- List of DNS servers, as well as DNS search list
- A read-only account to the NSX-ALB admin UI (useful for Dalet Ops team to check NSX-ALB status)
- Creation of DNS records, including a wildcard, to Pyramid services. Usually in the form: pyramid.<cluster>.<fqdn>, as well as *.<cluster>.<fqdn>
- A public TLS certificate for the above domains. Note: TLS certificate rotation to be defined
Additional information & recommendations
Authentication
LDAP authentication has been tested, using Pinniped.
For simplicity reasons, we disabled it in the validation platform.
Anti-affinity
It is recommended to put VM anti-affinity rules in place for HA services, so that the VMs do not get scheduled onto the same VM host, thus limiting the impact of a failed VM host.
We recommend to set anti-affinity rules for the following VM types:
- NSX-ALB controllers
- NSX-ALB Service engines
- Kubernetes master nodes (Management cluster and Workload clusters)
- Postgres instances (or worker nodes).
Tanzu Mission Control
TMC was tested as part of the POC, but we didn't find any use case for it. No specific integration or tooling was developed against it.
As a result, TMC may be installed on Tanzu clusters, if the customer finds it helpful.
Known issues
Node labels
In the tested version, Tanzu Management cluster fails to add node labels as part of the initial provisioning (nodes are provisioned without custom labels). Labels need to be added manually later.
Taints are added correctly.
403 errors on Keycloak endpoint
When Keycloak admin interface is not exposed through a Load balancer, we observed 403 errors when running our deployment scripts. The workaround is to either expose Keycloak admin UI via LB, or to selectively disable SSL on Keycloak (this has no effect on overall security, because TLS is terminated at the ingress controller level). This will be handled by Dalet Ops team.
Comments
0 comments
Please sign in to leave a comment.