@@ -28,9 +28,19 @@ Does this sound cool to you? If so, continue to read on! 👇
## 🚀 Let's Go!
There are **5 stages** outlined below for completing this project, make sure you follow the stages in order.
There are **6 stages** outlined below for completing this project, make sure you follow the stages in order.
### Stage 1: Machine Preparation
### Stage 1: Hardware Configuration
For a **stable** and **high-availability** production Kubernetes cluster, hardware selection is critical. NVMe/SSDs are strongly preferred over HDDs, and **Bare Metal is strongly recommended** over virtualized platforms like Proxmox.
Using **enterprise NVMe or SATA SSDs on Bare Metal** (even used drives) provides the most reliable performance and rock-solid stability. Consumer **NVMe or SATA SSDs**, on the other hand, carry risks such as latency spikes, corruption, and fsync delays, particularly in multi-node setups.
**Proxmox with enterprise drives can work** for testing or carefully tuned production clusters, but it introduces additional layers of potential I/O contention — especially if consumer drives are used. Any **replicated storage** (e.g., Rook-Ceph, Longhorn) should always use **dedicated disks separate from control plane and etcd nodes** to ensure reliability. Worker nodes are more flexible, but risky configurations should still be avoided for stateful workloads to maintain cluster stability.
These guidelines provide a strong baseline, but there are always exceptions and nuances. The best way to ensure your hardware configuration works is to **test it thoroughly and benchmark performance** under realistic workloads.
### Stage 2: Machine Preparation
> [!IMPORTANT]
> If you have **3 or more nodes** it is recommended to make 3 of them controller nodes for a highly available control plane. This project configures **all nodes** to be able to run workloads. **Worker nodes** are therefore **optional**.
@@ -40,7 +50,7 @@ There are **5 stages** outlined below for completing this project, make sure you
1. Head over to the [Talos Linux Image Factory](https://factory.talos.dev) and follow the instructions. Be sure to only choose the **bare-minimum system extensions** as some might require additional configuration and prevent Talos from booting without it. You can always add system extensions after Talos is installed and working.
1. Head over to the [Talos Linux Image Factory](https://factory.talos.dev) and follow the instructions. Be sure to only choose the **bare-minimum system extensions** as some might require additional configuration and prevent Talos from booting without it. Depending on your CPU start with the Intel/AMD system extensions (`i915`, `intel-ucode` & `mei`**or**`amdgpu` & `amd-ucode`), you can always add system extensions after Talos is installed and working.
2. This will eventually lead you to download a Talos Linux ISO (or for SBCs a RAW) image. Make sure to note the **schematic ID** you will need this later on.
@@ -52,19 +62,20 @@ There are **5 stages** outlined below for completing this project, make sure you
> It is recommended to set the visibility of your repository to `Public` so you can easily request help if you get stuck.
1. Create a new repository by clicking the green `Use this template` button at the top of this page, then clone the new repo you just created and `cd` into it. Alternatively you can us the [GitHub CLI](https://cli.github.com/) ...
1. Create a new repository by clicking the green `Use this template` button at the top of this page, then clone the new repo you just created and `cd` into it. Alternatively you can use the [GitHub CLI](https://cli.github.com/) ...
2. **Install** the [Mise CLI](https://mise.jdx.dev/getting-started.html#installing-mise-cli) on your workstation.
2. **Install** the [Mise CLI](https://mise.jdx.dev/getting-started.html#installing-mise-cli) on your local workstation.
3. **Activate** Mise in your shell by following the [activation guide](https://mise.jdx.dev/getting-started.html#activate-mise).
@@ -80,17 +91,17 @@ There are **5 stages** outlined below for completing this project, make sure you
📍 _**Having trouble compiling Python?** Try running `mise settings python.compile=0` and then run these commands again_
5. Logout of GitHub Container Registry (GHCR) as this may cause authorization problems when using the public registry:
5. Logout of the GitHub Container Registry as this may cause authorization problems in future steps when using the public registry:
```sh
docker logout ghcr.io
helm registry logout ghcr.io
```
### Stage 3: Cloudflare configuration
### Stage 4: Cloudflare configuration
> [!WARNING]
> If any of the commands fail with `command not found` or `unknown command` it means `mise` is either not install or configured incorrectly.
> If any of the commands fail with `command not found` or `unknown command` it means `mise` is either not installed, activated or it could be configured incorrectly.
1. Create a Cloudflare API token for use with cloudflared and external-dns by reviewing the official [documentation](https://developers.cloudflare.com/fundamentals/api/get-started/create-token/) and following the instructions below.
@@ -107,7 +118,7 @@ There are **5 stages** outlined below for completing this project, make sure you
1. Generate the config files from the sample files:
@@ -136,10 +147,10 @@ There are **5 stages** outlined below for completing this project, make sure you
> [!TIP]
> Using a **private repository**? Make sure to paste the public key from `github-deploy.key.pub` into the deploy keys section of your GitHub repository settings. This will make sure Flux has read/write access to your repository.
### Stage 5: Bootstrap Talos, Kubernetes, and Flux
### Stage 6: Bootstrap Talos, Kubernetes, and Flux
> [!WARNING]
> It might take a while for the cluster to be setup (10+ minutes is normal). During which time you will see a variety of error messages like: "couldn't get current server API group list," "error: no matching resources found", etc. 'Ready' will remain "False" as no CNI is deployed yet. **This is a normal.** If this step gets interrupted, e.g. by pressing <kbd>Ctrl</kbd> + <kbd>C</kbd>, you likely will need to [reset the cluster](#-reset) before trying again
> It might take a while for the cluster to be setup (10+ minutes is normal). During which time you will see a variety of error messages like: "couldn't get current server API group list," "error: no matching resources found", etc. 'Ready' will remain "False" as no CNI is deployed yet. **This is normal.** If this step gets interrupted, e.g. by pressing <kbd>Ctrl</kbd> + <kbd>C</kbd>, you likely will need to [reset the cluster](#-reset) before trying again
1. Install Talos:
@@ -207,7 +218,7 @@ There are **5 stages** outlined below for completing this project, make sure you
5. Check the status of your wildcard `Certificate`:
```sh
kubectl -n kube-system describe certificates
kubectl -n network describe certificates
```
### 🌐 Public DNS
@@ -226,9 +237,9 @@ The `external-dns` application created in the `network` namespace will handle cr
_... Nothing working? That is expected, this is DNS after all!_
### 🪝 Github Webhook
### 🪝 GitHub Webhook
By default Flux will periodically check your git repository for changes. In-order to have Flux reconcile on `git push` you must configure Github to send `push` events to Flux.
By default Flux will periodically check your git repository for changes. In-order to have Flux reconcile on `git push` you must configure GitHub to send `push` events to Flux.
1. Obtain the webhook path:
@@ -244,7 +255,7 @@ By default Flux will periodically check your git repository for changes. In-orde
3. Navigate to the settings of your repository on Github, under "Settings/Webhooks" press the "Add webhook" button. Fill in the webhook URL and your token from `github-push-token.txt`, Content type: `application/json`, Events: Choose Just the push event, and save.
3. Navigate to the settings of your repository on GitHub, under "Settings/Webhooks" press the "Add webhook" button. Fill in the webhook URL and your token from `github-push-token.txt`, Content type: `application/json`, Events: Choose Just the push event, and save.
## 💥 Reset
@@ -289,6 +300,36 @@ task talos:upgrade-k8s
# e.g. task talos:upgrade-k8s
```
### ➕ Adding a node to your cluster
At some point you might want to expand your cluster to run more workloads and/or improve the reliability of your cluster. Keep in mind it is recommended to have an **odd number** of control plane nodes for quorum reasons.
You don't need to re-bootstrap the cluster to add new nodes. Follow these steps:
1. **Prepare the new node**: Review the [Stage 2: Machine Preparation](#stage-2-machine-preparation) section and boot your new node into maintenance mode.
2. **Get the node information**: While the node is in maintenance mode, retrieve the disk and MAC address information needed for configuration:
```sh
talosctl get disks -n <ip> --insecure
talosctl get links -n <ip> --insecure
```
3. **Update the configuration**: Read the documentation for [talhelper](https://budimanjojo.github.io/talhelper/latest/) and extend the `talconfig.yaml` file manually with the new node information (including the disk and MAC address from step 2).
4. **Generate and apply the configuration**:
```sh
# Render your talosconfig based on the talconfig.yaml file
task talos:generate-config
# Apply the configuration to the node
task talos:apply-node IP=?
# e.g. task talos:apply-node IP=10.10.10.10
```
The node should join the cluster automatically and workloads will be scheduled once they report as ready.
## 🤖 Renovate
[Renovate](https://www.mend.io/renovate) is a tool that automates dependency management. It is designed to scan your repository around the clock and open PRs for out-of-date dependencies it finds. Common dependencies it can discover are Helm charts, container images, GitHub Actions and more! In most cases merging a PR will cause Flux to apply the update to your cluster.
@@ -317,13 +358,13 @@ Below is a general guide on trying to debug an issue with an resource or applica
kubectl -n <namespace> get pods -o wide
```
3. Check the logs of the pod if its there:
3. Check the logs of the pod if it's there:
```sh
kubectl -n <namespace> logs <pod-name> -f
```
4. If a resource exists try to describe it to see what problems it might have:
4. If a resource exists, try to describe it to see what problems it might have:
```sh
kubectl -n <namespace> describe <resource> <name>
@@ -363,7 +404,7 @@ Below are some optional considerations you may want to explore.
### DNS
The template uses [k8s_gateway](https://github.com/ori-edge/k8s_gateway) to provide DNS for your applications, consider exploring [external-dns](https://github.com/kubernetes-sigs/external-dns) as an alternative.
The template uses [k8s_gateway](https://github.com/k8s-gateway/k8s_gateway) to provide DNS for your applications, consider exploring [external-dns](https://github.com/kubernetes-sigs/external-dns) as an alternative.
External-DNS offers broad support for various DNS providers, including but not limited to:
@@ -376,7 +417,7 @@ This flexibility allows you to integrate seamlessly with a range of DNS solution
### Secrets
SOPs is an excellent tool for managing secrets in a GitOps workflow. However, it can become cumbersome when rotating secrets or maintaining a single source of truth for secret items.
SOPS is an excellent tool for managing secrets in a GitOps workflow. However, it can become cumbersome when rotating secrets or maintaining a single source of truth for secret items.
For a more streamlined approach to those issues, consider [External Secrets](https://external-secrets.io/latest/). This tool allows you to move away from SOPs and leverage an external provider for managing your secrets. External Secrets supports a wide range of providers, from cloud-based solutions to self-hosted options.
@@ -384,13 +425,11 @@ For a more streamlined approach to those issues, consider [External Secrets](htt
If your workloads require persistent storage with features like replication or connectivity to NFS, SMB, or iSCSI servers, there are several projects worth exploring:
These tools offer a variety of solutions to meet your persistent storage needs, whether you’re using cloud-native or self-hosted infrastructures.
@@ -402,27 +441,20 @@ Community member [@whazor](https://github.com/whazor) created [Kubesearch](https
### Community
- Make a post in this repository's Github [Discussions](https://github.com/onedr0p/cluster-template/discussions).
- Make a post in this repository's GitHub [Discussions](https://github.com/onedr0p/cluster-template/discussions).
- Start a thread in the `#support` or `#cluster-template` channels in the [Home Operations](https://discord.gg/home-operations) Discord server.
### GitHub Sponsors
## 📺 Media
If you're having difficulty with this project, can't find the answers you need through the community support options above, or simply want to show your appreciation while gaining deeper insights, I’m offering one-on-one paid support through GitHub Sponsors for a limited time. Payment and scheduling will be coordinated through [GitHub Sponsors](https://github.com/sponsors/onedr0p).
Check out these videos below. If you find them helpful, a like and subscribe goes a long way!
<details>
<summary>Click to expand the details</summary>
<br>
- **Rate**: $50/hour (no longer than 2 hours / day).
- **What’s Included**: Assistance with deployment, debugging, or answering questions related to this project.
- **What to Expect**:
1. Sessions will focus on specific questions or issues you are facing.
2. I will provide guidance, explanations, and actionable steps to help resolve your concerns.
3. Support is limited to this project and does not extend to unrelated tools or custom feature development.
# encrypt_disk: false # (ADVANCED/OPTIONAL) TPM-based disk encryption. Ref: https://www.talos.dev/latest/talos-guides/install/bare-metal-platforms/secureboot
# kernel_modules: [] # (ADVANCED/OPTIONAL) Only applicable if the `schematic_id` you've provided contains system extensions that require kernel modules to correctly load - Example: ["nvidia", "nvidia_uvm", "nvidia_drm", "nvidia_modeset", "zfs"]
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.