#rke2 - Fosstodon

Recent searches

Search options

Only available when logged in.

1 post1 participant0 posts today

Mika @irfan@sakurajima.social

I have finally caved in and dove into the rabbit hole of #Linux Container (#LXC) on #Proxmox during my exploration on how to split a GPU across multiple servers and... I totally understand now seeing people's Proxmox setups that are made up exclusively of LXCs rather than VMs lol - it's just so pleasant to setup and use, and superficially at least, very efficient.

I now have a #Jellyfin and #ErsatzTV setup running on LXCs with working iGPU passthrough of my server's #AMD Ryzen 5600G APU. My #Intel #ArcA380 GPU has also arrived, but I'm prolly gonna hold off on adding that until I decide on which node should I add it to and schedule the shutdown, etc. In the future, I might even consider exploring (re)building a #Kubernetes, #RKE2 cluster on LXC nodes instead of VMs - and if that's viable or perhaps better.

Anyway, I've updated my #Homelab Wiki with guides pertaining LXCs, including creating one, passing through a GPU to multiple unprivileged LXCs, and adding an #SMB share for the entire cluster and mounting them, also, on unprivileged LXC containers.

https://github.com/irfanhakim-as/homelab-wiki/blob/master/topics/proxmox.md#linux-containers-lxc

Wiki about everything Homelab. Contribute to irfanhakim-as/homelab-wiki development by creating an account on GitHub.

GitHubhomelab-wiki/topics/proxmox.md at master · irfanhakim-as/homelab-wikiWiki about everything Homelab. Contribute to irfanhakim-as/homelab-wiki development by creating an account on GitHub.

Johannes Kastl @johanneskastl@digitalcourage.social

Package updates for #rke2 including fixes for the #nginx #ingress #vulnerability are on their way to @opensuse #Tumbleweed. This means rke2 as well as the flavors for Kubernetes 1.31, 1.30 and 1.29.

Replied in thread

OOTS @oots@infosec.exchange

@andreasdotorg @redknight
I think at that level it's conceptually easy, you "just" need (wo-)manpower to set up and maintain everything yourself. Assuming you want to set up a new cloud provider from scratch and build one/two/three new DCs in different regions in Europe:
- buy standard "off-the-shelve" server hardware
- at this level you can use US networking equipment (firewalls, routers, switches)
- and then use/self-host all the open-source software you want

E.g.:
- use your favourite #Linux distro (#debian, #ubuntu, #fedora, or whatever)
- set up Netbox or a similar tool (and maybe phpIPAM) + #PostGreSQL Server
- there's probably no way around #OpenStack either way, with #MariaDB and some other open source tools in the background
- you can set up #Prometheus, #Grafana, #OpenSearch for observability

And on top of that offer services as you see fit:
- automate setup/maintenance of #Kubernetes clusters (I heard #RKE2 is a fairly self-contained #K8s distribution)
- automate setup/maintenance of DB servers
- provide a way to run "serverless" apps
- set up #nextcloud or so

Mika @irfan@sakurajima.social

#FediHire #GetFediHired

I'm a #Programmer/#SoftwareEngineer. I'm most fluent in #Python, have some basics in #Java and #C++, but I'm also taking up new languages like #Javascript and others in my eternal journey of getting better and minimising the impostor syndrome that befalls pretty much all programmers (I feel). I'm also very experienced in #CloudNative/#DevOps technologies, and have been the one devising solutions and maintaining infrastructure in a fast-paced startup environment in my previous employment.

I'm passionate in what I do and those that know me here or IRL would know that I'm always yapping about the things I'm learning or working on - I love discussing them, and I love helping people out - esp those on the same boat as me.

This passion has led me into writing and maintaining tons of #FOSS projects like Mango: a content distribution framework based on #Django for #Mastodon and #Bluesky that powers various bots of mine like @lowyat@mastodon.social and @waktusolat@mastodon.social, Charts: a #Helm chart repository for an easy and reproducible deployment strategy for all my projects and everything else I self-host on my #homelab, and Orked: O-tomated #RKE2 distribution, a collection of scripts I wrote that are comprehensively documented to enable everyone to self-host a production-grade #Kubernetes cluster for absolutely free in their homes.

I'm based in Malaysia, but I'm open to just about any on-site, hybrid, or remote job opportunities anywhere. In the meantime though, I'm actively looking for a job in countries like #Japan and #Singapore, in a bid for a desperate lifestyle change. I've linked below my Portfolio (which you too, could self-host your own!), for those who'd wish to connect/learn more of me. Thank you

https://l.irfanhak.im/resume

GitHubGitHub - irfanhakim-as/mango: Mastodon bot module for DjangoMastodon bot module for Django. Contribute to irfanhakim-as/mango development by creating an account on GitHub.

Mika @irfan@sakurajima.social

#Rancher/#RKE2 #Kubernetes cluster question - I don't need Rancher, but in the past with my RKE2 clusters, I normally deploy Rancher on a single VM using #Docker just for the sake of having some sort of UI for my cluster(s) if need be - with this setup, I'm relying on importing the downstream (RKE 2) cluster(s) onto said Rancher deployment. That worked well.

This time round though, I tried deploying Rancher on the cluster itself, instead of an external VM, using #Helm. Rancher's pretty beefy and heavy to deploy even with a single replica, and from my limited testing I found that it's easier to deploy when your cluster's pretty new and not have much resources running just yet.

What I'm curious about tho are these errors - my cluster's fine, and I'm not seeing anything wrong with it, but ever since deploying it a few days ago, I'm constantly seeing these Liveness/Readiness probe failed error on all 3 of my Master nodes (periodically most of the time, not all at once) - the same error also seems to include etcd failed: reason withheld. What does it mean, and how do I "address" it?

Rancher web UI showing the dashboard of a cluster, with some recent events listed pertaining various Master nodes, each with an error of either Readiness or Liveness probe failed.

Mika @irfan@sakurajima.social

OK, finally got the node back up again. I decided to just get a brand new #BeQuiet SFX PSU, to replace the old #SpeedCruiser Flex PSU and rebuild the node with otherwise the same hardware, into a spare #Silverstone SG13 case I have and it all works.

Also discovered that the reason why HA wasn't working and none of my VMs I've set replication for didn't carry over to the healthy node was because I had forgotten to actually create a HA group on Proxmox and set the VMs to HA so... did that.

Now I'm freaking wrestling with my #RKE2 #Kubernetes cluster to get it back up and running again, cos atm the cluster is littered with pods that are either Pending, Unknown or in a crash loop... which is always fucking fun. Also the cluster itself is kind of slow to respond (on #k9s)... which is concerning but I think has to do prolly with how its networking is setup.

I'm still completely clueless honestly on what the "ideal" networking setup is for both Proxmox, and a Kubernetes cluster hosted on Proxmox. I'm still stuck with the defaults, for now, that is using the onboard nic on each Proxmox node for every single thing. The only customisation I've done was setting a bandwidth limit (on Proxmox) only for migration. #Homelab folks please feel free sending some suggestions my way, that is as dummy-friendly as possible :))

RE: https://sakurajima.social/notes/a2e9fm36yg

Sakurajima Social (桜島SNS）Mika (@irfan)Uh... one of my #Proxmox nodes exploded... wtf 😂 I was running the #Rancher cleanup script, to remove Rancher from my #Kubernetes cluster (was experimenting something), and left it. After a while I noticed that the cluster wasn't reachable. Then I checked Proxmox and saw that the node (which is also the one hosting my primary Master node of my cluster) was offline. I went to the room and saw that the PC's power button's lights weren't on and the fans weren't running. I pressed it, and the fans started running for a while, then it sort of went quiet. I pressed it again, and then something just popped within the PC and all the motherboard lights went off. I've no fucking idea what happened, and what damage was done (I **really** hope it's just the PSU) but I'm just gonna let it sit and check it later :blobfoxangrylaugh:

Continued thread

Mika @irfan@sakurajima.social

I'm wondering right now what to do - I'm outside rn and so fucking eager (and extremely worried) to inspect and see just what caused it, when I'm home. I did notice the power plug (attached to the wall/extension cord) wasn't seated tight.. could that have been the cause?

It's a #B550 ITX board with a 500W Flex PSU - it doesn't have any GPU, just an #AMD Ryzen 5 5600G APU. Besides that, just 2 sticks of DDR4 RAM (64GB) and 2x 1TB NVME SSDs. It all depends on what I'll find later when I inspect it tho, but I'm wondering if I should continue having the node in my #homelab with replaced PSU (found a 600W 80 Platinum rated Flex PSU by FSP), or ditch it completely and replace it with an MATX build instead - #AsRock B550M Pro4 mATX board, same APU (or prolly, replace with my spare Ryzen 7 3700X), and brand new #CoolerMaster or #Corsair ATX PSU (80 Gold rated prolly, cos ATX PSUs are somehow more expensive than Flex ones?).

The latter route is def more expensive, but idk if running a #Proxmox node with an #RKE2 #Kubernetes cluster 24/7 in a mini ITX setup is the most brilliant idea... though that node was running perfectly fine for a pretty long time now (~1 year or so probably).

Mika @irfan@sakurajima.social

Lol, tried draining a #Kubernetes #RKE2 node with #Longhorn - first, I enabled the allow-node-drain-with-last-healthy-replica setting. That still didn't help with the PDB issue, so I manually deleted the PDBs (which helped in a different test, on a different cluster) - that resulted to the pod being evicted, but... it would just hang forever with no output whatsoever.

I tried restarting the drain process, but that did nothing, it'll just be stuck at evicting something again with no output. When I checked available volumes, I'd see several (on the draining node) appearing and disappearing, and changing status from detached to attaching, etc. I'm so freaking done with this whole shenanigan at this point that I just decide to freaking shutdown the whole goddamn node lol.

Maybe the "safe" way to stop a cluster from this point onwards is to just nuke the damn thing

Mika @irfan@sakurajima.social

Does anyone know what I'm missing with removing a node from a #Kubernetes cluster (set up with #RKE2 and #Longhorn)?

I've cordoned the node, drained the node, "killed" the node with the rke2-killall.sh script, uncordoned the node (I wouldn't have in a node removal case, but my script does that generally and thought it doesn't matter), and then uninstalled RKE2 on the (removing) node with the rke2-uninstall.sh script.

If I check the nodes' status on said cluster now, the worker nodes I've "removed" are still there, just in a NotReady state. What's the proper, missing piece here?

NAME             STATUS     ROLES                       AGE   VERSION
orked-master-1   Ready      control-plane,etcd,master   46h   v1.25.15+rke2r2
orked-worker-1   Ready      worker                      46h   v1.25.15+rke2r2
orked-worker-2   NotReady   worker                      23h   v1.25.15+rke2r2
orked-worker-3   NotReady   worker                      23h   v1.25.15+rke2r2

Is it as simple as just, deleting the node with kubectl now?

kubectl delete node <node>

Mika @irfan@sakurajima.social

How do you update #Longhorn's Node Drain Policy on a #Kubernetes/#RKE2 cluster? I think you could do it on the UI, but in this test cluster I'm experimenting with, I did not install #Rancher or "attach" this cluster to one so I don't have access to the UI.

I'm trying to update said policy to allow-if-replica-is-stopped, and see if that would solve the errors I'm getting draining nodes in my cluster: Cannot evict pod as it would violate the pod's disruption budget.

Update: nvm got it https://longhorn.io/docs/1.7.2/advanced-resources/deploy/customizing-default-settings/#using-kubectl

Didn't solve my error though.

LonghornLonghorn | Customizing Default Settings

Mika @irfan@sakurajima.social

How does one properly, and safely shutdown an #RKE2 #Kubernetes cluster (and monitor + ensure that is truly the case)? I'm surprised it doesn't seem to be discussed in the official RKE2 docs at all, and the very few discussions I've seen of it on #GitHub kinda return various answers I'm not quite confident/certain of.

I'm trying to fully and properly tackle this so I can incorporate it in my RKE2 management tool so even fools like me can be "proper" about it rather than... idk... shutting the VMs down and hope for the best each time

https://github.com/irfanhakim-as/orked/issues/25

GitHubHelper script to safely stop all nodes · Issue #25 · irfanhakim-as/orkedBy irfanhakim-as

Mika @irfan@sakurajima.social

I've successfully migrated my #ESXi #homelab server over to #Proxmox after surprisingly a little bit of (unexpected) trouble - haven't really even moved all of my old services or #Kubernetes cluster back into it, but I'd say the most challenging part I was expecting which is #TrueNAS has not only been migrated, but also upgraded from TrueNAS Core 12 to TrueNAS Scale 24.10 (HUGE jump, I know).

Now then. I'm thinking what's the best way to move forward with this, now that I have 2 separate nodes running Proxmox. There are multiple things to consider. I suppose I could cluster 'em up, so I can manage both of them under one roof but from what I can tell, clustering on Proxmox works the same way as you would with Kubernetes clusters like #RKE2 or #K3s whereby you'd want at least 3 nodes, if not just 1. I can build another server, I have the hardware parts for it, but I don't think I'd want to take up more space I already do and have 3 PCs running 24/7.

I'm also thinking of possibly joining my 2 RKE 2 Clusters (1 on each node) into 1... but I'm not sure how I'd go about it having only 2 physical nodes. Atm, each cluster has 1 Master node and 3 Worker nodes (VMs ofc). Having only 2 physical nodes, I'm not sure how I'd spread the number of master/worker nodes across the 2. Maintaining only 1 (joined) cluster would be helpful though, since it'd solve my current issue of not being able to use one of them to publish services online using #Ingress "effectively", since I could only port forward the standard HTTP/S ports to only a single endpoint (which means the secondary cluster will use a non-standard port instead i.e. 8443).

This turned out pretty long - but yea... any ideas what'd be the "best" way of moving forward if I only plan to retain 2 Proxmox nodes - Proxmox wise, and perhaps even Kubernetes wise?

Mika @irfan@sakurajima.social

For some reason, I feel like the #RKE2 cluster on my #Proxmox node is more fragile than the cluster on my #ESXi node. Like, on the latter, I can simply shutdown and boot the nodes however I want just like that and everything seems to just get in a working state on tis own. On the former, for some reason, things seem to boot in a non-running start with various status like Unknown, CrashLoopBackOff, etc. - some gets solved by me deleting/restarting the pods, some though will require me to run the killall script and reboot the entire node. Pretty weird, when both clusters were deployed/configured the exact same way and runs the exact same version.

Mika @irfan@sakurajima.social

Twice now my secondary #RKE2 cluster running on my #Proxmox node is giving me a bunch of useless errors, mostly seems to do with #Longhorn, that are preventing my services from running

Maybe it has to do with one of my SSD which shows that it had passed the #SMART test and that it's "healthy", on Proxmox, yet shows a Media and Data Integrity Errors value of 609 which I assume is definitely concerning.

Mika @irfan@sakurajima.social

One of my #RKE2 #Kubernetes worker nodes seems to be having a networking issue of some sort I’ve not seen before in any of my clusters. I can ssh to it, it can access the internet, but yea a bunch of pods, all from that node, are either stuck at terminating for wtv reason while some of them are in a crash loop. The other worker nodes work fine though and they should all be identical to each other.

Mika @irfan@sakurajima.social

Huh... one of my #RKE2 #Kubernetes nodes running on my second cluster on #Proxmox seems to be completely borked and unbootable for whatever reason.

Jote @ailnoth@social.plux.wtf

Honestly, now when I understand how #kubernetes works, the differences between #k3s, #rke2 etc.... Life is pretty sweet. But this also means I will end up getting more machines at hetzner for build proper HA #clusters for my stuff. It is nice to have everything on servers you actually have control over (the only thing I will not touch for now is hosting my own email as that is a fucking hassle due to bigtech and getting spamlisted etc)

Mika @irfan@sakurajima.social

I've just merged a huge PR to my #Orked (O-tomated RKE Distribution - GREAT NAME I KNOW) that makes it easier than ever for anyone to set up a production-ready #RKE2 #Kubernetes cluster in their #homelab.

With this collection of scripts, all you need to do is just provision the nodes required, including a login/management node, and run the scripts right from the login node to configure all of the other nodes to make up the cluster. This setup includes:

- Configuring the Login node with any required or essential dependencies (such as #Helm, #Docker, #k9s, #kubens, #kubectx, etc.)

- Setup passwordless #SSH access from the Login node to the rest of the Kubernetes nodes

- Update the hosts file for strictly necessary name resolution on the Login node and between the Kubernetes nodes

- Necessary, best practice configurations for all of the Kubernetes nodes including networking configuration, disabling unnecessary services, disabling swap, loading required modules, etc.

- Installation and configuration of RKE2 on all the Kubernetes nodes and joining them together as a cluster

- Installation and configuration of #Longhorn storage, including formatting/configuring their virtual disks on the Worker nodes

- Deployment and configuration of #MetalLB as the cluster's load-balancer

- Deployment and configuration of #Ingress #NGINX as the ingress controller and reverse proxy for the cluster - this helps manage external access to the services in the cluster

- Setup and configuration of #cert-manager to obtain and renew #LetsEncrypt certs automatically - supports both #DNS and HTTP validation with #Cloudflare

- Installation and configuration of #csi-driver-smb which adds support for integrating your external SMB storage to the Kubernetes cluster

Besides these, there are also some other helper scripts to make certain related tasks easy such as a script to set a unique static IP address and hostname, and another to toggle #SELinux enforcement to on or off - should you need to turn it off (temporarily).

If you already have an existing RKE2 cluster, there's a step-by-step guide on how you could use it to easily configure and join additional nodes to your cluster if you're planning on expanding.

Orked currently expects and supports #RockyLinux 8+ (should also support any other #RHEL distros such as #AlmaLinux), but I am planning to improve the project over time by adding more #Linux distros, #IPv6 support, and possibly even #K3s for a more lightweight #RaspberryPi cluster for example.

I've used this exact setup to deploy and manage vital services to hundreds of unique clients/organisations that I've become obsessed with sharing it to everyone and making it easier to get started. If this is something that interests you, feel free to check it out!

If you're wondering what to deploy on a Kubernetes cluster - feel free to also check out my #mika helm chart repo

https://github.com/irfanhakim-as/orked

https://github.com/irfanhakim-as/charts

GitHubGitHub - irfanhakim-as/orked at stop-clusterO-tomated RKE Distribution (Orked) is a collection of scripts that aims to easily and reliably set up a production-ready Kubernetes cluster based on RKE2, with Longhorn storage, that is highly perf...

#certmanager #csidriversmb

Mika @irfan@sakurajima.social

I've #MetalLB + #NGINX #Ingress controller setup on my #RKE2 #Kubernetes cluster. How it should/would work is for me to:

1. Register a domain name on my DNS server i.e. #Cloudflare pointing towards my public IP (router)

2. Set up port forwarding on my router for HTTP (wan port: 80, virtual host port: 80, lan host IP: my loadbalancer IP) and HTTPS (wan port: 443, virtual host port: 443, lan host IP: my loadbalancer IP)

3. Deploy an ingress for my service

^ this setup works. but right now, I'm in a situation where i want to avoid using WAN ports; 80 and 443 - and possibly switch them with 8080 and 8443. What should I do to be able to do this, because as it is if I do just that, going to the address/domain specified in my service's ingress would return me a 404 NGINX error.

Why I want to do this is because that 80-443 port forwarding rule I've already used for my Cluster 1's loadbalancer IP. I'm now setting things up for my second cluster, Cluster 2, and I can't port forward the same 80-443 pair to Cluster 2's loadbalancer IP.

Please help ;(

Mika @irfan@sakurajima.social

Asking this again to #Kubernetes/#networking experts cos I still am not able to figure this out. I've 1 #RKE2 cluster previously, with #MetalLB and #NGINX Ingress configured. In this "Cluster 1" setup, I have 2 port forwarding rules on my router that looks something like this:

name: cluster1-http
wan start/end port: 80
lan host address: 192.168.0.88
virtual host port: 80

name: cluster1-https
wan start/end port: 443
lan host address: 192.168.0.88
virtual host port: 443

The MetalLB IPv4 address range (IPAddressPool) I had set up for Cluster 1 was 192.168.0.88-192.168.0.89.

Now I had deployed a second cluster (i.e. Cluster 2), in the same network, with the exact same set of configurations, besides a different IP range of 192.168.0.86-192.168.0.87 for the load balancer. I thought all I need now is to setup a similar pair of port forwarding rules, but with updated LAN Host Address according to its IP range (i.e. 192.168.0.86) but that's not possible since my router gave out an error complaining that the WAN Start/End ports were conflicting.

I updated the WAN ports to 8080 and 8443 respectively (just for the sake of it really cos idk what else to do for now), but when I tested deploying a service with Ingress and while it successfully received a cert from #LetsEncrypt/#Cloudflare, actually going to the domain would give me an NGINX 404 error. What should I do?

Drag & drop to upload