this week I'm reading Human Factors in Systems Engineering
there are so many gems I've highlighted already but really vibed with how the author clearly and simply expressed the impact of writing docs "early" here
this week I'm reading Human Factors in Systems Engineering
there are so many gems I've highlighted already but really vibed with how the author clearly and simply expressed the impact of writing docs "early" here
Are you looking for a new remote job? Browse 400+ remote positions from open source companies including @acquia @grafana @mozilla @wikimediafoundation and more on #OSJH
https://opensourcejobhub.com/jobs/?q=remote&utm_source=mosjh
#career #OpenSource #engineer #sales #security #marketing #CloudNative #developer #DevSecOps #SRE #FOSS
Want to grow your open source career? The LiFT Scholarship offers training & certs to help you level up—whether you're starting out or advancing.
Apply by April 30: https://app.smarterselect.com/programs/102338-Linux-Foundation-Education
Deploy Consul as OpenTofu Backend with Azure & Ansible
Auf geht's erstmals nach #Berlin zu meinen neuen Kollegen von coding. powerful. systems. CPS GmbH. (Vom wohl kleinsten Bahnhof im Nachbardorf.)
#Antrittsbesuch #Hauptstadt #Arbeit #Dienstreise #SRE
Running an Incidents 101 training tomorrow. Including two games, both involving some dice rolling, should be fun. I don't feel nervous, I know what process and common ground to cover.
Trying my best to keep the material interesting between getting some interaction and showing the necessary slides with steps and rules on them.
In one activity we throw gaming dice and build a context,
randomizing things like customers affected, size of response, time of day, etc. Then use the rules for gauging severity. That's the whole game!
I hope above all that the activities go well and the people unfamiliar with the process will have an opportunity to learn something. I cannot make it be everything for everybody, but I hope to the right people it is a help.
a short lil blog post sharing how re-reading the evergreen etsy Debriefing Facilitation Guide helped me better investigate a mysterious sound....
Not sure if I asked this before: Does anyone use anything in particular to inject #apache logs into #SQL databases? I have been looking around and asking around and the only solid I got was "do not expect an apache module for that; it would introduce too much latency to each request" in #httpd@libera.chat.
Your logs are lying to you - metrics are meaner and better.
Everyone loves logs… until the incident postmortem reads like bad fan fiction.
Most teams start with expensive log aggregation, full-text searching their way into oblivion. So much noise. So little signal. And still, no clue what actually happened. Why? Because writing meaningful logs is a lost art.
Logs are like candles, nice for mood lighting, useless in a house fire.
If you need traces to understand your system, congratulations: you're already in hell.
Let me introduce my favourite method: real-time, metric-driven user simulation aka "Overwatch".
Here's how you do it:
Set up a service that runs real end-to-end user workflows 24/7. Use Cypress, Playwright, Selenium… your poison of choice.
Every action creates a timed metric tagged with the user workflow and action.
Now you know exactly what a user did before everything went up in flames.
Use Grafana + InfluxDB (or other tools you already use) to build dashboards that actually tell stories:
* How fast are user workflows?
* Which steps are breaking, and how often?
* What's slower today than yesterday?
* Who's affected, and where?
Alerts now mean something.
Incidents become surgical strikes, not scavenger hunts.
Bonus: run the same system on every test environment and detect regressions before deployment. And if you made it reusable, you can even run the service to do load tests.
No need to buy overpriced tools. Just build a small service like you already do, except this one might save your soul.
And yes, transform logs into metrics where possible. Just hash your PII data and move on.
Stop guessing. Start observing.
Metrics > Logs. Always.
System Administration
Week 10, Backups: Core Concepts
In this video, we begin our discussion of backups by covering some core concepts and terminology, looking at full vs. incremental vs. differential backups and the difference between long-term storage and disaster recovery of files due to more localized data loss.
And here’s the big reveal:
Virtual flash cards for the key terms for all of DevOps Institute’s exams. I took the glossaries from all their public study guides, deduplicated them, converted the courses they appear in to tags and added an exam they missed.
https://github.com/ajn142/DOI-Exam-Glossary
Reposting because I forgot the number one rule of chronological timelines (don’t post when everyone’s asleep lol).
Observability Migration - A new approach
https://www.cloudraft.io/blog/influxdb-to-grafana-mimir-migration
Discussions: https://discu.eu/q/https://www.cloudraft.io/blog/influxdb-to-grafana-mimir-migration
Site Reliability Engineering is often like Cassandra (not the database), where you tell devs the kinds of scaling issues they'll see if they continue following clever shortsighted patterns — you're frequently correct but they never believe you.
Job search journey as a DevOps/SRE/Platform engineer in Netherlands/Amsterdam(Dec '24 - Apr '25)
Discussions: https://discu.eu/q/http://cargo.one/
System Administration
Week 9, Writing System Tools
This week we're going on a side-quest to discover solid #programming best practices that apply across simple scripting, prototyping, growing your tools, and owning a software product. We don't have videos for this topic, but the slides below include a lot of hopefully useful links ranging from coding style to ticket management and commit messages.
https://stevens.netmeister.org/615/09-writing-system-tools.pdf
If you've tried both Thanos and Mimir, which do you prefer? Feel free to comment why below
So, I've been using Thanos to receive and store my prometheus metrics long term in a self hosted S3 bucket. Thanos also acts as a datasource for my dashboards in Grafana, and provides a Ruler, which evaluates alerting rules against my metrics and forwards them to my alertmanager. It's ok. It's certainly got it's downsides, which I can go into later, but I've thinking... what about Mimir?
How do you all feel about Grafana's Mimir (source on GitHub)? It's AGPL and seems to literally be a replacement of Thanos, which is Apache 2.0.
Thanos description from their website:
Open source, highly available Prometheus setup with long term storage capabilities.
Mimir description from their website:
...open source software project that provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus and OpenTelemetry metrics.
Both with work with alloy and prometheus alike. Both require you to configure initially confusing hashrings and replication parameters. Both have a bunch of large companies adopting them, so... now I feel conflicted. Should I try mimir? Poll in reply.