I'm trying to share as much technical across this thread as for your two examples:
System upgrades:
Keep in mind that as per the ISO specification, system upgrades should be applied but in a controlled manner. This lends itself perfectly to the following case that is manually triggered.
Since we take steps to make applications stateless, and Ansible scripts are immutable:
We spin up a new machine with the latest packages and once ready it join the Cloudflare load balancer. The old machines are drained and deprovisioned.
we spin up a new machine We have a playbook that iterates through our machines and does it per machine before proceeding. Since we have redundancy on components, this creates no downtime. The redundancy in the web application is easy to achieve using the load balancer in Cloudflare. For the Postgres database, it does require that we switch the read-only replica to become the main database.
DB failover:
The database is only written and read from by our web applications. We have a second VM on a different cloud that has a streaming replication of the Postgres database. It is a hot standby that can be promoted. You can use something like PG Bouncer or HAProxy to route traffic from your apps. But our web framework allows for changing the database at runtime.
> Business
Before migration (AWS): We had about 0.1 FTE on infra — most of the time went into deployment pipelines and occasional fine-tuning (the usual AWS dance).
After migration (Hetzner + OVHCloud + DIY stack): After stabilizing it is still 0.1 FTE (but I was 0.5 FTE for 3-4 months), but now it rests with one person. We didn’t hire a dedicated ops person.
On scaling — if we grew 5-10×:
* For stateless services, we’re confident we’d stay DIY — Hetzner + OVHCloud + automation scales beautifully.
* For stateful services, especially the Postgres database, I think we'd investigate servicing clients out of their own DBs in a multi-tenant setup, and if too cumbersome (we would need tenant-specific disaster recovery playbooks), we'd go back to a managed solution quickly.
I can't speak for cloud FTE toll vs a series of VPS servers in the big boys league ($ million in monthly consumption) and in the tiny league but at our league it turns out that it is the same FTE requirement.
Anyone want to see my scripts, hit me up at jk@datapult.dk. I'm not sure it'd be great security posture to hand it out on a public forum.
Cloudflare could be considered a point of failure and is another level of complexity compare to doing your own LB (the extra is the external org — actually extra both in terms of tech and of compliance).
Have you considered doing your own HA Load balance? If yes what tech options did you consider
I took for granted that Hetzner and OVHcloud would be prone to failures due to their bad rep, not my own experience, so I wanted to be able to direct traffic to one if the other was down.
Doing load balancing ourselves in either of the two clouds gave rise to some chicken and egg situations now that we were assuming that one of them could be down (again not my lived experience).
Doing this externally was deliberate and picking something with a better rep than Hetzner and OVHcloud was obvious in that case.
The network allows relevant ports from the respective IPs and so does the UFW so the servers can communicate between in each other in a restricted way. Needless to say communication is encrypted with certificates.
Our logging server will switch primary DB in case the original primary DB server is down. Since we are counting on downtime, the monitoring server is by default not hosted in the same places as the primary DB but in the same place as the secondary DB.
We assume that each clod will go down but not at the same time.
I'm trying to share as much technical across this thread as for your two examples:
System upgrades:
Keep in mind that as per the ISO specification, system upgrades should be applied but in a controlled manner. This lends itself perfectly to the following case that is manually triggered.
Since we take steps to make applications stateless, and Ansible scripts are immutable:
We spin up a new machine with the latest packages and once ready it join the Cloudflare load balancer. The old machines are drained and deprovisioned.
we spin up a new machine We have a playbook that iterates through our machines and does it per machine before proceeding. Since we have redundancy on components, this creates no downtime. The redundancy in the web application is easy to achieve using the load balancer in Cloudflare. For the Postgres database, it does require that we switch the read-only replica to become the main database.
DB failover:
The database is only written and read from by our web applications. We have a second VM on a different cloud that has a streaming replication of the Postgres database. It is a hot standby that can be promoted. You can use something like PG Bouncer or HAProxy to route traffic from your apps. But our web framework allows for changing the database at runtime.
> Business
Before migration (AWS): We had about 0.1 FTE on infra — most of the time went into deployment pipelines and occasional fine-tuning (the usual AWS dance). After migration (Hetzner + OVHCloud + DIY stack): After stabilizing it is still 0.1 FTE (but I was 0.5 FTE for 3-4 months), but now it rests with one person. We didn’t hire a dedicated ops person. On scaling — if we grew 5-10×: * For stateless services, we’re confident we’d stay DIY — Hetzner + OVHCloud + automation scales beautifully. * For stateful services, especially the Postgres database, I think we'd investigate servicing clients out of their own DBs in a multi-tenant setup, and if too cumbersome (we would need tenant-specific disaster recovery playbooks), we'd go back to a managed solution quickly.
I can't speak for cloud FTE toll vs a series of VPS servers in the big boys league ($ million in monthly consumption) and in the tiny league but at our league it turns out that it is the same FTE requirement.
Anyone want to see my scripts, hit me up at jk@datapult.dk. I'm not sure it'd be great security posture to hand it out on a public forum.