tl;dr: What’s your cost per request (cpr)? How much does more monitoring and logging cost for what gain in uptime/reliability/visibility and more importantly, customer satisfaction?
At Heztner a dedicated server with 8 hyperthreads, 32 GB RAM, 4 TB HDD is €33. at time of writing At Amazon, an EC2 with t4g.2xlarge with 8 cores, 32 GB RAM costs, $194, then add in 2 TB EBS HDD for $30, then add in Amazon traffic charges.
Of course, it gets even worse. If you want to run “serverless” you’ll need 100 AWS lambdas to serve 1000 requests/sec, while the same node process on fastify.js or hapi.js would handle hundreds, yea thousands of RPS. With AWS you also get to shell out for provisioned concurrency and build workarounds for lambda’s inherent limitations.
Thus cloud spending shall soon account for half of all IT spending. But how has this money sink seduced our industry?
Optimistically, the hollowed cloud/DevOps saves money by removing operations. The story starts: Once upon a time your servers would just sit there, spinning drives idly, waiting, burning cash in underutilization. The cloud’s not so bad, it just makes you notice hourly costs more than your own idle hardware…
Alas, the typical story sees a hot shot resume driven dev spend his investor’s money to build containers 100s of MBs in size (with an OS, Node.js etc.), even being charged to transfer images between datacenters, lazily storing the build cache in a multi-region bucket and registry, all to get long cold-start times.
Cloudfare sounds a bit different, providing special workers and shipping a V8-like runtime for you. While this shockingly efficient offering beats the “give me your worst container image”, it clips your wings. Imagine you know what you’re doing and want to use the “cloud native language”: Go. Too bad, you can’t use go with cloudfare workers. But even if you use JS, you’ll wrestle with a litany of idiosyncrasies and incompatabilities in their runtime. To get a first class Go experience, you’ll use GCP' cloud functions of AWS lambdas - bringing us full circle.
To fully leverage the cloud’s features, modern organizations turn to tools like Kubernetes. Alas, Kubernetes is hard and we quickly turn to services, outsourcing orchestration with Amazon’s EKS, Microsoft’s AKS and Google’s GKE, yea GKE Autopilot, EKS Fargate, which see majority adoption (or Rancher, Red Hat, OpenShift, VMware Tanzu etc.)
Joe Beda, K8 cofounder in this video: “there are very few reasons to … go … to open source binaries and [write your] own tooling … unless you are using Kubernetes in a way that is really unique” But why should there be very few reasons? You can go at it alone and save!
Sure, “there are always exceptions for organizations with strong engineering and operations chops to run Kubernetes themselves, but it became clear for most customers that became a daunting task” (said)[https://www.infoworld.com/article/3614850/no-one-wants-to-manage-kubernetes-anymore.html] Deepak Singh.
Well, ok. There is always the promise of SaaS. Your company may wish to run its operations itself, but any good economist knows you only gain competetive advantage with operations in your core competency, so you give in and turn to specialized vendors who run them better, cheaper… Wait.
We price our service tiers according to API calls, data volume etc. so we know how much they cost. Yet unlike every other industry, tech companies don’t often care to optimize on cost and increase profit margins. Indeed, many are proudly unprofitable, rushing to scale at all costs, while incinerating capital in a zerosum game. Capital, human and financial, is limited and we allocate to these outlets to the detrimate of others. We can do better. We owe it to ourselves, the world and the future to build profitable, productive organizations.
I leave you with horrific cloud cost stories:
We are actively encouraged NOT to spend time finding cost savings, we are a unicorn startup and everything is about growth and pruning old instances is not all a priority right now. If I can spin it that it’s for security reasons, it can be a win. we have a two week ‘gap week’s un between agile sprints in July where we are encouraged to do work that’s not priority, or fun, or even for other teams, and I used that time to convert some storage to lower storage tiers, saving $15000/month on two buckets alone just by changing some IaC lifecycle flags …I was all like ‘well that’s another FTE budget’ and my manager didn’t even want to announce I had done it because it’s off mission
Someone was benchmarking the latest VM SKUs to see if it will work better for our application. They had them incorrectly configured where the min and max node count was the same in the VM scale set, to save time so he didn’t need to wait for a scaling event. And once he was done, he left them up and forgotten about until after he got back after Christmas vacation, and we saw the 6 digit spend in the subscription. I hope you appreciated our inadvertent Christmas bonus we sent you, you fine folks at M$.
always seems to crop its ugly head is logging costs whether it be DataDog, ELK, Greylog, etc.. is that for some reason (valid or not) a software developer turns up logging on the app/service and forgets about it.
what was generating 3000$ in Cloudwatch logs in one day.. My lambda function was freewheeling for most of the day, because I made a typo.. so the function would error out, causing the EC2 instance to be recreated because of the failed startup function and the same thing happened again and again until I finally fixed the log function properly later that day.