Cloud cost breakdown and reduction plan

August’s Cost - budget

AWS Dev: 110k
AWS Main: 45k
Digital Ocean: 30k
Hetzner: 2k

Total: 187k

July’s Cost

AWS Dev: 108,930.75
AWS Main: 46,481.79
Digital Ocean: 68,514.41
Hetzner: 614

Total: 224,540.95

June’s Cost

AWS Dev: 145,954.46
AWS Main: 62,787.60
Digital Ocean: 69,792.60

Total: 278,534.66

May’s Cost

AWS Dev: 190,815.71
AWS Main: 94,936.39
Digital Ocean: 88,419.91

Total: 374,172.01

April's Cost

AWS Dev: 436,991.69
AWS Main: 98,197.06
Digital Ocean: 96,297.69

Total: 631486.44

Goals

Noted the following estimate was based on the $300k monthly cost, given the $687k monthly cost peak in April, the realistic goal is to cut the cost to $200K total

by 6/15, 30% reduction in cloud cost to $200k total across all services

the goal was set based on the $300k/m assumption
we’ve cut more than 30% by May already

by 9/15, 60% reduction in cloud cost to $100k total

Weekly Cost Breakdown Tracking Sheet

https://docs.google.com/spreadsheets/d/14BaWu2GhyvrrNAvW1e3u6VS4TA-vjqku7eShU7qRuKw/edit#gid=1856754480

Action Items

By 8/30

Migrate 25% of validator node to Hetzner

By 7/31

@Tahir Javed to complete DNS migration hence be able to start rate limiting globally (in scope : t.hmny.io and test the rate limits
@Konstantin Potapov @Nita Neou (Soph) test and deploy the crosslink heartbeat signal feature so the crosslink sending can function even without the beacon chain database for validators in shard 1 2 3.
@Konstantin Potapov @Nita Neou (Soph) test and launch the epoch chain to remove the storage of the full beacon chain database for nodes in shard 1 2 3.
@Socheat KHAUV build1 mainnet validator node to Hetzner

By 6/30

@Tahir Javed to complete rsync via Storj to reduce S3 cloud cost (snapshots will be faster and cheaper, both for 20TB archived node and 2TB full/RPC nodes)
@Tahir Javed DNS migration (in scope : b.hmny.io / bridge.hmny.io / ps.hmny.io / stn.hmny.io)
@Nita Neou (Soph) switch over getLogs call to RDS
@Socheat KHAUV Migration all EBS storage from GP2 to GP3 and save 20% in storage
@Nita Neou (Soph) Build 2 archival nodes on Hetzner
@Diego Nava redirect 50% of s0 RPC traffic to elastic RPC
@Diego Nava / @Haodi Jiang redirect 100% of s0 archival node RPC traffic to elastic RPC and terminate all archival nodes on AWS (keep 1 for while building hetzner node)
@Nita Neou (Soph) removal of the dedicated RPC node used for the bridge api endpoint
@Nita Neou (Soph) retire all the internal graph node since Thegraph is now supporting harmony

4/18

integrate grafana in the backend to get more metrics (@Jenya Piskunov)
explain this new sql query deployed around march to understand the db usage (@Jenya Piskunov, @Russell )
try to revert this query to check if the cloud cost can go down (@Jenya Piskunov)

03/26

weekly meeting on cloud cost metrics and actions (@Nita Neou (Soph), @Jack Chan, @Leo Chen)

Context

We've hit an "ATH" of our cloud cost due to three things. (not good), around ~140k in one of our November AWS bill, not including DO, and mainnet account 40k. So, in total, we are paying around 200k per month for all cloud services and mainnet nodes. We are projecting even higher cloud cost in December due to the high volume of transactions and increasing number of usages.

The following is a simple breakdown of the top 3 cost sources and usage pattern analysis on the dev account. (140k)

high data transfer cost (40%)

s3 hosted blockchain snapshot for rclone, used by both internal and external nodes
explorer nodes have a high number of outbound traffic

high cost on RDS (15%)

host explorer DB to provide service to https://explorer.harmony.one
high cost due to charge per million I/O
indexers write to the DB while explorer backend reads from the DB

high cost on the EC instances (23%)

increased number of EC2 instances serving the RPC
high spec of the instance cost much more

The high cost is not sustainable to us. We need to have a plan to reduce the cloud cost while maintain the same level of services to our users.

Here is some lower hanging fruit and short term solution we can adopt. Noted, each solution will need engineering resources to design, architecture, test, and migrate.

use spot instance to replace on-demand instances

Pros: spot instances usually cost no more than 1/3 of the on-demand instances with the same performance.
Cons: spot instances may be terminated with a short notice by AWS.
Cons: With the current S0 disk space and based on previous experience, it can take up to 12h to fully rebuild the node (and will of course incur more cost if we transfer data via public rclone)
Mitigation: develop monitoring and automation solution to quickly spin up new instances if existing ones are terminated
Plan and ETA: 3 weeks of dedicated engineering resources to plan out and fully automate the monitoring and rollout

setup and manage our own postgres db clusters instead of RDS

Pros: saving 70% of the cost of RDS
Cons: setting it up, monitoring, maintenance, disk resizing etc requires dedicated engineering resources
Mitigation: outsource the DBA task, need to find trusted resources though
Plan and ETA: 3 weeks of engineering planning, implementation, and migration

use alternative tier-2 cloud provider to save instance and data cost

Pros: Vultr, Digital Ocean, Hetzner, or other cloud providers can be considered to provide service
Cons: learning curve and tooling/infra (LB) adaption
Mitigation: use additional resources like Haodi to do research. some manual process can be established at first
Plan and ETA: two week of prototyping, two weeks of migration, 3 more weeks of tooling automation