Uptime and Scaling for Services

I would like discuss the uptime and scaling for the services of Harmony.

Since the launch of the Harmony protocol in June/2019, Harmony protocol had grown from an infant period (June/2019 to June/2020), to a little kid period (June/2020 to June/2021). Now, after June/2021, Harmony is in its teenager period (June/2021 - now). At different ages, we faced different issues, from consensus stuck, to network down, from p2p spamming attack to network recovery taking too long. Nowadays, the most prominent issue we are facing is the growing pain. With so many users, ever-growing ecosystem projects, faster block time, and a higher number of transactions, our web2 infrastructure is sometimes won't be able to catch up with the scaling demands. We have to provide higher uptime and scale our service infrastructure in order to continue serving our users and ecosystem projects smoothly. Of course, the protocol level enhancement can't be stopped as it is the foundation of our technology. But without a scalable web2 service infrastructure, the perception of the blockchain to end users won't be good and will disappoint them. It can become our weakest link in the entire protocol.

Checking Daniel's document and we can see the impact and importance of the web2 infrastructure.

Principles

There are a few principles/tenets I would like to share at first.

The first is scalability. Our infrastructure has to scale in order to meet the upcoming 10x or 100x growth of blockchain applications and users. Patches to the existing codebase will not work as it was designed to be a single node solution, while a new framework has to be adopted. Lessons from web2 scaling solutions have to be learned.

The second is security. Blockchain is an open ledger technology and carries digital assets worth millions of dollars. Security should always be the first priority and the one-click blocker to any potential risks. Service infrastructure has also need to take account of the security measure on existing services or any new services.

The third one is frugality. Frugality doesn't mean we can't spend money. We just need to get the most out of the money we spent. Again, simply adding more nodes may only solve the temporary problem. Going longer term, we need to think out of the box. The explorer v2 design was a perfect example of how new architecture can solve our plaguing issue.

Issues

There are a few urgent issues we shall discuss in our service infrastructure.

RPC nodes out of sync: when the RPC nodes are under heavy load, they are regularly out of sync from the top of the blockchain. Under the current architecture, it is not a simple fix. I propose to use new architecture to separate out the DB writer from the DB reader, just like the explorer indexer.
RPC endpoint out of memory: it is also observed that the RPC endpoint may be out of memory when there is a heavy load.
Scalable RPC service during spike time

New DB backend - bounty
CDN on top of the endpoint to reduce the latency, using Cloudflare
Autoscaling group for the DB reader, aka, RPC service nodes

High Storage requirement

The increasing number of transactions per block and faster blocks added a lot of data to the blockchain. The blockchain is growing bigger and bigger on a daily basis. It's increasingly harder for a small validator to a node and the increased size of the full node makes it infeasible to keep the full node of the blockchain.

Cloud Cost Reduction

We've hit an "ATH" of our cloud cost. We need to find an alternative cloud provider to reduce the cloud cost in the mid-term. Going longer term, We shall use grants or form technical partnerships to fully externalize the services. Please read the "Cloud Cost" document for more details and plans.

Solutions

I've proposed a new architecture solution back in July. Here is the link to the proposal.

In short, the solution is to separate DB syncing from RPC service nodes. Using separate writer and reader instances to scale the reading operations. The writer can keep syncing and writing to the cloud DB, while readers read from the DB and serve all the RPC requests.

Services

A list of the critical services provided by Harmony right now.

https://explorer.harmony.one
https://staking.harmony.one
https://api.harmony.one
wss://ws.s0.t.hmny.io/

We need to discuss and determine

Which service shall we run by ourselves? Why?

Soph : after the validating node, the RPCs is the next critical infrastructure, our network is still going through changes (1s finality, RPC architecture change). Keeping under us is to avoid delay coordinating with partner during issues, however we may miss the expertise of the professional RPC service provider. They can provide / share very helpful tips on how to run the service

Which service shall we externalize? How to prioritize them? Who to work with and what's our budget?

soph: api.harmony.one is already externalized with pocket already processing request, and coming are covalent / chainstack

RPC performance RCA

In response to the user's experience on the performance of RPC, there are some thoughts about the RCA. We need further investigation ofc.

the fast block time puts a lot of pressure on the block syncing logic, which is also not fully optimized.
rpc nodes are competing for the resources to do state syncing and rpc service
performance of the EBS volume can't compete with the local nvme drive, that's why the syncing may take more resources. The following is the pprof chart of one node that underperformed to sync the blocks. We can see that the DB reading time took most of the resources.

underperform.pdf38.0KB