Our current setup features several RPC nodes behind a load balancer. While this is fine for most cases, our nodes are stateful, meaning during peak times when we want to increase our serving capacity by adding a new node, it will take several hours to catch up with the latest state of the blockchain.
This quirk cripples our ability to maneuver with the freedom we would like to have. Scaling up takes so long that by the time the node becomes available, the need to have that new node might be gone. Scaling down is more of a problem than it has to be because we don't want to give up that extra serving resources, but at the same time, we might be over-serving, making the whole system more expensive to keep running.
With that in mind, we have two main problems to solve. First, we need to make our nodes stateless to scale up and down based on demand, optimizing costs. Second, we need to utilize our resources more efficiently by allocating more than one node in the same server (i.e. EC2 instance).
If we want to make a node stateless, we have to shift from in-disk storage to remote storage so more than one node can connect to the same DB, thereby removing the need to catch up each time the node restarts. There is a PR open to adding support for TiKV for storage; we could use it for our intended purpose.
An open question remains open if two or more nodes are connected to the same DB; would that cause race conditions? If true, we could deploy it in two stages. First, to have a node that will catch up with the blockchain updates and store them in the remote DB. Second, have the light-RPC nodes that will be a light version of a full node, whose sole purpose will be to connect to the DB and serve reading calls and forward write calls to the full RPC nodes that will write that into the blockchain.
This way we can scale the light RPCs up and down with ease and keeping their resource consumption somewhat limited so we can allocate more RPC nodes in the same server, more on that bellow.
This architecture allows us to further distribute resources based on hot domains or hot endpoints; in the example below, we create another set of light RPC nodes that will only serve getLogs. Please bear in mind this is a standard light RPC node that only serves getLogs requests, each group scales independently from the other, and they are all served using the same load balancer (more on how we could do that bellow).
This same principle could be applied to more domains or endpoints, where we could have the general RPC group, the blockchain RPC group with all the endpoints related to blockchain, etc.
Now that we talked about the possibility of grouping domains to further optimize the utilization of our resources and, more specifically, CPU and memory is time to articulate how we can re-use resources more efficiently. For the most part, our nodes won't be running full speed, and therefore keeping one server (EC2) per node (RPC) is a waste of resources. We can build a cluster that can allocate more than one node on the same server, and if we run out of space, the cluster can add instances to allocate more nodes if the need arises and release them otherwise.
Note in the image below how we can move nodes interchangeably between servers (EC2); this is the powerful part. We can treat each node as a computational unit that is entirely disposable (nothing personal); we can restart it, remove it, and relocate it to a different server if needed. Our RPC API currently supports graceful shutdown, which means that when the cluster decides it is time to shutdown one RPC node, the given node will stop accepting new requests but still finish up what is already processing.
So far, we have introduced the concept of the cluster without talking about what kind of tech we could use, one of the most popular solutions for this is K8S, which works and it will get the job done; however, I'd like to propose another tool which I find the right one for the job. Nomad is a solution created by the folks from Hashicorp, fully open-source, and written in Go. One of the main advantages is that it integrates natively with terraform and consul, making the setup and maintenance a breeze. The learning curve is much smoother than the one with k8s because it was designed in a modular way, so instead of having a big out-of-the-box solution with tons of config files, it starts simple, and then you add what you need, like a Lego. Their web interface gives a lot of helpful information to debug and monitor the cluster with an appealing look and feel and easy to navigate. It exports by default metrics to Prometheus and allows a flexible config for auto-scalers.
I have used both and find Nomad way easier to use and maintain, and I also see new engineers more comfortable interacting with it.
Having a native terraform integration allows for simple node version updates since we have to update one single line of terraforming code with the new version, run the terraform apply and the cluster will do all the magic by restarting one or two nodes at a time, wait till they are healthy and then do the same with the rest.
So far, we have discussed how we can run light RPC nodes for reading purposes with remote databases, create groups of hot endpoints so they can scale independently and stop competing for other's resources, and use server resources by allocating instances of those nodes interchangeably, this allows a more stable network because if one group of endpoints is for some reason falling short, then the rest of the network is still working as nothing happened. Note that those groups run the same binary; what gives meaning to each group is how we route requests and what kind of endpoints we forward to what group. This is especially tricky in an RPC endpoint since the endpoint is embedded in the body; most of the most performant load balancers and routers don't support such a level of routing.
The solution to this problem lies in a new tech we are migrating, CloudFlare; yes, we can leverage Cloudflare workers to route this kind of request; the process will work as follows:
- Cloudflare gets the requests.
- A worker decomposes the body and gets the operation. If the method is part of a particular group, it will forward the call appending the group in the path. Otherwise, It won't append anything and forward the call.
- The Nomad cluster will have an internal router (Fabio) that can detect the path and forward the call to a specific group of nodes, and it will strip the added path, so the call is backward compatible with the harmony node.
What if there is a misconfiguration in either the CF worker or Fabio? Well, no problem; by default, the request will be served by the general group, so for the end-user, that won't be an issue, and for us will be easily detectable because we can add a label to the Prometheus metrics indicating what group a call is being executed, detecting those issues will be as easy as looking a Grafana chart. BTW we don't need to update the node code to support a new label; that's something we can do at the cluster level before exporting the metric.
What about WebSockets?
How could we extend this architecture for WebSocket? We can create another WebSocket group that will serve all calls, but we would only route WebSocket connections there. And use a similar CF worker trick where we append /ws to the request and then forward that to the cluster, so it gets allocated. Fabio supports WebSocket connections by detecting the upgrade connection and keeping the link open with the intended node.
If further granular optimization is needed, we can create a custom reverse proxy for WebSockets, so the connection remains open at the proxy level, but the proxy interacts with each group to fulfill requests; this won't be too hard to implement; however, this can be deferred as a future optimization.