Fast Sync & Gossips: streaming state sync, information dispersal algorithm
We are replacing gRPC in state sync with p2p streaming with a rate limiter and downloader batch. Later we will implement RapidChain’s gossiping protocol based on an information dispersal algorithm (IDA) for sending large blocks. Both approaches, along with further performance analysis and parameter tuning, will improve our network resiliency and latency.
Current problem at hand:
- The CPU and memory usage will grow to 100% at the light sail nodes after running for 2~3 days.
- Some of the in-sync RPCs are redirected to gRPC sync after some iterations of code change.
- Sometimes, the sync loop is stuck in short-range sync. Need to investigate.
After these three problems being solved, the stream sync protocol can be launched.
2021/10/26
1. Continue debugging for stream sync. 1. The previous found suspect do impact CPU (very occasionally), but it is still not the root cause. The nodes dies after 3 days' running... 2. The next suspect is... still the discovery protocol... Keep researching
2021/10/19
- Continue working on stream sync. Found a suspect for CPU explosion
- The problem lies in two peers sending out the INIT package in two end's streaming protocol because of the immediate discovery after stream number is lower than lower threshold. And this will fail the stream setup and result in a large number of discovery and stream setup operations (which is heavy)
- Have reproduced the issue in two nodes in different regions resulting in high CPU cost (which is a hipe, not a slow growth).
- Starting with the fix - Adding a cooldown mechanism in discovery and stream manager.
2021/10/08
- Working on stream sync. First looking at the potential memory / CPU leak. Running two machines running customized code on mainnet with stream sync turned on. The customized code include:
- Frequent discovery call (10s/discovery).
- Unstable stream sync protocol (injected error in stream sync, occasionally fail)
- Finished code revisit and RPC fix concerning stream sync.
- Next step is to
- Investigate the CPU / memory leak
- Spin up stressnet for reproducing the stucking short range sync.
The result remains to be observed.