Created time

May 2, 2022 3:36 PM

https://www.linkedin.com/in/gheismohammadi

Number

Gheis is a blockchain Engineer experienced in C/C++, C#, Go and JavaScript. He excels at developing DApps and blockchain services. Gheis is highly skilled in blockchain core development with a good knowledge of blockchain technologies. Additionally, he is familiar with DevOps and automation. Gheis has a Bachelor’s in Electronics and a Master's in Mechatronics, so, he is quite familiar with microcontrollers and concepts of the electronic circuits, and has been involved in a couple of high-tech projects. Gheis published a couple of papers on AI and image processing. He is familiar with auto-trading bots and has implemented tens of Forex and crypto-trading and analyzing tools. He loves designing and implementing games in his free time and has implemented his first 2D game on 1996.

ossinsight.io

GheisMohammadi - Overview

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or window. Reload to refresh your session. Reload to refresh your session.

github.com

Deliverables

2023 Oct, 02 - Oct, 06

I was on call starting from Wednesday. It has been a relatively quiet week so far. Soph helped me restart the development network (devnet), allowing me to continue my tests on a clean database.

I made continued progress on the tests for fast sync. I implemented numerous code changes to address the last remaining issues with fast sync. The branch has been rebased with dev branch and a couple of times deployed to test nodes. I conducted a series of heavy tests for the final stage of state synchronization, and there have been significant improvements in this area. This week, I achieved a major milestone by successfully completing one round of state synchronization for the first time.

We encountered an issue in the devnet environment that was causing failures in Travis tests. The problem stemmed from conflicts in the dev branch and the redeclaration of a function within the same package. This issue has been resolved through a pull request and has been merged into the dev.

I also added tests to the GetNodeData code and submitted a PR to increase the capacity for receipt and state requests. This enhancement enables nodes to request more data per query. Although the numbers are still lower than Ethereum's capacity, they are now substantial enough to handle large batches of receipts and states per request. This has the potential to significantly boost the performance of state sync.

2023 Sep, 25 - Sep, 29

my primary focus was on fixing, refactoring and testing fast sync. I initiated this process by refactoring the pivot block calculations. In addition, I updated the codes that enable the transition between fast and full sync modes, allowing the execution of stages based on the chosen sync mode.

The main thing I have done was the refactoring of the receipts stage, which I subsequently merged with the block stage for fast sync. This integration allowed for the utilization of a single function to insert both into the chain database. Various other code refactoring were also done on fast sync PR.

After completing these preparations, I did a series of individual tests, and for first time in a full-scale test of the block and receipt stages. Remarkably, the node on the devnet was able to download all 24.6 million blocks in approximately 78 hours. This equates to a rate of around 87 blocks per second. Notably, this achievement was realized with just two connections to other peers and a self-imposed limitation of 10 for the maximum number of blocks and receipts per each request.

To further enhance synchronization, additional nodes can be introduced to facilitate syncing with peers, and the limit can be raised to match the 128 or 256 limits utilized by Ethereum. This will be my next phase of optimization. At present, our priority is ensuring that the synchronization process functions correctly, with performance improvements slated for subsequent stages.

Currently, I am actively working on the last stage, which involves downloading the states. While progress has been fairly good, a few challenging issues remain to be identified and addressed. if this part get fixed, hopefully during next two weeks, fast sync is ready to deploy on the testnet.

2023 Sep, 18 - Sep, 22

I was on call, and it turned out to be a relatively quiet week. However, we encountered an issue with the devnet node consensus going down. Our investigation involved thoroughly examining nodes, databases, and logs. Fortunately, Soph provided assistance by copying the database from other nodes. It has become evident that our rollback functionalities require improvement. Apart from that, we faced a few issues due to Contabo maintenance.

One critical issue we encountered was related to the getReceipts RPC, which had the potential to cause node crashes. Although the mainnet does not currently utilize stream sync, this function was present in the codebase. I took the initiative to address and fix this issue, and I have submitted a hotfix PR for it. We plan to merge it with the latest mainnet release.

Following discussions with Casey, I rebased the stream sync improvement PR and added the getReceipts fixes into it as well. Soph and Diego were helping me to deploy the PR across all devnet nodes.

Additionally, I conducted several tests of the state sync PR and made some refactoring changes. The new updates were deployed to test nodes, and they successfully resolved 50% of the issues in the state sync code. Currently, the node can fetch blocks and receipts from other peers. I am committed to completing this PR by next week.

2023 Sep, 11 - Sep, 15

Soph and I attended the Token2049 event. I have created a report about the event in Notion.

Additionally, I have been on call since Wednesday. It has been a relatively quiet week so far, with the exception of a maintenance incident. We haven't experienced any other issues.

2023 Sep, 04 - Sep, 08

I focused on testing the state sync PR. During this time, I have done some code refactoring. We had an issue related to the testing node, which took a day to resolve. The problem originated from Contabo's side. Soph, Diego, and Ulad assisted in investigating OOM issue and resource-related aspects.

As of now, the overall state sync flow is functioning as expected, and the various stages are executing correctly. However, we're still encountering a few issues when it comes to receiving states from remote nodes. Specifically, the syncing node receives empty data, even though the request is correct, while the target node successfully receives and handles it. I'm still working on this aspect, and once it's resolved, we should be ready for a full round of testing.

Additionally, I had a couple of discussions with Soph regarding HIP30 and our strategies for applying it. We also explored an alternative solution involving a precompiled smart contract. At present, the decision is to proceed with our current existing approach.

I had a few code reviews as well, and the team is addressing the review comments.

2023 Aug, 28 - Sep, 01

I was on call until Wednesday. It was a relatively quiet week with no major incidents to report. Additionally, we successfully deployed the latest stream sync PR on all nodes in devnet shard 0. Now, all devnet nodes are using the latest PR which has all the new improvements.

In shard 1 of the devnet, we had a minor issue when one node became stuck. However, the team managed to resolve this issue by allowing it to resynchronize from scratch. Apart from that, there was a minor issue with the explorer node, which Ulad and Soph promptly addressed and fixed.

After all of the current issues of the syncing were fixed and the improvements were completed as well, I've begun testing state sync and have made some code changes as part of the initial round of full testing. Once the testing is completed and all the necessary tests pass, I will be ready to finalize the PR for state sync, most likely during the next week.

Additionally, we had a couple of discussions with the team regarding HIP30, and we also conducted a few PR reviews.

2023 Aug, 21 - Aug, 25

This week, my main focus was the implementation of the ideas that I had documented on the Notion page. The current stream synchronization code is designed to identify neighboring peers and establish connections with them. This discovery process is completed through the bootstrap mechanism. Once nodes have successfully connected to the minimum number of peers, as configured in the config file, the bootstrap phase is complete, and they proceed to the subsequent step: synchronization with the connected peers.

In the mainnet environment, the majority of validators don't have a full database and they only possess the latest blocks. This presents a challenge, as new nodes seeking older blocks from these validators may not find the required data. Consequently, for a fresh synchronization, new nodes might have to rely on explorer or archive nodes, leading to a more centralized syncing process.

After the bootstrap phase is accomplished and a node is connected to its neighbors, during the synchronization phase, if a remote peer lacks the necessary blocks, the node removes it from the list. In the mainnet, this often results in the stream list becoming empty within a few seconds due to the limited database availability among most validators. Consequently, the node becomes stuck and is unable to progress further with the synchronization process.

I implemented all the proposed ideas to address mentioned issues and have pushed the corresponding pull request (PR). This PR enhances the discovery process and ensures to maintain longer stream lists. Additionally, it checks that there are sufficient available streams for each synchronization cycle. The team is currently reviewing the code. We have already deployed this PR on shard 1 in the devnet (thanks to Diego). Pending successful outcomes, we plan to deploy it on shard 0 as well. Subsequently, we intend to deploy it on the testnet, paving the way for its eventual readiness on the mainnet.

To facilitate the team's understanding, I have documented the stream management process. This documentation provides insights into how stream management is implemented in the existing codebase. It explains the process of adding streams and the manner in which nodes can remove them. Moreover, it clarifies the workings of the discovery mechanism, and it delineates the significance of each configuration option within the stream configuration.

In addition to the aforementioned tasks, I was on-call starting from Tuesday last week. Fortunately, no major issues arose during this period. We did encounter a situation where we needed to upgrade an node and allocate additional memory. Aside from this, the week was rather uneventful, and the stability of the mainnet was maintained.

I also conducted a code review for a PR and provided constructive feedback to enhance the quality of the codebase.

2023 Aug, 14 - Aug, 18

This week, we successfully finalized PR 4467 and merged the code using the squash and merge method. This pull request introduced many changes to our stream sync process. I replicated these modifications in the state sync PR code as well. The implementation has been completed and pushed to a separate branch.

My primary focus was addressing the challenges associated with implementing stream sync on the main-net. I did in-depth research, checking solutions used by other blockchains. I documented them, explaining both the issues and potential solutions. Later, I had discussions within the team, we shared the ideas and discussed the solutions.

Through these discussions, we checked various potential solutions and assessed the risks. After that, we reached a consensus on the following strategies to address the main-net stream sync challenge:

Continued Discovery Process: We decided to enable nodes to continue discovering peers even after the bootstrap process has completed.
Enhanced Reserved List: Considering the importance of stability, we decided to enhance the reserved list with a higher number of standby nodes if necessary. This approach will ensure smooth transition for the streams to be dropped.
New Config Option for Reserved Nodes: We agreed to introduce a new configuration option that allows us to define the number of nodes reserved.
Improved Re-bootstrapping: We have also decided to implement a mechanism that allows nodes to effectively re-bootstrap in situations where the peer list is empty.
Dynamic Stream Replacement: We plan to have nodes continuously request blocks from other peers. If a remote peer is unable to provide the requested block, the corresponding stream will be replaced with a reserved peer.

We also discussed concerns related to trust, especially during the bootstrapping process. While there might be concerns about the involvement of malicious nodes, they won't be able to participate in consensus.

I have already begun the implementation phase for these solutions. I hope I can get most of them done by the end of next week. By then we will deploy them to devnet and testnet.

2023 Aug, 07 - Aug, 11

During the week, the team had a comprehensive review of PR 4467, which suggested modifying the debug logs. I addressed this feedback, making the necessary changes to the code and eliminating any extra logs. Furthermore, I checked and responded to the remaining review comments. The pull request has been approved by three reviewers and is ready for merging.

We had a couple of discussions within the team about the merging process. The main point was the consideration of squashing and merging PRs. While this approach would lead to a cleaner and more organized branch, it also entails merging all commits into a single commit, potentially complicating the process of retracing changes and comprehending the logic behind the code. These discussions took place both during my meeting with Soph and Max and during the engineering session. To test the potential solutions, we experimented with the concept of introducing release branches and pre-main branches. Eventually, Soph initiated a team-wide vote to determine whether a normal merge or squash merge approach should be used.

Another issue occured on devnet in relation to libp2p, which was persistently logging an error in server mode. I investigated the libp2p codebase and found the root cause. It was revealed that the error message was hardcoded into the code, leading to the keep printing the logs. Following a team discussion, I decided to address this issue by introducing a server mode flag. This modification helps the transition of the peer into server mode and as a result, halt the error logging. Later, I had a discussion with Soph, we decided to approach the libp2p team to rectify this issue from their end. I initiated contact with the libp2p team, then submitted a pull request to the libp2p repository, and successfully resolved the issue on their platform.

Another challenge we encountered was validators using SnapDB. Many validators on the mainnet rely on SnapDB and so their databases have only the latest blocks. This can cause many issues during stream sync, as unsynchronized nodes were not able to request additional blocks from these validators. As a result, nodes needed to connect only to full nodes and request blocks from them. This makes a range of challenges, particularly considering our existing lack of a mechanism to check whether a node has all the data. While a few ideas to address this challenge exist, they need further research and development. In an effort to find viable solutions, I've been closely examining how other blockchain networks, such as Ethereum, manage similar challenges.

2023 July, 31 - Aug, 04

During the week, I was on-call until Tuesday and encountered a few issues with sync in the devnet environment. Soph and I discussed the problem and had a great investigation by examining logs and nodes. We identified three incidents:

1- A panic issue that caused nodes to crash randomly.

2- Empty bytes received when nodes attempted to download blocks, likely because of connectivity issues.

3- Node synchronization was successful, but the consensus block was not updated until the restart, and the cache was written on disk.

To resolve these issues, we restarted the nodes and restored consensus. I worked on PR 4467, implementing several changes and fixing the panic issue. It turned out that the empty bytes problem was also a result of the panic issue. However, the third issue is still under investigation.

Additionally, I fixed and rebased my other two PRs (4474, 25). Team reviewed and I did required changes on those PRs. So, they were reviewed, completed, and merged by the team.

In response to some go-libp2p packages being moved to its main repository, I created PR 4483, which updates those deprecated packages.

As for PR 4467, it has been reviewed, and the team requested the removal of extra debug logic. We deployed this PR to all devnet nodes starting from Wednesday, and it seems stable so far with no error logs. This was a great achievement because at a point of time I was about to give up for those issues. Investigation and replication of these issues was very tough. That made me going through every single output in each sync cycle using debug mode. Fortunately I could identified and managed to solve them.

Overall, it was a productive week with several critical issues resolved and significant progress made on various pull requests.

2023 July, 24 - July, 28

Starting from Wednesday, I was on call, and it has been relatively quiet so far.

During the week, I noticed a few instances where RPC tests failed on Travis. We knew it's mostly due to resource constraints. However, after more investigation, I found that there are other improvements that can be made. Currently, RPC tests wait for the local network to boot before starting test suites. This approach sometimes fails when the local network boot times out, and the test suites start regardless of the local network's readiness. To address this, I created a PR to enhance the process, ensuring that the local network is stable enough to run tests and reducing the occurrence of failures.

Another issue I worked on was related to consensus catchup, which was generating many error logs. The problem was from the fact that the current code only joins consensus if the client is enabled but doesn't check the stream downloader. As a result, if a node disables DNS sync and solely relies on the stream, it won't run supportSyncing, and thus won't join consensus. To resolve this, I created a PR to fix the issue by adding a simple check for the downloader as well.

My main focus last week was on fixing conflicts in staged sync and consensus and making staged sync stable. The changes from two weeks ago were deployed and proved to be stable and effective. I also added several other improvements, effectively turning the PR into a new version of stream sync. These changes underwent multiple deployments on nodes for thorough testing, thanks to Diego and his great support. We made frequent deployments and testing several times a day, and some changes briefly caused consensus disruptions. However, I addressed most of the issues caused by the sync update codes, and all local tests passed successfully.

At the end of the week, there was one instance where we lost consensus on shard 0 in devnet, but this time, no errors were found in the logs. I am currently investigating this matter to identify any hidden issues.

2023 July, 17 - July, 21

During the week, my primary focus was on resolving stream sync issues. Initially, I tried to investigate node-related problems by analyzing the logs. So, I implemented some code changes to address the issues. Then, I did monitoring and performance checks, later I decided to refactor the stream codes. (commits)

Several improvements were made to enhance stream synchronization:

Enhanced the short-range stage, reducing the required resources.
Improved the epoch sync stage and made various enhancements to the sync loop.
Added the last mile stage to the staged stream sync, resulting in improved stability when the node is fully synced.
Added Debug logs to the staged stream sync, enabling detailed logging of each process and issue.

By default, the stream sync debug mode is disabled, but it can be easily enabled to do the debugging and issue tracing.

Furthermore, I optimized the syncing process by eliminating redundant stage executions. The PR addressed an issue with one of the dev-net nodes.

I had a couple of productive discussions with Soph, and the PR has already been deployed on all devnet nodes, thanks to the contributions of Soph and Diego.

As of now, it appears that the issues have been resolved, and the nodes are stable. In the next week, I plan to investigate the explorer nodes as well, as they are currently not as stable as expected.

2023 July, 10 - July, 14

During the week, Soph and I had several productive discussions related to DevOps and CI/CD matters. We successfully addressed the following issues:

1- Git Issue - Error: Pathspec 'feature/statesync_client' did not match any file(s) known to git.

To resolve this, we eliminated redundant code from the script. You can find the details in this pull request.

2- Rosetta Checker Error:

After multiple discussions and thorough examination of the Rosetta checker codes, we determined that the issue was related to Travis resources. This issue was fixed by rebuild the PR.

3- Discord Notification:

after checking, there were no issues with Discord notifications.

4- RPC Issue:

We managed to fix this problem after gaining access to rebuild the PR.

5- Localnet Testing:

After careful consideration, we concluded that we should not be testing long-range sync in the local net on the validator.

6- PR #4452:

We discussed different types of syncing and the necessity of merging PR #4452. It is crucial to upgrade dev to have these functions on all nodes.Later I will be able to test fast sync.

In addition to addressing the above, I also fixed an issue with the kill_node script (pull request: 4463). During CI/CD tests, we suspected this script might be causing random test failures. However, we later realized it is a cosmetic script, and the codes are not closing any service because the services are running in the separate instances.

I have submitted the "Initial Version of State Sync." The pull request is currently in draft status as it requires a couple of tests and very likely more changes to the code. The fast sync feature relies on streams for synchronization, making it essential to stabilize stream sync first. Presently, we are facing an issue with one of the nodes in devnet. Our team is actively working on resolving this issue, and once it's fixed, we can proceed with thorough testing of fast sync.

Our first goal for next week is Ensuring that stream synchronization is stable and resolving any outstanding issues because it is a requirement to the success and reliability of state sync.

2023 July, 3 - July, 7

During week, I was on call ( weekly on-call report ), and we had a relatively quiet week. The team successfully upgraded all dev-net and test-net nodes to version 2.4. During the week, Soph and I had a great discussion regarding a consensus issue that occurred on one node while the stream sync was running. We have come up with a few ideas to resolve this issue, and we are still working on that.

One of the tasks I completed was refactoring the stream client codes to support an array of receipts for multiple blocks. The previous implementation did not have this capability. I have already finished and pushed the code (Commit), and the team has reviewed and approved the pull request (PR). It's now ready to be merged into the dev branch. So I will be able to use it for state sync tests.

Additionally, I made progress on the state sync feature (commits). I refactored the block insertion and receipt insertion parts, and I improved the staged sync to handle multiple sync modes. Previously, the implementation only supported full sync mode, but now it can adapt stages based on the selected sync mode. I have completed both receipts and states stages, and I'm currently working on a few issues to ensure the code is ready for testing.

I also rebased several PRs with the DEV branch and resolved the conflicts. These PRs are now ready for team review and merging.

Overall, last week was a productive week. Next week I'm looking forward to continuing the state sync and fixing the possible issues.

Also, I have plans to refactor certain parts of the staged sync codes. Additionally, I want to explore the possibility of implementing state sync functionality into the DNS sync, as the main-net currently utilizes DNS sync. This approach could potentially be merged faster into the main-net compared to using stream sync.

2023, June, 26 - June, 30

I worked on state sync client integration with stream sync last week. It is about 70% completed. This was more complicated than I expected, so I created two new stages: one for state sync and one for receipts. The state sync stage downloads all states from neighbor peers, and the receipts stage downloads all receipts which are the result of block execution from other peers.

If state sync is enabled, the node should not execute blocks. Instead, it should only add blocks to the database while we have states ready from the related stage. We need to refactor the block insertion process to be able to add blocks without executing them. This will involve rebuilding the entire database.

I will try to complete the integration next week. If all goes well, we will start testing.

2023 June, 19 - June, 23

I created an GitHub issue for persistent disk cache issue in stream sync. I also added the exact instructions on how to replicate the issue. Later, I created a pull request that fixes the issue.

I added instructions to the state pruning PR on how to use the commands. Also fixed a few merge conflicts on that PR.

The stream client was completed last week. This week, I fixed the issues, added tests, finalized the code, and created a PR. The PR is ready for review. This PR adds all the necessary functionalities to the stream client, which will later be used by state sync to fetch state data from other nodes.

I started working on the integration of the client for state sync. It is 30% complete and should be the last step for fast sync. After the integration, we will do the devnet deployments and final tests.

2023 June, 12 - June, 16

After a few weeks of working on issues, tests, hot-fixes and a couple of improvements, last week I was completely focused on the fast sync implementation. I have done a good researches on state sync methods including fast sync and snap sync. Any sync method needs a few new functionalities for stream client. This client is responsible to build, send, receive and manage sync requests. So, adding the new functions to client means all nodes will be able to send a request to fetch needed data for state sync.

The codes for client are 100% completed (State Sync Client) and are pushed to branch "feature/fastsync". Next week I will be on testing the client and debugging and enhancing these functions. Once we are done with client, the next step would be integrating this client to sync engine. My plan is to add it as a new step in staged stream sync.

2023 June, 5 - June, 9

I made a few changes in sync codes to test the syncing part, and we redeployed the new version on all devnet nodes, including boot nodes. The nodes synced 500k blocks and all fully synced. However, the node it seems an issue with other part of sync which Konstantin were working on. There are a few fixes made by him and we are discussing them to make sure they are matched with production codes in syncing module.

Later, the PR #4444 upgraded the dependencies, and I made a small fix to the cache folder path.

I had a great discussion with Max about extracting accounts from statedb.

I and Casey had a discussion on state sync challenges and we decided to start with fast sync. Later we can add snap sync. So user will have different options for syncing like what other blockchains have.

I am implementing fast sync, and I am 30% done with the client. I will use this client for staged sync later. However, new changes in syncing, may I need to update the codes. To test the state sync codes through streams, we have to deploy it on all devnet nodes and then any new node added to devnet should be able to use fast sync.

2023 May, 29- June, 2

I was on call since Wednesday. It was a quiet week with no major issues.

I worked with Konstantin to fix a panic issue in the stream sync code. If node doesn't have a proper connection to neighbors, it doesn't receive block bytes and later going through block bytes array will break the sync cycle. This issue already fixed.

After researching the state sync, I decided to implement the same approach as Ethereum. I have started implementing the sync client to fetch the receipt and snapshot. This should take a few days to implement and test. Once the client is implemented, I will integrate them into the stream sync as new stages.

May 2023, 22-26

• We enabled stream sync on all devnet nodes. Stream sync were implemented a few months ago and had been disabled for a while due to an unknown issue. Devops team reported an issue about routing. Later after checking, we confirmed that routing is working fine. We needed to investigate more through the logs and debug the code, but we didn't have that logs to do this. So, we had to deploy the newest binary with stream sync enabled in order to be able to monitor the nodes. All nodes in shard 0 and 1 in devnet were working fine after deployment, except for one. I investigated the error and found that the issue was some conflicts between the consensus and syncing modules. Both of them were working fine separately, as stream sync had been working for almost 3 months without any issue. I thought that this must be something related to the latest changes. We tested a few solutions, such as logging all consensus output, reverting the database, enabling/disabling different modules, testing a few changes in p2p, and manipulating the channels between sync and consensus. The issue only happened on one node, and the others are working fine without any issue. We are still working on this.

• The cache folder was previously stored in the current directory of the binary file. This is not a good practice, as it can lead to problems if the binary file is moved or we want to migrate node. As Soph suggested, we moved it to the data folder. This is a more organized and structured way to store the cache, and it makes maintenance and node migration easier. (PR #4438)

• There were a few conflicts in PR #4422, which required us to rebase the latest dev branch in order to fix the conflicts.

• I continued my research on state sync and worked on Erigon codes as well. Erigon is a full node implementation of Ethereum. It does not support fast sync or snap sync, so I switched back to go-ethereum codes as my reference. I am still working on this to get familiar with the flow.

May 2023, 15-19 I was on-call and we encountered a few incidents, such as consensus being down and devnet shard 0 getting stuck. All tech team and devops team were working hard to fix these issues. The full report of the on-call can be found here: https://docs.google.com/document/d/1YEsYHqElb7RA2foz88nXFE6KwHX0ysOdDbzYfPNTtYM/edit

During a session with Max, we had a productive knowledge sharing and discussion for a couple of hours regarding the P2P protocol and the challenges related to cross shard communication.

Following the release of version 2023.2.0, we encountered an issue on the mainnet, which required us to merge hot fixes. To ensure a reliable version of the beacon chain database, I set up a backup node (with assistance from Diego) that utilized the original version of the database and an older binary. We utilized the February 2023 version of the DB, allowing the previous version of the binary to sync and continue creating states as we had before the last release. This backup node's DB can be used as the original version of the statedb for any future incidents.

We faced an issue with logs on the devnet, where any error in the print logs could disrupt the sync process. I addressed this issue by fixing the stream log problem and replacing print log errors with warnings (https://github.com/harmony-one/harmony/pull/4433).

There were a few issues with snapshot integration that caused it to fail the tests. These issues were fixed (https://github.com/harmony-one/harmony/pull/4419/commits/5bbafd834ef262ceab3d702b893c58ebb7a6679c).

I resolved conflicts while rebasing dev with main and worked on different approaches for the rebase. I pushed these approaches to two separate branches: dev_rebase_with_merge (https://github.com/harmony-one/harmony/tree/dev_rebase_with_merge) and fix/dev_rebase_conflicts (https://github.com/harmony-one/harmony/tree/fix/dev_rebase_conflicts). I later discussed these approaches with Soph, and we had a productive session.

I also started researching different state sync approaches, such as full sync, fast sync, and snap sync. We have already implemented DNS sync and Stream sync, both falling under full sync. Ethereum has two additional sync modes: Snap and Fast. Snap sync is the default sync behavior in the geth node and only requires downloading less than 2GB (128 blocks and the latest trie). Consequently, it can be completed within a few hours, making it much faster than full sync. However, a snap node cannot respond to requests for historical data, and snap sync cannot be applied to archive or explorer nodes. I will go through details for each approach to find out the proper method to use in harmony node.

May 2023, 8-12

We continued working on mainnet consensus stuck issue. The error message "can't load code hash …: not found" was on the logs of the stuck nodes. This error is very random and only happens in random blocks. Only upgraded internal validators encountered this error and the weird part is after restarting they could continue work as normal. There are a few other signs that the error is a combination of a few hidden issues. In my side, I created a command to inspect database to iterate all states and populate stats.

(https://github.com/harmony-one/harmony/pull/4426). The stats could help us to identify the root cause. We made a new mainnet validator node and I tested the inspection there. It took almost 7 hours to inspect 3.39 TB data for Feb version of db. The inspection showed that the statedb is correct and there were no meaningful differences between db versions.

Later we noticed that the logs are showing wrong beacon-ness flag. I fixed that (https://github.com/harmony-one/harmony/pull/4428) but it was not related to mainnet outage.

As the next step to identify mainnet issue, I reviewed again the rawdb upgrade codes and node logs. It found another possible scenario that could make problem to retrieve codes from db. The validator wrapper code is prefixed differently than contract codes. Prior to the current release (2023.2.0), neither of them used a prefix and both used a code hash as the key to store the code in the database. We used a flag called “isValidatorCode” to tell state object how to fetch code and which prefix to use, but this flag cannot be identified for old data, because validator and contract codes do not have prefixes and there is no way to distinguish between the keys. So, it was using false value by default in most parts of the code, which can cause problems if the code belongs to a validator wrapper and is stored by validator prefix. This PR removes that flag and tries to retrieve the code without and with all possible code prefixes. This issue addressed later (https://github.com/harmony-one/harmony/pull/4430) and the codes reviewed and added directly to main branch as a hotfix. We hope this could fix the mainnet issue as well, but we need to replicate the mainnet issue in localnet to be able to test.

May 2023, 1-5

This week was a bit tough week. After latest release 2023.2.0 we saw a few new issues which we haven't seen before. This release was so huge and has a lot of changes in codes which makes it a bit hard to identify issues. Max reported one common issue on all internal validators. It was about reading an smart contract code from DB. So, the leader commited a block which other internal validators couldn't validate it because they couldn't fetch codes for that specific contract. We all were working on the issue and trying to figure out the root cause. I did a few changes to check the explorer and validator nodes to see if that code exist in db. We had a couple of discussion and meetings on that. Still we are working on it. One reason is changes on db accessors and code fetching function from rawdb upgrade PR. the issue is a bit random and it's hard to replicate on local net.

I finalized snapshot integration PR (https://github.com/harmony-one/harmony/pull/4419). This PR has the potential to enhance the performance of the state database. The snapshot reduces read times from O(log n) to O(1). We need this integration to be able to use ethereum pruner. If we revert rawdb upgrade PR because of mainnet issue, then this PR would stay pending.

I submitted state pruning PR (https://github.com/harmony-one/harmony/pull/4422) as well to shared with Adam. This PR uses ethereum Pruner and it still needs more works and tests and improvements.

April 2023, 24 - 28

⁃ the pr #4402 needed to test to make sure it works as it expected and it doesn't break the states. Thanks to @Max we used a python script (#4413) to build a validator on localnet and monitor the related states. This PR was confirmed and merged to dev and was a part of release 2023.2.0.

⁃ In order to prune states, it needs to generate a snapshot which needs it iterates all states. Iteration takes a few hours and then state pruning can be started. The solution is to have snapshot ready before pruning. I integrated the snapshot state to be generated and gets live updated. Whenever a new block is processed, we do not merge the writes directly into the disk layer, rather just create a new in-memory diff layer with the changes. If enough in-memory diff layers are piled on top, the bottom ones start getting merged together and eventually pushed to disk. The snapshot essentially reduces reads from O(log n) to O(1) (times LevelDB overhead) at the cost of increasing writes from O(log n) to O(1 + log n) (times LevelDB overhead) and increasing disk storage from O(n log n) to O(n + n log n). It significantly improves the state db performance.All of these changes are implemented and are under tests. I will push the codes as the tests are done. Then we are ready to finalize PR for state pruning.

April 2023, 17 - 21

On call
work on testnet issue

https://github.com/harmony-one/harmony/issues/4407

remove old stream codes

https://github.com/harmony-one/harmony/pull/4408

work on state-pruning(initial version of state pruning is completed and I am testing the codes)

April 2023, 10 - April 14

- Add snapshot feature (Ethereum feature which adjusted for Harmony core)

⁃ https://github.com/harmony-one/harmony/pull/4406

⁃ needs more tests and investigation on compatibility with Harmony state db

Work on localnet issue

⁃ tested many possible scenarios for staged stream sync with different configs and db mode

Review PR (ongoing)

April 2023, 3 - April 7

Add prefix for contract code: The first solution I have worked on, needs hard-fork because it adds prefix to code after a certain epoch. (https://github.com/GheisMohammadi/harmony/commit/c8a1a3a4d713c47da86845266ea535ead0cc3004)

Later I noticed that the upgraded rawdb already has the codes ready to handle it and it can be done without hard-fork. The Ethereum also uses same solution. (https://github.com/harmony-one/harmony/pull/4397)

Add prefix for validator wrapper (https://github.com/harmony-one/harmony/pull/4402)

I am planning to do more detailed test on this PR

working on issue for localnet (https://github.com/harmony-one/harmony/issues/4403)

It still needs more investigation

March 2023, 27 - March 31

I was on-call
Fix review comments

Marked all the functions which are not used yet and just ported over from eth (Embed GitHub)
Fix merge conflicts (Embed GitHub)

Tests for statedb and rawdb update PR finalized

fetch random transactions
fetch random blocks
cross check block transactions
a few RPC methods
added a few functions to dr.Harmony to do tests.

All passed and I haven't seen any issue.

Moved all rawdb dependencies to harmony repo (Embed GitHub)
Worked on Add contract prefix

March 20 - March 24

- On Leave

March 14 - March 17

last tests on rawdb and statedb
Run a archival node to test statedb for archival nodes
working on next PR to add db snapshot

March 7 - March 13

on call
update rawdb and statedb (https://github.com/harmony-one/harmony/pull/4374)
testing PR 4374 by running a validator node on testnet and mainnet
testing statedb using explorer node (in progress)

Feb 29 - March 7

Conducted deep study on state pruning

decided to start with offline mode, So I tried to use same idea as go-ethereum for state pruning.
The first requirement for implementing state pruning is the db should have snapshot functionality. The harmony is using ethereum db and we are using same rawdb codes as ethereum with a few changes and a few custom codes. But the harmony rawdb codes are given from ethereum long back ago and since then eth db got a lot of improvements and updates which we don't have them yet. One of those updates is snapshot. Also we are using old version of trie and state db which needs to be updated as well.

So, I made a roadmap for implementation of state pruning:

1- update the rawdb and statedb in harmony code first. This is a huge change which needs to update almost every part of codes involving with db.

Embed GitHub

The PR is under review by team

2- add snapshot functionality.

3- implement state pruning

4- complete the tests and running a few nodes to monitor states and performance consistency

Aug 2022 Deliverables ♏️ Gheis Mohammadi 🛡: Erigon’s staged sync (block execute & download); Stream sync on testnet(testing and integrate with stage sync); leader rotation. 100/20/0%

July 2022 Deliverables

♏️ Gheis Mohammadi 🛡: Erigon code review, Staged sync with erigon’s database (10X potential on sync speed from 20TB archival, block catchup among 5% out of sync 160 validators) upgrade from DNS sync and later streaming sync. 100/100/100%

June 2022 Deliverables

♏️ Gheis Mohammadi 🛡: Takeover block sync code; Research Erigon block sync; Implement Erigon block sync 100/90/80% May 2022 Deliverables

♏️ Gheis Mohammadi 🛡: Fix unexpected ChainID issue in EVM, Fix endpoints prometheus metrics reporting, Separate staking endpoints from ETH endpoints. 100/100/100%

Cultural Self Assessment

Cultural	Values	Read More	Self Assessment + Personal Story
Empathy	Communicative	conversation turn taking	3- I am always trying to be a good listener. But normally I ask very deep questions afterward to enhance my understandings.
		disconfirm own beliefs	5- I don’t follow zero and one logic for my life. Everything could be partially true or false based on testimony and situations.
		self-aware & articulate	9- I learned to be honest with myself and ask myself thoughtful questions. I pay attention to my inner voice. Because I think my inner voice expresses what I feel and believe.
	Personal	share a drink	5- It depends on the situation, but normally it could be a good starting point.
		spend 10 hours daily together	5- If we enjoy, then why not? I like the strong energy of people and I believe it’s a combination of their past, their mindset, their dominant thoughts, and their perception of the world.
		nurture & mentor	10- Most people who know me say you are very good at making complicated things so easy to understand. That’s very cool to help people move forward.
	Collaborative	make everyone shine	8- I believe everyone has his own talents. It needs to show them what they are good at. That’s how we can make a strong team.
		people over process	5- It depends on the people and the process and the situation. Sometimes people and sometimes the goal we are looking for. The people’s talent makes the process and the process by itself could be dependent on the team, not each individual.
		dare to disagree	7- Sometimes I do it directly and sometimes I start to ask questions to clear the details that I do not agree with. I try to find a good end to unhealthy conversations.
Passion	Devoted	long-time craftsmanship	8- That’s ideal but sometimes not possible. I believe having focused on the big picture makes you be patient and make better long-term decisions.
		obsess over details	5- It depends. I think being flexible helps with not breaking hard. Too much focusing on details pushes you away of your goals.
		hungry & foolish	5- I am hungry and trying not to be foolish. But sometimes it happens.
	Aligned	share the mission	8- That’s a part of my everyday life and work. I like to improve my team work skills. Being on the same mission and sharing it, is an inevitable part of that.
		optimistic about flying off a cliff	5- I am optimistic about it but, if we know what we are doing and we believe in that. Nothing is guaranteed.
		your 50-year dream	5- Trying to be. But sometimes I like to take care of other people’s dreams rather than mine.
	Authentic	consistent with own actions	10- Thinking deeply before taking actions makes me be very consistent with my own actions.
		make tough decisions	4- I would like to make small decisions correctly. It’s very hard to make tough decisions if I don’t know the consequences or they are based on luck. But sometimes we have to and to be honest, having the first step does matter either right or not.
		admit mistakes	7- I made lots of mistakes in my life which I couldn’t do better than that!
Excellence	Technical	top 1% superstar	7- Not sure but I believe everyone can be a superstar in its own way.
		effective tooling	8- That helps to make me more productive. As an engineer, it is crucial to learn how to use proper tools and instruments.
		relevant to our needs	9- Being over two decades in this field and being involved in a couple of teams, taught me how to implement what is appropriate.
	Potential	10x growth	8- I do like breaking things down into small steps and taking each single step correctly and consistently. It could take you to any level of growth.
		voracious learner	8- I enjoy learning new things, especially those I am passionate about.
		contrarian thinker	7- I do like to get enough reasons and think about them.
	Impact	accomplish important work	9- If the word important means something worth doing, I will try it out and try to set up the steps to achieve it.
		activity < productivity	8- Too much activity and no productivity means something is wrong. No activity and too much productivity means cheating. I believe there should be a proper ratio between activity and productivity and anything out of this range is not normal and should be investigated.
		thrive in chaos	8- based on the time-frame and window, there are some sort of orders in any chaos. Sometimes being in chaos is more healthy than calm and it makes you stronger.