PROBLEM:
I wish to implement additional fault tolerance for DGE's such that they will fail over to another backup DGE in the event of the primary DGE going down.
SOLUTION:
The Traverse architecture utilizes OS/server level high-availability options which may enable it to fit into existing IT processes for different organizations.
It should be noted that a large degree of fault-tolerance is built into Traverse because of the distributed nature of the Data Gathering Engines (DGE) and that only the meta-data is centralized in the Business Visibility Engine (BVE). This allows a new machine to assume the identity of a failed DGE simply by connecting to the BVE with the same identity (name) of a failed DGE. For high-availability, the solution that is typically recommended is as follows:
- At regular intervals, save the databases in the BVE and the DGE to "another" server or servers (depending on how much redundancy you want). The BVE database (configuration data) can be saved more often in the beginning, but as your system matures, you can fall back to every 24 hours as configuration changes become less frequent.
- In the event that a server fails (DGE or BVE), the spare machine is enabled with the identity of the failed machine. Most customers are comfortable with a 5-10 minute interval before monitoring is resumed.
*Please engage with your Customer Success manager to avail of this High Availability configuration, as our Professional Services team (PSE) is required to implement the above solution.