Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions modules/cluster-and-ha-management/pages/crr-monitoring.adoc
Original file line number Diff line number Diff line change
@@ -1 +1,115 @@
= CRR Monitoring
= Monitor Cross-Region Replication (CRR)
//:page-aliases: crr:monitor-crr.adoc
:description: Monitor Cross-Region Replication (CRR) status and metrics collection in TigerGraph.
:sectnums:

Starting from version 4.3, TigerGraph Disaster Recovery (DR) clusters support monitoring Cross-Region Replication (CRR) status and metrics. This enhancement enables administrators to track replication and replay progress, identify potential delays, and integrate monitoring data with external observability systems.



The CRR metrics collection interval can be configured using the following command (default: `60s`):

[source,console]
----
gadmin config entry System.Metrics.CRRIntervalSec
----

[#cli_status_query]
== CLI-based CRR status query

TigerGraph introduces a new `gadmin` sub-command to display detailed CRR metrics.

[source,console]
----
gadmin crr status [flags]
----

Displays comprehensive CRR status information, including:

* Overall CRR system status
* Topic replication and replay lag
* MirrorMaker 2 connector status

=== Flags

|===
| Flag | Description

| `--json` | Outputs CRR status in JSON format.
| `--realtime` | Retrieves real-time status instead of cached data.
| `-v`, `--verbose` | Displays detailed metrics, including source and target offsets and timestamps.
|===


[#topic_metrics]
=== Topic Metrics Definition


|===
| Field Name | Description | Notes | Verbose

| `replicationLagRecords` | Number of records the DR topic is behind the corresponding PR topic. | Calculated as `sourceOffset - replicatedOffset` | false
| `replayLagRecords` | Number of records the replay process in DR is behind the replicated topic. | Calculated as `targetOffset - replayedOffset` | false
| `replicationLagSeconds` | Time lag in seconds between the latest record in PR and its replication in DR. | Calculated as `sourceTimestamp - replicatedTimestamp` | false
| `replayLagSeconds` | Time lag in seconds between the latest record in DR and its replay. | Calculated as `targetTimestamp - replayedTimestamp` | false
| `replicationStalledSeconds` | Duration in seconds that replication from PR to DR has made no forward progress. | 0 means replication is actively progressing. | false
| `replayStalledSeconds` | Duration in seconds that replay within DR has made no forward progress. | 0 means replay is actively progressing. | false
| `sourceOffset` | Latest offset available in the PR Kafka cluster. | Highest offset produced in the PR topic. | true
| `replicatedOffset` | Offset up to which data has been replicated from PR to DR. | Highest offset confirmed written to the DR topic. | true
| `targetOffset` | Latest offset available in the DR Kafka topic. | Highest offset physically present in the DR topic. | true
| `replayedOffset` | Offset up to which the replay service in DR has processed records. | Highest offset successfully replayed in DR. | true
| `sourceTimestamp` | Timestamp of the record at `sourceOffset` in PR. | Based on Kafka message timestamp. | true
| `replicatedTimestamp` | Timestamp of the record at `replicatedOffset` in DR. | Based on Kafka message timestamp. | true
| `targetTimestamp` | Timestamp of the record at `targetOffset` in DR. | Based on Kafka message timestamp. | true
| `replayedTimestamp` | Timestamp of the record at `replayedOffset` in DR. | Based on Kafka message timestamp. | true
|===

[#openmetrics_integration]
== OpenMetrics Integration

CRR topic metrics can also be collected in **OpenMetrics** format using the following API:

[source,console]
----
curl http://localhost:14240/informant/metrics
----

=== OpenMetrics Definition

|===
| Type | Name | Label | Description

| Metric | `tigergraph_crr_replication_lag_records` | topic | Replication lag (records)
| Metric | `tigergraph_crr_replay_lag_records` | topic | Replay lag (records)
| Metric | `tigergraph_crr_replication_lag_seconds` | topic | Replication lag (seconds)
| Metric | `tigergraph_crr_replay_lag_seconds` | topic | Replay lag (seconds)
| Metric | `tigergraph_crr_replication_stalled_seconds` | topic | Duration replication has stalled
| Metric | `tigergraph_crr_replay_stalled_seconds` | topic | Duration replay has stalled
| Metric | `tigergraph_crr_source_offset` | topic | Source offset in PR
| Metric | `tigergraph_crr_replicated_offset` | topic | Replicated offset in DR
| Metric | `tigergraph_crr_target_offset` | topic | Target offset in DR topic
| Metric | `tigergraph_crr_replayed_offset` | topic | Replayed offset in DR
| Metric | `tigergraph_crr_source_timestamp` | topic | Timestamp of source record in PR
| Metric | `tigergraph_crr_replicated_timestamp` | topic | Timestamp of replicated record in DR
| Metric | `tigergraph_crr_target_timestamp` | topic | Timestamp of target record in DR
| Metric | `tigergraph_crr_replayed_timestamp` | topic | Timestamp of replayed record in DR
|===

[NOTE]
====
TigerGraph provides raw metrics for CRR topic replication and replay, but does not include built-in alerting capabilities.
Replication and replay lag may vary depending on cluster configuration, network bandwidth, and latency. Users are advised to define alerting rules in their own monitoring systems (for example, Prometheus).
A CRR topic can be considered abnormal if any of the following conditions occur:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example, not a rule that must be followed.

* `tigergraph_crr_replication_lag_records > 20`
* `tigergraph_crr_replay_lag_records > 20`
* `tigergraph_crr_replication_stalled_seconds > 60`
* `tigergraph_crr_replay_stalled_seconds > 60`
These thresholds can be adjusted according to the environment and operational requirements.
====


160 changes: 133 additions & 27 deletions modules/cluster-and-ha-management/pages/fail-over.adoc
Original file line number Diff line number Diff line change
@@ -1,44 +1,150 @@
= Fail over to the DR cluster
//:page-aliases: crr:fail-over.adoc


In the event of catastrophic failure that has impacted the full cluster due to Data Center or Region failure, the user can initiate the failover to the Disaster Recovery (DR) cluster.
This is a manual process.

Run the following commands to make configuration changes on the DR cluster to upgrade it to the primary cluster.
Starting from TigerGraph 4.3.0, failover to a Disaster Recovery (DR) cluster is simplified into a single command.
The new `--promote` flag for `gadmin crr stop` replaces the previous multi-step manual process, making DR promotion faster and more reliable.
Update existing failover scripts to use this new command.

[source,bash]
----
gadmin crr stop --promote --dump-checkpoint /path/to/checkpoint.json -y
----

[#key_behavior_changes]
=== Key Behavior Changes and Enhancements

==== 1. Single Command Operation
The `gadmin crr stop --promote` command now orchestrates the entire promotion process.
It replaces four legacy commands and automates the following actions:

* Stops the CRR connector.
* Updates the configuration parameter: `System.CrossRegionReplication.Enabled=false`
* Applies configuration changes and restarts dependent services.

==== 2. Automated Safety Checks for Data Consistency
Before stopping replication, the command automatically verifies synchronization between the DR and Primary clusters.

* If replication lag is detected, the system displays a warning indicating the DR cluster is not fully synchronized.
* Promoting a lagging DR cluster can result in data loss. If lag is detected, the command pauses and requests explicit confirmation before proceeding.
* If no lag exists, promotion continues automatically without user intervention.

==== 3. Integrated Replication Checkpointing
The `--dump-checkpoint` flag generates a replication checkpoint file that records precise replication offsets at the moment of promotion.

* When the DR cluster is fully synchronized, the checkpoint file is saved automatically to the specified path.
* This checkpoint enables a **fast-switch** operation when converting the old Primary cluster into a new DR cluster—avoiding a full backup and restore.
* See <<set-up-new-dr-cluster>> for instructions on using the checkpoint for a fast-switch.

==== 4. Forced Promotion for Emergency Scenarios
In urgent cases, administrators can use the `--force` flag along with `--promote` to bypass synchronization checks.

[source,bash]
----
gadmin crr stop --promote --force -y
----

Use this option only during emergencies. Forced promotion may lead to data loss and will require a full backup and restore to re-establish the DR cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can further explain what “emergency” means — for example, when the Primary Cluster is down and we need to immediately switch the DR to Primary to ensure service continuity.


[NOTE]
====
Users must update existing failover scripts and operational runbooks.
The older four-step failover process is now deprecated and should be replaced with the single `gadmin crr stop --promote` command.
This update improves safety, reduces operational complexity, and ensures more reliable failover operations.
====


[[set-up-new-dr-cluster]]

== Set Up a New DR Cluster After Failover
:description: Procedure for re-establishing high availability after DR promotion using fast-switch or backup restore.

[source,console]
=== Overview
TigerGraph v4.3 introduces a new **fast-switch** capability that allows administrators to quickly convert the former Primary cluster into a new DR cluster.
This method leverages the replication checkpoint generated during failover, reducing downtime and avoiding a full backup and restore.

After promoting the DR cluster to Primary, the former Primary becomes offline.
To re-establish high availability, configure the old Primary as the new DR cluster.

=== Principle: Checkpoint Validation for Fast-Switch
The fast-switch mechanism depends on whether the DR cluster was fully synchronized with the Primary at the time of failover.
This is determined by validating the replication checkpoint file created during promotion.

[source,bash]
----
gadmin crr stop -y"
gadmin config set System.CrossRegionReplication.Enabled false"
gadmin config apply -y"
gadmin restart -y
gadmin crr stop --promote --dump-checkpoint <path>
----

== Set up a new DR cluster after failover
Validation outcomes:

* **Validation passes:**
Clusters were fully synchronized. A fast-switch occurs, converting the cluster to a DR instance instantly without data transfer.
* **Validation fails:**
The failover occurred during replication lag (e.g., when `--force` was used). A fast-switch is not possible, and a full backup and restore is required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., when --force was used, or when new data is inserted into the old Primary Cluster after the failover


=== Method 1: Fast-Switch Using a Replication Checkpoint (Recommended)

Use this method when:

After you fail over to your DR cluster, your DR cluster is now the primary cluster. You may want to set up a new DR cluster to still be able to recover your services in the event of another disaster.
* The failover was clean (no replication lag detected).
* A valid replication checkpoint file (e.g., `checkpoint.json`) was created successfully.

To set up a new DR cluster over the upgraded primary cluster:
==== Steps

. Make a backup of the upgraded primary cluster
. Run the following command on the new cluster. The commands are the mostly same as setting up the first DR cluster, except that in the fourth command, the value for `System.CrossRegionReplication.TopicPrefix` becomes `Primary.Primary` instead of `Primary`
. On the new DR cluster, restore from the backup of the upgraded primary cluster
. **Configure the old Primary cluster as the new DR cluster.**
+
[source,bash]
----
# Set the IPs of the new Primary cluster
gadmin config set System.CrossRegionReplication.PrimaryKafkaIPs <new_primary_ip1,...>
# Set the Kafka port of the new Primary cluster
gadmin config set System.CrossRegionReplication.PrimaryKafkaPort <new_primary_kafka_port>
# Set the topic prefix. Add an additional ".Primary" suffix for each failover.
gadmin config set System.CrossRegionReplication.TopicPrefix <prefix>
# Apply configuration changes
gadmin config apply
----

[source,console]
. **Perform the fast-switch restore.**
+
[source,bash]
----
# Kafka mirrormaker primary cluster's IPs, separator by ','
$ gadmin config set System.CrossRegionReplication.PrimaryKafkaIPs PRIMARY_IP1,PRIMARY_IP2,PRIMARY_IP3
# Kafka mirrormaker primary cluster's KafkaPort
$ gadmin config set System.CrossRegionReplication.PrimaryKafkaPort 30002
# The prefix of GPE/GUI/GSQL Kafka Topic, by default is empty.
$ gadmin config set System.CrossRegionReplication.TopicPrefix Primary.Primary
# Enable CRR with the primary's backup created in step 1
$ gadmin backup restore <backup from the primary cluster> --dr
gadmin backup restore --dr --dr-checkpoint /path/to/checkpoint.json
----
+
If validation succeeds, the cluster is reconfigured as a DR instance and replication resumes from the exact checkpoint position.

There is no limit on the number of times a cluster can fail over to another cluster. When designating a new DR cluster, make sure that you set the `System.CrossRegionReplication.TopicPrefix` parameter correctly by adding an additional `.Primary` .
=== Method 2: Full Backup and Restore (Fallback Method)

Use this method when:

* The checkpoint validation fails (clusters were not synchronized).
* The checkpoint file is missing.
* The DR cluster is being deployed on new hardware.

==== Steps

. **Create a backup on the new Primary cluster.**
+
[source,bash]
----
gadmin backup create <backup_name>
----

. **Configure the new DR cluster.**
+
Follow the configuration steps in *Method 1* to point the DR cluster to the new Primary.

. **Restore the backup to the new DR cluster.**
+
[source,bash]
----
gadmin backup restore <backup_tag> --dr
----

For example, if your original cluster fails over once, and the current cluster's `TopicPrefix` is `Primary`, then the new DR cluster needs to have its `TopicPrefix` be `Primary.Primary`. If it needs to fail over again, the new DR cluster needs to have its `TopicPrefix` be set to `Primary.Primary.Primary`.