tigergraph · AyeshaFirdausTG · Oct 29, 2025 · Oct 29, 2025 · pingxieTG · Nov 3, 2025
diff --git a/modules/cluster-and-ha-management/pages/crr-monitoring.adoc b/modules/cluster-and-ha-management/pages/crr-monitoring.adoc
@@ -1 +1,115 @@
 = CRR Monitoring
+= Monitor Cross-Region Replication (CRR)
+//:page-aliases: crr:monitor-crr.adoc
+:description: Monitor Cross-Region Replication (CRR) status and metrics collection in TigerGraph.
+:sectnums:
+
+Starting from version 4.3, TigerGraph Disaster Recovery (DR) clusters support monitoring Cross-Region Replication (CRR) status and metrics. This enhancement enables administrators to track replication and replay progress, identify potential delays, and integrate monitoring data with external observability systems.
+
+
+
+The CRR metrics collection interval can be configured using the following command (default: `60s`):
+
+[source,console]
+----
+gadmin config entry System.Metrics.CRRIntervalSec
+----
+
+[#cli_status_query]
+== CLI-based CRR status query
+
+TigerGraph introduces a new `gadmin` sub-command to display detailed CRR metrics.
+
+[source,console]
+----
+gadmin crr status [flags]
+----
+
+Displays comprehensive CRR status information, including:
+
+* Overall CRR system status
+* Topic replication and replay lag
+* MirrorMaker 2 connector status
+
+=== Flags
+
+|===
+| Flag | Description
+
+| `--json` | Outputs CRR status in JSON format.
+| `--realtime` | Retrieves real-time status instead of cached data.
+| `-v`, `--verbose` | Displays detailed metrics, including source and target offsets and timestamps.
+|===
+
+
+[#topic_metrics]
+=== Topic Metrics Definition
+
+
+|===
+| Field Name | Description | Notes | Verbose
+
+| `replicationLagRecords` | Number of records the DR topic is behind the corresponding PR topic. | Calculated as `sourceOffset - replicatedOffset` | false
+| `replayLagRecords` | Number of records the replay process in DR is behind the replicated topic. | Calculated as `targetOffset - replayedOffset` | false
+| `replicationLagSeconds` | Time lag in seconds between the latest record in PR and its replication in DR. | Calculated as `sourceTimestamp - replicatedTimestamp` | false
+| `replayLagSeconds` | Time lag in seconds between the latest record in DR and its replay. | Calculated as `targetTimestamp - replayedTimestamp` | false
+| `replicationStalledSeconds` | Duration in seconds that replication from PR to DR has made no forward progress. | 0 means replication is actively progressing. | false
+| `replayStalledSeconds` | Duration in seconds that replay within DR has made no forward progress. | 0 means replay is actively progressing. | false
+| `sourceOffset` | Latest offset available in the PR Kafka cluster. | Highest offset produced in the PR topic. | true
+| `replicatedOffset` | Offset up to which data has been replicated from PR to DR. | Highest offset confirmed written to the DR topic. | true
+| `targetOffset` | Latest offset available in the DR Kafka topic. | Highest offset physically present in the DR topic. | true
+| `replayedOffset` | Offset up to which the replay service in DR has processed records. | Highest offset successfully replayed in DR. | true
+| `sourceTimestamp` | Timestamp of the record at `sourceOffset` in PR. | Based on Kafka message timestamp. | true
+| `replicatedTimestamp` | Timestamp of the record at `replicatedOffset` in DR. | Based on Kafka message timestamp. | true
+| `targetTimestamp` | Timestamp of the record at `targetOffset` in DR. | Based on Kafka message timestamp. | true
+| `replayedTimestamp` | Timestamp of the record at `replayedOffset` in DR. | Based on Kafka message timestamp. | true
+|===
+
+[#openmetrics_integration]
+==  OpenMetrics Integration
+
+CRR topic metrics can also be collected in **OpenMetrics** format using the following API:
+
+[source,console]
+----
+curl http://localhost:14240/informant/metrics
+----
+
+=== OpenMetrics Definition
+
+|===
+| Type | Name | Label | Description
+
+| Metric | `tigergraph_crr_replication_lag_records` | topic | Replication lag (records)
+| Metric | `tigergraph_crr_replay_lag_records` | topic | Replay lag (records)
+| Metric | `tigergraph_crr_replication_lag_seconds` | topic | Replication lag (seconds)
+| Metric | `tigergraph_crr_replay_lag_seconds` | topic | Replay lag (seconds)
+| Metric | `tigergraph_crr_replication_stalled_seconds` | topic | Duration replication has stalled
+| Metric | `tigergraph_crr_replay_stalled_seconds` | topic | Duration replay has stalled
+| Metric | `tigergraph_crr_source_offset` | topic | Source offset in PR
+| Metric | `tigergraph_crr_replicated_offset` | topic | Replicated offset in DR
+| Metric | `tigergraph_crr_target_offset` | topic | Target offset in DR topic
+| Metric | `tigergraph_crr_replayed_offset` | topic | Replayed offset in DR
+| Metric | `tigergraph_crr_source_timestamp` | topic | Timestamp of source record in PR
+| Metric | `tigergraph_crr_replicated_timestamp` | topic | Timestamp of replicated record in DR
+| Metric | `tigergraph_crr_target_timestamp` | topic | Timestamp of target record in DR
+| Metric | `tigergraph_crr_replayed_timestamp` | topic | Timestamp of replayed record in DR
+|===
+
+[NOTE]
+====
+TigerGraph provides raw metrics for CRR topic replication and replay, but does not include built-in alerting capabilities.
+
+Replication and replay lag may vary depending on cluster configuration, network bandwidth, and latency. Users are advised to define alerting rules in their own monitoring systems (for example, Prometheus).
+
+A CRR topic can be considered abnormal if any of the following conditions occur:
+
+* `tigergraph_crr_replication_lag_records > 20`
+* `tigergraph_crr_replay_lag_records > 20`
+* `tigergraph_crr_replication_stalled_seconds > 60`
+* `tigergraph_crr_replay_stalled_seconds > 60`
+
+These thresholds can be adjusted according to the environment and operational requirements.
+====
+
+
diff --git a/modules/cluster-and-ha-management/pages/fail-over.adoc b/modules/cluster-and-ha-management/pages/fail-over.adoc
@@ -1,44 +1,150 @@
 = Fail over to the DR cluster
 //:page-aliases: crr:fail-over.adoc
 
+
 In the event of catastrophic failure that has impacted the full cluster due to Data Center or Region failure, the user can initiate the failover to the Disaster Recovery (DR) cluster.
-This is a manual process.
 
-Run the following commands to make configuration changes on the DR cluster to upgrade it to the primary cluster.
+Starting from TigerGraph 4.3.0, failover to a Disaster Recovery (DR) cluster is simplified into a single command.
+The new `--promote` flag for `gadmin crr stop` replaces the previous multi-step manual process, making DR promotion faster and more reliable.
+Update existing failover scripts to use this new command.
+
+[source,bash]
+----
+gadmin crr stop --promote --dump-checkpoint /path/to/checkpoint.json -y
+----
+
+[#key_behavior_changes]
+=== Key Behavior Changes and Enhancements
+
+==== 1. Single Command Operation
+The `gadmin crr stop --promote` command now orchestrates the entire promotion process.
+It replaces four legacy commands and automates the following actions:
+
+* Stops the CRR connector.
+* Updates the configuration parameter: `System.CrossRegionReplication.Enabled=false`
+* Applies configuration changes and restarts dependent services.
+
+==== 2. Automated Safety Checks for Data Consistency
+Before stopping replication, the command automatically verifies synchronization between the DR and Primary clusters.
+
+* If replication lag is detected, the system displays a warning indicating the DR cluster is not fully synchronized.
+* Promoting a lagging DR cluster can result in data loss. If lag is detected, the command pauses and requests explicit confirmation before proceeding.
+* If no lag exists, promotion continues automatically without user intervention.
+
+==== 3. Integrated Replication Checkpointing
+The `--dump-checkpoint` flag generates a replication checkpoint file that records precise replication offsets at the moment of promotion.
+
+* When the DR cluster is fully synchronized, the checkpoint file is saved automatically to the specified path.
+* This checkpoint enables a **fast-switch** operation when converting the old Primary cluster into a new DR cluster—avoiding a full backup and restore.
+* See <<set-up-new-dr-cluster>> for instructions on using the checkpoint for a fast-switch.
+
+==== 4. Forced Promotion for Emergency Scenarios
+In urgent cases, administrators can use the `--force` flag along with `--promote` to bypass synchronization checks.
+
+[source,bash]
+----
+gadmin crr stop --promote --force -y
+----
+
+Use this option only during emergencies. Forced promotion may lead to data loss and will require a full backup and restore to re-establish the DR cluster.
+
+[NOTE]
+====
+Users must update existing failover scripts and operational runbooks.
+The older four-step failover process is now deprecated and should be replaced with the single `gadmin crr stop --promote` command.
+
+This update improves safety, reduces operational complexity, and ensures more reliable failover operations.
+====
+
+
+[[set-up-new-dr-cluster]]
+
+== Set Up a New DR Cluster After Failover
+:description: Procedure for re-establishing high availability after DR promotion using fast-switch or backup restore.
 
-[source,console]
+=== Overview
+TigerGraph v4.3 introduces a new **fast-switch** capability that allows administrators to quickly convert the former Primary cluster into a new DR cluster.
+This method leverages the replication checkpoint generated during failover, reducing downtime and avoiding a full backup and restore.
+
+After promoting the DR cluster to Primary, the former Primary becomes offline.
+To re-establish high availability, configure the old Primary as the new DR cluster.
+
+=== Principle: Checkpoint Validation for Fast-Switch
+The fast-switch mechanism depends on whether the DR cluster was fully synchronized with the Primary at the time of failover.
+This is determined by validating the replication checkpoint file created during promotion.
+
+[source,bash]
 ----
-gadmin crr stop -y"
-gadmin config set System.CrossRegionReplication.Enabled false"
-gadmin config apply -y"
-gadmin restart -y
+gadmin crr stop --promote --dump-checkpoint <path>
 ----
 
-== Set up a new DR cluster after failover
+Validation outcomes:
+
+* **Validation passes:**
+Clusters were fully synchronized. A fast-switch occurs, converting the cluster to a DR instance instantly without data transfer.
+* **Validation fails:**
+The failover occurred during replication lag (e.g., when `--force` was used). A fast-switch is not possible, and a full backup and restore is required.
+
+=== Method 1: Fast-Switch Using a Replication Checkpoint (Recommended)
+
+Use this method when:
 
-After you fail over to your DR cluster, your DR cluster is now the primary cluster. You may want to set up a new DR cluster to still be able to recover your services in the event of another disaster.
+* The failover was clean (no replication lag detected).
+* A valid replication checkpoint file (e.g., `checkpoint.json`) was created successfully.
 
-To set up a new DR cluster over the upgraded primary cluster:
+==== Steps
 
-. Make a backup of the upgraded primary cluster
-. Run the following command on the new cluster. The commands are the mostly same as setting up the first DR cluster, except that in the fourth command, the value for `System.CrossRegionReplication.TopicPrefix` becomes `Primary.Primary` instead of `Primary`
-. On the new DR cluster, restore from the backup of the upgraded primary cluster
+. **Configure the old Primary cluster as the new DR cluster.**
++
+[source,bash]
+----
+# Set the IPs of the new Primary cluster
+gadmin config set System.CrossRegionReplication.PrimaryKafkaIPs <new_primary_ip1,...>
+
+# Set the Kafka port of the new Primary cluster
+gadmin config set System.CrossRegionReplication.PrimaryKafkaPort <new_primary_kafka_port>
+
+# Set the topic prefix. Add an additional ".Primary" suffix for each failover.
+gadmin config set System.CrossRegionReplication.TopicPrefix <prefix>
+
+# Apply configuration changes
+gadmin config apply
+----
 
-[source,console]
+. **Perform the fast-switch restore.**
++
+[source,bash]
 ----
- # Kafka mirrormaker primary cluster's IPs, separator by ','
- $ gadmin config set System.CrossRegionReplication.PrimaryKafkaIPs PRIMARY_IP1,PRIMARY_IP2,PRIMARY_IP3
- 
- # Kafka mirrormaker primary cluster's KafkaPort
- $ gadmin config set System.CrossRegionReplication.PrimaryKafkaPort 30002
- 
- # The prefix of GPE/GUI/GSQL Kafka Topic, by default is empty.
- $ gadmin config set System.CrossRegionReplication.TopicPrefix Primary.Primary
- 
- # Enable CRR with the primary's backup created in step 1
- $ gadmin backup restore <backup from the primary cluster> --dr
+gadmin backup restore --dr --dr-checkpoint /path/to/checkpoint.json
 ----
++
+If validation succeeds, the cluster is reconfigured as a DR instance and replication resumes from the exact checkpoint position.
 
-There is no limit on the number of times a cluster can fail over to another cluster. When designating a new DR cluster, make sure that you set the `System.CrossRegionReplication.TopicPrefix` parameter correctly by adding an additional `.Primary` .
+=== Method 2: Full Backup and Restore (Fallback Method)
+
+Use this method when:
+
+* The checkpoint validation fails (clusters were not synchronized).
+* The checkpoint file is missing.
+* The DR cluster is being deployed on new hardware.
+
+==== Steps
+
+. **Create a backup on the new Primary cluster.**
++
+[source,bash]
+----
+gadmin backup create <backup_name>
+----
+
+. **Configure the new DR cluster.**
++
+Follow the configuration steps in *Method 1* to point the DR cluster to the new Primary.
+
+. **Restore the backup to the new DR cluster.**
++
+[source,bash]
+----
+gadmin backup restore <backup_tag> --dr
+----
 
-For example, if your original cluster fails over once, and the current cluster's `TopicPrefix` is `Primary`, then the new DR cluster needs to have its `TopicPrefix` be `Primary.Primary`. If it needs to fail over again, the new DR cluster needs to have its `TopicPrefix` be set to `Primary.Primary.Primary`.