Skip to content

OpenTelemetry CloudMonitoringMetricsExporter error with Cloudrun #438

@raducodescu

Description

@raducodescu

Background
I have a multi-process Python service running in Cloudrun which has Custom Opentelemetry metrics which most of the time are running successfully.

Error
From time to time (i'm testing with 1k+ requests which are spinning more than 10 instances) i get the following error:

2025-10-21 15:57:31,136 2     webapp.v2.cr_metrics INFO  Export successful on attempt 1
Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/opentelemetry/exporter/cloud_monitoring/__init__.py", line 248, in _get_metric_descriptor
    response_descriptor = self.client.create_metric_descriptor(
  File "/opt/venv/lib/python3.12/site-packages/google/cloud/monitoring_v3/services/metric_service/client.py", line 1386, in create_metric_descriptor
    response = rpc(
  File "/opt/venv/lib/python3.12/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
  File "/opt/venv/lib/python3.12/site-packages/google/api_core/timeout.py", line 130, in func_with_timeout
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 77, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded
The above exception was the direct cause of the following exception:
>
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2025-10-21T15:57:10.835572366+00:00"}"
	details = "Deadline Exceeded"
	status = StatusCode.DEADLINE_EXCEEDED
Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 75, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/opt/venv/lib/python3.12/site-packages/grpc/_interceptor.py", line 277, in __call__
    response, ignored_call = self._with_call(
  File "/opt/venv/lib/python3.12/site-packages/grpc/_interceptor.py", line 332, in _with_call
    return call.result(), call
  File "/opt/venv/lib/python3.12/site-packages/grpc/_channel.py", line 440, in result
    raise self
  File "/opt/venv/lib/python3.12/site-packages/grpc/_interceptor.py", line 315, in continuation
    response, call = self._thunk(new_method).with_call(
  File "/opt/venv/lib/python3.12/site-packages/grpc/_channel.py", line 1198, in with_call
    return _end_unary_response_blocking(state, call, True, None)
  File "/opt/venv/lib/python3.12/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
type: "workload.googleapis.com/requests_total"
display_name: "requests_total"
description: "Total number of HTTP requests received"
unit: "1"
value_type: INT64
metric_kind: CUMULATIVE
}
  key: "opentelemetry_id"
labels {
}
  key: "endpoint"
labels {
}
  key: "http_status_code"
labels {
}
  key: "instance_id"
labels {
}
  key: "service_name"
2025-10-21 15:57:11,636 2     opentelemetry.exporter.cloud_monitoring ERROR Failed to create metric descriptor labels {

Current implementation
For the current implementation i'm using a periodic exporting metricReader with a custom Exporter:

# Wrap with retry logic
      exporter = RetryableCloudMonitoringExporter(
         base_exporter,
         initial_delay=4.0,
         max_delay=10,
         backoff_multiplier=2.0
      )
      reader = PeriodicExportingMetricReader(
         exporter,
         export_interval_millis=60_000 + jitter_milis, # export between 60s-120s
         export_timeout_millis=60_000,
      )

While the retry logic is the following:

def _export_with_retry(self, metrics_data, timeout_millis, attempt=0):
      try:
         result = self.base_exporter.export(metrics_data,
                                            timeout_millis=timeout_millis)
         if result == MetricExportResult.SUCCESS:
            logger.info("Export successful on attempt %d", attempt + 1)
            return result
         # Export returned failure status
         error_msg = f"Export returned failure status: {result}"
         logger.warning(error_msg)
         if attempt < self.max_retries:
            delay = min(self.initial_delay * (self.backoff_multiplier ** attempt),
                        self.max_delay)
            logger.warning(
                "Export failed (attempt %d/%d), retrying in %ds: %s",
                        attempt + 1, self.max_retries + 1, delay, error_msg)
            time.sleep(delay)
            return self._export_with_retry(metrics_data, timeout_millis,
                                           attempt + 1)
         logger.error("Export failed after %d attempts: %s",
                  attempt + 1, error_msg)
         return result
      except Exception as e: # pylint: disable=broad-except
         if attempt < self.max_retries and self._is_retryable_error(e):
            delay = min(self.initial_delay * (self.backoff_multiplier ** attempt),
                        self.max_delay)
            logger.warning(
                    "Export failed with exception (attempt %d/%d ), "
                    "retrying in %ds: %s",
                    attempt + 1, self.max_retries + 1, delay, e)
            time.sleep(delay)
            return self._export_with_retry(metrics_data, timeout_millis, attempt + 1)
         logger.error("Export failed with exception after %d attempts: %s",
                      attempt + 1, e)
         raise

One issue that i'm having is that i cannot rely on the retry because I see 504 Deadline exceeded error while the result is MetricExportResult.SUCCESS. This is not happening during shutdown this is happening while processing requests.
One thing here, the metrics seems to be recorded correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions