-
Couldn't load subscription status.
- Fork 48
Description
Background
I have a multi-process Python service running in Cloudrun which has Custom Opentelemetry metrics which most of the time are running successfully.
Error
From time to time (i'm testing with 1k+ requests which are spinning more than 10 instances) i get the following error:
2025-10-21 15:57:31,136 2 webapp.v2.cr_metrics INFO Export successful on attempt 1
Traceback (most recent call last):
File "/opt/venv/lib/python3.12/site-packages/opentelemetry/exporter/cloud_monitoring/__init__.py", line 248, in _get_metric_descriptor
response_descriptor = self.client.create_metric_descriptor(
File "/opt/venv/lib/python3.12/site-packages/google/cloud/monitoring_v3/services/metric_service/client.py", line 1386, in create_metric_descriptor
response = rpc(
File "/opt/venv/lib/python3.12/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
return wrapped_func(*args, **kwargs)
File "/opt/venv/lib/python3.12/site-packages/google/api_core/timeout.py", line 130, in func_with_timeout
return func(*args, **kwargs)
File "/opt/venv/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 77, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded
The above exception was the direct cause of the following exception:
>
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2025-10-21T15:57:10.835572366+00:00"}"
details = "Deadline Exceeded"
status = StatusCode.DEADLINE_EXCEEDED
Traceback (most recent call last):
File "/opt/venv/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 75, in error_remapped_callable
return callable_(*args, **kwargs)
File "/opt/venv/lib/python3.12/site-packages/grpc/_interceptor.py", line 277, in __call__
response, ignored_call = self._with_call(
File "/opt/venv/lib/python3.12/site-packages/grpc/_interceptor.py", line 332, in _with_call
return call.result(), call
File "/opt/venv/lib/python3.12/site-packages/grpc/_channel.py", line 440, in result
raise self
File "/opt/venv/lib/python3.12/site-packages/grpc/_interceptor.py", line 315, in continuation
response, call = self._thunk(new_method).with_call(
File "/opt/venv/lib/python3.12/site-packages/grpc/_channel.py", line 1198, in with_call
return _end_unary_response_blocking(state, call, True, None)
File "/opt/venv/lib/python3.12/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
type: "workload.googleapis.com/requests_total"
display_name: "requests_total"
description: "Total number of HTTP requests received"
unit: "1"
value_type: INT64
metric_kind: CUMULATIVE
}
key: "opentelemetry_id"
labels {
}
key: "endpoint"
labels {
}
key: "http_status_code"
labels {
}
key: "instance_id"
labels {
}
key: "service_name"
2025-10-21 15:57:11,636 2 opentelemetry.exporter.cloud_monitoring ERROR Failed to create metric descriptor labels {
Current implementation
For the current implementation i'm using a periodic exporting metricReader with a custom Exporter:
# Wrap with retry logic
exporter = RetryableCloudMonitoringExporter(
base_exporter,
initial_delay=4.0,
max_delay=10,
backoff_multiplier=2.0
)
reader = PeriodicExportingMetricReader(
exporter,
export_interval_millis=60_000 + jitter_milis, # export between 60s-120s
export_timeout_millis=60_000,
)
While the retry logic is the following:
def _export_with_retry(self, metrics_data, timeout_millis, attempt=0):
try:
result = self.base_exporter.export(metrics_data,
timeout_millis=timeout_millis)
if result == MetricExportResult.SUCCESS:
logger.info("Export successful on attempt %d", attempt + 1)
return result
# Export returned failure status
error_msg = f"Export returned failure status: {result}"
logger.warning(error_msg)
if attempt < self.max_retries:
delay = min(self.initial_delay * (self.backoff_multiplier ** attempt),
self.max_delay)
logger.warning(
"Export failed (attempt %d/%d), retrying in %ds: %s",
attempt + 1, self.max_retries + 1, delay, error_msg)
time.sleep(delay)
return self._export_with_retry(metrics_data, timeout_millis,
attempt + 1)
logger.error("Export failed after %d attempts: %s",
attempt + 1, error_msg)
return result
except Exception as e: # pylint: disable=broad-except
if attempt < self.max_retries and self._is_retryable_error(e):
delay = min(self.initial_delay * (self.backoff_multiplier ** attempt),
self.max_delay)
logger.warning(
"Export failed with exception (attempt %d/%d ), "
"retrying in %ds: %s",
attempt + 1, self.max_retries + 1, delay, e)
time.sleep(delay)
return self._export_with_retry(metrics_data, timeout_millis, attempt + 1)
logger.error("Export failed with exception after %d attempts: %s",
attempt + 1, e)
raise
One issue that i'm having is that i cannot rely on the retry because I see 504 Deadline exceeded error while the result is MetricExportResult.SUCCESS. This is not happening during shutdown this is happening while processing requests.
One thing here, the metrics seems to be recorded correctly.