End-to-End Observability: A Must-Have for Multi-Cloud and Hybrid Architectures - Performance notes

Running workloads across multiple cloud providers and on-premises infrastructure is now ordinary. Regulatory requirements push organisations toward data residency controls. Cost optimisation motivates spreading compute across providers. Acquisition histories leave teams managing services on AWS, Azure, and GCP simultaneously, often with legacy systems still running in colocated data centres. The architectural reality is heterogeneous, and that heterogeneity creates a specific, difficult problem for observability. When a request crosses provider boundaries, the tooling that worked within a single cloud account breaks down. Traces fragment. Metrics live in incompatible formats. Logs scatter across vendor-specific stores with different query languages and retention policies.

This article examines the observability challenges unique to multi-cloud and hybrid environments: why these architectures make visibility harder, how trace propagation works (and fails) across providers, what it takes to aggregate metrics into a coherent picture, how to correlate logs across boundaries, what the real costs look like, and the organisational factors that determine whether a unified observability strategy actually holds. These concerns connect directly to earlier discussions of observability fundamentals, request timing across infrastructure, and the multi-cloud regulatory pressures shaping deployment decisions.

Why multi-cloud makes observability harder

Single-cloud environments benefit from integrated tooling. AWS CloudWatch, Azure Monitor, and GCP Cloud Operations each provide metrics, logs, and traces within their own ecosystem. The APIs are consistent, the naming conventions are uniform, and the data stays in one place. The moment you span two or more of these environments, that coherence evaporates.

Each provider uses different terminology for equivalent concepts. What AWS calls an "invocation" Azure might call an "execution." Metric names, label conventions, and aggregation intervals differ. A CPU utilisation metric from one provider might report at 60-second granularity while another reports at 5-second granularity. Aligning these for a unified dashboard is not a configuration toggle. It is an ongoing normalisation effort.

Vendor-specific tooling creates silos. An engineer troubleshooting a latency spike involving a service on GCP calling an API on AWS must switch between two different observability consoles, each with its own query language and authentication. Context switching between tools during an incident slows resolution and increases the chance of missing the root cause.

Network boundaries add further complexity. Cross-provider traffic traverses the public internet or dedicated interconnects, introducing latency and packet loss invisible to application-level observability unless explicitly instrumented.

Trace propagation across providers

Distributed tracing is the primary tool for understanding request flow across services. In a multi-cloud environment, the challenge is ensuring that trace context survives the journey between providers without being dropped, mangled, or restarted.

The W3C Trace Context specification provides a standardised header format (traceparent and tracestate) for propagating trace identity across service boundaries. OpenTelemetry implements this specification and has become the dominant instrumentation framework. In principle, if every service in the call chain uses W3C Trace Context headers, traces should connect seamlessly regardless of where each service runs.

In practice, several things break this.

Managed services strip headersAPI gateways, load balancers, and managed messaging services sometimes drop or overwrite trace context headers. AWS API Gateway, for example, has historically required explicit configuration to forward custom headers. If a managed service sits at a provider boundary and strips the traceparent header, the trace breaks into two disconnected fragments.
Legacy services do not propagateOn-premises systems or older microservices that predate OpenTelemetry adoption may not propagate trace context at all. Instrumenting these services, even minimally, is essential. Without it, every request that passes through them creates a gap in the trace.
Sampling decisions divergeIf provider A's tracing system decides to sample a trace but provider B's system does not, the trace is incomplete. Consistent sampling decisions need to be made at the trace root and respected downstream. Head-based sampling is simpler to coordinate; tail-based sampling is more efficient but requires centralised decision-making that is hard to implement across providers.
Clock skew distorts timelinesSpan timestamps from different providers reflect different clock sources. Even with NTP synchronisation, clock skew of tens of milliseconds is common. When reconstructing a trace timeline, this skew can make it appear that a child span started before its parent, confusing analysis tooling and misleading engineers. The problems of [time and ordering](/notes/distributed-systems/time-clocks-and-ordering/) in distributed systems apply directly here.

The practical recommendation is to treat trace propagation as infrastructure. Define a standard (W3C Trace Context via OpenTelemetry), enforce it through shared libraries or service mesh sidecars, and verify continuously with synthetic transactions.

Metric aggregation and normalisation

Metrics are the foundation of dashboards, alerts, and capacity planning. In a multi-cloud environment, the raw metrics from each provider need to be collected, normalised, and stored in a common system before they become useful for cross-provider analysis.

Normalisation involves several dimensions. Naming conventions must be unified: decide whether you call it cpu_utilisation or cpu_usage_percent and map all provider-specific names to your chosen convention. Units must be aligned: some providers report network throughput in bytes per second, others in bits per second. Aggregation intervals must be harmonised or at least accounted for when comparing values.

OpenTelemetry's metric SDK provides a vendor-neutral collection layer. Services instrumented with OpenTelemetry export metrics in a common format, regardless of where they run. For infrastructure metrics (CPU, memory, disk, network) that come from the cloud providers themselves, you typically need a collector that pulls from each provider's monitoring API and re-exports in a normalised format. Prometheus with provider-specific exporters is one common pattern. The OpenTelemetry Collector with receiver plugins for CloudWatch, Azure Monitor, and GCP Monitoring is another.

The storage backend matters. A single time-series database (Prometheus, Thanos, Mimir, or a managed equivalent) holding metrics from all providers enables cross-provider queries and composite alerts. Running separate stores per provider defeats the purpose, but centralising metrics means shipping data across boundaries, which introduces egress costs. The placement of your aggregation layer is an architectural decision with cost implications.

Alerts should be tied to SLO-based indicators rather than raw resource metrics where possible, because SLOs abstract away provider-specific differences in instance types and metric semantics.

Log correlation across boundaries

Logs remain the most granular signal for debugging. They capture the details that metrics and traces summarise away. In a multi-cloud environment, the challenge is not just collecting logs from different sources but correlating them so that an engineer can follow a single request's journey through its entire lifecycle.

The minimum requirement is a shared correlation identifier. If every service stamps its log entries with the trace ID from the W3C Trace Context header, any log aggregation system can group log lines by request. This sounds simple. In practice, it requires discipline: every logging framework, in every language, across every team, must extract and include the trace ID. A single service that omits it creates a gap in the correlation chain.

Log formats should be structured. JSON-formatted logs with consistent field names (timestamp, service name, trace ID, span ID, log level, message) are far easier to aggregate and query than free-text logs with ad hoc formats. Schema enforcement at the collection layer, rejecting or transforming logs that do not conform, prevents the log store from becoming an unsearchable swamp.

Centralisation is the next problem. Shipping logs from multiple providers into a single aggregation system (Elasticsearch, Loki, a managed service) means dealing with different export mechanisms: CloudWatch Logs on AWS, Diagnostic Settings on Azure, Cloud Logging on GCP. Each requires a different pipeline to forward logs to your central store. These pipelines must be monitored themselves. A broken export pipeline is invisible until someone needs those logs during an incident and discovers they stopped flowing weeks ago.

The cost of observability

Observability in multi-cloud environments is expensive, and the costs are not always obvious.

Data volumeMore infrastructure means more metrics, more traces, more logs. A multi-cloud architecture that spans three providers generates roughly three times the observability data of a single-provider deployment, often more because of the additional instrumentation needed at boundaries.
Egress chargesShipping telemetry data across provider boundaries incurs network egress fees. These can be substantial at scale. A service generating 10 GB of logs per day that must be shipped from AWS to a central log store on GCP incurs daily egress charges that compound over time.
Tool licensingCommercial observability platforms (Datadog, New Relic, Splunk, Dynatrace) charge per host, per GB ingested, or per span. Multi-cloud deployments increase the host count and data volume, scaling costs proportionally. The sticker shock when moving from single-cloud to multi-cloud observability is a common source of budget friction.
Engineering timeBuilding and maintaining the normalisation, collection, and correlation pipelines described above is ongoing engineering work. It is not a one-time setup. Provider APIs change, new services are added, schemas evolve. Someone must keep the plumbing working.

Cost control strategies include aggressive trace sampling (keep 100% of error traces and a small percentage of successes), tiered log storage, and careful selection of which infrastructure metrics to collect. The goal is retaining enough data to debug incidents and track SLOs without paying for telemetry nobody queries.

Organisational challenges

The hardest parts of multi-cloud observability are not technical. They are organisational.

Ownership boundaries become ambiguous when a request spans teams that operate in different providers. If team A owns a service on AWS and team B owns a service on Azure, who is responsible for the observability of the boundary between them? Without clear ownership, the boundary becomes a blind spot. Defining explicit contracts for what telemetry each team emits, and what correlation identifiers they propagate, is essential.

Tooling standardisation requires political will. Teams accustomed to their provider's native tooling will resist adopting a common observability stack. The argument for standardisation is strongest when framed in terms of incident response: during an outage affecting cross-provider request paths, the on-call engineer cannot afford to learn a new query language. A single pane of glass, built on normalised data from all providers, reduces mean time to resolution. This connects to the broader principle that observability must serve the humans responding to incidents, not just generate dashboards for steady-state monitoring.

Alert routing must account for the fact that a single symptom may have its root cause in any provider. Triage workflows should direct responders to the unified observability layer first. If the default response to an alert is to open three different consoles and compare timestamps manually, the strategy has failed.

Skills and training are a persistent challenge. Engineers need to understand telemetry semantics from multiple providers and the query capabilities of the central platform. Cross-training pays off during incidents but is easy to defer.

Multi-cloud and hybrid architectures are here to stay. Without deliberate investment in trace propagation, metric normalisation, log correlation, and the organisational structures to support them, teams end up with fragmented visibility that fails precisely when it matters most.

Related notes

TTFB and origin latencyUnderstanding request timing phases across distributed infrastructure.
Cache hierarchy: edge to originMulti-layer caching across providers affects observability requirements.
Observability for distributed systemsFoundational signals and correlation for cross-service debugging.

Back to topic • Back to Notes index