Data duplication is one of the most persistent sources of cost, inconsistency, and operational complexity in modern data systems. Every copy of a dataset is a liability: it drifts out of sync, consumes storage, requires its own access controls, and creates another surface for compliance exposure. Composable and zero-copy architectures attempt to address this by keeping data in place and bringing computation to it, rather than the reverse. For distributed systems, this is significant because it changes the coupling patterns between services and introduces new consistency, ordering, and observability challenges.

This article examines what zero-copy means in practice, why data duplication persists despite its costs, composability as a design principle for data infrastructure, the trade-offs these architectures introduce, their actual effect on vendor lock-in, and the observability challenges that come with composable data flows.

What zero-copy means in practice

The term "zero-copy" in the context of data architecture does not mean that no bytes are ever duplicated in memory or on disk. It means that the architectural pattern avoids creating persistent, redundant copies of datasets as they move between processing stages. The data stays in one place. Different systems read from it, compute over it, and write results alongside it, but they do not ingest it into their own proprietary storage.

The technical foundations for this have matured considerably. Apache Arrow provides a columnar, language-independent memory format that allows different tools to operate on the same in-memory data without serialisation and deserialisation overhead. Open table formats like Apache Iceberg, Delta Lake, and Apache Hudi define metadata layers on top of object storage that support transactional semantics, schema evolution, and time travel queries. These formats are designed so that multiple engines, a Spark job, a Trino query, a Pandas script, can read the same physical files.

Data lakehouses extend this further by combining the flexibility of data lakes with the management features traditionally associated with data warehouses. The key architectural property is that the storage layer is decoupled from the compute layer. You store data once in an open format. You attach whatever compute engine is appropriate for the task. You avoid the ETL pipelines that historically existed solely to move data from one system's format into another's.

This is not just an optimisation. It is a structural change in how data systems are composed.

Why data duplication persists

If duplication is so costly, why does it remain widespread? The answer is mostly historical and organisational, not technical.

ETL pipelines were originally necessary because different systems genuinely could not read each other's data formats. A relational database stored data in row-oriented pages. An analytics engine needed columnar storage. A machine learning framework expected flat files or specific binary formats. Moving data between these systems required transformation and copying. The pipeline was the bridge.

Those technical constraints have largely been resolved by open formats, but the organisational patterns they created persist. Teams build pipelines because that is how things have always been done. Data engineering roles are defined around pipeline construction and maintenance. Vendor products are sold on the basis of ingesting data into their proprietary format for optimal performance. Each ingestion is another copy.

There are also legitimate reasons for some duplication. Caching data closer to compute reduces read latency. Materialised views and pre-aggregated tables serve queries faster than scanning raw data. Some regulatory frameworks require data to be stored in specific jurisdictions, which may necessitate regional copies, a topic explored in detail in multi-cloud strategies under regulatory pressures. The goal of zero-copy architecture is not to eliminate every copy. It is to eliminate the unnecessary ones.

The distinction matters. A cache is a deliberate, managed copy with a defined invalidation strategy. A copy created because two teams independently ingest the same source into different warehouses is waste.

Composability as a design principle

Composability in data architecture means that components can be assembled, replaced, and recombined without rewriting the systems that depend on them. It is the same principle that makes Unix pipes powerful: small tools that do one thing well, connected through a standard interface.

In practice, composable data architectures rely on several building blocks.

  • Open interchange formatsArrow, Parquet, Iceberg, and similar formats define the interface between components. Any engine that can read Iceberg tables can participate in the architecture without custom integration.
  • Catalog servicesA shared metadata catalog (such as Apache Polaris, Hive Metastore, or Unity Catalog) provides a single registry of datasets, schemas, and access policies. Components discover data through the catalog rather than through hardcoded paths.
  • Pluggable computeThe compute layer is not permanently attached to the storage layer. You can run a Spark cluster for batch processing, Trino for interactive queries, and a Python process for ad hoc analysis, all against the same underlying data.
  • Declarative access controlPermissions are defined at the catalog level and enforced consistently regardless of which compute engine is reading the data. This is essential because composability without unified access control is a security gap.

The benefit is flexibility. When a new compute engine offers better performance for a specific workload, you plug it in. When a storage format evolves, you migrate the metadata layer without rewriting every consumer. When a team needs access to a dataset, they query the catalog instead of requesting a new pipeline.

The cost is coordination. Every component must adhere to the shared formats and protocols. Schema changes affect all consumers, not just the ones managed by the team that owns the data. This is where consensus and agreement protocols become relevant at the organisational level: composable systems require shared standards and the discipline to maintain them.

Trade-offs and consistency challenges

Composable, zero-copy architectures are not free of trade-offs. Several are significant enough to influence architectural decisions.

Consistency is the primary concern. When multiple engines read from shared storage, and when writes may come from different systems, the question of what state a reader sees becomes critical. Open table formats provide snapshot isolation and atomic commits, but the guarantees vary between formats and between engines reading those formats. Concurrent writes from different engines to the same table can conflict. Time, clocks, and ordering are directly relevant: without a consistent ordering of writes, readers can observe inconsistent states.

Schema evolution is the second challenge. In a world where data is copied into each consumer's store, schema changes are isolated. The source can change its schema, and downstream consumers continue reading their copy in the old format until their pipeline is updated. In a zero-copy architecture, a schema change is visible to all consumers immediately. This demands rigorous schema governance, including backward and forward compatibility rules, and it requires every consumer to handle schema changes gracefully.

Performance characteristics differ from traditional architectures. Reading data over object storage through an open format is not always as fast as reading from a locally optimised, vendor-specific store. Caching, indexing, and query planning become more important. Some workloads will still benefit from materialised copies, and the architecture needs to accommodate that without degenerating into unmanaged duplication.

Access control complexity increases when multiple engines share data. Each engine has its own authentication and authorisation model. Enforcing consistent fine-grained access policies across all of them is a genuine engineering challenge, not a configuration exercise.

Reducing vendor lock-in

One of the stated motivations for composable architectures is reducing vendor lock-in. The reasoning is straightforward: if your data is in an open format on storage you control, you can switch compute engines without migrating data. This is true, but the picture is more nuanced.

Open formats reduce storage-level lock-in effectively. If your data is in Parquet files on S3, you are not locked into any particular query engine. You can switch from one vendor's warehouse to another, or to an open-source engine, without re-ingesting terabytes of data.

However, lock-in migrates up the stack. Catalog services have their own APIs and metadata formats. Query engines have proprietary optimisations, UDFs, and extensions. Orchestration tools, data quality frameworks, and lineage tracking systems all create switching costs. Adopting a composable architecture reduces one form of lock-in but does not eliminate dependency on specific vendors.

The practical approach is to distinguish between layers. Use open formats for storage. Accept some vendor specificity at the compute layer where performance justifies it. Invest in portability at the catalog and governance layers, because those are the hardest to migrate later.

This layered strategy aligns with the broader discussion of human oversight boundaries and operational maturity in the sense that architectural decisions should be made deliberately at each layer, with clear understanding of the trade-offs, rather than applying a single principle uniformly across the entire stack.

Observability in composable architectures

When data flows through a composable architecture, understanding what happened and when becomes harder, not easier. In a monolithic data warehouse, lineage is implicit. Data enters through defined ingestion paths, transformations are logged, and queries run against a known state. In a composable architecture, multiple engines read and write to shared storage, and the flow of data through the system is less visible.

Observability in this context requires several capabilities that traditional monitoring does not provide.

  • Data lineage trackingEvery read and write to shared storage should be recorded with enough context to reconstruct the provenance of any dataset. This includes which engine performed the operation, what version of the schema was used, and what the input datasets were.
  • Freshness monitoringWhen consumers expect data to be current, there needs to be a mechanism to detect when a dataset has not been updated within its expected window. Stale data in a zero-copy architecture has the same effect as a broken pipeline in a traditional one, except that the breakage is less visible.
  • Cross-engine query tracingA workflow that spans multiple compute engines needs end-to-end tracing. If a Spark job writes a table that a Trino query reads, and the Trino query returns unexpected results, the operator needs to trace back through both engines to find the cause. This is the same principle as distributed tracing in microservices, applied to data infrastructure.
  • Schema change auditingEvery schema change should be logged, versioned, and reviewable. Consumers need to be notified when a schema they depend on has changed, and there should be a mechanism to assess the impact before the change takes effect.

The observability principles for distributed systems apply directly here. The three pillars of metrics, logs, and traces still hold. What changes is the scope: observability must span storage, multiple compute engines, catalog services, and access control layers. Building this observability is not optional. Without it, composable architectures become opaque systems where data quality issues are discovered late, by end users rather than by engineers.

Composable, zero-copy data architectures represent a genuine improvement over the ETL-heavy, duplication-laden patterns that preceded them. But they are not simple. They shift complexity from data movement to data governance, from pipeline engineering to format standardisation, and from vendor-managed silos to self-managed, multi-engine ecosystems. The organisations that benefit most from them are the ones that invest in the coordination, observability, and governance infrastructure these architectures demand.

Related notes