Co-Located Metadata and Data

Pattern IDA-004
Pattern Typearchitectural
AuthorMat Bettinson
Last updated2026-03-17
Keywords metadatadata-packagingportabilityarchivallong-term-preservationFAIR
HASS Domains linguisticsmusicologyanthropologydigital-humanities
Source typetalk-transcript
Source refPeter Sefton, RO-Crate overview talk (YouTube)

Co-Located Metadata and Data

An architectural principle for ensuring that data collections remain self-describing and portable across storage systems and institutional lifespans.


Pattern Metadata

FieldValue
Pattern IDA-004
Pattern TypeArchitectural
Keywordsmetadata, data-packaging, portability, archival, long-term-preservation, FAIR
Author(s)Mat Bettinson
Last Updated2026-03-17

Intent

Store machine-readable metadata in the same package or directory as the data it describes, so that data collections remain self-describing and portable across storage systems without depending on an external metadata service.


Context

When This Pattern Applies

When This Pattern Does NOT Apply

Prerequisites


Issues

Issue 1: Data Portability Across Storage Systems

Datasets must remain usable when moved between storage technologies — file systems, object storage, cloud platforms, institutional repositories. If the metadata lives only in an external database or catalogue, a storage migration severs the link between data and its description. The data arrives at its destination stripped of context.

Issue 2: Metadata Loss at System Retirement

Institutional systems — catalogues, data management platforms, funding-project databases — have lifespans shorter than the data they describe. When a system is retired, metadata held only in that system is at risk of loss or disconnection from the data it describes. Once separated, the link is rarely recovered.

Issue 3: Long-term Preservation Without Institutional Continuity

HASS collections in linguistics, musicology, and anthropology are expected to remain usable decades after initial capture, often outliving the project team, institution, or funding body. Preservation requires that meaning be embedded in the collection itself, not held by a service that may not exist.

Key Constraints


Motivating Example

The Situation:

A research team at Western Sydney University managing environmental data from the Hawkesbury Institute for the Environment needed to store field data in a way that would remain interpretable over time. No obvious standard existed at the time for bundling data with a description manifest.

The Issues That Emerged:

Without a co-location standard, the team faced the prospect of metadata living in a separate system. If that system was not maintained alongside the data, future researchers encountering the raw data files would have no way to interpret them.

Why Balance Is Needed:

Metadata storage is often treated as a secondary concern relative to the data itself. The cost of separating them is not felt immediately — it accumulates over time as systems drift and people move on. By the time the separation becomes a problem, the people who understood the original context may no longer be reachable.


Solution

Core Idea

Store metadata in the same package, directory, or object prefix as the data it describes. When data moves — to a new storage system, a new institution, or a new custodian — the metadata moves with it automatically. The package is self-describing.

Key Principles

  1. Metadata travels with data. The metadata file or record is stored in the same directory, object prefix, or package as the data it describes. Moving or copying the data automatically includes its description.
  2. Format-agnostic. The architectural principle does not prescribe a metadata format. PARADISEC has used bespoke XML for 20 years; RO-Crate uses JSON-LD. What matters is that metadata is machine-readable and co-located, not which vocabulary or serialisation is used.
  3. Self-describing packages. Each data package carries enough metadata to be understood in isolation — without querying an external system. A new repository, researcher, or tool receiving the package can understand its contents from the package itself.

Solution Structure

Data Package (directory or object prefix)
├── data/
│   ├── item-001.wav
│   ├── item-002.wav
│   └── ...
└── metadata.json   ← co-located metadata file
    (describes items, provenance, rights, relationships)

When moved to new storage:
[Old storage] ──copy──► [New storage]
    package                 package (metadata intact)

Contrast with external metadata store:
[Old storage]   [Metadata DB]   → system retired → metadata lost
    data ───────────╳

How the Issues Are Balanced

Values and Considerations

When choosing a metadata format:

When implementing co-location:


Implementation Examples

Example 1: PARADISEC — Pacific and Regional Archive for Digital Sources in Endangered Cultures

Context: Long-running digital archive for endangered language recordings and cultural materials, operating for approximately 20 years.

How They Balanced the Issues: PARADISEC stores audio, video, and document files on commodity file storage with bespoke XML metadata beside each item. The same principle has been maintained as their storage backend evolved to object storage — the co-location approach works identically in both contexts. Data and metadata live together regardless of the underlying storage technology.

What Worked Well: The archive has survived multiple storage system changes over 20 years without loss of metadata linkage. The collection remains interpretable because meaning is embedded in the packages, not in an external catalogue alone.

Example 2: Language Data Commons of Australia (LDaCA)

Context: National infrastructure for Australian language research collections, including dictionaries and community language materials.

How They Balanced the Issues: LDaCA applies the co-location principle at scale using RO-Crate as the metadata standard. Language collections are packaged as self-describing crates — each collection carries its RO-Crate metadata file alongside the data, making collections portable and independently interpretable.

What Worked Well: Adoption of a shared standard (RO-Crate) extends the principle into an ecosystem: tools built for RO-Crate can consume any LDaCA collection without project-specific integration.

Additional Examples


Context-Specific Guidance

For HASS Research

For Indigenous Research

CARE Principles Application (Carroll et al., 2020):

Cultural Considerations:

For Different Scales

Small Projects / Solo Researchers:

Large Collaborative Projects:


Consequences

What You Gain

What You Accept

Risks to Manage


Known Uses

PARADISEC

Language Data Commons of Australia (LDaCA)

Hawkesbury Environmental Data (WSU)


Works Well With

Typical Sequence

[Co-Located Metadata and Data (A-004)] → [RO-Crate for Research Data Packaging (I-005)]
Architectural principle                    Concrete implementation choice

Pitfalls to Avoid

Anti-Pattern: Metadata-Only Repository

Common Mistake: Co-Location Without Schema


Resources

Further Reading


Key References

Alexander, C. (1977). A Pattern Language: Towns, Buildings, Construction. Oxford University Press.

Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R., Sara, R., Walker, J. D., Anderson, J., & Hudson, M. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(1), 43. https://doi.org/10.5334/dsj-2020-043

Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994). Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional.