RO-Crate for Research Data Packaging

Pattern IDI-005
Pattern Typeimplementation
Also known asResearch Object Crate, RO-Crate
AuthorMat Bettinson
Last updated2026-03-17
Keywords ro-crateJSON-LDmetadatadata-packagingresearch-objectsFAIRstatic-publicationschema.orglinked-data
HASS Domains linguisticsmusicologyanthropologydigital-humanitiescultural-heritage
Source typetalk-transcript
Source refPeter Sefton, RO-Crate overview talk (YouTube)

RO-Crate for Research Data Packaging

A concrete implementation of co-located metadata and data using JSON-LD and optional HTML preview to make research data collections self-describing, portable, and human-accessible.

Alternative Names: Research Object Crate, RO-Crate


Pattern Metadata

FieldValue
Pattern IDI-005
Pattern TypeImplementation
Keywordsro-crate, JSON-LD, metadata, data-packaging, research-objects, FAIR, static-publication, schema.org
Author(s)Mat Bettinson
Last Updated2026-03-17

Intent

Package a research data collection as a self-describing object by placing a JSON-LD metadata file (ro-crate-metadata.json) alongside the data, with an optional HTML file that provides a human-readable preview — making the collection portable, machine-readable, and immediately interpretable without external infrastructure.


Context

When This Pattern Applies

When This Pattern Does NOT Apply

Prerequisites


Issues

Issue 1: Machine-Readability vs. Human Accessibility

JSON-LD metadata is machine-readable but opaque to researchers, archivists, and community members who need to understand what a package contains. A metadata file alone creates a two-tier collection: interpretable by tools, inaccessible to humans without further infrastructure.

Issue 2: General Vocabulary Coverage vs. Domain Specificity

schema.org provides a well-supported, machine-readable vocabulary for common properties (file size, dates, descriptions, authorship) but does not cover domain-specific concepts in linguistics, musicology, or cultural heritage. Domain-specific terms — e.g., dialogue type, gesture, handwritten vs. audio — are essential for describing HASS collections but absent from the general vocabulary.

Issue 3: Archival Longevity vs. Technology Dependencies

Research data must remain accessible over long timeframes — decades, not years. Technologies that package or present data in ways that depend on maintained infrastructure (apps, web services, proprietary tools) routinely fail before the research communities that depend on them.

Key Constraints


Motivating Example

The Situation:

A project documenting an Arandic language community produced a dictionary combining images, audio recordings of speakers, and textual entries. Historically, such resources were often packaged as phone applications — the obvious choice for a format people would use to access the dictionary.

The Issues That Emerged:

Phone apps built for one operating system version become unmaintainable as platforms evolve. Apps funded for a specific project period typically have no maintenance budget once the grant ends. A dictionary that worked at project completion may be non-functional within two years — while the language community it serves continues to need it.

Why Balance Is Needed:

The need for human accessibility (a format people will actually use) pulls toward rich applications. The need for long-term availability pulls toward low-dependency formats. RO-Crate with a static HTML preview resolves this by delivering a human-accessible presentation layer (browsable in any web browser) from a format that requires no ongoing server infrastructure.


Solution

Core Idea

Place a ro-crate-metadata.json file — a JSON-LD document conforming to the RO-Crate specification — in the same directory as the research data files. Optionally (and strongly recommended) also generate an ro-crate-preview.html file that renders the metadata as a human-readable page, potentially with embedded content previews (first rows of a spreadsheet, an audio player for sound files). The package can then be served as a static website, deposited in a repository, or archived without any ongoing server infrastructure.

Key Principles

  1. JSON-LD as the metadata layer. ro-crate-metadata.json uses JSON-LD with schema.org as the base vocabulary. Entities (files, datasets, people, organisations) are described using schema.org properties, with domain-specific vocabulary extensions where required.
  2. HTML preview as the accessibility layer. The HTML file can be generated automatically from the JSON-LD and provides human-readable access to the collection without tooling or infrastructure. It can include rendered content — audio players, image previews, spreadsheet rows — not just file listings.
  3. Extend schema.org for domain needs. Where general vocabulary is insufficient, domain communities maintain separate vocabulary documents with persistent identifiers (e.g., W3ID). The extension vocabulary is referenced from within the crate, keeping extensions discoverable and independently resolvable.
  4. Profiles and validation. A profile defines required and optional properties per entity type for a given domain or project. Validation checks candidate crates against their declared profile. This enables domain communities to express and enforce metadata requirements while remaining within the RO-Crate ecosystem.

Solution Structure

Research Data Directory (RO-Crate)
├── ro-crate-metadata.json   ← JSON-LD; entities + relationships
├── ro-crate-preview.html    ← auto-generated human-readable view
├── data/
│   ├── recording-001.wav
│   ├── image-001.jpg
│   └── dictionary-entries.csv
└── (optional: declared profile for validation)

ro-crate-metadata.json structure:
{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
    { "@type": "CreativeWork", "@id": "ro-crate-metadata.json", ... },
    { "@type": "Dataset", "@id": "./", "name": "...", ... },
    { "@type": "File", "@id": "data/recording-001.wav", ... },
    ...
  ]
}

Publication pathway (no server required):
  RO-Crate directory → GitHub Pages → static website
  RO-Crate directory → archival repository → persistent access

How the Issues Are Balanced

Values and Considerations

When authoring metadata:

When generating the HTML preview:


Implementation Examples

Example 1: LDaCA Arandic Dictionary

Context: Language Data Commons of Australia; a community language dictionary combining images, audio recordings of speakers, and textual dictionary entries for an Arandic language.

How They Balanced the Issues: A single RO-Crate was generated for the dictionary, with an HTML preview produced automatically from the crate metadata. The preview includes an audio player and image display. The entire collection is published as a static website on GitHub Pages — no server infrastructure required.

What Worked Well: Long-lived, low-overhead publication. The static site has no maintenance dependency after deployment, in direct contrast to app-based alternatives that had failed within years of project end.

Link to Details: https://www.ldaca.edu.au/

Example 2: PARADISEC — Archival Scale

Context: Pacific and Regional Archive for Digital Sources in Endangered Cultures; tens of thousands of endangered language recordings and cultural materials.

How They Balanced the Issues: PARADISEC delivers RO-Crates at scale from a server with object storage backend. File/directory crates are used for backup. The same RO-Crate structure works from individual crates to repository scale — the pattern is scale-agnostic.

What Worked Well: RO-Crate’s flexibility across scales means PARADISEC can use it both as the internal packaging format and as the interchange format for data delivery.

Link to Details: https://www.paradisec.org.au/

Example 3: LDaCA — Domain Vocabulary Extension

Context: Language Data Commons of Australia; multiple language collections across language groups and institutions, all requiring domain-specific metadata terms beyond what schema.org provides.

How They Balanced the Issues: LDaCA maintains a vocabulary extension for language data — terms such as dialogue, drama, formulaic, gesture, handwritten — identified via a W3ID persistent identifier. This vocabulary is embedded in RO-Crates for all language collections, providing domain-specific metadata while remaining within the JSON-LD ecosystem.

What Worked Well: Domain vocabulary extension allows richly typed metadata for language collections that schema.org alone cannot express, while the persistent identifier for the vocabulary ensures extensions remain resolvable over time.

Link to Details: https://www.ldaca.edu.au/


Context-Specific Guidance

For HASS Research

For Indigenous Research

CARE Principles Application (Carroll et al., 2020):

Cultural Considerations:

For Different Scales

Small Projects / Solo Researchers:

Large Collaborative Projects:


Consequences

What You Gain

What You Accept

Risks to Manage


Known Uses

PARADISEC

Language Data Commons of Australia (LDaCA)

ARDC-Funded Projects

European Biosciences (FAIR Digital Objects)


Works Well With

Typical Sequence

Co-Located Metadata and Data (A-004) → RO-Crate for Research Data Packaging (I-005)
Architectural principle chosen             Concrete technology selected

Pitfalls to Avoid

Anti-Pattern: Hand-Authored JSON-LD

Common Mistake: Maintaining the HTML Preview Separately


Resources

Learning Materials

Tools and Platforms

Further Reading


Key References

Alexander, C. (1977). A Pattern Language: Towns, Buildings, Construction. Oxford University Press.

Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R., Sara, R., Walker, J. D., Anderson, J., & Hudson, M. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(1), 43. https://datascience.codata.org/articles/10.5334/dsj-2020-043 (DOI https://doi.org/10.5334/dsj-2020-043 does not resolve correctly due to journal platform migration)

Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994). Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional.

Soiland-Reyes, S., Sefton, P., Goble, C., Garijo, D., Ó Carragáin, E., Crusoe, M., Ó Searcóid, S., Sloggett, C., & Thieberger, N. (2022). Packaging research artefacts with RO-Crate. Data Science, 5(2), 97–138. https://doi.org/10.3233/DS-210053