View on GitHub

DatAasee - A Metadatalake for Libraries

DatAasee centralizes and interlinks distributed library/research metadata into an API‑first union catalog.

DatAasee Architecture Documentation

Version: 0.9

The principal goal of DatAasee is to provision a library-focused one-stop shop for research data discovery as well as a library-wide metadata hub. DatAasee is a Metadata-Lake (MDL) that aggregates and interconnects research metadata and bibliographic data from various data sources and interacts via a JSON HTTP API, which in turn is prototypically utilized by a web frontend.

Sections:

  1. Introduction & Goals
  2. Constraints
  3. Context & Scope
  4. Solution Strategy
  5. Building Block View
  6. Runtime View
  7. Deployment View
  8. Crosscutting Concepts
  9. Architectural Decisions
  10. Quality Requirements
  11. Risks & Technical Debt
  12. Glossary

Summary:

NOTE: For the specific data model, see: YASQL schema

For background information on data and software architecture, see: https://arxiv.org/abs/2409.05512 and references therein.


1. Introduction & Goals

1.1 Requirements Overview

Given: research and bibliographic (meta)data maintained in various distributed databases and no central access point to browse, search, or locate data-sets. The metadata-lake …

System Landscape

1.2 Quality Goals

Quality Goal Associated Scenarios
Functional Suitability F0
Transferability T0
Compatibility C0
Operability O0
Maintainability M0, M1

2. Constraints

2.1 Technical Constraints

Constraint Explanation
Cloud Deployability To integrate into existing infrastructure and operation environments, a containerized service is required.
Interoperability Data pipelining is required to be compatible to existing database interfaces.
Extensibility Components such as metadata schemas, data pipelines, and metadata exports are required to be extensible.

2.2 Organizational Constraints

Constraint Explanation
OAI-PMH Many existing data sources provide an OAI-PMH endpoint which needs to be supported.
XML All source metadata is expected to be in XML.
S3 File-based ingest has to be also performed via object storage, particularly Ceph’s S3 API.
K8s If possible Kubernetes should be supported (in addition to Compose).

2.3 Conventions

Technical

Standard Function
JSON Serialization language for all external messages
JSON:API External message format standardization
JSON Schema External message content validation
YAML Internal processor (and prototype frontend) declaration language
StrictYAML Preferred declaration language dialect
OpenAPI External API definition and documentation format
SHA256 Identifier Hashing and Checksums
Base64URL Identifier Encoding
Naming Things with Hashes Identifier Marking
Compose Deployment and orchestration

Content

Standard Function
DataCite Core metadata vocabulary
OpenWEMI Entity relationships
Fields of Science Scientific classification
SPDX License List Software license names
Creative Commons License names
RightsStatements.org Copyright classification
ISO 8601 Date and time formatting
ISO 639-1 Language name abbreviations
DOI Preferred resource identifier
ORCID Preferred creator identifier
DublinCore Import format
MODS Import format
MARCXML Import format
LIDO Import format
BibJSON Export format

Documentation

Standard Function
Tech Stack Canvas Product tech stack (see README)
Diataxis Software documentation structure (see docs)
arc42 Software architecture documentation (this document)
yasql Database schema documentation (can be rendered with PlantUML)

3. Context & Scope

Context

3.1 Business Context

Channel Description
Interact All unprivileged functionality
Search Directly query metadata records (typically privileged)
Control Monitor, trigger ingests and backups (privileged)
Import Ingest metadata records from source system

3.2 Technical Context

Channel Description
Interact Unprivileged HTTP API
Search Requested and responded through HTTP API
Control Privileged HTTP API
Import Pulled via HTTP

4. Solution Strategy


5. Building Block View

DatAasee uses a three-tier architecture with these separately containerized components which are orchestrated by Compose:

Function Abstraction Tier Product
Metadata Catalog Multi-Model Database Data (Database) ArcadeDB
EtLT Processor Declarative Streaming Processor Logic (Backend) Benthos
Web Frontend Declarative Web Framework Presentation (Frontend) Lowdefy

Level 0 (Outside View)

Outside View

DatAasee

Source Databases (External)

Backup Storage (External)

Prototype Web-Frontend (Optional)

Level 1 (Inside View)

Inside View

Database Container

Backend Container

Frontend Container (Optional)

Level 2 (Container View)

Database Container Internals

Database

Backend Container Internals

Backend

Frontend Container Internals

Frontend


6. Runtime View

System Endpoints

/api Endpoint (Public)

NOTE: This endpoint is implicitly cached, meaning all schema files are opened only once.

API Endpoint

See api endpoint documentation and source file.


/ready Endpoint (Public)

NOTE: This endpoint reports ready if processor and database are ready.

Ready Endpoint

See ready endpoint reference and source file.


/health Endpoint (Private)

NOTE: Since the returned information is only useful to an operator, not to a user, this is a private and thus POST endpoint.

Health Endpoint

See health endpoint reference and source file.


/ingest Endpoint (Private, External Read)

NOTE: The ingest process is asynchronous; the request returns success if an ingest was started.

Ingest Endpoint

See ingest endpoint reference and source file.


Support Endpoints

/schema Endpoint (Public, Cached)

Schema Endpoint

See schema endpoint reference and source file.


Data Endpoints

/metadata Endpoint (Public)

Metadata Endpoint

See metadata endpoint reference and source file.


/database Endpoint (Public)

NOTE: This endpoint allows idempotent read operations since it uses the query endpoint of ArcadeDB.

Database Endpoint

See database endpoint reference and source file.


7. Deployment View

Level 0 (Technical View)

Overview

See compose.yaml for deployment details.

Level 1 (Data-Flow View)

EtLT

EtLT (Extract-transform-Load-Transform): Ingest vs. Read


8. Crosscutting Concepts

Internal Concepts

Security Concepts

Development Concepts

Operational Concepts


9. Architectural Decisions

Timestamp Template
Status
Decision
Consequences
2026-04-17 Remove backup endpoint
Status Approved
Decision Remove manual backups since data does not change between ingests
Consequences Backup endpoint not needed any more.
2026-04-08 Remove view tracking
Status Approved
Decision Remove the tracking of record views.
Consequences Simpler and faster database, no data loss between ingests.
2026-02-20 No Backup on Shutdown
Status Approved
Decision The database shutdown does not trigger a backup, as it takes too long for large databases.
Consequences Faster shutdown and simpler database init script; backups after ingest preserve most of state.
2026-01-27 Use NI for record identifiers
Status Approved
Decision Record identifiers are prefixed with the NI URI scheme and use base64url-encoded SHA256 hash.
Consequences Frontends can detect record identifiers without parsing the key field.
2026-01-22 Remove Gremlin query language
Status Approved
Decision Remove Gremlin module from ArcadeDB
Consequences The Gremlin query language is not supported anymore in DatAasee and hence SPARQL will not be overlaid.
2025-12-17 Streamline HTTP API
Status Approved
Decision Remove enums and sources endpoints and integrate their information into schema endpoint
Consequences More uniform API handling, and less endpoints for easier usability.
2025-09-19 Frontend Container Image
Status Approved
Decision Production and development frontend images should be air-gapped after build.
Consequences More control over dependencies especially during dynamic rebuilds.
2025-04-11 Post-Processing
Status Approved
Decision Minimize database response post-processing.
Consequences Shift transformation workload to ArcadeDB.
2024-10-23 Container Base Images
Status Approved
Decision Base containers for database and backend are the current Ubuntu LTS (ie: 26.04).
Consequences Full libc support compared to Alpine and obvious release date and support horizon from version number compared to Debian.
2024-07-04 Indirect Processor Dependency Updates
Status Approved
Decision Indirect processor dependency updates do not cause a (minor) version update.
Consequences A release image build (of the current version) can be triggered and processor dependencies are updated in the process.
2024-06-03 API Licensing
Status Approved
Decision The OpenAPI license definition is additionally licensed under CC-BY.
Consequences Easier third-party reimplementation of the DatAasee API.
2024-02-21 Use OAI vs Non-OAI metadata format variants
Status Approved
Decision Non-OAI variants of the DC and DataCite formats are supported.
Consequences More lenient, and less strict with fields configuring ingest.
2024-01-17 Compose-only Deployment
Status Approved
Decision Deployment is solely distributed and initiated by the compose.yaml.
Consequences The compose file and orchestrator have central importance.
2023-11-20 Database Storage
Status Approved
Decision Database uses in-container storage, only backups are stored outside.
Consequences Faster database at the price of fixed savepoints.
2023-08-24 Record Identifier
Status Approved (Superseded)
Decision Use xxhash64 / SHA256 of ingested or inserted raw record.
Consequences Identifier is reproducible but not a URL.
2023-08-08 Ingest Modularity
Status Approved
Decision Ingest sources are passed via API to the backend.
Consequences Sources can be maintained outside and appended during runtime.
2023-05-16 Graph Edges
Status Approved
Decision Graph edges are only set by ingest (or other automatic) processes, not by a user.
Consequences Edge semantics need to be machine-interpretable.
2022-12-07 Frontend Language
Status Approved
Decision Use English language only for frontend and metadata labels and comments.
Consequences Additional translations (German) are not prepared for now.
2022-10-10 Only Virtual Storage
Status Approved
Decision No explicit storage component for data, only metadata is managed.
Consequences No interface or instance to e.g. Ceph is developed, but URL references (to data storage) are stored.
2022-10-05 API-only Frontend
Status Approved
Decision The HTTP API is the sole frontend, further frontends are only expressions of the API.
Consequences Web frontend can only use the API.
2022-10-04 Declarative First
Status Approved
Decision Prefer declarative (YAML-based) approaches for defining processes and interfaces to reduce free coding and increase robustness.
Consequences Frontrunners Benthos as backend, and Lowdefy (or uteam) as prototype web-frontend.
2022-09-16 Multi-model Database
Status Approved
Decision Use (property)-graph / document / key-value database as central catalog component for maximal flexible data model.
Consequences Frontrunner ArcadeDB (or OrientDB) as database.

10. Quality Requirements

10.1 Quality Requirements

Quality Category Quality ID Description
Functional Suitability Appropriateness F0 DatAasee should fulfill the expected overall functionality.
Transferability Installability T0 Installation should work in various container-based environments.
Compatibility Interoperability C0 The available protocols (and format parsers) should fit the most common systems.
Operability Ease of Use O0 The API should be self-describing, well documented, and following standards and best practices.
Maintainability Modularity M0 New protocols, format parsers or other pipelines should be implementable without too much effort.
Maintainability Reusability M1 The protocol and format parser codes serve as sample and documentation.

10.2 Quality Scenarios

ID Scenario
F0 Stakeholder project evaluation
T0 Setup of DatAasee by a new operator
C0 Ingesting from a new source system
O0 User and (downstream) developer API Usage
M0 Extending the compatibility to new systems
M1 Development of a follow-up project to DatAasee

11. Risks & Technical Debt

Risk Description Mitigation
Unsecure deployment There is no bultin in TLS termination or rate limiting, and the database endpoint is not meant for public consumption Comprehensive documentation with warnings and guidelines.
DBMS project might cease ArcadeDB is a small project which has small-project risks However, since SQL is used internally to interact with ArcadeDB, in principle RDBMs could be a replacement, but it is a core architectural dependency.
Processor project might complicate Benthos was acquired by “Redpanda” who may change its license or licenses of the connectors Using hard fork bento.

12. Glossary

Term Acronym Definition
Administrative Metadata   Metadata about accessibility.
Application Programming Interface API Specification and implementation of a way for software to interact (here HTTP API).
Backend BE Software component encoding the internal logic.
Container CTR Software packaged into standardized unit for operating-system-level virtualization.
Create-Read-Update-Delete CRUD Basic operations when interacting with a database (or storage).
Database DB Collection of related records.
Database Management System DBMS The software running the databases.
Data Catalog DCAT Inventory of databases.
Data-Lake DL Structured, semi-structured, and unstructured data architecture.
Declarative Low-Code   Defining an application only by configuration of components (and minimal explicit transformations).
Declarative Programming   Programming style of expressing logic without prescribing control flow (“what”, not “how”).
Descriptive Metadata   Metadata describing the underlying data.
Domain Specific Language DSL A formal language designed for a particular application.
Extract-Load-Transform ELT A typical ingestion process for unstructured data.
Extract-Transform-Load ETL A typical ingestion process for structured data.
Extract-transform-Load-Transform EtLT An ingestion process for semi-structured data.
Frontend FE (Web-based) software component presenting a user interface.
Inter-Metadata   Metadata about data related to the underlying data.
Intra-Metadata   Metadata about the underlying data.
Low-Code   Functionality assembly using high-level prefabricated components.
Metadata MD All statements about a (tangible or digital) information object.
Metadata Catalog MDCAT Inventory of metadata databases.
Metadata-Lake MDL Structured, semi-structured, and unstructured data architecture for metadata management.
Metadata-Set   A record containing metadata.
Named Identifier NI Protocol for record identifiers.
Process Metadata   Metadata about lineage.
Social Metadata   Metadata about usage and discoverability.
Technical Metadata   Metadata about format and structure.