View on GitHub

dataasee

DatAasee - A Metadata-Lake for Libraries

DatAasee Architecture Documentation

Version: 0.3

The principal goal of DatAasee is to provision a library-focussed one-stop shop for research data discovery and as a metadata hub. DatAasee is a Metadata-Lake (MDL) that aggregates and interconnects research metadata and bibliographic data from various data sources, and is interacted with via an HTTP API, which is prototypically utilized by a web front-end.

Sections:

Introduction & Goals
Constraints
Context & Scope
Solution Strategy
Building Block View
Runtime View
Deployment View
Crosscutting Concepts
Architectural Decisions
Quality Requirements
Risks & Technical Debt
Glossary

Summary:

Data Architecture: Data-Lake with Metadata Catalog
Software Architecture: 3-Tier Architecture
- Data-Tier Model: Wide, denormalized One-Big-Table (Graph)
- Logic-Tier Type: Semantic layer
- Presentation-Tier Type: HTTP-API (and Web-Frontend)

1. Introduction & Goals

1.1 Requirements Overview

Given research and bibliographic (meta)data is maintained in various distributed databases and there is no central access point to browse, search, or locate data-sets. The metadata-lake:

… allows users to search, filter and browse metadata (and data).
… incorporates metadata of research outputs as well as bibliographic metadata.
… cleans, normalizes, and provides metadata.
… facilitates exports of data/metadata bundles to external repositories.
… integrates with other services and processes.

System Landscape

The database is the core component (included)
The backend encapsulates the database and spans the API (included)
A frontend uses the API (optionally included)
All external and internal communication via HTTP
Imports of sources to the database via the backend (through the API)
Exports to services are triggered externally (through the API)
Consumers can interact (through the API)

1.2 Quality Goals

Quality Goal	Associated Scenarios
Functional Suitability	F0
Transferability	T0
Compatibility	C0
Operability	O0
Maintainability	M0, M1

2. Constraints

2.1 Technical Constraints

Constraint	Explanation
Cloud Deployability	To integrate into existing infrastructure and operation environments, a containered service is required.
Interoperability	Data pipelining is required to be compatible to existing systems such as databases.
Extensibility	Components such as metadata schemas, data pipelines, and metadata exports are required to be extensible.

2.2 Organisational Constraints

Constraint	Explanantion
OAI-PMH	Many existing data sources provide a OAI-PMH API which needs to be supported.
S3	File-based ingest has to be also performed via object storage, particularly Ceph’s S3 API.
K8	If possible Kubernetes should be supported (in addition to Compose).

2.3 Conventions

Technical

Standard	Function
JSON	Serialization language for all external messages
JSON:API	External message format standardization
JSON Schema	External message content validation
YAML	Internal processor (and prototype frontend) declaration language
StrictYAML	Preferred declaration language dialect
OpenAPI	External API definition and documentation format
MD5	Raw metadata checksums
XXH64	Identifier Hashing
Base64URL	Identifier Encoding
Compose	Deployment and orchestration

Content

Standard	Function
DataCite	Core metadata vocabulary
OpenWEMI	Entity relationships
Fields of Science	Scientific classification
SPDX License List	Software license names
Creative Commons	License names
RightsStatements.org	Copyright classification
ISO 8601	Data and time formatting
ISO 639-1	Language name abbreviations
DOI	Preferred resource identifier
ORCID	Preferred creator identifier

Documentation

3. Context & Scope

Context

3.1 Business Context

Channel	Description
Interact	All unpriviledged functionality
Search	Query metadata records
Control	Monitor, trigger ingests and backups (priviledged)
Forward	Send metadata record(s) to service
Import	Ingest metadata records from source system

3.2 Technical Context

Channel	Description
Interact	Unpriviledged `HTTP` API
Search	Requested and responded through `HTTP` API
Control	Priviledged `HTTP` API
Forward	Performed via `HTTP`
Import	Pulled via `HTTP`

4. Solution Strategy

Three-tier architecture:
- HTTP-API is the primary presentation layer (part of the backend)
- Web frontend (exclusively using API) is secondary presentation tier
Two main components:
- Database (data tier)
- Backend (state-less application tier)
All components are packaged in containers for:
- infrastructure compatibility
- cloud deployability
All messaging happens via HTTP APIs:
- internal between components (containers)
- external via endpoints (including frontend)
Source codes and external messages are in plain text and in standardized formats:
- External messages are in JSON, formatted as JSON-API, and documented by JSON-Schemas.
- Declarative sources are in YAML, following StrictYAML.
Separate horizontal scaling of database and backend for high availability:
- Database has replication capability
- Backend has no state, hence unproblematic
Further components are optional:
- Storage not necessary since only metadata is handled, payload data referenced
- Web-Frontend uses HTTP API (prototype is included)
Declarative realization for high level of abstraction via:
- Internal Queries: ArcadeDB SQL (external queries may use various query languages)
- Processes: Configuration-based + Bloblang (data mapping language)

5. Building Block View

Level 0 (Outside View)

Outside View

DatAasee

Imports metadata from source systems via pull
Provides API to interact with metadata via endpoints
Exports metadata to other services triggered via endpoints

Source Databases (External)

Known URLs (ie service or database endpoints) holding metadata
Bulk ingested
Pollable regularly for updates

Prototype Web-Frontend (Optional)

Included prototype frontend
External to core system
Template and documentation for a production frontend

Level 1 (Inside View)

Inside View

Database

Container holding a ArcadeDB database system
This core component stores and serves all metadata
A system backup saves its database

Backend

Container holding a Benthos stream processor
This component spans the external API endpoints and translates between data formats as well as between API and database
Has no state

Prototype Web-Frontend (Optional)

Container holding a Lowdefy web-frontend
This optional component renders a web-based user interface
Uses API endpoints, (but from the internal network, thus the frontend does not need the external port)

Level 2 (Container View)

Database

The native schema is created via SQL (during build)
Enumerated types are inserted via SQL (during build)
The initialization script loads the schema and preloaded data

Backend

API schemas are deposited
Custom configurable components (templates) are defined
Reusable fixed components (resources) are defined

Prototype Web-Frontend

Frontend

Pages are defined via YAML
Static assets (images and styles) are loaded
Reused template blocks are loaded

6. Runtime View

System Endpoints

`/api` Endpoint (Public)

API Endpoint

See api endoint docu and source file.

`/ready` Endpoint (Public)

Ready Endpoint

See ready endoint docu and source file.

`/health` Endpoint (Private)

Health Endpoint

See health endoint docu and source file.

`/backup` Endpoint (Private, External Write)

Backup Endpoint

See backup endoint docu and source file.

`/ingest` Endpoint (Private, External Read)

Ingest Endpoint

See ingest endoint docu and source file.

Metadata Endpoints

`/schema` Endpoint (Public, Cached)

Schema Endpoint

See schema endoint docu and source file.

`/attributes` Endpoint (Public, Cached)

Attributes Endpoint

See attributes endoint docu and source file.

Data Endpoints

`/stats` Endpoint (Public, Cached)

Stats Endpoint

See stats endoint docu and source file.

`/sources` Endpoint (Public, Cached)

Sources Endpoint

See sources endoint docu and source file.

`/metadata` Endpoint (Public)

Metadata Endpoint

See metadata endoint docu and source file.

`/insert` Endpoint (Private)

Insert Endpoint

See insert endoint docu and source file.

7. Deployment View

Overview

Level 0

See compose.yaml for deployment details.

8. Crosscutting Concepts

Internal Concepts

All components are separately containerized.
All communication between components is performed via HTTP and in JSON.

Security Concepts

Read access is granted to every user without limitation.
Write access (trigger ingest or backup, insert record) is only granted to the “admin” user.

Development Concepts

Container images are multi-stage with a generic base stage and a custom develop and release stage.
All images run a health check.

Operational Concepts

All components provide (internal) ready endpoints and write logs to the standard output.
Secrets are mounted as files.

9. Architectural Decisions

Timestamp	Title
Status	…
Decision	…
Consequences	…

2025-04-11	Title
Status	Approved
Decision	Minimize database response post-processing.
Consequences	Shift transformation workload to ArcadeDB.

2024-12-03	Simplified API Definition
Status	Approved
Decision	Use `api.csv` as ground truth for API definition.
Consequences	The API definition is better parseable and more complete; furthermore the OpenAPI file is now dependent on the `api.csv`.

2024-10-23	Title
Status	Approved
Decision	Base containers for database and backend are the current Ubuntu LTS (ie: 24.04).
Consequences	Full `libc` support compared to Alpine and obvious release date and support horizon from version number compared to Debian.

2024-07-04	Indirect Processor Dependency Updates
Status	Approved
Decision	Indirect processor dependency updates do not cause a (minor) version update.
Consequences	A release image build (of the current version) can be triggered and processor dependencies are updated in the process.

2024-06-03	API Licensing
Status	Approved
Decision	The OpenAPI license definition is additionally licensed under CC-BY.
Consequences	Easier third-party reimplementation of the DatAasee API.

2024-02-21	Use OAI vs Non-OAI metadata format variants
Status	Approved
Decision	Non-OAI variants of the DC and DataCite formats are supported.
Consequences	More lenient, and less strict ingest of fields.

2024-01-17	Compose-only Deployment
Status	Approved
Decision	Deployment is solely distributed and initiated by the `compose.yaml`.
Consequences	The compose file and orchestrator have central importance.

2023-11-20	Database Storage
Status	Approved
Decision	Database uses in-container storage, only backups are stored outside.
Consequences	Faster database at the price of fixed savepoints.

2023-08-24	Record Identifier
Status	Approved
Decision	Use xxhash64 / SHA256 of ingested or inserted raw record.
Consequences	Identifier is reproducible but not a URL.

2023-08-08	Ingest Modularity
Status	Approved
Decision	Ingest sources are passed via API to the backend.
Consequences	Sources can be maintained outside and appended during runtime.

2023-05-16	Graph Edges
Status	Approved
Decision	Graph edges are only set by ingest (or other automatic) processes, not by a user.
Consequences	Edge semantics need to be machine-interpretable.

2022-12-07	Frontend Language
Status	Approved
Decision	Use English language only for frontend and metadata labels and comments.
Consequences	Additional translations (German) are not prepared for now.

2022-10-10	Only Virtual Storage
Status	Approved
Decision	No explicit storage component for data, only metadata is managed.
Consequences	No interface or instance ie to Ceph is developed, but URL references (to data storage) are stored.

2022-10-05	API-only Frontend
Status	Approved
Decision	The HTTP API is the sole frontend, further frontends are only expressions of the API.
Consequences	Web frontend can only use API frontend

2022-10-04	Declarative First
Status	Approved
Decision	Prefer declarative (YAML-based) approaches for defining processes and interfaces to reduce free coding and increase robustness.
Consequences	Frontrunners Benthos as backend, and Lowdefy (or uteam) as prototype web-frontend.

2022-09-16	Multi-model Database
Status	Approved
Decision	Use (property)-graph / document / key-value database as central catalog component for maximal flexible data model.
Consequences	Frontrunner ArcadeDB (or OrientDB) as database.

10. Quality Requirements

10.1 Quality Requirements

Quality Category	Quality	ID	Description
Functional Suitability	Appropriateness	F0	DatAasee should fulfill the expected overall functionality.
Transferability	Installability	T0	Installation should work in various container-based environments.
Compatibility	Interoperability	C0	The available protocols (and format parsers) should fit the most common systems.
Operability	Ease of Use	O0	The API should be self-describing, well documented, and following standards and best practices.
Maintainability	Modularity	M0	New protocols, format parsers or other pipelines should be implementable without too much effort.
Maintainability	Reusability	M1	The protocol and format parser codes serve as sample and documentation.

10.2 Quality Scenarios

ID	Scenario
F0	Stakeholder project evaluation
T0	Setup of DatAasee by a new operator
C0	Ingesting from a new source system
O0	User and (downstream) developer API Usage
M0	Extending the compatibility to new systems
M1	Development of a follow-up project to DatAasee

11. Risks & Technical Debt

Risk	Description	Mitigation
DBMS project might cease	`ArcadeDB` is a small project which has small-project risks	However, `ArcadeDB` is derived from `OrientDB`, which could be a replacement (but not drop-in).
Processor dependency hell	`Benthos` has many dependencies.	Consider rewrite with minimal dependencies.
Processor project might complicate	`Benthos` was acquired by “Red Panda” who may change its license or of the connectors	Using hard fork `bento` or self-maintain.

12. Glossary

Term	Acronym	Definition
Metadata	MD	All statements about a (tangible or digital) information object.
Metadata-Set		A record containing metadata.
Intra Metadata		Metadata about the underlying data.
Inter Metadata		Metadata about data related to the underlying data.
Descriptive Metadata		Metadata describing the underlying data.
Process Metadata		Metadata about lineage.
Technical Metadata		Metadata about format and structure.
Administrative Metadata		Metadata about accessibility.
Social Metadata		Metadata about usage and discoverability.
Database	DB	Collection of related records.
Database Management System	DBMS	The software running the databases.
Backend	BE	Software component encoding the internal logic.
Frontend	FE	(Web-based) software component presenting a user interface.
Container	CTR	Software packaged into standardized unit for operating-system-level virtualization.
Data Catalog	DCAT	Inventory of databases.
Metadata Catalog	MDCAT	Inventory of databases of metadata.
Data Lake	DL	Structured, semi-structures, and unstructured data architecture.
Metadata Lake	MDL	Structured, semi-structures, and unstructured data architecture for metadata management.
Extract-Transform-Load	ETL	A typical ingestion process for structured data.
Extract-Load-Transform	ELT	A typical ingestion process for unstructured data.
Extract-transform-Load-Transform	EtLT	An ingestion process for semi-structured data.
Declarative Programming		Programming style of expressing logic without prescribing control flow (“what”, not “how”).
Low-Code		Functionality assembly using high-level prefabricated components.
Declarative Low-Code		Defining an application only by configuration of components (and minimal explicit transformations).
Application Programming Interface	API	Specification and implementation of a way for software to interact (here HTTP API).
Domain Specific Language	DSL	A formal language designed for a particular application.
Command-Query-Responsibility-Segregation	CQRS	API pattern separating read and write requests.

DatAasee Architecture Documentation

1. Introduction & Goals

1.1 Requirements Overview

1.2 Quality Goals

2. Constraints

2.1 Technical Constraints

2.2 Organisational Constraints

2.3 Conventions

Technical

Content

Documentation

3. Context & Scope

3.1 Business Context

3.2 Technical Context

4. Solution Strategy

5. Building Block View

Level 0 (Outside View)

DatAasee

Source Databases (External)

Prototype Web-Frontend (Optional)

Level 1 (Inside View)

Database

Backend

Prototype Web-Frontend (Optional)

Level 2 (Container View)

Database

Backend

Prototype Web-Frontend

6. Runtime View

System Endpoints

/api Endpoint (Public)

/ready Endpoint (Public)

/health Endpoint (Private)

/backup Endpoint (Private, External Write)

/ingest Endpoint (Private, External Read)

Metadata Endpoints

/schema Endpoint (Public, Cached)

/attributes Endpoint (Public, Cached)

Data Endpoints

/stats Endpoint (Public, Cached)

/sources Endpoint (Public, Cached)

/metadata Endpoint (Public)

/insert Endpoint (Private)

7. Deployment View

Level 0

8. Crosscutting Concepts

Internal Concepts

Security Concepts

Development Concepts

Operational Concepts

9. Architectural Decisions

10. Quality Requirements

10.1 Quality Requirements

10.2 Quality Scenarios

11. Risks & Technical Debt

12. Glossary

`/api` Endpoint (Public)

`/ready` Endpoint (Public)

`/health` Endpoint (Private)

`/backup` Endpoint (Private, External Write)

`/ingest` Endpoint (Private, External Read)

`/schema` Endpoint (Public, Cached)

`/attributes` Endpoint (Public, Cached)

`/stats` Endpoint (Public, Cached)

`/sources` Endpoint (Public, Cached)

`/metadata` Endpoint (Public)

`/insert` Endpoint (Private)