View on GitHub

dataasee

DatAasee - A Metadata-Lake for Libraries

DatAasee Architecture Documentation

Version: 0.2

The Metadata-Lake (MDL) DatAasee gathers research metadata and bibliographic data from a multitude of sources, associates them with underlying data-sets, and provides a HTTP API for access, that is utilized by a (prototype) web frontend.

The main goal of the Metadata-Lake is to provision a one-stop shop for research data discovery at university libraries, research libraries, academic libraries, and scientific libraries.

Sections:

  1. Introduction & Goals
  2. Constraints
  3. Context & Scope
  4. Solution Strategy
  5. Building Block View
  6. Runtime View
  7. Deployment View
  8. Crosscutting Concepts
  9. Architectural Decisions
  10. Quality Requirements
  11. Risks & Technical Debt
  12. Glossary

Summary:


1. Introduction & Goals

1.1 Requirements Overview

Given research and bibliographic (meta)data is maintained in various distributed databases and there is no central access point to browse, search, or locate data-sets. The metadata-lake:

Overview

1.2 Quality Goals

Quality Goal Associated Scenarios
Functional Suitability F0
Transferability T0
Compatibility C0
Operability O0
Maintainability M0, M1

2. Constraints

2.1 Technical Constraints

Constraint Explanation
Cloud Deployability To integrate into existing infrastructure and operation environments, a containered service is required.
Interoperability Data pipelining is required to be compatible to existing systems such as databases.
Extensibility Components such as metadata schemas, data pipelines, and metadata exports are required to be extensible.

2.2 Organisational Constraints

Constraint Explanantion
OAI-PMH Many existing data sources provide a OAI-PMH API which needs to be supported.
S3 File-based ingest has to be also performed via object storage, particularly Ceph’s S3 API.
K8 If possible Kubernetes should be supported (in addition to Compose).

2.3 Conventions

Technical

Standard Function
JSON Serialization language for all external messages
JSON:API External message format standardization
JSON Schema External message content validation
YAML Internal processor (and prototype frontend) declaration language
StrictYAML Preferred declaration language dialect
OpenAPI External API definition and documentation format
MD5 Raw metadata checksums
XXH64 Identifier Hashing
Base64URL Identifier Encoding
Compose Deployment and orchestration

Content

Standard Function
DataCite Core metadata vocabulary
FRBR Entity relationships
Fields of Science Scientific classification
SPDX License List Software license names
ISO 8601 Data and time formatting
ISO 639-1 Language name abbreviations
DOI Preferred resource identifier
ORCID Preferred creator identifier

Documentation

| Standard | Function |———-|——— | divio | Software documentation structure | arc42 | Software architecture documentation | yasql | Database schema documentation


3. Context & Scope

Context

3.1 Business Context

Channel Description
Interact All unpriviledged functionality
Search Query metadata records
Control Monitor, trigger ingests and backups (priviledged)
Forward Send metadata record(s) to service
Import Ingest metadata records from source system

3.2 Technical Context

Channel Description
Interact Unpriviledged HTTP API
Search Requested and responded through HTTP API
Control Priviledged HTTP API
Forward Performed via HTTP
Import Pulled via HTTP

4. Solution Strategy


5. Building Block View

Level 0 (Outside View)

Outside View

DatAasee

Source Databases (External)

Prototype Web-Frontend (Optional)

Level 1 (Inside View)

Inside View

Database

Backend

Prototype Web-Frontend (Optional)

Level 2 (Container View)

Database

Database

Backend

Backend

Prototype Web-Frontend

Frontend


6. Runtime View

Processes

/ready Endpoint

Ready Endpoint

See ready endoint docu.


/api Endpoint

API Endpoint

See api endoint docu.


/schema Endpoint

Schema Endpoint

See schema endoint docu.


/attributes Endpoint

Attributes Endpoint

See attributes endoint docu.


/stats Endpoint

Stats Endpoint

See stats endoint docu.


/metadata Endpoint

Metadata Endpoint

See metadata endoint docu.


/insert Endpoint

Insert Endpoint

See insert endoint docu.


/ingest Endpoint

Ingest Endpoint

See ingest endoint docu.


/backup Endpoint

Backup Endpoint

See backup endoint docu.


/health Endpoint

Health Endpoint

See health endoint docu.


7. Deployment View

Overview

Level 0

See compose.yaml for deployment details.


8. Crosscutting Concepts

Internal Concepts

Security Concepts

Development Concepts

Operational Concepts


9. Architectural Decisions

Timestamp Title
Status
Decision
Consequences
2024-07-04 Indirect Processor Dependency Updates
Status Approved
Decision Indirect processor dependency updates do not cause a (minor) version update.
Consequences A release image build (of the current version) can be triggered and processor dependencies are updated in the process.
2024-06-03 API Licensing
Status Approved
Decision The OpenAPI license definition is additionally licensed under CC-BY.
Consequences Easier third-party reimplementation of the DatAasee API.
2024-02-21 Use OAI vs Non-OAI metadata format variants
Status Approved
Decision Non-OAI variants of the DC and DataCite formats are supported.
Consequences More lenient, and less strict ingest of fields.
2024-01-17 Compose-only Deployment
Status Approved
Decision Deployment is solely distributed and initiated by the compose.yaml.
Consequences The compose file and orchestrator have central importance.
2023-11-20 Database Storage
Status Approved
Decision Database uses in-container storage, only backups are stored outside.
Consequences Faster database at the price of fixed savepoints.
2023-08-24 Record Identifier
Status Approved
Decision Use xxhash64 / SHA256 of ingested or inserted raw record.
Consequences Identifier is reproducible but not a URL.
2023-08-08 Ingest Modularity
Status Approved
Decision Ingest sources are passed via API to the backend.
Consequences Sources can be maintained outside and appended during runtime.
2023-05-16 Graph Edges
Status Approved
Decision Graph edges are only set by ingest (or other automatic) processes, not by a user.
Consequences Edge semantics need to be machine-interpretable.
2022-12-07 Frontend Language
Status Approved
Decision Use English language only for frontend and metadata labels and comments.
Consequences Additional translations (German) are not prepared for now.
2022-10-10 Only Virtual Storage
Status Approved
Decision No explicit storage component for data, only metadata is managed.
Consequences No interface or instance ie to Ceph is developed, but URL references (to data storage) are stored.
2022-10-05 API-only Frontend
Status Approved
Decision The HTTP API is the sole frontend, further frontends are only expressions of the API.
Consequences Web frontend can only use API frontend
2022-10-04 Declarative First
Status Approved
Decision Prefer declarative (YAML-based) approaches for defining processes and interfaces to reduce free coding and increase robustness.
Consequences Frontrunners Benthos as backend, and Lowdefy (or uteam) as prototype web-frontend.
2022-09-16 Multi-model Database
Status Approved
Decision Use (property)-graph / document / key-value database as central catalog component for maximal flexible data model.
Consequences Frontrunner ArcadeDB (or OrientDB) as database.

10. Quality Requirements

10.1 Quality Requirements

Quality Category Quality ID Description
Functional Suitability Appropriateness F0 DatAasee should fulfill the expected overall functionality.
Transferability Installability T0 Installation should work in various container-based environments.
Compatibility Interoperability C0 The available protocols (and format parsers) should fit the most common systems.
Operability Ease of Use O0 The API should be self-describing, well documented, and following standards and best practices.
Maintainability Modularity M0 New protocols, format parsers or other pipelines should be implementable without too much effort.
Maintainability Reusability M1 The protocol and format parser codes serve as sample and documentation.

10.2 Quality Scenarios

ID Scenario
F0 Stakeholder project evaluation
T0 Setup of DatAasee by a new operator
C0 Ingesting from a new source system
O0 User and (downstream) developer API Usage
M0 Extending the compatibility to new systems
M1 Development of a follow-up project to DatAasee

11. Risks & Technical Debt

Risk Description Mitigation
DBMS project might cease ArcadeDB is a small project which has small-project risks However, ArcadeDB is derived from OrientDB, which could be a replacement (but not drop-in).
Processor project might complicate Benthos was acquired by “Red Panda” who may change its license or of the connectors Using hard fork bento or self-maintain.

12. Glossary

Term Acronym Definition
Metadata MD All statements about a (tangible or digital) information object.
Metadata-Set   A record containing metadata.
Intra Metadata   Metadata about the underlying data.
Inter Metadata   Metadata about data related to the underlying data.
Descriptive Metadata   Metadata describing the underlying data.
Process Metadata   Metadata about lineage.
Technical Metadata   Metadata about format and structure.
Administrative Metadata   Metadata about accessibility.
Social Metadata   Metadata about usage and discoverability.
Database DB Collection of related records.
Database Management System DBMS The software running the databases.
Backend BE Software component encoding the internal logic.
Frontend FE (Web-based) software component presenting a user interface.
Container CTR Software packaged into standardized unit for operating-system-level virtualization.
Data Catalog DCAT Inventory of databases.
Metadata Catalog MDCAT Inventory of databases of metadata.
Data Lake DL Structured, semi-structures, and unstructured data architecture.
Metadata Lake MDL Structured, semi-structures, and unstructured data architecture for metadata management.
Extract-Transform-Load ETL A typical ingestion process for structured data.
Extract-Load-Transform ELT A typical ingestion process for unstructured data.
Extract-transform-Load-Transform EtLT An ingestion process for semi-structured data.
Declarative Programming   Programming style of expressing logic without prescribing control flow (“what”, not “how”).
Low-Code   Functionality assembly using high-level prefabricated components.
Declarative Low-Code   Defining an application only by configuration of components (and minimal explicit transformations).
Application Programming Interface API Specification and implementation of a way for software to interact (here HTTP API).
Domain Specific Language DSL A formal language designed for a particular application.
Command-Query-Responsibility-Segregation CQRS API pattern separating read and write requests.