View on GitHub

dataasee

DatAasee - A Metadata-Lake for Libraries

DatAasee Software Documentation

Version: 0.2

DatAasee is a metadata-lake for centralizing bibliographic metadata and scientific metadata from various sources, to increase research data findability and discoverability, as well as metadata availability, and thus supporting FAIR research and research reporting in university libraries, research libraries, academic libraries or scientific libraries.

Particularly, DatAasee is developed for and by the University and State Library of Münster.

Sections:

Selected Subsections:


Explanations

In this section understanding-oriented explanations are collected.

Overview:

About

Features

Components

DatAasee uses a three-tier architecture with these separately containered components:

Function Abstraction Tier Product
Metadata Catalog Multi-Model Database Data ArcadeDB
EtLT Processor Declarative Streaming Processor Logic Benthos
Web Frontend Declarative Web Framework Presentation Lowdefy

Design

Data Model

The internal data model is based on the one big table (OBT) approach, but with the exception of linked enumerated dimensions (Look-Up tables) making it effectively an denormalized wide table with star schema, named metadata.

EtLT Process

Combining the ETL (Extract-Transform-Load / schema-on-write) and ELT (Extract-Load-Transform / schema-on-read) concepts, processing is built upon the EtLT approach:

Particularly, this means “EtL” happens (batch-wise) during ingest, while “T” occurs when requested.

Security

Secrets:

Infrastructure:

Interface:


How-Tos

In this section, step-by-step guides for real-world problems are listed.

Overview:

Prerequisite

The (virtual) machine deploying DatAasee requires docker-compose, or podman-compose. See also the container engine compatibility.

Resources

The compute and memory resources for DatAasee can be configured via the compose.yaml. Overall, a bare-metal machine or virtual machine requires:

So, a Raspberry Pi would be sufficient. In terms of DatAasee components this breaks down to:

Note, that resource and system requirements depend on load, particularly, database and backend are under heavy load during ingest. Post ingest, (new) metadata records are interrelated, also causing heavy database loads. Generally, the database drives the overall performance. Thus to improve performance, try first to increase memory for the database component (i.e. 4G to 8G).

Deploy

$ mkdir -p backup  # or: ln -s /path/to/backup/volume backup
$ wget https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)

NOTE: The required secrets are kept temporary in the files dl_pass and db_pass.

NOTE: Make sure to delete (or encrypt) secret files dl_pass and db_pass after use!

NOTE: To customize your deploy, use these environment variables.

NOTE: The runtime configuration environment variables can be stored in an .env file.

NOTE: A custom backup location can alternatively also be specified inside the compose.yaml.

Test

wget -SqO- http://localhost:8343/api/v1/ready

NOTE: The default port for the HTTP API is 8343.

Shutdown

$ docker-compose down

NOTE: A (database) backup is automatically triggered on every shutdown.

Ingest

$ wget -O- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data \
  '{"source":"https://my.url/to/oai","method":"oai-pmh","format":"mods","steward":"https://my.url/identifying/steward"}'

NOTE: A (database) backup is automatically triggered after every ingest.

Backup Manually

$ wget -O- http://localhost:8343/api/v1/backup --user admin --ask-password --post-data=''

Logs

$ docker-compose logs backend

Update

$ docker compose down
$ docker compose pull
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)

NOTE: “Update” means: if available, new images of the same DatAasee version but updated dependencies will be installed, whereas “Upgrade” means: a new version of DatAasee will be installed.

Upgrade

$ docker compose down
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && DL_VERSION=0.3 docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)

NOTE: docker-compose restart cannot be used here because environment variables (such as DL_VERSION) are not updated when using restart.

NOTE: Make sure to put the DL_VERSION variable also into the .env file for a permanent upgrade.

Web Interface (Prototype)

NOTE: The default port for the web frontend is 8000 for development and 80 for deployment.

Index Screenshot

Filter Screenshot

Query Screenshot

Overview Screenshot

About Screenshot

Fetch Screenshot

Insert Screenshot

Admin Screenshot

API Indexing

Add the JSON object inside to the apis array in your global apis.json API index.

{
  "name": "DatAasee API",
  "description": "The DatAasee API enables research data search and discovery via metadata",
  "keywords": ["Metadata"],
  "attribution": "DatAasee",
  "baseURL": "http://your-dataasee.url/api/v1",
  "properties": [
    {
      "type": "InterfaceLicense",
      "url": "https://spdx.org/licenses/MIT.html"
    },
    {
      "type": "x-openapi",
      "url": "http://your-dataasee.url/api/v1/api"
    }
  ]
}

References

In this section technical descriptions are summarized.

Overview:

HTTP-API

The HTTP-API is served under http://<your-url-here>/api/v1 and provides the following endpoints:

Method Endpoint Type Summary
GET /ready system Return service status
GET /api special Return API specification and schemas
GET /schema metadata Return database schema
GET /attributes metadata Return enumerated properties
GET /stats data Return statistics about records
GET /metadata data Return metadata record(s)
POST /insert data Create new record
POST /ingest system Trigger ingest from source
POST /backup system Trigger database backup
POST /health system Return service health
GET /export data TODO:
GET /sru data TODO:
POST /forward system TODO:

NOTE: The base path for all endpoints is /api/v1.

NOTE: All GET requests are unchallenged, all POST requests are challenged, which are handled via “Basic Authentication”.

NOTE: All request and response bodies have content type JSON, and if provided, the Content-Type HTTP header must be application/json!

NOTE: As the metadata-lake’s data is metadata, a type “data” means metadata, and a type “metadata” means metadata about metadata.

NOTE: Responses follow the JSON:API format.

NOTE: The id property is the server’s Unix timestamp.


/ready Endpoint

Returns boolean answering if service is ready.

NOTE: The ready endpoint can be used as readiness probe.

Status:

Example:

Get service readiness:

$ wget -qO- http://localhost:8343/api/v1/ready

/api Endpoint

Returns OpenAPI specification if no parameter is given, otherwise returns a request or response schema.

NOTE: In case of a succesful request, the response is NOT in the JSONAPI format, but the requested JSON file directly.

Statuses:

Examples:

Get OpenAPI definition:

$ wget -qO- http://localhost:8343/api/v1/api

Get ingest endpoint request schema:

$ wget -qO- http://localhost:8343/api/v1/api?request=ingest

Get metadata endpoint response schema:

$ wget -qO- http://localhost:8343/api/v1/api?response=metadata

/schema Endpoint

Returns internal metadata schema.

Statuses:

Example:

Get native metadata schema:

$ wget -qO- http://localhost:8343/api/v1/schema

/attributes Endpoint

Returns list of enumerated attribute values.

Statuses:

Example:

Get all enumerated attributes:

$ wget -qO- http://localhost:8343/api/v1/attributes

Get language attributes:

$ wget -qO- http://localhost:8343/api/v1/attributes?type=languages

/stats Endpoint

Return statistics about records.

Statuses:

Example:

$ wget -qO- http://localhost:8343/api/v1/stats

/metadata Endpoint

Fetch from, search, filter or query metadata record(s).

NOTE: Only idem-potent read operations are permitted in custom queries.

NOTE: A full-text search always matches for all argument terms (AND-based) in titles, descriptions and keywords in any order, while accepting * as wildcards and _ to build phrases.

Statuses:

Examples:

Get record by record identifier:

$ wget -qO- http://localhost:8343/api/v1/metadata?id=

Search records by single filter:

$ wget -qO- http://localhost:8343/api/v1/metadata?language=chinese

Search records by multiple filters:

$ wget -qO- http://localhost:8343/api/v1/metadata?resourcetype=book&language=german

Search records by full-text for word “History”:

$ wget -qO- http://localhost:8343/api/v1/metadata?search=History

Search records by full-text and filter, oldest first:

$ wget -qO- http://localhost:8343/api/v1/metadata?search=Geschichte&resourcetype=book&language=german&newest=false

Search records by custom SQL query:

$ wget -qO- http://localhost:8343/api/v1/metadata?language=sql&query=SELECT%20FROM%20metadata%20LIMIT%2010

/insert Endpoint

Inserts and parses, if necessary, a new record into the database.

NOTE: This endpoint is meant for metadata records that cannot be ingested such as a report of ingested sources or testing; general use is discouraged. For details on the request body, see the associated JSON schema.

Status:

Example:

Insert record with given fields: TODO:

$ wget -qO- http://localhost:8343/api/v1/insert --user admin --ask-password --post-file=myinsert.json

/ingest Endpoint

Trigger ingest from data source.

NOTE: To test if the server is busy, send an empty (POST) body to this endpoint. HTTP status 400 means available, status 503 means currently ingesting.

NOTE: The method and format are case-sensitive.

Status:

Example:

Start ingest from a given source:

$ wget -qO- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data='{"source":"https://datastore.uni-muenster.de/oai", "method":"oai-pmh", "format":"datacite", "steward":"forschungsdaten@uni-muenster.de"}'

/backup Endpoint

Trigger database backup.

Status:

Example:

$ wget -qO- http://localhost:8343/api/v1/backup --user admin --ask-password --post-data=''

/health Endpoint

Returns internal status of service components.

NOTE: The health endpoint can be used as liveness probe.

Status:

Example:

Get service health:

$ wget -qO- http://localhost:8343/api/v1/health --user admin --ask-password --post-data=''

/export Endpoint

TODO:

/sru Endpoint

TODO:

/forward Endpoint

TODO:

Ingest Protocols

Ingest Encodings

Currently, XML (eXtensible Markup Language) is the sole encoding for ingested metadata.

Ingest Formats

Native Schema

Key Class Entry Type Constraints
schemaVersion Process Automatic Integer min 0
recordId Process Automatic String max 31
metadataQuality Process Automatic String max 255
dataSteward process Automatic String max 4095
source Process Automatic String max 4095
createdAt Process Automatic Datetime  
updatedAt Process Automatic Datetime  
         
sizeBytes Technical Automatic Integer min 0
fileFormat Technical Automatic String max 255
dataLocation Technical Automatic String max 4095, regexp
         
numberDownloads Social Automatic Integer min 0
keywords Social Optional String max 255
categories Social Optional List(String) max 4
         
name Descriptive Mandatory String max 255
creators Descriptive Mandatory List(pair) max 255
publisher Descriptive Mandatory String min 1, max 255
publicationYear Descriptive Mandatory Integer min -9999, max 9999
resourceType Descriptive Mandatory Link(attribute) resourceTypes
identifiers Descriptive Mandatory List(pair) max 255
         
synonyms Descriptive Optional List(pair) max 255
language Descriptive Optional Link(attribute) languages
subjects Descriptive Optional List(pair) max 255
version Descriptive Optional String max 255
license Descriptive Optional Link(pair) licenses
rights Descriptive Optional String max 65535
project Descriptive Optional Embedded(pair)  
fundings Descriptive Optional List(pair) max 255
description Descriptive Optional String max 65535
message Descriptive Optional String max 65535
externalItems Descriptive Optional List(pair) max 255
         
rawType Raw Optional String max 255
raw Raw Optional String max 1048575
rawChecksum Raw Optional String max 255

NOTE: See also the schema diagram: schema.md

NOTE: The preloaded set of categories (see preload.sql) is highly opinionated.

Helper types

attributes
Property Type Constraints
name String min 3, max 255
also List(String)  
pair
Property Type Constraints
name String max 255
data String max 4095, regexp

Global Metadata

Each schema property has a label, additionally the descriptive properties have a comment property.

Key Type Comment
label String For UI labels
comment String For UI helper texts

Interrelation Edges

Type Comment
isRelatedTo Base edge type
isNewVersionOf Derived from isRelatedTo
isDerivedFrom Derived from isRelatedTo
isPartOf Derived from isRelatedTo
isSameExpressionAs Derived from isRelatedTo
isSameManifestationAs Derived from isRelatedTo

Ingestable to Native Schema Crosswalk

TODO: Add sub elements

DatAasee DataCite DC MARC MODS
name titles title 245, 130 titleInfo, part
creators creators, contributors creator, contributor 100, 700 name, relatedItem
publisher publisher publisher 260, 264 originInfo
publicationYear publicationYear date 260, 264 originInfo, part, recordInfo
resourceType resourceType type 007, genre
identifiers identifier, alternateIdentifiers identifier 001, 020, 856 identifier, recordInfo
synonyms titles title 210, 222, 240, 242, 246, 247 titleInfo
language language language 008, 041 language
subjects subjects subjects 655, 689 subject
version version   250  
license rights     accessCondition
rights   rights 506, 540  
project        
fundings fundingReferences      
description description description 520  
message   format 500 note
externalItems relatedIdentifiers identifier   identifier
         
isRelatedTo relatedItems, relatedIdentifiers related 773 relatedItem
isNewVersionOf relatedItems, relatedIdentifiers     relatedItem
isDerivedFrom relatedItems, relatedIdentifiers     relatedItem
isPartOf relatedItems, relatedIdentifiers     relatedItem
isSameExpressionAs       relatedItem
isSameManifestationAs       recordInfo

Query Languages

Language Identifier Documentation
SQL sql ArcadeDB SQL
Cypher cypher Neo4J Cypher
GraphQL graphql GraphQL Spec
Gremlin gremlin Tinkerpop Gremlin
MQL mongo Mongo MQL
     
SPARQL sparql SPARQL (WIP)

Runtime Configuration

The following environment variables affect DatAasee if set before starting.

Symbol Value Meaning
TZ CET (Default) Timezone of server
DL_VERSION 0.2 (Example) Requested DatAasee version
DL_BACKUP $PWD/backup (Default) Path to backup folder
DL_USER admin (Default) DatAasee admin username
DL_BASE http://my.url (Example) Outward DatAasee base URL (including protocol and port, no trailing slash)
DL_PORT 8343 (Default) DatAasee API port
FE_PORT 8000 (Default) Web Frontend port (Development)

Tutorials

In this section learning-oriented lessons for new-comers are given.

Overview:

Getting Started

  1. Setup compatible compose orchestrator
  2. Download DatAasee release
     $ wget https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
    

    or:

     $ curl https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
    
  3. Unpack compose.yaml
     $ tar -xf dataasee-0.2.tar.gz
    

    and:

     $ cd dataasee-0.2
    
  4. Create or mount folder for backups (assuming your backup volume is mounted under /backup)
     $ mkdir -p backup
    

    or:

     $ ln -s /backup backup
    
  5. Create DatAasee API and database admin passwords. The spaces before echo prevent these commands from being added to the history. echo -n is used to create the password files as most editors add a newline at the end of a file.
     $  echo -n 'password1' > dl_pass
    

    and:

     $  echo -n 'password2' > db_pass
    
  6. Start DatAasee service
     $ docker-compose up -d
    

    or:

     $ podman-compose up -d
    

Now, if started locally point a browser to http://localhost:8000 to use the web frontend, or send requests to http://localhost:8343/api/v1/ to use the HTTP API directly.

Example Ingest

For demonstration purposes the collection of the “Directory of Open Access Journals” (DOAJ) is ingested. An ingest has four phases: First, the administrator needs to collect the necessary information of the metadata source, i.e. URL, protocol, format, and data steward. Second, the ingest is triggered via the HTTP-API. Third, the backend ingests the metadata records from the source to the database. Fourth and lastly, the ingested data is interconnected inside the database.

  1. Check the documentation of DOAJ:
     https://doaj.org/docs
    

    The oai-pmh protocol is available.

  2. Check the documentation about OAI-PMH:
     https://doaj.org/docs/oai-pmh/
    

    The OAI-PMH endpoint URL is: https://doaj.org/oai.

  3. Check the OAI-PMH for available metadata formats:
     https://doaj.org/oai?verb=ListMetadataFormats
    

    A compatible metadata format is oai_dc.

  4. Start an ingest:
     $ wget -qO- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data='{"source":"https://doaj.org/oai", "method":"oai-pmh", "format":"oai_dc", "steward":"helpdesk@doaj.org"}'
    

    A status 202 confirms the start of the ingest. Here, no steward is listed in the DOAJ documentation, thus a general contact is set. Alternatively, the “Ingest” form of the “Admin” page in the web frontend can be used.

  5. DatAasee reports the start of the ingest in the backend logs:
     $ docker logs dataasee-backend-1
    

    with a message akin to: Starting ingest from https://doaj.org/oai via oai-pmh as oai_dc..

  6. DatAasee reports completion of the ingest in the backend logs:
     $ docker logs dataasee-backend-1
    

    with a message akin to: Completed ingest of 20812 records from https://doaj.org/oai after 0.05h..

  7. DatAasee starts interconnecting the ingested metadata records:
     $ docker logs dataasee-database-1
    

    with a message akin to: Interconnect Started!.

  8. DatAasee finishes interconnecting the ingested metadata records:
     $ docker logs dataasee-database-1
    

    with a message akin to: Interconnect Completed!.

NOTE: The interconnection is a potentially long-running, asynchronous operation, whose status is only reported in the database logs.

Container Engines

DatAasee is deployed via a compose.yaml (see How to deploy), which is compatible to the following orchestration tools:

Docker-Compose

Installation see: docs.docker.com/compose/install/

$ docker-compose up -d
$ docker-compose ps
$ docker-compose down

Docker-Compose (with Podman)

Installation see: docs.docker.com/compose/install/

NOTE: This tutorial assumes a Debian-based Linux host like Ubuntu.

$ sudo apt-get -y install dnsmasq podman-plugins containernetworking-plugins podman-docker
$ docker-compose up -d
$ docker-compose ps
$ docker-compose down

Podman-Compose

NOTE: This tutorial assumes a Debian-based Linux host like Ubuntu.

Additionally:

$ sudo apt-get -y install dnsmasq podman-plugins containernetworking-plugins python3-pip
$ pip3 install podman-compose
$ podman-compose up -d
$ podman-compose ps
$ podman-compose down

Kompose (Minikube)

Installation see: kompose.io/installation/

Prepare compose.yaml:

$ kompose -f compose.yaml convert

Particularly, for kompose in version 1.33.0 and 1.34.0 the following manual changes need to be made in:

$ rm compose.yaml
$ minikube start
$ kubectl apply -f .
$ kubectl port-forward service/backend 8343:8343  # now the backend can be accessed via `http://localhost:8343/api/v1`
$ minikube stop

Container Probes

The following endpoints are available for monitoring the respective containers; here the compose.yaml host names (service names) are used. Logs are written to the standard output.

Backend

Ready:

http://backend:4195/ready

returns HTTP status 200 if ready, see also Benthos ready.

Liveness:

http://backend:4195/ping

returns HTTP status 200 if live, see also Benthos ping.

Metrics:

http://backend:4195/metrics

allows Prometheus scraping, see also Connect prometheus.

Database

Ready:

http://database:2480/api/v1/ready

returns HTTP status 204 if ready, see also ArcadeDB ready.

Frontend

Ready:

http://frontend:3000

returns HTTP status 200 if ready.

Custom Queries

NOTE: All custom query results are limited to 100 items.

SQL

DatAasee uses the ArcadeDB SQL dialect. For custom SQL queries, only single, read-only queries are admissible, meaning:

The vertex type (cf. table) holding the metadata records is named metadata.

Examples:

Get the schema:

SELECT FROM schema:types

Get one-hundred metadata record titles:

SELECT name FROM metadata

Gremlin

TODO:

Get one-hundred metadata record titles:

g.V().hasLabel("metadata")

Cypher

DatAasee supports a subset of OpenCypher. For custom Cypher queries, only read-queries are admissible, meaning:

Examples:

Get labels:

MATCH (n) RETURN DISTINCT labels(n)

Get one-hundred metadata record titles:

MATCH (m:metadata) RETURN m

MQL

TODO:

GraphQL

TODO:

SPARQL

TODO:

Custom Frontend

Remove Prototype Frontend

Remove the YAML object "frontend" in the compose.yaml (all lines below ## Frontend # ...).


Appendix

In this section development-related guidelines are gathered.

Overview:

Development Decision Rationales:

Infrastructure

Database

Backend

Frontend

Development Workflows

Development Setup

  1. git clone https://github.com/ulbmuenster/dataasee && cd dataasee (clone repository)
  2. make setup (builds container images locally)
  3. make start (starts development setup)

Dependency Updates

  1. Dependency documentation
  2. Dependency versions
  3. Version verification (Frontend only)

Schema Changes

  1. Schema definition
  2. Schema documentation
  3. Schema implementation

API Changes

  1. API definition
  2. API architecture
  3. API documentation
  4. API implementation
  5. API testing

Coding Standards