View on GitHub

DatAasee - A Metadatalake for Libraries

DatAasee centralizes and interlinks distributed library/research metadata into an API‑first union catalog.

DatAasee Software Documentation

Version: 0.9

The metadata-lake DatAasee centralizes bibliographic data as well as metadata of research data from distributed sources to increase metadata availability and research data discoverability, hence supporting FAIR research in university libraries, research libraries, academic libraries, or scientific libraries.

DatAasee is developed for and by the University and State Library of Münster, and available openly under a free and open-source license.

Table of Contents:

  1. Explanations (understanding-oriented)
  2. How-Tos (goal-oriented)
  3. References (information-oriented)
  4. Tutorials (learning-oriented)
  5. Appendix (development-oriented)

Selected Subsections:


1. Explanations

In this section in-depth explanations and backgrounds are collected.

Overview:

About

Features

Brackets [ ] mean: under construction.

Design

Data Model

A graph database is used with a central node (vertex) type (cf. table) named Metadata. The node properties are based on DataCite metadata schema for the descriptive metadata. For further information see the schema.

Persistence

Backup and Restore:

Security

Secrets:

Database:

Infrastructure:

Interface:


2. How-Tos

In this section, brief guides for typical tasks are compiled.

Overview:

Prerequisite

The (virtual) machine deploying DatAasee requires docker compose (>=2.37) on top of docker or podman, see also container engine compatibility.

Resources

The compute and memory resources for DatAasee can be configured via the compose.yaml. To run, a bare-metal machine or virtual machine requires:

In terms of DatAasee components this breaks down to:

Note, that resource and system requirements depend on load; especially the database is under heavy load during ingest. Post ingest, (new) metadata records are interrelated, also causing heavy database loads. Generally, the database drives the overall performance. Thus, to improve performance, try first to increase the memory limits (in the compose.yaml) for the database component (i.e., from 4G to 24G).

NOTE: Practically, the memory required by database roughly corresponds to its size, which is reported on startup (in case a backup is restored), or after completing a backup. As a rough estimate for resource planning expect 6G RAM per million records, and 1G disk per million records for backups.

Using DatAasee

In this section the terms “operator” and “user” are used, where “operator” refers to the party installing, serving and maintaining DatAasee, and “user” refers to the individuals or services reading from DatAasee.

Operator Activities

User Activities

This means the user can only use the GET API endpoints, while the operator typically uses the POST API endpoints.

For details about the HTTP API calls, see the API reference.

NOTE: In the following, a leading space (symbolized by “␣”) is used with commands containing passwords to omit them from the shell history. This is not safe for production, as this mechanism depends on the shell implementation and its configuration. Use this only for local testing; for production, see safer methods for passing secrets.

Production Checklist

Deploy

Deploy DatAasee via (Docker) Compose by providing the two secrets:

for further details, see the Getting Started tutorial as well as the compose.yaml and the Docker Compose file reference.

WARNING: DatAasee must not be exposed directly to the public Internet! Put it behind a TLS-terminating reverse proxy and block the database endpoint. See compose.proxy.yaml for an example.

$ mkdir -p backup  # or: ln -s /path/to/backup/volume backup
$ wget https://raw.githubusercontent.com/ulbmuenster/dataasee/0.9/compose.yaml
$ ␣DL_PASS=password1 DB_PASS=password2 docker compose up -d

NOTE: The backup folder (or mount) needs permissions to read from, and write into by the root (actually by the database container’s user, but root can represent them on the host). Thus, a change of ownership sudo chown root backup is typically required. For testing purposes chmod o+w backup is fine, but not recommended for production.

NOTE: To further customize your deploy, use environment variables. The runtime configuration environment variables can be stored in an .env file.

WARNING: Do not put secrets into the .env file!

Logs

$ docker compose logs backend --no-log-prefix

NOTE: The default Docker logging driver local is used.

Shutdown

$ docker compose down

NOTE: No (database) backup is automatically triggered on shutdown!

Probe

For further details see /ready endpoint API reference entry (assuming this is done on the hosting machine).

wget -SqO- http://localhost:8343/api/v1/ready

Ingest

For further details see /ingest endpoint API reference entry (assuming this is done on the hosting machine).

$ wget -qO- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data \
  '{"source":"https://example.invalid/oai","method":"oai-pmh","format":"mods","rights":"CC0","steward":"steward@example.invalid"}'

NOTE: This is an async action, progress and completion is only noted in the (backend) logs. If the backend is ingesting can be checked via the /ingest endpoint or the /health endpoint.

Update

$ docker compose pull
$ ␣DL_PASS=password1 DB_PASS=password2 docker compose up -d

NOTE: “Update” means: if available, new images of the same DatAasee version but with updated dependencies will be installed, whereas “Upgrade” means: a new version of DatAasee will be installed.

NOTE: An update terminates an ongoing ingest or interconnect process.

Upgrade

$ docker compose down
$ ␣DL_PASS=password1 DB_PASS=password2 DL_VERSION=0.9 docker compose up -d

NOTE: docker compose restart cannot be used to upgrade because environment variables (such as DL_VERSION) are not updated when using restart.

NOTE: Make sure to put the DL_VERSION variable into an .env file for a permanent upgrade or edit the compose file accordingly.

Reset

$ docker compose restart

NOTE: A reset may be necessary if the backend crashes during an ingest.

Database Console

$ ␣docker exec -it dataasee-database-1 bin/console.sh 'CONNECT remote:localhost/metadatalake root <db_pass>'

NOTE: This is for emergency use only and not needed in day-to-day operations.

Web Interface (Prototype)

All frontend pages show a menu on the left side listing all other pages, as well as an indicator if the backend server is ready.

NOTE: The default port for the web frontend is 8000, e.g. http://localhost:8000, and can be adapted in the compose.yaml.


Home Screenshot

The home page has a full-text search input.


Resolve Screenshot

The “DOI Search” page takes a DOI and returns the associated metadata record.


List Screenshot

The “List Records” page allows to list all metadata records from a selected source.


Filter Screenshot

The “Filter Search” page allows to filter for a subset of metadata records by categories, resource types, languages, licenses, or publication year.


Query Screenshot

The “Custom Query” page allows to enter a query via sql, opencypher, mql, graphql, or redis.


Statistics Screenshot

The “Statistics Overview” page shows top-10 bar graphs for number of views, publication years, and keywords, as well as top-100 pie charts for resource types, categories, licenses, subjects, languages, and metadata schemas.


About Screenshot

The “Interface Summary” page lists the backend API endpoints and provides links to parameter, request, and response schemas.


Fetch Screenshot

The “Display Record” page presents a single metadata record.


Admin Screenshot

The “Admin Controls” page allows to trigger actions in the backend like ingest source, or health check.

API Indexing

Add the JSON object below to the apis array in your global apis.json:

{
  "name": "DatAasee API",
  "description": "The DatAasee API enables research data search and discovery via metadata",
  "keywords": ["Metadata"],
  "attribution": "DatAasee",
  "baseURL": "http://your-dataasee.url/api/v1",
  "properties": [
    {
      "type": "InterfaceLicense",
      "url": "https://creativecommons.org/licenses/by/4.0/"
    },
    {
      "type": "x-openapi",
      "url": "http://your-dataasee.url/api/v1/api"
    }
  ]
}

For api-catalog / FAIRiCat, add the JSON object below to the linkset array:

{
  "anchor": "http://your-dataasee.url/api/v1",
  "service-doc": [
    {
      "href": "http://your-dataasee.url/api/v1/api",
      "type": "application/json",
      "title": "DatAasee API"
    }
  ]
}

3. References

In this section technical descriptions are summarized.

Overview:

Runtime Configuration

The following environment variables affect DatAasee if set before starting:

Symbol Value Meaning
TZ Europe/Berlin (Default) Timezone of all component servers
DL_PASS password1 (Example) DatAasee password (use only command-local!)
DB_PASS password2 (Example) Database password (use only command-local!)
DB_OPTS   Custom database server options, see: https://docs.arcadedb.com/#arcadedb-settings
DL_VERSION 0.9 (Example) Requested DatAasee version
DL_BACKUP $PWD/backup (Default) Path to backup folder
DL_USER admin (Default) DatAasee admin username
DL_BASE your-dataasee.url (Example) Outward DatAasee host name
DL_SAFE false (Default) Disables the /database endpoint if true
DL_PORT 8343 (Default) DatAasee API port
FE_PORT 8000 (Default) Web Frontend port

Default Ports

HTTP API

The HTTP API is served under http://<your-base-url>:port/api/v1 by default (see DL_PORT and DL_BASE), aimed at human as well as machine clients and consumers, is self-documenting, and provides the following endpoints:

Method Endpoint Type Summary
GET /ready system Returns service readiness.
GET /api system Returns API specification and schemas.
GET /schema support Returns database schema.
GET /metadata data Returns metadata record(s).
GET /database data Returns metadata queries.
POST /health system Returns service liveness.
POST /ingest system Triggers async ingest of metadata records from source.

For more details see also the associated OpenAPI definition. Furthermore, parameters, request and response bodies are specified as JSON-Schemas, which are linked in the respective endpoint entries below.

All GET requests are unchallenged; all POST requests are challenged and handled via “Basic Authentication”, where the username is admin (by default, or was set via DL_USER), and the password was set via DL_PASS. A POST request without Authorization HTTP header is answered with status 401 and WWW-Authenticate header, whereas a POST request with Authorization HTTP header is answered with status 403 for missing or invalid credentials.

All response bodies have content type JSON; thus, if provided, the Accept HTTP header can only be application/json or application/vnd.api+json! Responses follow the JSON:API format, with the exception of the /api endpoint, which returns JSON files directly. All error messages, which means 4XX and 5XX HTTP status responses, follow the JSON:API specification as well, see the error response JSON schema.

For the metadata endpoint, the id property in a response’s data element corresponds to the native recordId, for all other (non-resource) endpoints, the id property is the server’s Unix timestamp.


/ready Endpoint

Returns a boolean answering whether the service is ready.

This endpoint is meant for readiness probes by an orchestrator, monitoring or in a frontend.

NOTE: Internally, the overall readiness consists of the backend server AND database server readiness.

Statuses

Examples

Get service readiness:

$ wget -qO- "http://localhost:8343/api/v1/ready"

/api Endpoint

Returns OpenAPI specification without parameters, or parameter, request and response schemas (for the respective endpoint).

This endpoint documents the HTTP API as a whole as well as parameter, request, and response JSON schemas for all endpoints, and helps navigating the API for humans and machines.

NOTE: In case of a successful request, the response is NOT in the JSON:API format, but the requested OpenAPI or Schema JSON-file directly.

NOTE: At most one parameter can be used per request.

Statuses

Examples

Get OpenAPI definition:

$ wget -qO- "http://localhost:8343/api/v1/api"

Get schema endpoint parameter schema:

$ wget -qO- "http://localhost:8343/api/v1/api?params=schema"

Get ingest endpoint request schema:

$ wget -qO- "http://localhost:8343/api/v1/api?request=ingest"

Get metadata endpoint response schema:

$ wget -qO- "http://localhost:8343/api/v1/api?response=metadata"

/schema Endpoint

Returns the native metadata schema.

This endpoint provides the hierarchy of the data model, labels and descriptions for all properties as well as values and facets for enumerated properties, and is meant for labels, selectors, hints or tooltips in a frontend.

NOTE: Keys prefixed with @ refer to meta information (schema version, type comment, or relations).

NOTE: Values for prop are case-sensitive: for example, use prop=resourceType, not prop=resourcetype.

NOTE: The response is cached and refreshed every 5 minutes.

Statuses

Examples

Get native full metadata schema:

$ wget -qO- "http://localhost:8343/api/v1/schema"

Get native metadata schema’s title property:

$ wget -qO- "http://localhost:8343/api/v1/schema?prop=title"

Get native metadata schema’s enumerated language property:

$ wget -qO- "http://localhost:8343/api/v1/schema?prop=language"

Get native metadata schema’s isRelatedTo relation:

$ wget -qO- "http://localhost:8343/api/v1/schema?prop=@isRelatedTo"

/metadata Endpoint

Fetches, lists, or searches and filters metadata record(s). Three distinct modes of operation are available (in order of precedence):

This endpoint’s responses can include pagination where appropriate. Paging via page is one-based (meaning the first page is 1) for the combined full-text / filter search, as well as sorting via newest; for source listings paging is cursor-based, where the cursor is passed also via page. For requests with id at most one result is returned, for requests with source at most one-hundred results are returned per page, search/filter requests return at most twenty results per page.

This is the main endpoint serving the metadata data of the DatAasee database similar to a resource.

NOTE: Export formats are a convenience feature without round-trip guarantees.

NOTE: An explicitly empty source parameter (i.e., source=) implies all sources.

NOTE: A full-text search always matches for all query terms (AND-based) in titles, synonyms, descriptions and keywords in any order, while accepting * as wildcards and _ to build phrases, for example: I_*_a_dream.

NOTE: A response includes paginated links first, prev, and next if applicable.

NOTE: The type in a BibJSON export is renamed entrytype due to a collision with JSON:API rules.

NOTE: The id=ni:dataasee is a special record, see Example Record, which can be returned by id, but not searched or filtered.

Statuses

Examples

Get record by record identifier:

$ wget -qO- "http://localhost:8343/api/v1/metadata?id=ni:dataasee"

Get record(s) by DOI:

$ wget -qO- "http://localhost:8343/api/v1/metadata?doi=10.5281/zenodo.13734194"

Export record in given format:

$ wget -qO- "http://localhost:8343/api/v1/metadata?id=ni:dataasee&format=datacite"

Search records by single filter:

$ wget -qO- "http://localhost:8343/api/v1/metadata?language=chinese"

Search records by multiple filters:

$ wget -qO- "http://localhost:8343/api/v1/metadata?resourcetype=book&language=german"

Search records by full-text for word “History”:

$ wget -qO- "http://localhost:8343/api/v1/metadata?search=History"

Search records by full-text and filter, oldest first:

$ wget -qO- "http://localhost:8343/api/v1/metadata?search=Geschichte&resourcetype=book&language=german&newest=false"

List records from all sources:

$ wget -qO- "http://localhost:8343/api/v1/metadata?source="

/database Endpoint

Returns the results of queries directly against the database.

This endpoint is meant for custom queries and intended for trusted internal clients only. For public deployments, set DL_SAFE=true or block this endpoint at a reverse proxy.

NOTE: Only idempotent read-only operations are permitted.

WARNING: Queries can potentially cause high loads on the database. If deployed publicly, this endpoint should be blocked via setting the environment variable DL_SAFE to true.

Statuses

Examples

Search records by custom SQL query:

$ wget -qO- "http://localhost:8343/api/v1/database?language=sql&query=SELECT+FROM+Metadata+LIMIT+10"

/health Endpoint

Returns internal status and versions of service components.

This endpoint is meant for liveness checks by an orchestrator, observability, or for manually inspecting the database and processor health and status. In particular the ingest and interconnect status of the processor and database respectively is reported.

Statuses

Examples

Get service health:

$ wget -qO- "http://localhost:8343/api/v1/health" --user admin --ask-password --post-data=''

/ingest Endpoint

Triggers an asynchronous ingest of metadata records from a source, followed by an interconnect of records.

An ingest is a two-stage process: First, the backend forwards records from an external source to the database; second the database interconnects all new records based on identified relations.

NOTE: The request body can be JSON or application/x-www-form-urlencoded.

NOTE: This is an asynchronous action, so the response just reports if an ingest was started. Completion is noted in the backend logs and the subsequent interconnect in the database logs. The current ingest and interconnect status is also reported by the /health endpoint.

NOTE: Only one ingest at a time can happen which is enforced with a backend lock and a database lock. To check if the server is currently ingesting, send an empty body to this endpoint.

NOTE: The method and format properties are case-sensitive.

NOTE: The options field follows the selective harvesting in OAI-PMH, For example, incremental harvesting is possible using from=2000-01-01 or set=institution&from=2000-01-01

NOTE: Since the record identifier is a hash of metadata properties, records are not duplicated but updated (UPSERTed) if the hash coincides.

Statuses

Examples

Check if the server is busy ingesting:

$ wget -qO- "http://localhost:8343/api/v1/ingest" --user admin --ask-password --post-data=''

Start ingest from a given source:

$ wget -qO- "http://localhost:8343/api/v1/ingest" --user admin --ask-password --post-data \
  '{"source":"https://datastore.uni-muenster.de/oai2d","method":"oai-pmh","format":"datacite","rights":"CC0","steward":"fdm@uni-muenster"}'

Ingest Protocols

Ingest Encodings

Currently, XML (eXtensible Markup Language) is the only encoding for ingested metadata, with the exception of ingesting via the dataasee protocol, which uses JSON.

Ingest Formats

Native Schema

The underlying DBMS (ArcadeDB) is a property-graph database of nodes (vertexes) and edges being documents (similar to JSON files). The graph nature is utilized by interconnecting records (vertex documents) via identifiers (i.e., DOI) during ingest, given a set of predefined relations.

Conceptually, the data model for metadata records has five sections:

The central type of the metadatalake database is the Metadata vertex type, with the following properties:

Key Section Entry Internal Type Constraints Comment
schemaVersion Process Automatic Integer =1  
recordId Process Automatic String max 47 base64url-encoded sha256 hash of: source, format, source record identifier (or publisher), publicationYear, title; with prefix "ni:"
metadataQuality Process Automatic String max 255 Currently one of: "Incomplete", "OK"
dataSteward Process Automatic String max 4095  
source Process Automatic Link(Pair) sources  
sourceRights Process Automatic String max 4095  
createdAt Process Automatic Datetime    
           
sizeBytes Technical Optional Integer min 0  
dataFormat Technical Optional String max 255  
dataLocation Technical Optional String max 4095, URL regex  
           
categories Social Automatic List(Link) max 3 Pass array of strings to API, returned as array of strings from API
keywords Social Optional List(String) max 15 Full-text indexed
           
title Descriptive Mandatory String max 255 Full-text indexed (Longer titles are truncated, but stored in full in synonyms)
creators Descriptive Mandatory List(Pair) max 255 Pass array of Pair objects (name:fullname, data:identifier) to API
publisher Descriptive Mandatory String max 255  
publicationYear Descriptive Mandatory Integer min -9999, max 9999  
resourceType Descriptive Mandatory Link(Pair) resourceTypes Pass string to API, returned as string from API
identifiers Descriptive Mandatory List(Pair) max 255 Pass array of Pair objects (name:identifier, data:type) to API
           
synonyms Descriptive Optional List(Pair) max 255 Pass array of Pair objects (name:title, data:type) to API, name full-text indexed
language Descriptive Optional Link(Pair) languages Pass string to API, returned as string from API
subjects Descriptive Optional List(Pair) max 255 Pass array of Pair objects (name:name, data:identifier) to API
version Descriptive Optional String max 255  
license Descriptive Optional Link(Pair) licenses Pass string to API, returned as string from API
rights Descriptive Optional String max 65535  
fundings Descriptive Optional List(Pair) max 255 Pass array of Pair objects (name:funder, data:identifier) to API
description Descriptive Optional String max 65535 Full-text indexed
relatedItems Descriptive Optional List(Pair) max 255 Pass array of Pair objects (name:type, data:URL) to API
           
rawMetadata Raw Automatic Link(Raw)    
rawFormat Raw Automatic Link(Pair) schemas  
rawChecksum Raw Automatic String max 255 SHA256 hash of rawMetadata

NOTE: See also the custom queries section and the schema diagram: schema.md.

NOTE: The recordId property is DatAasee specific identifier and should not be treated as a public web identifier.

NOTE: The properties related, visited are only for internal purposes and hence not listed here.

NOTE: The preloaded set of Categories (see categories.csv) is based on the OECD Fields of Science and Technology.

Global Metadata

The Metadata type has the custom metadata fields:

Key Type Comment
version Integer Internal schema version (to compare against the schemaVersion property)
comment String Database comment

Property Metadata

Each schema property has a label, additionally, the descriptive properties have a comment property, and enumerated properties hold the associated document type name holding all admissible values in enum.

Key Type Comment
label String For UI labels
comment String For UI helper texts
enum String  

NOTE: The /schema endpoint response provides all admissible values for enumerated properties in enum directly not the type name, and additionally a facets lists which is not part of the schema, but view refreshed after ingests.

Pair Documents

A helper document type used for source, creators, identifiers, synonyms, language, subjects, license, fundings, relatedItems link targets or list elements.

Property Type Constraints
name String min 1, max 255
data String max 4095, URL regex

NOTE: The URL regex is based on stephenhay’s pattern.

Raw Documents

A helper document type for rawMetadata.

Property Type Constraints
value String  

Interrelation Edges

Type Domain Range Comment
isRelatedTo Metadata Metadata Generic catch-all edge type and base type for all other edge types
isNewVersionOf Metadata Metadata See DataCite
isDerivedFrom Metadata Metadata See DataCite
hasPart Metadata Metadata See DataCite
isPartOf Metadata Metadata See DataCite
isDescribedBy Metadata Metadata See DataCite
commonExpression Metadata Metadata See OpenWEMI
commonManifestation Metadata Metadata See OpenWEMI

NOTE: The graph is directed, so the edge names have a direction. By default, the edge name refers to the outbound direction.

Edge Metadata

Key Type Comment
label_in String For UI labels (outbound edge)
label_out String For UI labels (incoming edge)

Native Schema Crosswalk

DatAasee DataCite DC LIDO MARC MODS
title titles.title title descriptiveMetadata.objectIdentificationWrap.titleWrap.titleSet 245, 130 titleInfo, part
creators creators.creator creator descriptiveMetadata.eventWrap.eventSet 100, 700 name, relatedItem
publisher publisher publisher descriptiveMetadata.objectIdentificationWrap.repositoryWrap.repositorySet 260, 264 originInfo
publicationYear publicationYear date descriptiveMetadata.eventWrap.eventSet 260, 264 originInfo, part
resourceType resourceType type category 007, 337 genre, typeOfResource
identifiers identifier, alternateIdentifiers.alternateIdentifier identifier lidoRecID, objectPublishedID 001, 003, 020, 024, 856 identifier, recordInfo
synonyms titles.title title descriptiveMetadata.objectIdentificationWrap.titleWrap.titleSet 210, 222, 240, 242, 243, 246, 247 titleInfo
language language language descriptiveMetadata.objectClassificationWrap.classificationWrap.classification 008, 041 language
subjects subjects.subject subject descriptiveMetadata.objectRelationWrap.subjectWrap.subjectSet, descriptiveMetadata.objectClassificationWrap.classificationWrap.classification 655, 689 subject, classification
version version   descriptiveMetadata.objectIdentificationWrap.displayStateEditionWrap.displayEdition 250 originInfo
license rightsList.rights   administrativeMetadata.rightsWorkWrap.rightsWorkSet 506, 540 accessCondition
rights rightsList.rights rights administrativeMetadata.rightsWorkWrap.rightsWorkSet 506, 540 accessCondition
fundings fundingReferences.fundingReference        
description descriptions.description description descriptiveMetadata.objectIdentificationWrap.objectDescriptionWrap.objectDescriptionSet 500, 520 abstract
relatedItems relatedIdentifiers.relatedIdentifier related descriptiveMetadata.objectRelationWrap.relatedWorksWrap.relatedWorkSet 856 relatedItem
           
keywords subjects.subject subject descriptiveMetadata.objectIdentificationWrap.objectDescriptionWrap.objectDescriptionSet 653 subject, classification
           
dataLocation identifier     856 location
dataFormat formats.format format      
sizeBytes          
           
isRelatedTo relatedItems.relatedItem, relatedIdentifiers.relatedIdentifier related descriptiveMetadata.objectRelationWrap.relatedWorksWrap.relatedWorkSet 780, 785, 786, 787 relatedItem
isNewVersionOf relatedItems.relatedItem, relatedIdentifiers.relatedIdentifier       relatedItem
isDerivedFrom relatedItems.relatedItem, relatedIdentifiers.relatedIdentifier       relatedItem
isPartOf relatedItems.relatedItem, relatedIdentifiers.relatedIdentifier     773 relatedItem
hasPart relatedItems.relatedItem, relatedIdentifiers.relatedIdentifier       relatedItem
isDescribedBy          
commonExpression         relatedItem
commonManifestation identifier, alternateIdentifiers.alternateIdentifier identifier lidoRecID, objectPublishedID 001, 003, 020, 024, 856 identifier, recordInfo

Query Languages

Language Identifier Mini Tutorial Documentation
SQL sql here ArcadeDB SQL
Cypher opencypher here OpenCypher
MQL mongo here Mongo MQL
GraphQL graphql here GraphQL Spec
Redis redis here Redis Commands

4. Tutorials

In this section lessons for newcomers are given.

Overview:

Getting Started

  1. Setup compatible compose orchestrator
  2. Download DatAasee release:

    $ wget "https://raw.githubusercontent.com/ulbmuenster/dataasee/0.9/compose.yaml"
    

    or:

    $ curl -O "https://raw.githubusercontent.com/ulbmuenster/dataasee/0.9/compose.yaml"
    
  3. Create or mount a folder for backups (assuming the backup volume is mounted under /backup on the host in case of mount)

    $ mkdir -p backup
    

    or:

    $ ln -s /backup backup
    
  4. Ensure the backup location has the necessary permissions:

    $ chmod o+w backup  # For testing
    

    or:

    $ sudo chown root backup  # For deploying
    
  5. Start the DatAasee service:

    $ ␣DL_PASS=password1 DB_PASS=password2 docker compose up -d
    

    or:

    $ ␣DL_PASS=password1 DB_PASS=password2 podman compose up -d
    

Now, if started locally, point a browser to http://localhost:8000 to use the web frontend, or send requests to http://localhost:8343/api/v1/ to use the HTTP API directly, for example via wget, curl.

Example Ingest

For demonstration purposes the collection of the “Directory of Open Access Journals” (DOAJ) is ingested. An ingest has four steps: First, the operator needs to collect the necessary information of the metadata source, i.e. URL, protocol, format, and data steward. Second, the ingest is triggered via the HTTP API. Third, the backend ingests the metadata records from the source to the database. Fourth and last, the ingested data is interconnected inside the database.

  1. Check the “Directory of Open Access Journals” (in a browser) for a compatible ingest method:

    https://doaj.org
    

    The oai-pmh protocol is available.

  2. Check the documentation about OAI-PMH for the corresponding endpoint:

    https://doaj.org/docs/oai-pmh/
    

    The OAI-PMH endpoint URL is: https://doaj.org/oai.

  3. Check the OAI-PMH endpoint for available metadata formats (for example, in a browser):

    https://doaj.org/oai?verb=ListMetadataFormats
    

    A compatible metadata format is oai_dc.

  4. Trigger the ingest:

    $ wget -qO- "http://localhost:8343/api/v1/ingest" --user admin --ask-password --post-data \
      '{"source":"https://doaj.org/oai", "method":"oai-pmh", "format":"oai_dc", "rights":"CC0", "steward":"helpdesk@doaj.org"}'
    

    HTTP status 202 confirms the start of the ingest. There is no steward listed in the DOAJ documentation, thus a general contact is set. Alternatively, the “Ingest” form of the “Admin” page in the web frontend can be used.

  5. DatAasee reports the start of the ingest in the backend logs:

    $ docker logs dataasee-backend-1
    

    with a message such as: Ingest started from https://doaj.org/oai via oai-pmh as oai_dc..

  6. DatAasee reports completion of the ingest in the backend logs:

    $ docker logs dataasee-backend-1
    

    with a message such as: Ingest completed from https://doaj.org/oai of 21319 records (of which 0 failed) after 0.1h..

  7. DatAasee starts interconnecting the ingested metadata records:

    $ docker logs dataasee-database-1
    

    with the message: Interconnect Started!.

  8. DatAasee finishes interconnecting the ingested metadata records:

    $ docker logs dataasee-database-1
    

    with the message: Interconnect Completed!.

NOTE: The interconnect is an asynchronous operation, whose status is reported in the database logs or via the /health endpoint.

NOTE: Generally, the ingest methods OAI-PMH for suitable sources, S3 for multi-file sources, and GET for single-file sources should be used.

Example Harvest

A typical use-case for DatAasee is to forward all metadata records from a specific source. To demonstrate this, the previous Example Ingest is assumed to have happened.

  1. Check the ingested sources

    $ wget -qO- "http://localhost:8343/api/v1/schema?prop=source"
    
  2. Request the first set of metadata records from source https://doaj.org/oai (the source needs to be URL encoded):

    $ wget -qO- "http://localhost:8343/api/v1/metadata?source=https%3A%2F%2Fdoaj.org%2Foai"
    

    At most 100 records are returned.

  3. Request the next set of metadata records (A next link is given in the meta object of the previous response):

    $ wget -qO- "http://localhost:8343/api/v1/metadata?source=https%3A%2F%2Fdoaj.org%2Foai&page=IzE3OjQxMTY"
    

    The last page can contain less than 100 records, all pages before contain 100 records.

NOTE: Using the source filter, the full record is returned, instead of a search result when used without, see /metadata

NOTE: No stable ordering of returned records is guaranteed.

Secret Management

Two secrets need to be managed for DatAasee, the database root password and the backend admin password. To protect these secrets on a host running docker(-compose), for example, the following tools can be used:

sops

$ gpg --quick-generate-key --batch --passphrase '' sops  # For testing
$ export SOPS_PGP_FP=$(gpg --with-colons --fingerprint sops | grep '^fpr:' | cut -d':' -f10 | head -n1)
$ printf "DL_PASS=password1\nDB_PASS=password2" > secrets.env
$ sops encrypt -i secrets.env
$ sops exec-env secrets.env 'docker compose up -d'

consul & envconsul

$ consul agent -dev  # For testing
$ consul kv put dataasee/DL_PASS password1
$ consul kv put dataasee/DB_PASS password2
$ envconsul -prefix dataasee docker compose up -d

env-vault

$ EDITOR=nano env-vault create secrets.env
$ env-vault secrets.env docker compose -- up -d

openssl

$  printf "DL_PASS=password1\nDB_PASS=password2" | openssl aes-256-cbc -e -a -salt -pbkdf2 -in - -out secrets.enc
$ (openssl aes-256-cbc -d -a -pbkdf2 -in secrets.enc -out secrets.env; docker compose --env-file .env --env-file secrets.env up -d; rm secrets.env)

Container Engines

DatAasee is deployed via a compose.yaml (see How to deploy), which is compatible with the following container and orchestration tools:

Docker Compose (Docker)

Installation see: docs.docker.com/compose/install/

$ ␣DB_PASS=password1 DL_PASS=password2 docker compose up -d
$ docker compose ps
$ docker compose down

Docker Compose (Podman)

Installation see: podman-desktop.io/docs/compose/setting-up-compose

NOTE: See also the podman compose manpage.

NOTE: Alternatively the package podman-docker (on Ubuntu) can be used to emulate docker through podman.

NOTE: The compose implementation podman-compose is not compatible at the moment.

$ ␣DB_PASS=password1 DL_PASS=password2 podman compose up -d
$ podman compose ps
$ podman compose down

Kompose (Minikube)

Installation see: kompose.io/installation/

Rename the compose.yaml to compose.txt (so kubectl ignores it) and run:

$ kompose -f compose.txt convert --secrets-as-files
$ minikube start
$ kubectl create secret generic datalake --from-literal=datalake=password1
$ kubectl create secret generic database --from-literal=database=password2
$ kubectl apply -f .
$ kubectl get pods
$ kubectl port-forward service/backend 8343:8343  # now the backend can be accessed via `http://localhost:8343/api/v1`
$ kubectl port-forward service/frontend 8000:8000  # now the frontend can be accessed via `http://localhost:8000`
$ minikube stop

Container Probes

The following endpoints are available for monitoring the respective containers; here the compose.yaml host names (service names) are used. Logs are written to the standard output.

Backend

Ready:

http://backend:4195/ready

returns HTTP status 200 if ready, see also Connect /ready.

Liveness:

http://backend:4195/ping

returns HTTP status 200 if live, see also Connect /ping.

Database

Ready:

http://database:2480/api/v1/ready

returns HTTP status 204 if ready, see also ArcadeDB /ready.

Liveness:

http://database:2480/api/v1/exists/metadatalake

returns HTTP status 200 if live, see also ArcadeDB /exists.

NOTE: This endpoint needs database credentials.

Frontend

Ready:

http://frontend:8000

returns HTTP status 200 if ready.

Custom Queries

Custom queries are meant for downstream services to customize recurring data access. A usage example for a custom query is to get a subset of metadata records or their contents for which filters are too generic; a practical example is the statistics query in the prototype frontend. Overall, the DatAasee database schema is based around the Metadata vertex type, of which its properties correspond to a star schema in relational terms. See the schema reference as well as the schema overview for the data model.

NOTE: All custom query results are limited to 100 items per request; use a paging mechanism if needed.

NOTE: A good learning resource for SQL, Cypher, and MQL is “SQL and NoSQL Databases”.

SQL

DatAasee uses the ArcadeDB SQL dialect (via language sql). For custom SQL queries, only single, read-only queries are admissible, meaning:

The vertex type (cf. table) holding the metadata records is named Metadata.

Examples:

Get the schema:

SELECT FROM schema:types

Get (at most) the first one-hundred metadata record titles:

SELECT title FROM Metadata

OpenCypher

DatAasee supports a subset of OpenCypher (via language opencypher). For custom Cypher queries, only read-queries are admissible, meaning:

Examples:

Get labels:

MATCH (n) RETURN DISTINCT labels(n)

Get one-hundred metadata record titles:

MATCH (m:Metadata) RETURN m

MQL

DatAasee supports a subset of a MQL (via language mongo) as JSON queries.

Examples:

Get (at most) the first one-hundred metadata record titles:

{ "collection": "Metadata", "query": { } }

GraphQL

DatAasee supports a subset of GraphQL (via language graphql). GraphQL use requires some prior setup:

  1. A corresponding GraphQL type for the native Metadata type needs to be defined:

    type Metadata { recordId: ID! }
    
  2. Some GraphQL query needs to be defined, for example named getMetadata:

    type Query { getMetadata(recordId: ID!): [Metadata!]! }
    

Since GraphQL type and query declarations are ephemeral, declarations and query execution should be sent together.

Examples

Get (at most) the first one-hundred metadata record titles:

type Metadata { recordId: ID! }

type Query { getMetadata(recordId: ID!): [Metadata!]! }

{ getMetadata }

Redis

DatAasee supports a subset of Redis commands (via language redis)

Examples

Get record with a recordId:

HGET Metadata[recordId] "ni:6g8aa2ARLuJ-8L1Uhjnf-dBN2Q-X1pC0Iqfuw7_yKec"

5. Appendix

In this section development-related guidelines are gathered.

Overview:

Support Matrix

Component Status Intention
Database Schema Supported Pilot, breaking changes possible until 1.1
Backend API Supported Pilot, breaking changes possible until 1.1
Compose Deployment Reference Pilot and evaluation
Frontend Container Prototype Testing and template

Dependency Docs

Development Decision Rationales

User Privacy

Infrastructure FAQ

Data Model FAQ

Database FAQ

Backend FAQ

Frontend FAQ

Example Record

Following is a minimal test record stored in the processor (not the database), accessible via the special record identifier (aka recordId): ni:dataasee.

{
  "createdAt": "2026-04-07 13:16:08",
  "creators": [
    {
      "name": "C. Himpe"
    }
  ],
  "dataSteward": "dataasee",
  "identifiers": [
    {
      "data": "doi",
      "name": "10.5281/zenodo.13734194"
    }
  ],
  "metadataQuality": null,
  "publicationYear": null,
  "publisher": "ULB Münster",
  "rawChecksum": null,
  "rawFormat": "dataasee",
  "recordId": "ni:dataasee",
  "relatedItems": [
    {
      "data": "https://github.com/ulbmuenster/dataasee",
      "name": "repository"
    }
  ],
  "resourceType": "Software",
  "schemaVersion": 1,
  "source": "dataasee",
  "title": "DatAasee",
  "version": "0.9"
}

Development Workflows

Development Setup

  1. git clone https://github.com/ulbmuenster/dataasee && cd dataasee (clone repository)
  2. make setup (builds container images locally)
  3. make start (starts development setup)

Testing

Release Builds

Compose Setup

Dependency Updates

  1. Dependency listing
  2. Dependency versions
  3. Version verification (Frontend only)

Schema Changes

  1. Schema definition
  2. Schema documentation
  3. Schema implementation

API Changes

  1. API rendering
  2. API schema
  3. API architecture
  4. API documentation
  5. API testing
  6. API implementation

Dev Monitoring

Coding Standards

Release Management