DatAasee Software Documentation
Version: 0.2
DatAasee is a metadata-lake for centralizing bibliographic metadata and scientific metadata from various sources, to increase research data findability and discoverability, as well as metadata availability, and thus supporting FAIR research and research reporting in university libraries, research libraries, academic libraries or scientific libraries.
Particularly, DatAasee is developed for and by the University and State Library of Münster.
Sections:
Selected Subsections:
Explanations
In this section understanding-oriented explanations are collected.
Overview:
About
- What is DatAasee?
- DatAasee is a metadata-lake!
- What is a Metadata-Lake?
- A metadata-lake (a.k.a. metalake) is a data-lake restricted to metadata data!
- What is a Data-Lake?
- A data-lake is a data architecture for structured, semi-structured and unstructured data!
- How does a data-lake differ from a database?
- A data-lake includes a database, but requires further components to import and export data!
- How does a data-lake differ from a data warehouse?
- How is data in a data-lake organized?
- A data lake includes a metadata catalog that stores data locations, its metadata, and transformations!
- What makes a metadata-lake special?
- The metadata-lake’s data-lake and metadata catalog coincide, this implies incomig data is partially transformed (cf EtLT) to hydrate the catalog aspect of the metadata-lake!
- How does a metadata-lake differ from a data catalog?
- A metadata-lake’s data is (textual) metadata while a data catalog’s data is databases (and their contents)!
- How is a metadata-lake a data-lake?
- The ingested metadata is stored in raw form in the metadata-lake in addition to the partially transformed catalog metadata, and transformations are performed on the raw or catalog metadata upon request.
- How does a metadata-lake relate to a virtual data-lake?
- A metadata-lake can act as a central metadata catalog for a set of distributed data sources and thus define a virtual data-lake.
Features
- Search via: full-text, filter, [SRU]
- Query by:
SQL
,Gremlin
,Cypher
,MQL
,GraphQL
, [SPARQL] - Ingest:
DataCite
(XML),DC
(XML),MARC
(XML),MODS
(XML), [LIDO], [EAD], [DCAT], [RDF] - Ingest via:
OAI-PMH
(HTTP),S3
(HTTP), [GraphQL], [Postgres], [Self] - Deploy via:
Docker
,Podman
, [Kubernetes] - REST-like API with CQRS aspects
- Best-of statistics of enumerated properties
- CRUD frontend for manual interaction and observation.
Components
DatAasee uses a three-tier architecture with these separately containered components:
Function | Abstraction | Tier | Product |
---|---|---|---|
Metadata Catalog | Multi-Model Database | Data | ArcadeDB |
EtLT Processor | Declarative Streaming Processor | Logic | Benthos |
Web Frontend | Declarative Web Framework | Presentation | Lowdefy |
Design
- Each component is encapsulated in its own container.
- External access happens through an HTTP API transporting JSON and conforming to JSON:API.
- Ingests may happen via compatible protocols, e.g.
OAI-PMH
,S3
. - The frontend is optional as it is exclusively using the (external) HTTP-API.
- Internal communication happens via the components’ HTTP-APIs.
- Only the database component holds state, the backend (and frontend) are stateless.
- For more details see the architecture documentation.
Data Model
The internal data model is based on the one big table (OBT) approach, but with
the exception of linked enumerated dimensions (Look-Up tables) making it
effectively an denormalized wide table with star schema, named metadata
.
EtLT Process
Combining the ETL (Extract-Transform-Load / schema-on-write) and ELT (Extract-Load-Transform / schema-on-read) concepts, processing is built upon the EtLT approach:
- Extract: Ingest from data source, see ingest endpoint.
- transform: Partial parsing and cleaning.
- Load: Write to database.
- Transform: Parse to export format on-demand.
Particularly, this means “EtL” happens (batch-wise) during ingest, while “T” occurs when requested.
Security
Secrets:
- Two secrets need to be handled: database admin and datalake admin passwords.
- The default datalake admin user name is
admin
, the password can be passed during initial deploy. - The database admin user name is
root
, the password can be passed during initial deploy. - The passwords are handled as file-based secrets by the deploying compose file (loaded from a file and provided to containers as a file).
- The database credentials are used by the backend and may be used for manual database access.
- If the secrets are kept on the host, they need to be protected, for example via
openssl
,SOPS
, or similar tools.
Infrastructure:
- Component containers are custom-build and hardened.
- Only
HTTP
andBasic Authentication
are used, as it is assumed thatHTTPS
is provided by a user-provided proxy-server.
Interface:
- HTTP-API GET requests are idempotent and thus unchallenged.
- HTTP-API POST requests may change the state of the database and thus need to be authorized by the data-lake admin user credentials.
- See the DatAasee OpenAPI definition.
How-Tos
In this section, step-by-step guides for real-world problems are listed.
Overview:
- Prerequisite
- Resources
- Deploy
- Test
- Shutdown
- Ingest
- Backup Manually
- Logs
- Update
- Upgrade
- Web Interface (Prototype)
- API Indexing
Prerequisite
The (virtual) machine deploying DatAasee requires docker-compose
,
or podman-compose
.
See also the container engine compatibility.
Resources
The compute and memory resources for DatAasee can be configured via the compose.yaml
.
Overall, a bare-metal machine or virtual machine requires:
- Minimum: 2 CPU, 4G RAM
- Recommended: 4 CPU, 8G RAM
So, a Raspberry Pi would be sufficient. In terms of DatAasee components this breaks down to:
- Database:
- Minimum: 1 CPU, 2G RAM
- Recommended: 2 CPU, 4G RAM
- Backend:
- Minimum: 1 CPU, 1G RAM
- Recommended: 2 CPU, 2G RAM
- Frontend:
- Minimum: 1 CPU, 1G RAM
- Recommended: 2 CPU, 2G RAM
Note, that resource and system requirements depend on load, particularly, database and backend are under heavy load during ingest. Post ingest, (new) metadata records are interrelated, also causing heavy database loads. Generally, the database drives the overall performance. Thus to improve performance, try first to increase memory for the database component (i.e. 4G to 8G).
Deploy
$ mkdir -p backup # or: ln -s /path/to/backup/volume backup
$ wget https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)
NOTE: The required secrets are kept temporary in the files
dl_pass
anddb_pass
.
NOTE: Make sure to delete (or encrypt) secret files
dl_pass
anddb_pass
after use!
NOTE: To customize your deploy, use these environment variables.
NOTE: The runtime configuration environment variables can be stored in an
.env
file.
NOTE: A custom backup location can alternatively also be specified inside the
compose.yaml
.
Test
wget -SqO- http://localhost:8343/api/v1/ready
NOTE: The default port for the HTTP API is
8343
.
Shutdown
$ docker-compose down
NOTE: A (database) backup is automatically triggered on every shutdown.
Ingest
$ wget -O- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data \
'{"source":"https://my.url/to/oai","method":"oai-pmh","format":"mods","steward":"https://my.url/identifying/steward"}'
NOTE: A (database) backup is automatically triggered after every ingest.
Backup Manually
$ wget -O- http://localhost:8343/api/v1/backup --user admin --ask-password --post-data=''
Logs
$ docker-compose logs backend
Update
$ docker compose down
$ docker compose pull
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)
NOTE: “Update” means: if available, new images of the same DatAasee version but updated dependencies will be installed, whereas “Upgrade” means: a new version of DatAasee will be installed.
Upgrade
$ docker compose down
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && DL_VERSION=0.3 docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)
NOTE:
docker-compose restart
cannot be used here because environment variables (such asDL_VERSION
) are not updated when using restart.
NOTE: Make sure to put the
DL_VERSION
variable also into the.env
file for a permanent upgrade.
Web Interface (Prototype)
NOTE: The default port for the web frontend is
8000
for development and80
for deployment.
API Indexing
Add the JSON object inside to the apis
array in your global apis.json
API index.
{
"name": "DatAasee API",
"description": "The DatAasee API enables research data search and discovery via metadata",
"keywords": ["Metadata"],
"attribution": "DatAasee",
"baseURL": "http://your-dataasee.url/api/v1",
"properties": [
{
"type": "InterfaceLicense",
"url": "https://spdx.org/licenses/MIT.html"
},
{
"type": "x-openapi",
"url": "http://your-dataasee.url/api/v1/api"
}
]
}
References
In this section technical descriptions are summarized.
Overview:
- HTTP-API
- Ingest Protocols
- Ingest Encodings
- Ingest Formats
- Native Schema
- Interrelation Edges
- Ingestable to Native Schema Crosswalk
- Query Languages
- Runtime Configuration
HTTP-API
The HTTP-API is served under http://<your-url-here>/api/v1
and provides the following endpoints:
Method | Endpoint | Type | Summary |
---|---|---|---|
GET |
/ready |
system | Return service status |
GET |
/api |
special | Return API specification and schemas |
GET |
/schema |
metadata | Return database schema |
GET |
/attributes |
metadata | Return enumerated properties |
GET |
/stats |
data | Return statistics about records |
GET |
/metadata |
data | Return metadata record(s) |
POST |
/insert |
data | Create new record |
POST |
/ingest |
system | Trigger ingest from source |
POST |
/backup |
system | Trigger database backup |
POST |
/health |
system | Return service health |
GET |
/export |
data | TODO: |
GET |
/sru |
data | TODO: |
POST |
/forward |
system | TODO: |
NOTE: The base path for all endpoints is
/api/v1
.
NOTE: All
GET
requests are unchallenged, allPOST
requests are challenged, which are handled via “Basic Authentication”.
NOTE: All request and response bodies have content type
JSON
, and if provided, theContent-Type
HTTP header must beapplication/json
!
NOTE: As the metadata-lake’s data is metadata, a type “data” means metadata, and a type “metadata” means metadata about metadata.
NOTE: Responses follow the JSON:API format.
NOTE: The
id
property is the server’s Unix timestamp.
/ready
Endpoint
Returns boolean answering if service is ready.
NOTE: The
ready
endpoint can be used as readiness probe.
- Method:
GET
- Parameters: None
- Response Schema:
response/ready.json
- Access: Public
- Process: see architecture
Status:
Example:
Get service readiness:
$ wget -qO- http://localhost:8343/api/v1/ready
/api
Endpoint
Returns OpenAPI specification if no parameter is given, otherwise returns a request or response schema.
NOTE: In case of a succesful request, the response is NOT in the JSONAPI format, but the requested JSON file directly.
- Method:
GET
- Parameters:
request
(Optional; if provided, a request schema for the endpoint in the parameter value is returned.)response
(Optional; if provided, a response schema for the endpoint in the parameter value is returned.)
- Response Schema:
response/api.json
- Access: Public
- Process: see architecture
Statuses:
Examples:
Get OpenAPI definition:
$ wget -qO- http://localhost:8343/api/v1/api
Get ingest
endpoint request schema:
$ wget -qO- http://localhost:8343/api/v1/api?request=ingest
Get metadata
endpoint response schema:
$ wget -qO- http://localhost:8343/api/v1/api?response=metadata
/schema
Endpoint
Returns internal metadata schema.
- Method:
GET
- Parameters: None
- Response Body:
response/schema.json
- Access: Public
- Process: see architecture
Statuses:
Example:
Get native metadata schema:
$ wget -qO- http://localhost:8343/api/v1/schema
/attributes
Endpoint
Returns list of enumerated attribute values.
- Method:
GET
- Parameters:
type
(Optional; if provided only selected attribute type is returned.)
- Response Schema:
response/attributes.json
- Access: Public
- Process: see architecture
Statuses:
Example:
Get all enumerated attributes:
$ wget -qO- http://localhost:8343/api/v1/attributes
Get language attributes:
$ wget -qO- http://localhost:8343/api/v1/attributes?type=languages
/stats
Endpoint
Return statistics about records.
- Method:
GET
- Parameters: None
- Response Body:
response/stats.json
- Access: Public
- Process: see architecture
Statuses:
Example:
$ wget -qO- http://localhost:8343/api/v1/stats
/metadata
Endpoint
Fetch from, search, filter or query metadata record(s).
- Method:
GET
- Parameters:
id
(Optional; if provided, a metadata-set with this value is returned.)search
(Optional; if provided, full-text search results for this value are returned.)query
(Optional; if provided, query results using this value are returned, nolanguage
parameter impliessql
.)language
(Optional; if provided, filter results bylanguage
are returned, also used to setquery
language.)resourcetype
(Optional; if provided, filter results byresourceType
are returned.)license
(Optional; if provided, filter results bylicense
are returned.)category
(Optional; if provided, filter results bycategory
are returned.)from
(Optional; if provided, filter results greater or equalpublicationYear
are returned.)till
(Optional; if provided, filter results lesser or equalpublicationYear
are returned.)skip
(Optional; if provided, this number of results is skipped, use for paging.)newest
(Optional; if provided, results are sorted new-to-oldest if true (default), or old-to-new if false.)
- Response Body:
response/metadata.json
- Access: Public
- Process: see architecture
NOTE: Only idem-potent read operations are permitted in custom queries.
NOTE: A full-text search always matches for all argument terms (AND-based) in titles, descriptions and keywords in any order, while accepting
*
as wildcards and_
to build phrases.
Statuses:
Examples:
Get record by record identifier:
$ wget -qO- http://localhost:8343/api/v1/metadata?id=
Search records by single filter:
$ wget -qO- http://localhost:8343/api/v1/metadata?language=chinese
Search records by multiple filters:
$ wget -qO- http://localhost:8343/api/v1/metadata?resourcetype=book&language=german
Search records by full-text for word “History”:
$ wget -qO- http://localhost:8343/api/v1/metadata?search=History
Search records by full-text and filter, oldest first:
$ wget -qO- http://localhost:8343/api/v1/metadata?search=Geschichte&resourcetype=book&language=german&newest=false
Search records by custom SQL query:
$ wget -qO- http://localhost:8343/api/v1/metadata?language=sql&query=SELECT%20FROM%20metadata%20LIMIT%2010
/insert
Endpoint
Inserts and parses, if necessary, a new record into the database.
- Method:
POST
- Request Body:
request/insert.json
- Response Body:
response/insert.json
- Access: Challenged (Basic Authentication)
- Process: see architecture
NOTE: This endpoint is meant for metadata records that cannot be ingested such as a report of ingested sources or testing; general use is discouraged. For details on the request body, see the associated JSON schema.
Status:
Example:
Insert record with given fields: TODO:
$ wget -qO- http://localhost:8343/api/v1/insert --user admin --ask-password --post-file=myinsert.json
/ingest
Endpoint
Trigger ingest from data source.
- Method:
POST
- Request Body:
request/ingest.json
source
must be an URLmethod
can beoai-pmh
ors3
format
can bedatacite
,oai_datacite
,dc
,oai_dc
,marc21
,marcxml
,mods
, orrawmods
steward
should be an URL or email address.
- Response Body:
response/ingest.json
- Access: Challenged (Basic Authentication)
- Process: see architecture
NOTE: To test if the server is busy, send an empty (POST) body to this endpoint. HTTP status
400
means available, status503
means currently ingesting.
NOTE: The
method
andformat
are case-sensitive.
Status:
- 202 Accepted.
- 400 Invalid request.
- 403 Invalid credentials.
- 406 Not Acceptable.
- 503 Already ingesting.
Example:
Start ingest from a given source:
$ wget -qO- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data='{"source":"https://datastore.uni-muenster.de/oai", "method":"oai-pmh", "format":"datacite", "steward":"forschungsdaten@uni-muenster.de"}'
/backup
Endpoint
Trigger database backup.
- Method:
POST
- Request Body: None
- Response Schema:
response/backup.json
- Access: Challenged (Basic Authentication)
- Process: see architecture
Status:
Example:
$ wget -qO- http://localhost:8343/api/v1/backup --user admin --ask-password --post-data=''
/health
Endpoint
Returns internal status of service components.
NOTE: The
health
endpoint can be used as liveness probe.
- Method:
POST
- Parameters: None
- Response Schema:
response/health.json
- Access: Public
- Process: see architecture
Status:
Example:
Get service health:
$ wget -qO- http://localhost:8343/api/v1/health --user admin --ask-password --post-data=''
/export
Endpoint
TODO:
/sru
Endpoint
TODO:
/forward
Endpoint
TODO:
Ingest Protocols
- OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)
- Supported Versions:
2.0
- Supported Versions:
- S3 (Simple Storage Service)
- Supported Versions:
2006-03-01
- Supported Versions:
Ingest Encodings
Currently, XML (eXtensible Markup Language) is the sole encoding for ingested metadata.
Ingest Formats
- DataCite
- Supported Versions:
4.4
,4.5
- Format Specification
- Supported Versions:
- DC (Dublin Core)
- Supported Versions:
1.1
- Format Specification
- Supported Versions:
- MARC (MAchine-Readable Cataloging)
- Supported Versions:
1.1
- Format Specification
- Supported Versions:
- MODS (Metadata Object Description Schema)
- Supported Versions:
3.7
,3.8
- Format Specification
- Supported Versions:
- TODO: RDF
Native Schema
Key | Class | Entry | Type | Constraints |
---|---|---|---|---|
schemaVersion |
Process | Automatic | Integer | min 0 |
recordId |
Process | Automatic | String | max 31 |
metadataQuality |
Process | Automatic | String | max 255 |
dataSteward |
process | Automatic | String | max 4095 |
source |
Process | Automatic | String | max 4095 |
createdAt |
Process | Automatic | Datetime | |
updatedAt |
Process | Automatic | Datetime | |
sizeBytes |
Technical | Automatic | Integer | min 0 |
fileFormat |
Technical | Automatic | String | max 255 |
dataLocation |
Technical | Automatic | String | max 4095, regexp |
numberDownloads |
Social | Automatic | Integer | min 0 |
keywords |
Social | Optional | String | max 255 |
categories |
Social | Optional | List(String) | max 4 |
name |
Descriptive | Mandatory | String | max 255 |
creators |
Descriptive | Mandatory | List(pair) | max 255 |
publisher |
Descriptive | Mandatory | String | min 1, max 255 |
publicationYear |
Descriptive | Mandatory | Integer | min -9999, max 9999 |
resourceType |
Descriptive | Mandatory | Link(attribute) | resourceTypes |
identifiers |
Descriptive | Mandatory | List(pair) | max 255 |
synonyms |
Descriptive | Optional | List(pair) | max 255 |
language |
Descriptive | Optional | Link(attribute) | languages |
subjects |
Descriptive | Optional | List(pair) | max 255 |
version |
Descriptive | Optional | String | max 255 |
license |
Descriptive | Optional | Link(pair) | licenses |
rights |
Descriptive | Optional | String | max 65535 |
project |
Descriptive | Optional | Embedded(pair) | |
fundings |
Descriptive | Optional | List(pair) | max 255 |
description |
Descriptive | Optional | String | max 65535 |
message |
Descriptive | Optional | String | max 65535 |
externalItems |
Descriptive | Optional | List(pair) | max 255 |
rawType |
Raw | Optional | String | max 255 |
raw |
Raw | Optional | String | max 1048575 |
rawChecksum |
Raw | Optional | String | max 255 |
NOTE: See also the schema diagram: schema.md
NOTE: The preloaded set of
categories
(see preload.sql) is highly opinionated.
Helper types
attributes
Property | Type | Constraints |
---|---|---|
name |
String | min 3, max 255 |
also |
List(String) |
pair
Property | Type | Constraints |
---|---|---|
name |
String | max 255 |
data |
String | max 4095, regexp |
Global Metadata
Each schema property has a label
, additionally the descriptive properties have
a comment
property.
Key | Type | Comment |
---|---|---|
label |
String | For UI labels |
comment |
String | For UI helper texts |
Interrelation Edges
Type | Comment |
---|---|
isRelatedTo |
Base edge type |
isNewVersionOf |
Derived from isRelatedTo |
isDerivedFrom |
Derived from isRelatedTo |
isPartOf |
Derived from isRelatedTo |
isSameExpressionAs |
Derived from isRelatedTo |
isSameManifestationAs |
Derived from isRelatedTo |
Ingestable to Native Schema Crosswalk
TODO: Add sub elements
DatAasee | DataCite | DC | MARC | MODS |
---|---|---|---|---|
name |
titles |
title |
245 , 130 |
titleInfo , part |
creators |
creators , contributors |
creator , contributor |
100 , 700 |
name , relatedItem |
publisher |
publisher |
publisher |
260 , 264 |
originInfo |
publicationYear |
publicationYear |
date |
260 , 264 |
originInfo , part , recordInfo |
resourceType |
resourceType |
type |
007 , |
genre |
identifiers |
identifier , alternateIdentifiers |
identifier |
001 , 020 , 856 |
identifier , recordInfo |
synonyms |
titles |
title |
210 , 222 , 240 , 242 , 246 , 247 |
titleInfo |
language |
language |
language |
008 , 041 |
language |
subjects |
subjects |
subjects |
655 , 689 |
subject |
version |
version |
250 |
||
license |
rights |
accessCondition |
||
rights |
rights |
506 , 540 |
||
project |
||||
fundings |
fundingReferences |
|||
description |
description |
description |
520 |
|
message |
format |
500 |
note |
|
externalItems |
relatedIdentifiers |
identifier |
identifier |
|
isRelatedTo |
relatedItems , relatedIdentifiers |
related |
773 |
relatedItem |
isNewVersionOf |
relatedItems , relatedIdentifiers |
relatedItem |
||
isDerivedFrom |
relatedItems , relatedIdentifiers |
relatedItem |
||
isPartOf |
relatedItems , relatedIdentifiers |
relatedItem |
||
isSameExpressionAs |
relatedItem |
|||
isSameManifestationAs |
recordInfo |
Query Languages
Language | Identifier | Documentation |
---|---|---|
SQL | sql |
ArcadeDB SQL |
Cypher | cypher |
Neo4J Cypher |
GraphQL | graphql |
GraphQL Spec |
Gremlin | gremlin |
Tinkerpop Gremlin |
MQL | mongo |
Mongo MQL |
SPARQL | sparql |
SPARQL (WIP) |
Runtime Configuration
The following environment variables affect DatAasee if set before starting.
Symbol | Value | Meaning |
---|---|---|
TZ |
CET (Default) |
Timezone of server |
DL_VERSION |
0.2 (Example) |
Requested DatAasee version |
DL_BACKUP |
$PWD/backup (Default) |
Path to backup folder |
DL_USER |
admin (Default) |
DatAasee admin username |
DL_BASE |
http://my.url (Example) |
Outward DatAasee base URL (including protocol and port, no trailing slash) |
DL_PORT |
8343 (Default) |
DatAasee API port |
FE_PORT |
8000 (Default) |
Web Frontend port (Development) |
Tutorials
In this section learning-oriented lessons for new-comers are given.
Overview:
Getting Started
- Setup compatible compose orchestrator
- Download DatAasee release
$ wget https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
or:
$ curl https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
- Unpack
compose.yaml
$ tar -xf dataasee-0.2.tar.gz
and:
$ cd dataasee-0.2
- Create or mount folder for backups (assuming your backup volume is mounted under
/backup
)$ mkdir -p backup
or:
$ ln -s /backup backup
- Create DatAasee API and database admin passwords. The spaces before
echo
prevent these commands from being added to the history.echo -n
is used to create the password files as most editors add a newline at the end of a file.$ echo -n 'password1' > dl_pass
and:
$ echo -n 'password2' > db_pass
- Start DatAasee service
$ docker-compose up -d
or:
$ podman-compose up -d
Now, if started locally point a browser to http://localhost:8000
to use the web frontend,
or send requests to http://localhost:8343/api/v1/
to use the HTTP API directly.
Example Ingest
For demonstration purposes the collection of the “Directory of Open Access Journals” (DOAJ) is ingested. An ingest has four phases: First, the administrator needs to collect the necessary information of the metadata source, i.e. URL, protocol, format, and data steward. Second, the ingest is triggered via the HTTP-API. Third, the backend ingests the metadata records from the source to the database. Fourth and lastly, the ingested data is interconnected inside the database.
- Check the documentation of DOAJ:
https://doaj.org/docs
The
oai-pmh
protocol is available. - Check the documentation about OAI-PMH:
https://doaj.org/docs/oai-pmh/
The OAI-PMH endpoint URL is:
https://doaj.org/oai
. - Check the OAI-PMH for available metadata formats:
https://doaj.org/oai?verb=ListMetadataFormats
A compatible metadata format is
oai_dc
. - Start an ingest:
$ wget -qO- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data='{"source":"https://doaj.org/oai", "method":"oai-pmh", "format":"oai_dc", "steward":"helpdesk@doaj.org"}'
A status
202
confirms the start of the ingest. Here, no steward is listed in the DOAJ documentation, thus a general contact is set. Alternatively, the “Ingest” form of the “Admin” page in the web frontend can be used. - DatAasee reports the start of the ingest in the backend logs:
$ docker logs dataasee-backend-1
with a message akin to:
Starting ingest from https://doaj.org/oai via oai-pmh as oai_dc.
. - DatAasee reports completion of the ingest in the backend logs:
$ docker logs dataasee-backend-1
with a message akin to:
Completed ingest of 20812 records from https://doaj.org/oai after 0.05h.
. - DatAasee starts interconnecting the ingested metadata records:
$ docker logs dataasee-database-1
with a message akin to:
Interconnect Started!
. - DatAasee finishes interconnecting the ingested metadata records:
$ docker logs dataasee-database-1
with a message akin to:
Interconnect Completed!
.
NOTE: The interconnection is a potentially long-running, asynchronous operation, whose status is only reported in the database logs.
Container Engines
DatAasee is deployed via a compose.yaml
(see How to deploy),
which is compatible to the following orchestration tools:
docker-compose
podman-compose
- Kubernetes / Minikube via
kompose
Docker-Compose
- docker
- docker-compose >= 2
Installation see: docs.docker.com/compose/install/
$ docker-compose up -d
$ docker-compose ps
$ docker-compose down
Docker-Compose (with Podman)
- podman
- podman-docker
- docker-compose
Installation see: docs.docker.com/compose/install/
NOTE: This tutorial assumes a Debian-based Linux host like Ubuntu.
$ sudo apt-get -y install dnsmasq podman-plugins containernetworking-plugins podman-docker
$ docker-compose up -d
$ docker-compose ps
$ docker-compose down
Podman-Compose
- podman >= 4.6.2
- podman-compose >= 1.0.6
NOTE: This tutorial assumes a Debian-based Linux host like Ubuntu.
Additionally:
$ sudo apt-get -y install dnsmasq podman-plugins containernetworking-plugins python3-pip
$ pip3 install podman-compose
$ podman-compose up -d
$ podman-compose ps
$ podman-compose down
Kompose (Minikube)
- minikube
- kubectl
- kompose
Installation see: kompose.io/installation/
Prepare compose.yaml
:
- Add port to database service:
services: database: ports: # value changed from [] - "2480:2480"
$ kompose -f compose.yaml convert
Particularly, for kompose
in version 1.33.0 and 1.34.0 the following manual changes need to be made in:
database-deployment.yaml
:spec: template: spec: containers: - env: volumeMounts: - name: "database" mountPath: "/run/secrets/database" # value changed from "/run/secrets"
backend-deployment.yaml
:spec: template: spec: containers: - env: volumeMounts: - name: "database" mountPath: "/run/secrets/database" # value changed from "/run/secrets" - name: "datalake" mountPath: "/run/secrets/datalake" # value changed from "/run/secrets"
$ rm compose.yaml
$ minikube start
$ kubectl apply -f .
$ kubectl port-forward service/backend 8343:8343 # now the backend can be accessed via `http://localhost:8343/api/v1`
$ minikube stop
Container Probes
The following endpoints are available for monitoring the respective containers;
here the compose.yaml
host names (service names) are used.
Logs are written to the standard output.
Backend
Ready:
http://backend:4195/ready
returns HTTP status 200
if ready, see also Benthos ready
.
Liveness:
http://backend:4195/ping
returns HTTP status 200
if live, see also Benthos ping
.
Metrics:
http://backend:4195/metrics
allows Prometheus scraping, see also Connect prometheus
.
Database
Ready:
http://database:2480/api/v1/ready
returns HTTP status 204
if ready, see also ArcadeDB ready
.
Frontend
Ready:
http://frontend:3000
returns HTTP status 200
if ready.
Custom Queries
NOTE: All custom query results are limited to 100 items.
SQL
DatAasee uses the ArcadeDB SQL dialect. For custom SQL queries, only single, read-only queries are admissible, meaning:
The vertex type (cf. table) holding the metadata records is named metadata
.
Examples:
Get the schema:
SELECT FROM schema:types
Get one-hundred metadata record titles:
SELECT name FROM metadata
Gremlin
TODO:
Get one-hundred metadata record titles:
g.V().hasLabel("metadata")
Cypher
DatAasee supports a subset of OpenCypher. For custom Cypher queries, only read-queries are admissible, meaning:
MATCH
OPTIONAL MATCH
RETURN
Examples:
Get labels:
MATCH (n) RETURN DISTINCT labels(n)
Get one-hundred metadata record titles:
MATCH (m:metadata) RETURN m
MQL
TODO:
GraphQL
TODO:
SPARQL
TODO:
Custom Frontend
Remove Prototype Frontend
Remove the YAML object "frontend"
in the compose.yaml
(all lines below ## Frontend # ...
).
Appendix
In this section development-related guidelines are gathered.
Overview:
Development Decision Rationales:
Infrastructure
- What versioning scheme is used?
- DatAasee uses SimVer versioning, with the addition, that the minor
version starts with one for the first release of a major version (
X.1
), so during the development of a major version the minor version will be zero (X.0
).
- DatAasee uses SimVer versioning, with the addition, that the minor
version starts with one for the first release of a major version (
- How stable is the upgrade to a release?
- During the development releases (
0.X
) every release will likely be breaking, particularly with respect to backend API and database schema. Once a version 1.0 is released, breaking changes will only occur between major versions.
- During the development releases (
- What are the three
compose
files for?- The
compose.develop.yaml
is only for the development environment, - The
compose.package.yaml
is only for building the release container images, - The
compose.yaml
is the only file making up a release.
- The
- Why does a release consist only of the
compose.yaml
?- The compose configuration acts as a installation script and deploy recipe. Given access to a repository with DatAasee, all containers are set up on-the-fly by pulling. No other files are needed.
- Why is Ubuntu 24.04 used as base image for database and backend?
- Overall, the calendar based version together with the 5 year support policy for Ubuntu LTS
makes keeping current easier. Generally,
glibc
is used, and specifically for the database, OpenJDK is supported, as opposed to Alpine.
- Overall, the calendar based version together with the 5 year support policy for Ubuntu LTS
makes keeping current easier. Generally,
- Why does building the backend Docker image fail?
- This is likely a timeout when downloading Go module packages. Multiple retries maybe necessary to complete a build.
Database
- Why is an
init.sh
script used instead of a plain command in the database container?- This is a security measure; the script is designed to hide secrets which need to be passed on start up. A secondary use is the set up of the database schema in case the container is freshly created.
Backend
- Why are the main processing components part of the input and not a separate pipeline?
- Since the ingests may take very long, it is only triggered and the sucessful triggering is
reported in the response while the ingest keeps on running. This async behavior is only
possible with a
buffer
which has to be directly after the input and aftersync_response
of the trigger, thus the input post-pressing processors are used as main pipeline.
- Since the ingests may take very long, it is only triggered and the sucessful triggering is
reported in the response while the ingest keeps on running. This async behavior is only
possible with a
- Why is the content type
application/json
used for responses and notapplication/vnd.api+json
?- Using the official JSON MIME-type makes a response more compatible and states what it is in
more general terms. Requested content types on the other hand may be either empty,
*/*
,application/json
, orapplication/vnd.api+json
.
- Using the official JSON MIME-type makes a response more compatible and states what it is in
more general terms. Requested content types on the other hand may be either empty,
Frontend
- Why is the frontend a prototype?
- The frontend is not meant for direct production use but serves as system testing device, a proof-of-concept, living documentation, and simplification for manual testing. Thus it has the layout of an internal tool. Nonetheless, it can be used as a basis or template for a production frontend.
- Why is there custom JS defined?
- This is necessary to enable triggering the submit button when pressing the “Enter” key.
- Why does the frontend container use the
backend
name explicitly and not the host loopback, i.e.extra_hosts: [host.docker.internal:host-gateway]
?- Because
podman
does not seem to support it yet.
- Because
Development Workflows
Development Setup
git clone https://github.com/ulbmuenster/dataasee && cd dataasee
(clone repository)make setup
(builds container images locally)make start
(starts development setup)
Dependency Updates
- Dependency documentation
- Dependency versions
- Version verification (Frontend only)
Schema Changes
API Changes
Coding Standards
- YAML and SQL files must have a comment header line containing: dialect, project, license, author.
- YAML should be restricted to StrictYAML (except
github-ci
andcompose
). - SQL commands should be all-caps.