MetaDIG-Py is a library that contains helper functions and classes to assist users with writing python checks for suites in metadig-checks.
- Author: Jeanette Clark, Dou Mok, Peter Slaughter
- License: Apache 2
- Package source code on GitHub
- Submit Bugs and feature requests
- Contact us: [email protected]
- DataONE discussions
The MetaDIG-py client package contains python modules that users can call when writing checks
for their metadig-checks
. By importing this
package, users get access to all available helper functions and classes, such as object_store.py
which enables users to write checks that efficiently retrieves data objects to work with.
To run a suite, you must have the path to the suite to run, a path to the folder containing all the checks, the metadata file path and the path to the metadata's system metadata.
from metadig import suites
suite_file_path = "/path/to/FAIR-suite-0.4.0.xml"
checks_path = "path/to/folder/containing/checks"
metadata_file_path = "path/to/metadata:data_file.xml"
sysmeta_file_path = "path/to/metadata:data_sysmeta_file.xml"
# Note: storemanager_props are only relevant if you are executing data checks and has a default value of 'None'
suite_results = suites.run_suite(
suite_path,
checks_path,
sample_metadata_file_path,
sample_sysmeta_file_path,
)
print(suite_results)
{
"suite": "FAIR-suite-0.4.0.xml",
"timestamp": "2025-04-23 12:12:31",
"object_identifier": "doi:10.18739/A2QJ78081",
"run_status": "SUCCESS",
"run_comments": [
"Check not found: /Users/doumok/Code/metadig-py/tests/testdata/checks/resource.abstractLength.sufficient.1.xml",
...
"Check not found: /Users/doumok/Code/metadig-py/tests/testdata/checks/resource.type.valid.1.xml",
],
"sysmeta": {
"origin_member_node": "urn:node:ARCTIC",
"rights_holder": "http://orcid.org/0000-0001-2345-6789",
"date_uploaded": "2024-07-03T19:46:44.414+00:00",
"format_id": "https://eml.ecoinformatics.org/eml-2.2.0",
"obsoletes": "urn:uuid:dou-test-obsoleted"
},
"results": [
{
"check_id": "metadata.identifier.resolvable-2.0.0.xml",
"identifiers": [
"N/A"
],
"output": "The metadata identifier 'urn:uuid:dou-test-obsoleted' was found and is resolvable using the DataONE resolve service.",
"status": "SUCCESS"
},
{
"check_id": "entity.attributeName.differs-2.0.0.xml",
"identifiers": [
"N/A"
],
"output": "All 33 attributes have definitions that differ from the name",
"status": "SUCCESS"
},
{
"check_id": "provenance.ProcessStepCode.present-2.0.0.xml",
"identifiers": [
[
"urn:uuid:6a7a874a-39b5-4855-85d4-0fdfac795cd1"
]
],
"output": [
"Unexpected exception while running check: list index out of range"
],
"status": "Unable to execute check."
},
{
"check_id": "resource.license.present-2.0.0.xml",
"identifiers": [
"N/A"
],
"output": "The resource license 'This work is dedicated to the public dom...' was found.",
"status": "SUCCESS"
}
]
}
To run a metadata check, pass the check.xml, metadata file path and metadata's system metadata's file path to the run_check
function.
from metadig import checks
check_file_path = "/path/to/resource.license.present-2.0.0.xml"
metadata_file_path = "path/to/metadata:data_file.xml"
sysmeta_file_path = "path/to/metadata:data_sysmeta_file.xml"
result = checks.run_check(
check_file_path,
metadata_file_path,
sysmeta_file_path
)
print(result)
{
"output": "The resource license 'This work is dedicated to the public dom...' was found.",
"status": "SUCCESS"
}
In your python code, you can import a specific module or function like such:
from metadig import read_sysmeta_element
Currently, we have the following modules and functions available:
"StoreManager", # fn
"getType", # fn
"isResolvable", # fn
"isBlank", # fn
"toUnicode", # fn
"read_sysmeta_element", # fn
"find_eml_entity", # fn
"find_entity_index", # fn
"read_csv_with_metadata", # fn
"get_valid_csv", # fn
"run_check", # fn
"checks" # Module
To install MetaDIG-py, run the following commands
$ mkvirtualenv -p python3.9 metadigpy # Create a virtual environment
(metadigpy) $ poetry install # Run poetry command to install dependencies
- Note: If you run into an issue with installing jep, it is likely due to a backwards
compatibility issue with
setuptools
. Try downgrading to the version 58.0.0:(metadigpy) $ pip install setuptools==58.0.0
To confirm that you have installed MetaDIG-py
successfully, run pytest
.
(metadigpy) $ pytest
- Tip: You may run
pytest
with the option-s
(ex.pytest -s
) to see print statements. Not all pytests have print statements, but if you want to extend from what already exists, this may prove to be helpful!
The MetaDIG-py command line client was created to help users test python checks without having to spin up the java engine metadig-engine
and run a check through the dispatcher.
After you've installed MetaDIG-py
, you will have access to the command metadigpy
. Please see installation section above if you haven't installed MetaDIG-py
. Below is what running a check looks like:
(metadigpy) $ metadigpy -runcheck -store_path=/path/to/hashstore -check_xml=/path/to/check_xml -metadata_doc=/path/to/metadata/doc -sysmeta_doc=/path/to/sysmeta
The metadigclient
extracts the identifier (ex. DOI) & the authoritative member node (MN) (e.g. urn:node:ARCTIC
) from the system metadata document supplied for the given eml metadata document. It then passes these values to the run_check
function, which retrieves the associated data pids and their respective system metadata from the given hashstore.
The run_check
function then parses the check xml provided, validates the check definition, executes the check, and lastly prints the final result to stdout.
As of writing this documentation, we have only set up the metadigclient
to work with the following MN:
- urn:node:ARCTIC
To have additional nodes set up, please contact us at [email protected]
To set up a data check, you must have/prepare the following before you run the metadigpy
client command (above)
- A HashStore - this step is necessary because
run_check
will look for the data objects in a HashStore after retrieving the data pids. - The data objects associated with the DOI to check stored in HashStore, including the data objects' system metadata.
- A copy of the metadata document and its respective system metadata for the DOI.
- The check you want to run
- Note: A HashStore is only required if you are running data quality checks
HashStore is a python package developed for DataONE services to efficiently access data objects, metadata and system metadata. In order to simulate the process of retrieving data objects with a metadig
check, we must mimic the environment in which it happens in production. So the requirement of having a HashStore means that we need to create a HashStore and then store data objects and system metadata inside of it. Please see below for an example:
# Step 0: Install hashstore via poetry to create an executable script
(metadigpy) ~/Code $ git clone https://github.com/DataONEorg/hashstore.git /Code/hashstore
(metadigpy) ~/Code/hashstore $ poetry install
# Step 1: Create a HashStore at your desired store path (ex. /var/metacat/hashstore)
(metadigpy) ~/Code/hashstore $ hashstore /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="https://ns.dataone.org/service/types/v2.0#SystemMetadata"
# Store a data object
(metadigpy) ~/Code/hashstore $ hashstore /path/to/store/ -storeobject -pid=persistent_identifier -path=/path/to/object
# Store a metadata object
(metadigpy) ~/Code/hashstore $ hashstore /path/to/store/ -storemetadata -pid=persistent_identifier -path=/path/to/metadata/object -formatid=https://ns.dataone.org/service/types/v2.0#SystemMetadata
On your file system, HashStore looks like a folder with data objects and system metadata stored with hashes based on either a content identifier, or a combination of values that create a unique identifier. To interact with a HashStore and learn more, please see the documentation here.
During the run_check
process, after retrieving the data pids for the provided identifier, we then retrieve their associated data objects and system metadata from the provided HashStore. If these 'files' are not found, it could cause errors with the check that you're trying to run or test. Note, every data object stored in HashStore must have an equivalent system metadata, and this system metadata describes the basic attributes of the data object.
Every dataset not only has metadata about the dataset (which usually comes in the form of an EML metadata document) but also system metadata for the metadata. For us to run a metadig
check at this time, we need both the metadata document and its respective system metadata. The system metadata is parsed for the identifier, which is then used to retrieve the appropriate data pids, which is then used in the check.
TODO: Discuss how a python check is created, and link to metadig-checks
for more info.
$ mkvirtualenv -p python3.9 metadigpy // Create a virtual environment
(metadigpy) ~/Code $ git clone https://github.com/NCEAS/metadig-py.git ~/Code/metadigpy
(metadigpy) ~/Code $ cd /metadigpy
(metadigpy) ~/Code/metadigpy $ poetry install // Run poetry command to install dependencies
(metadigpy) ~/Code/metadigpy $ git clone https://github.com/DataONEorg/hashstore.git ~/Code/hashstore
(metadigpy) ~/Code $ cd ../hashstore
(metadigpy) ~/Code/hashstore $ poetry install
# Step 1: Create a HashStore at your desired store path (ex. /var/metacat/hashstore)
(metadigpy) ~/Code/hashstore $ hashstore /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="https://ns.dataone.org/service/types/v2.0#SystemMetadata"
# Store a data object
(metadigpy) ~/Code/hashstore $ hashstore /path/to/store/ -storeobject -pid=persistent_identifier -path=/path/to/object
# Store a metadata object
(metadigpy) ~/Code/hashstore $ hashstore /path/to/store/ -storemetadata -pid=persistent_identifier -path=/path/to/metadata/object -formatid=https://ns.dataone.org/service/types/v2.0#SystemMetadata
(metadigpy) ~/Code/hashstore $ metadigpy -runcheck -store_path=/path/to/hashstore -check_xml=/path/to/check_xml -metadata_doc=/path/to/metadata/doc -sysmeta_doc=/path/to/sysmeta
(metadigpy) ~/Code/hashstore $ {'Check Status': 0, 'Check Result': ['...RESULT...']}
Copyright 2020-2025 [Regents of the University of California]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Work on this package was supported by:
- DataONE Network
- Arctic Data Center: NSF-PLR grant #2042102 to M. B. Jones, A. Budden, M. Schildhauer, and J. Dozier
Additional support was provided for collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.