-
Notifications
You must be signed in to change notification settings - Fork 6
bdtools
The bdverse
(biodiversity data universe) is a biodiversity data quality toolkit constructed as a family of R packages (https://bdverse.org/). It allows users with- and without programming capabilities, to conveniently and coherently employ R for data exploration, quality assessment, data cleaning, and standardization.
bdverse
is a hierarchical package system. Its architecture comprises six core functionality units, of which four are already operative, and two are under construction. These units are: (I) bddwc
: a Darwin Core field name standardizer, which facilitates data inclusiveness from any biodiversity data aggregator. (ii) bdchecks
: a biodiversity data quality checks system for performing, filtering, developing, and managing various biodiversity data checks. (iii) bdclean
: a user-friendly data cleaning workflow system, composed of questionnaires (to collect user's specific needs) and data cleaning reports. (iv) bdvis
: an interactive biodiversity data visualizations and dashboards system (under construction). (v) bdtools
: an agile and modular tool framework for biodiversity data exploration (under construction). (vi) bdverse
: main installation package (one package to rule them all) that also stores the Shiny apps launcher. Currently, bdverse
contains five Shiny apps and eleven R packages.
Enetwild is a European consortium (https://enetwild.com/) funded by the European Food Safety Authority involving 10 agencies or universities. It aims at aggregating outputs of disease host wildlife monitoring in Europe and harmonising them. The first focus of the consortium is wild boar, due to the spread of African swine fever through Europe. Outputs correspond to data about species occurrences, population abundances, and hunting bags at different scales. To harmonise data, Enetwild adopted the Darwin core standard and proposed some innovation to adjust it to its need, forming the Wildlife Data Model (WLDM) (https://efsa.onlinelibrary.wiley.com/doi/abs/10.2903/sp.efsa.2020.EN-1841 ; https://biss.pensoft.net/article/59120/list/19/). Enetwild project developed a tool to ease regional coordinators in formating productors data into the WLDM. It has been thought to be generic to have the ability to be applied to the Darwin core standard.
The WLDM app was developed in the R Shiny language. The standardisation process is divided into two steps: (i) the generic standardisation aims at transforming a user dataset into a tidy data frame according to database standards (i.e. one observation per row, one variable per column), (ii) the specific standardisation aims at transforming the dataset into the WLDM format.
Under the hood, bdverse
comprises 11 R packages, with a hierarchical dependency structure between them. At the first level, bdutilities
stores the shared R functionality, so any bdverse
package can access it without any code duplication. bdutilities.app
stores all the shared Shiny modules (e.g., data download/upload; package citations; Darwinization) for efficient Shiny code maintenance and development. bddwc
comprises two packages, storing its R functionality, and its Shiny modules and app functionality (i.e., bddwc.app
), respectively. The same dual-package structure was applied for bdchecks
, but with an additional Shiny admin app within bdchecks.app
that provides a convenient user interface for editing and managing numerous data checks. bdclean
stores all its components in a single app, since users don’t need access to its R functionality, as the entire pipeline is carried out within the app.
All Shiny apps are modularized (see figure below), so the same module can be used in many apps (e.g., data download/upload; package citations; Darwinization), for flexibility, extensibility, and a plug-and-play capability between Shiny modules. bdverse
modularity coupled with a robust QA scheme cultivates an agile yet stable development framework.
Figure 1: bdverse
Shiny modules architecture. All our apps are modularized, which gives us the ability to build an app from modules from different packages. This gives us great order and control over the Shiny code.
The WLDM app follows a sequential approach. The first step is the generic standardisation into a tidy data frame. Several options are proposed to reformat the data frame (i.e. add new columns, duplicate columns, convert a wide-format data frame into a long-format data frame, convert column values). The step of specific standardisation, more complex, consists in:
- Converting user field names into Darwin Core standards. This step is very similar to the
bddwc
unit from the bdverse package. An intermediate concept-based filter helps the user in selecting the right standard name. - Dividing the fields into Event, Occurrence and Measurement, which are the Wildlife Data Model (WLDM) components.
- For both Event and Occurrence sets of fields:
- Identifying the different levels
- Enriching the information by adding standard variables with free or controlled values. The last step allows visualising and exporting the formatted dataset, including a view of the structure based on a diagram format.
Figure 2: WLDM app architecture.
Integrating the WLDM app into the bdDwC unit appears to be very consistent with both tools' objectives and approaches. Both of them were developed in the R Shiny language and take as inputs user datasets that need to be standardized. While the bdDwC unit focuses on the darwinization (i.e. standardizing names into DC standards), the WLDM app's core is the structuring in the Event/Occurrence/Measurement format. The bdDwC unit proposes additional features, including the spatial visualization of the dataset after the import and the possibility to import and export the dictionary used for the darwinization.
Figure 3: Comparison of bdDwC and WLDM app features. E: Event, O: Occurrence, M: Measurement
In order to successfully merge the WLDM app into a bdverse app, we have identified 6 challenging yet crucial steps:
- Learning and internalizing bdverse architecture, different packages, all its apps and different Shiny modules.
- Learning {golem} (a Shiny app development framework)
- Understanding the Darwin Core standard (DwC).
- Understanding the [Wildlife Data Model] (WLDM) and the WLDM app goals, based on the exploration of different examples of user datasets.
- Managing spatial information (WKT format, package [{sf}]).
- Identifying how to integrate the WLDM app into the bdverse app.
R (package development), Shiny app development (advanced level), tidyverse & data wrangling with dplyr, data visualizations, testing (testthat; shinytest). Advantage: experience in working with biodiversity data.
The motivation to develop the bdverse was derived from the urgent need to address biodiversity data quality in scientific research. The multilayered complexity of deducing data fitness for use calls for a user-friendly infrastructure forged by user-level needs and R supreme package ecosystem. We conceived and designed the conceptual, methodological, and technical foundations of a modular and agile, R-based toolkit for biodiversity data quality. Hopefully, the bdverse will promote practicality, learning, and reproducibility—for the benefit of biodiversity worldwide.
Students, please contact mentors below after completing at least one of the tests below.
- Tomer Gueta [email protected] is leading the bdverse project, which was born out of his Ph.D. work. The bdverse development team/family was founded thanks to GSoC. Today, Tomer is dividing his time between establishing a citizen science national center and developing an IT infrastructure for Hamaarag - Israel's National Nature Assessment Program. Both projects are for the Steinhardt Museum of Natural History, Tel Aviv University.
- Thiloshon Nagarajah [email protected] is the Shiny lead of the bdverse development team. He was past GSoC and GCI student for Fedora Project, Sahana Foundation and R Language. Thiloshon joined bdverse as a Google Summer of Code student developer in 2017 and has been a student, contributor, mentor and now, a core member of the bdverse team. All things Shiny of bdverse is the magic of Thiloshon.
- Guillaume Body [email protected] is an ecologist in the French Agency of Biodiversity (OFB) in charge of terrestrial vertebrates monitoring. He represents OFB in the Enetwild consortium and proposed the evolution of the Darwin core.
- Sarah Valentin [email protected] is a veterinarian, PhD in informatics, working in the French Agency of Biodiversity (OFB) as ENETWILD project.
- Vijay Barve [email protected] is the author and maintainer of bdvis and a key member in the bdverse development team. Vijay is a biodiversity data scientist and has been a GSoC student and mentor since 2012 with the R project organization. Vijay has contributed to several packages on CRAN.
- Sunny Dhoke [email protected] was the student lead for the DevOps project during GSoC 2020. His insights and industry experience will be invaluable to sport this infrastructural project.
Students, please do one or more of the following tests before contacting the mentors above. We designed these tests to be incorporated into your proposal rather quickly.
- Easy: fork the bdutilities package, develop two unit tests, and submit a PR.
- Medium: fork the bdutilities.app package, build a simple Shiny app by incorporating the three modules in this repository (look at
bdverse
other Shiny apps to see how this can be done easily); develop two Shiny tests, and submit a PR with a feature branch. - Medium: study the bdDwC package and formulate an R markdown document describing its ideal testing strategy, the more detailed, the merrier.
- Hard: Converting a dataset into the WLDM format, please follow instructions.
Students, please post here a GitHub link to your test solutions in the format: Name - Email - University - Link to solutions