How It Works
The purpose of FuTRES is to make trait data from biological and paleontological specimens accessible in a format that improves discoverability and promotes novel research. By following a few steps and using new tools developed by the FuTRES team, these data can be shared on the FuTRES platform, which is backed by an ontological framework that enables logical reasoning.
Step 1: Template
We have developed a template (viewable here) to help data providers create datasets that are ready for ingestion into the FuTRES knowledge base. The field names in the template largely correspond to Darwin Core terms. Since Darwin Core is the most commonly used standard for sharing biodiversity occurrence data, these fields may already be pre-existing in most collections databases, or if not, they can be easily mapped or crosswalked from other existing fields.
Existing Data and Future Collaborations
Currently, the FuTRES team is working with Principal Investigators, Dr. Kitty Emery, Dr. Ray Bernor, and Dr. Edward Davis, to share existing zooarchaeological and paleontological specimen datasets with associated trait information. These use cases have informed the development of the FuTRES template and the FuTRES Ontology of Vertebrate Traits. However, we are hosting community workshops and are happy to receive feedback and additional trait-focused datasets from other sources.
Step 2: Pipeline
The data processing pipeline is comprised of five main steps: pre-processing, triplifying, reasoning, conversion to a tabular format, and data loading. The pipeline is dependent on an existing ontology that defines and relates the terminology used in the data, but does not require a specific structure.
The pre-processing step lets users write a custom conversion method for each dataset into a common format (while the template helps to minimize the amount of customization needed). After pre-processing, RDF triples are generated by reading configuration files from each data source that includes term mappings, data validation, and creating relationships between processes and objects as defined by the ontology. All triples are referenced by globally unique identifiers by appending record identifiers from the input data to globally unique, resolvable HTTP prefixes that can be customized for each project. Because instance identifiers are derived from the input data, output identifiers can be linked back to the specific records in the raw source data, which also provides a mechanism to track data provenance.
The next step in the workflow - reasoning - uses the bundled OntoPilot software (Stucky et al., 2018) that supports multiple description logic profiles through multiple reasoners. The workflow provides an optional configuration file to OntoPilot that further allows users to customize the reasoning process.
The reformatting workflow converts the data to a series of CSV files via a customizable SPARQL query through query_fetcher, a bundled package for fast conversion of RDF to tabular data that is built upon the Apache Jena Java Library. The output data can be loaded into whatever data storage system the user prefers, including key/value stores (e.g., ElasticSearch), relational databases (e.g., PostgreSQL), or triplestores (e.g., Blazegraph).
Step 3: Ontology
The FuTRES Ontology of Vertebrate Traits (FOVT) is a fundamental tool to accomplish the FuTRES project goals. An ontology is a knowledge representation which describes concepts and their relationships to one another in a logical framework that is understandable by machines and humans. This logical format allows the data points in an ontology knowledge base to be reasoned over, allowing new inferences to be gained. The FOVT is an application ontology specifically designed to serve the purposes of the FuTRES projects. It was developed by Dr. Ramona Walls, Dr. Meghan Balk, and Laura Brenskelle, and it reuses many existing ontologies (for example, UBERON, PATO, and BSPO) to conceptualize different vertebrate traits.