Dextr: Semi-automated Data Extraction Tool

Automating the labor-intensive step of data extraction has great potential to improve the speed of conducting literature reviews or literature-based health assessments and reduce both workload and resources required, without comprising the rigor and transparency that are critical to the process.

Machine-learning approaches for literature-based assessments have primarily been implemented within clinical research by the medical community, although there have been efforts in other fields. There are key challenges within the environmental health field that add to the complexity of incorporating these existing clinical research-based algorithms into an environmental health science workflow.

The Division of Translational Toxicology (DTT) initiated a project with ICF and Evidence Prime to develop and test an automated tool for extracting data in literature reviews that supports user-verification of entries, or what can be called a semi-automated approach. Here we present Dextr, a semi-automated, web-based, data-extraction tool. We evaluated recall, precision, and extraction time in both the manual extraction and the semi-automated extraction modes of Dextr. We report a 47% reduction in the time to complete data extraction with similar recall and precision using Dextr’s semi-automated approach. We believe this tool will be a valuable resource for environmental health and other research fields when conducting literature reviews.

DTT is providing access to the current version of this jointly developed tool for the scientific community to review and explore Dextr while development continues in preparation for Dextr’s public launch.

Features in Version 2.3.1 of Dextr include:

Complex data: Ability to connect data mentions to maintain complex hierarchical data.
Exporting: Options for multiple csv exports or annotated exports of extractions.
Importing: Support for .ris and Endnote reference imports and batch PDF upload.
Models: Dextr currently provides Natural Language Processing models*, based on the 2018 NIST Text Analysis Conference Challenge**, for extracting entities from animal studies. Dextr also supports extraction of entities using rule-based regular-expression matching.
Quality control (QC): Dextr provides single extractor and extractor with QC modes.
Vocabulary management: Ability to utilize controlled vocabularies when extracting hierarchical entities to support data categorization and subsequent analysis.

For more information or to request access to explore the tool, email Vickie R. Walker.

Citation: Walker VR, Schmitt CP, Wolfe MS, Nowak, AJ, Kulesza K, Williams AR, Shin R, Cohen J, Burch D, Stout MD, Shipkowski KA, Rooney AR. 2021. Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr. Environment International in press. [Abstract Walker VR, Schmitt CP, Wolfe MS, Nowak, AJ, Kulesza K, Williams AR, Shin R, Cohen J, Burch D, Stout MD, Shipkowski KA, Rooney AR. 2021. Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr. Environment International in press.]

* Nowak A, Kunstman P. 2018. Team EP at TAC 2018: Automating data extraction in systematic reviews of environmental agents. Paper presented at: National Institute of Standards and Technology Text Analysis Conference. Gaithersburg, MD. [Abstract Nowak A, Kunstman P. 2018. Team EP at TAC 2018: Automating data extraction in systematic reviews of environmental agents. Paper presented at: National Institute of Standards and Technology Text Analysis Conference. Gaithersburg, MD.]

** Schmitt C, Walker V, Williams A, Varghese A, Ahmad Y, Rooney A, Wolfe M. 2018. Overview of the TAC 2018 systematic review information extraction track. Paper presented at: National Institute of Standards and Technology Text Analysis Conference. Gaithersburg, MD. [Abstract Schmitt C, Walker V, Williams A, Varghese A, Ahmad Y, Rooney A, Wolfe M. 2018. Overview of the TAC 2018 systematic review information extraction track. Paper presented at: National Institute of Standards and Technology Text Analysis Conference. Gaithersburg, MD.]

National Institute of Environmental Health Sciences

Webcasts

Your Environment. Your Health.

Dextr: Semi-automated Data Extraction Tool