Data and knowledge management (DKM) systems collect, manage, and provide controlled access to data and knowledge resources. These systems may also provide critical analytical and visualization capabilities to support research and decision processes. Data within the DKM may be at any stage of its lifecycle.
The ability to effectively curate, combine, and use scientific and operational data and knowledge resources (e.g., research data sets, databases, knowledge bases, content management systems [CMS]) is integral to the goals of each NIEHS division. The goals vary across the institute, including enabling data-driven scientific discovery, informing health policies and funding decisions, and informing business operations. The ability to use data- and knowledge-driven approaches is increasingly providing competitive advantage for researchers and is necessary for informed and defensible decision-making. Further, data- and knowledge-driven approaches are integral to fulfilling the NIH mandate for scientific rigor and transparency in the biomedical sciences.
The ability to use interoperable terminologies and semantics between systems is a critical requirement to bring together and make effective use of disparate data and knowledge resources. Unfortunately, DKM systems, whether they are commercial off-the-shelf tools or in-house developed tools, can be costly to acquire, develop, and customize, as well as maintain. The rapid pace in data and knowledge generation, and evolution of DKM solutions, is a stressor for planning, budgets, and staff resources. Investments in DKM infrastructure, expertise, and policies will balance costs, meet central needs, support innovation, and enable users in the scientific and business stakeholder communities.
The scope of DKM includes:
- Content comprising structured or unstructured data, and information owned or managed by NIEHS (directly or via contract) or other data that is publicly available.
- Technologies that directly relate to the creation, management, maintenance, and use of NIEHS data and knowledge assets. This includes commercial, externally procured, and internally developed databases and knowledge bases.
- Products that are acquired or created for collecting and using data to inform scientific discovery, or decision support for environmental health policies.
- Tools or systems used in, or necessary for, production, enterprise management, operational-level activities (e.g., CMS, search engine, storage management system, and others).
Although DKM overlaps with all other landscape areas, including laboratory, clinical, security and privacy, commodity business computing, non-commodity software, and website communications, the focus of I&IT is to ensure NIEHS has a unified approach to meeting DKM needs. The overlap is often use-case specific, but a general rule is that systems producing or consuming data or knowledge for other independent systems are part of the DKM infrastructure.
Ensure NIEHS-Generated Research Data Meets NIH and NIEHS Policies and FAIR+ Principles for Data Quality, Management, Lifecycle, Security, Access, Discovery, and Sharing
- NIEHS research data, whether generated by intramural or extramural, should be prioritized to comply with FAIR+ principles.
- However, FAIR+ principles do not specifically call for data to be made computable — that is, available in a form that fosters semi-automated and automated search, access, and usage by computer programs. This ability is increasingly critical given the rise in data volume and automated and semi-automated data processes, mining, and analysis tools. Therefore, NIEHS will help lead in making research data computable (i.e., FAIR+).
- The implementation of FAIR+ principles, and how they overlap with research data management practices (e.g., capture of data provenance) and federal records retention requirements has not been worked out. The institute will work with NIH in defining and implementing relevant policies and best practices, especially regarding data of priority to NIEHS.
Adopt an Interoperable Set of Metadata, Terminologies, Vocabularies, and Data Exchange Protocols That Facilitate Data Integration Across Identified Data and Knowledge Management Systems
- Integration of data and summarized knowledge is increasingly valuable for research. However, the cost is high, as integration requires ongoing maintenance. In addition, the number of data sets and DKMs to map between can scale proportionally, if standards that facilitate integration are not planned for and adopted.
- NIEHS will limit integration costs using several strategies. First, prioritizing the adoption of data standards for highly valuable data sets and DKMs. Second, educating and training data producers on data curation practices and providing resources to aid data producers in adopting standards. Third, championing and advancing new and existing methods for linking datasets.
- Efforts exist to advance standards and practices to improve integration, both within NIH and externally (e.g., Research Data Alliance, Global Alliance for Genomics and Health). Partnerships and collaborations with these efforts will aid in promoting environmental health data integration standards and practices.
Provide Data and Knowledge Management Systems and Services That Enable and Promote a Growing Collection of Tools and Capabilities for Data- and Knowledge-Driven Research and Decision Support
- The continued growth in data and knowledge sources, coupled with the growth in realistic biomedical and population-level computer models, are improving the way research discoveries and policy decisions are made.
- However, accessing data, knowledge, and models for data exploration, hypothesis testing, advanced analytics, and machine learning purposes is often a challenging and tedious task. These efforts are often duplicated needlessly across separate efforts and lead to increased percentage of time lost due to data cleaning and wrangling.
- NIEHS I&IT will accelerate the development and adoption of data- and knowledge-driven methods among all staff, by coordinating and supporting the development and usage of core data and knowledge systems (e.g., databases, metadata catalogs, software libraries), services (e.g., application programming interfaces [APIs]), data sets (e.g., for training artificial intelligence algorithms), and software libraries.
- I&IT will further ensure DKM capabilities meet policies around lifecycle, data standards, security, cost efficiencies, and federal regulations, while supporting best practices that foster innovation and best-of-breed solutions.
Align NIEHS Data and Knowledge Management Systems With Select External Efforts
- There are multiple efforts to manage, aggregate, and provide DKM systems and tools for accessing and analyzing biomedical and population-level health data, both within and external to NIH (e.g., National Cancer Institute [NCI] Commons, NIH Data Commons, All of Us cohort). Many of these efforts are making use of public clouds as supporting infrastructure. In aligning with and making use of these efforts, NIEHS will advance use of data standards that promote interoperability and reduce duplication of costly DKM capabilities.
- Multiple challenges exist in aligning with external systems, including technical challenges in interfacing with external systems (e.g., moving data securely between sites, authenticating with external DKM systems) and challenges in adopting evolving standards used by external partners, in addition to the challenge of funding and staffing such efforts (especially where cloud technologies are in use).
- NIEHS I&IT will facilitate alignment by engaging with external efforts to prioritize external systems to work with, advocate around NIEHS needs, and build tools and services to minimize technical challenges in using external systems.
Cultivate a Data Science−Oriented Workforce
- Data- and knowledge-driven approaches (e.g., deep learning for training classifiers, use of vocabularies for enrichment) represent a foundational method of conducting research (in addition to empirical, theoretical, and computational approaches). Increasing training of staff and researchers in data science methods will increase the use and effectiveness of such approaches and reduce the burden on centralized I&IT for developing DKM-related tools and services.
- NIEHS I&IT will facilitate training of data science researchers and staff, both intramural and extramural, by helping identify gaps in training, providing training sessions, promoting a community of data science, and providing infrastructure (e.g., servers, installed software, training data sets) for development of skills.
Strategic Capability Priorities
Convene a Standing Data Governance Committee to Establish Data Management Policies
NIEHS will create a Data Governance Committee to work with NIEHS researchers, staff, and I&IT groups to establish policies and compliance criteria for management of internal and contract-generated data with the goal of meeting FAIR+ principles and federal requirements for management of records. Policies include, but are not limited to, data retention; archiving of research data, including use of tiered and cloud storage; metadata standards for describing research data; and standards for representing, publishing, supporting, discovering, and indexing data. Policies must be tiered to account for differing levels of data prioritization. The Committee will interface with related NIH standards and NIEHS extramural data governance policies. Success will be determined by creation of policies that allow I&IT groups to make purchasing and development decisions, and compliance of NIEHS-generated data governed by data management policies.
Implement Processes for Strategic Oversight of Data and Knowledge Management Activities and Investments
NIEHS will put into practice processes to inform best approaches for addressing DKM needs (e.g., developing internally, contracting, partnering with other ICs), to prioritize I&IT investments and projects, to increase coordination between I&IT groups and users on implementations and change management, and to evaluate ongoing activities and lifecycle decisions. The process will include developing inventories of internal needs, capabilities, and projects (including lifecycle); developing inventories of external DKM resources (especially at NIH); formation of internal committees to plan and review activities from scientific and technological perspectives, and support strategic planning; use of knowledge and project management tools (e.g., Confluence, Jira) for process management; and creation of an external review panel to provide expertise in data- and knowledge-intensive methods. Success will be understood by direct evidence of business processes guiding DKM and data science investments.
Advance Adoption of Metadata and Vocabularies for Describing Data Sets and for Use in Data Analysis Efforts
NIEHS will use vocabularies to link knowledge systems to aid in analysis approaches (e.g., enrichment analysis). Advancing use will entail several efforts including tagging of prioritized NIEHS research data sets using manual SOPs and automated mechanisms (e.g., that link to LIMS and ELNs); creation of a reference catalog of commonly used metadata and linked vocabularies that can facilitate automation, inform SOPs, and allow for controlled crowdsourcing of updates; incorporation of metadata and vocabularies into search capabilities; increasing staff expertise in developing semantic and linked data-aware analysis and search tools; education and training on methods to integrate metadata and linked vocabularies into data analysis; aligning NIEHS internal and extramural efforts with environmental health vocabularies and other external communities; and formation of an internal working group to advance adoption of metadata and vocabularies. Success will be evaluated with metrics on metadata tagging on research data sets (accounting for governance policies), and use of metadata and linked vocabularies in tools for searching, analysis, and visualization.
Conduct Pilots to Inform Solutions for External Environmental Health Research Data
Whether an NIEHS grantee, external collaborator, or part of a public or private partnership arrangement, researchers are not always able to maintain or properly curate data resources, especially after funding periods end. A critical need exists to provide a solution for long-term management of such NIEHS-funded data, with a goal of ensuring the data meets FAIR+ principles. Multiple questions exist on how to sustain funding and control access, what level of access should be provided, and what services will be provided with the data. The evolving focus of data science at NIH, including the NIH Data Commons currently under development, and the proposed storage of publication-linked data at PubMed, may generate solutions. One or more pilots will aid in assessing these solutions, as well as provide detailed information on needs and constraints. Pilots will address providing standards and expectations for long-term data management, as well as testing the use of specific data repositories. Success will be determined by completion of pilots that inform and lead to decisions for long-term solutions.
Advance FAIR+ Practices for Intramural Research Data
A portion of the NIEHS-generated research data (internally or by contract) is managed by internal systems (e.g., CEBS, EpiShare, REDCap, NIEHS Data Commons) or external systems (e.g., dbGaP, GEO). These systems are at various levels of meeting FAIR+ principles, and limited commonality exists across these systems (e.g., for search, access, archiving, transport, data standards, data provisioning processes). A priority exists in ensuring these systems address NIEHS data policies and FAIR+ principles, while reducing overall costs for data management while meeting identified user needs. A second priority is classifying internal research data as to the level of management needed (e.g., none, archive only, FAIR, FAIR+) and ensuring the data is managed according to its classification. Decisions around inclusion of legacy data will be made based on the nature of the data. Success will be understood by the percentage of research data managed according to NIEHS policies, lack of redundant common data management functionality, percentage of user identified needs met, and user satisfaction.
Provide an NIEHS Information Commons for Querying and Computing Across FAIR+ Designated Data Sets and Knowledge Bases
NIEHS will develop and provide APIs to access and query DKM systems, which provide multiple benefits in enabling data- and knowledge-driven research methods. APIs provide access to statistical distributions that may be difficult to obtain due to data security and sharing concerns, facilitate development of new tools and methods by providing easy-to-obtain results for common queries that allow for auditing of data usage patterns, and inform data management practices. Coordinating the terminologies and vocabularies adopted by APIs such that common standards are adopted can further allow for integration of data between systems without direct mapping of the data systems, allowing for separation of implementations. Existing efforts at NIH and elsewhere (e.g., NIH Data Commons, EPA CompTox, Global Alliance for Genomics and Health) are making strides toward global information commons. Thus, providing APIs will increase NIEHS data utility and use. Success will be understood by the percentage of DKM systems with APIs, and usage of APIs by intramural and extramural researchers, tool developers, and partner resources.
Provide Centralized Resources to Support Researchers in Using and Advancing Data- and Knowledge-Driven Methods
As research staff and programs adopt greater use of data- and knowledge-driven research, there is a need to increase support for these activities. Support will fall into several areas, including developing and deploying data-centric tools, including visualizations and websites; developing, deploying, loading, and querying databases and knowledge bases, including traditional (e.g., RDBMS) and specialized data systems (e.g., Graph databases); working with novel data and computational science technologies (e.g., GPU-based computing, machine learning libraries, and semantic and linked data methods); managing and accessing research data; prototyping data science tools and transitioning tools to production use; and creating a tool and project repository to provide a centralized platform of available resources. Success will be understood as a pipeline of research projects involving NIEHS support staff and researchers; success in matrixing of support staff from I&IT groups (including Bioinformatics); development and retention of staff resources; development of software and data resources reused across projects; and successful use of temporary staffing (e.g., postbacs, summer students) to provide tailored expertise.
Establish Processes to Foster a Data-Oriented Workforce
Increasing the adoption of data- and knowledge-centric tools and methods broadly across NIEHS will be achieved through a number of targeted activities, including identifying and promoting resources for training and education; identifying and promoting tools and methods; conducting regular webinars and workshops to showcase methods, tools, and resources, and foster exchange of ideas; holding internal user meetings and journal clubs to foster communities of practice; and identifying NIEHS experts in topic areas willing to provide internal advice and consulting. Success will be evaluated as training and education efforts meet user needs; establishment of well-attended, active community and engagement efforts; and increased adoption of data- and knowledge-centric tools across NIEHS.
Data and Knowledge Management Theme Map
|I&IT Landscape||Agility||Analytics||Communications & Transparency||Foster Collaboration||Governance||Optimize Resources||Workforce Development|
Data and Knowledge Management
See Appendix A: I&IT Priorities Support NIEHS Strategic Themes