Bringing down the Tower of Babel in data sharing
By Eddy Ball
As presentations by scientists attending a pair of workshops in June made clear, talking about data sharing is one thing — getting it right is something else entirely.
A workshop co-sponsored by the Office of Scientific Information Management (OSIM) at NIEHS and the Office of Science Information Management (OSIM) at the U.S. Environmental Protection Agency (EPA) June 25 addressed common language concerns in environmental research, while a two-day meeting June 26-27 at NIEHS addressed operability issues that make the task of integrating databases so challenging (see related story).
No one mentioned the biblical story of the Tower of Babel during the workshop on Advancing Environmental Health Data Sharing and Analysis: Finding a Common Language, held at the EPA conference center in Research Triangle Park (RTP), N.C. But the difficulty of developing a consistent nomenclature, among the many now in use, to guide searches of multiple databases was the theme of each of the day’s ten presentations.
As the directors of NIEHS OSIM, Allen Dearry, Ph.D., and EPA OSIM, Jerry Blancato, Ph.D., explained in opening remarks, the foundation of effective data sharing is achieving standard language for computerized searches of massive data repositories, to make research data funded by the government publicly available. Data sharing, they emphasized, is an outcome that is not only desirable for research and regulatory scientists, but also one mandated by executive order.
“We’ve come to a new game in town,” said Blancato. “We have to have some type of common language.”
A common language emerging from a common ontology
Although the workshop substituted several terms that express the idea in plainer language, some presenters lapsed into database shoptalk with a more comprehensive philosophical term, ontology. Ontology refers to the formulation of definitions, classifications, and relationships, using the tools of logic and formal semantics, in order to most effectively achieve the goal of connecting data across different databases, and make these data accessible to standard software tools.
Unfortunately, as each of the presenters noted, many databases have emerged independently through good faith efforts to meet discipline-specific needs, using terms that may mean one thing for searches of that database, but something different in other contexts.
EPA Information Management Manager Lynne Petterson, Ph.D., offered a telling example of how this ambiguity might affect environmental health research. The word “flow,” she explained, means something different to atmospheric physicists than it does to hydrologists — a clash of ontologies that reduces the usefulness of information from their respective databases.
The search for solutions
Like several of her co-presenters at the workshop, Petterson is actively involved in developing what she called “a vocabulary for all seasons,” to represent these multiple perspectives, and reconcile past and present meanings of search terms.
Another effort underway at RTI, headed by Carol Hamilton, Ph.D., is developing consensus measures for exposures and biological outcomes, or phenotypes, for use in the NIH Common Data Element Resource Portal, to facilitate genome-wide association studies.
Ontology also has important implications for regulatory science. EPA Acting Chief of the Hazardous Pollutant Assessment Group Lyle Burgoon, Ph.D., described his team’s work developing a semiautomated predictive tool, for inferring the potential hazards to human health from the thousands of chemicals with insufficient toxicologic value data. This is critical, he said, “Because we can’t regulate chemicals with no tox values.”
During the concluding session of the workshop, participants split into small groups for discussion. The groups were charged with brainstorming responses to questions about moving the conversation forward among people with interests in broader and more effective data sharing.
Everyone seemed to agree with what North Carolina State University professor and developer of the NIEHS-funded Comparative Toxicogenomics Database, Carolyn Mattingly, Ph.D., said during her presentation, “You need better ways of navigating the data.” The question she and her colleagues faced at the close of the day, however, was just how to gather the momentum for unified progress on a much wider scale (see text box).