Autumn 2017

A Closer Read

To parse data sets, researchers look to natural language tools

The Environment Issue

  • Kevin Jiang

On a miserably cold January evening in 2014, Joseph Dexter met his friend and mentor Pramit Chaudhuri at a party at a classical studies conference in Chicago. Dexter, a graduate student in the HMS Department of Systems Biology, had met the former Dartmouth classics professor when Dexter, while still in high school, was taking classes at the New Hampshire college. Catching up with one another that evening began with the usual pleasantries, but their conversation soon carried them into uncharted territory. As the night deepened and the room emptied, the pair remained huddled, deliberating an idea: could bioinformatics be adapted for studying ancient literature?  

In the first century CE, the Roman philosopher and statesman Seneca—tutor, advisor, and, ultimately, victim of Emperor Nero’s anger—wrote a series of plays shaped by the political and social strife of his era. Collectively known as the Senecan tragedies, this corpus was relegated to the margins of history until it was rediscovered by Renaissance scholars in the fifteenth century. The plays’ reemergence marked the revival of the tragedy on European stages and served as a model for dramatic traditions that influence Western culture to this day.

The journey of the Senecan tragedies from antiquity to modernity has taken unpredictable turns. But perhaps the unlikeliest detour was made during that late-night conversation, where, over several glasses of wine, a biologist and a classics scholar began to flesh out how techniques from bioinformatics could be used to gain insights about texts such as the Senecan tragedies.

At first blush, it might seem implausible to speculate that ancient Roman plays packed with supernatural intervention and bloodthirsty revenge would have anything in common with the computational analysis of biological data. But for Dexter, whose lifelong obsession with classics paralleled his path to the study of math and biology, and for an increasing number of researchers like him, the intersection of computation, human language, and biology is fertile ground for discovery.

“There are lots of commonalities that arise when you deal with large amounts of multidimensional data in messy, unstructured contexts,” he says. “That’s certainly true in biomedicine, and it’s certainly true in culture and literature.”

Driven by rapid growth in computing power and new technologies, almost every facet of biomedical research has been deluged with data in recent years, from the petabyte-sized datasets of “-omic” fields used to study the genome, transcriptome, proteome, and similar molecular entities, to what many are estimating will become the zettabyte-sized data sets of scientific literature and electronic medical records (EMRs).

Extracting meaningful discoveries out of this wealth of information has necessitated the development of tools that not only can identify patterns of interest across massive data sets but can do so despite the inherent “messiness” of biology. This is no simple challenge. Whether at the level of molecules or populations, the study of biological systems involves untangling sets of rules, connections, and interdependencies that have been laid down by evolution that can vary by timing, context, and chance.

Yet computational techniques honed for the study of the complex, interconnected, often ambiguous system we call language are increasingly being used to inform biomedical research. For some applications, these tools are showing enormous promise, from improving our understanding of genomics and biochemical pathways to realizing the full potential of precision medicine.

Dramatic Language

As early as the 1940s, linguists and computer scientists were collaborating on methods that would allow computers to learn, understand, and apply human language to a variety of uses. Known as natural language processing, researchers drew from disciplines such as artificial intelligence, machine learning, computer science, statistics, and computational linguistics to analyze the rules and patterns of language.

As large amounts of linguistic data and increased computing power became available, these efforts bloomed, leading to contemporary applications such as Siri, Apple’s intelligent personal assistant software, and Google Translate. The field of biomedical informatics, which leverages similar techniques to analyze and interpret medical and biological data, has similarly matured over the past few decades.

Natural language processing and biomedical informatics intersect in many ways. One of the more unusual examples may be the project launched by Dexter and Chaudhuri after that late-night conversation. Applying a technique they dubbed quantitative literary criticism, the project’s team of classics scholars, computer scientists, and computational biologists used computational tools to analyze ancient Latin and Greek texts, including the plays by Seneca.

Earlier this year, Dexter and his colleagues published a paper in Proceedings of the National Academy of Sciences in which they used computational profiling of writing style to explore intertextuality—the concept that all texts have relationships to other texts—across the writings of ancient authors. In one trial, they computationally analyzed the entirety of the Senecan tragedies to investigate their influence on a play by a fifteenth-century Italian author writing in the Senecan tradition. The team identified places in which the later play differs in style from plays written by Seneca. By pinpointing these differences, they could reveal various literary effects for which the author was striving and which gave his work its distinctive character.

The group is also pursuing a method for the detection of verbal intertextuality based on one of the most common bioinformatics techniques: sequence alignment. This analysis allows like-to-like comparisons of DNA, RNA, or protein sequences by lining up the molecular strands so that they match at as many locations as possible. In evolutionary studies, this technique has been used to identify similar genes across different species and analyze the degree of difference between them to build phylogenetic trees.

“Linguistics played an important role in the development of sequence alignment tools that are now ubiquitous in biology” says Dexter. “We realized you could use the same techniques on literary problems.”

Topic Sentences

Bioinformatics tools can have powerful and creative applications, but when combined with natural language processing and applied to biomedical sciences, they have profound implications for human health.

Peter Park and Doga Gulhan
Peter Park and Doga Gulhan

On the third floor of the Francis A. Countway Library of Medicine, Peter Park, an HMS professor in the Department of Biomedical Informatics, oversees a research group that is using large-scale computational analysis of genomics data to better understand the mechanisms underlying human diseases.

Among the group’s many approaches is one drawn directly from natural language processing: a statistical model that can identify what “topics” are contained within texts. Instead of analyzing language, however, Park and his team are identifying the specific causes of mutations in the genomes of cancer patients.

To illustrate with an analogy, a book about military battles of World War I will include the words “tank” and “trench” more frequently than a book about battles in the American Revolutionary War. But both will have more occurrences of words like “gun” and “cannon” than a book about the Punic Wars, which raged in the third through second century BCE.

This technique can be used to scan entire libraries of literary texts for groups of co-occurring words that indicate a common topic. The statistics can then be used to infer not only what the topic of a book may be, but the mixture of topics contained within.

Led by HMS bioinformatics postdoctoral fellow Doga Gulhan—a particle physicist who trained at MIT and worked at CERN—the team applied this concept to genomes. Key to their work are studies that have linked certain causal factors to specific patterns of mutations. In the genomes of smokers, for instance, there is a dramatic increase in cytosine to adenine mutations. These single nucleotide variants are often accompanied by predictable patterns in nucleotides on either side of the single variant.

“If we think of each person’s genome as a book that contains many mutations or words,” says Gulhan, “we can use our algorithms to find words that occur together and group them by common occurrences into broad topics. You cannot do this using only a few genomes. You need a big set of books so that you can determine what the topics are. Then you can look at each genome to see which topics it contains.”

Park, Gulhan, and their team are scanning trillions of DNA base pairs and petabytes of data found in roughly 2,700 different tumor genome sequences from the International Cancer Genome Consortium. They have identified dozens of mutation signatures that indicate different causal factors, or “topics,” in their analogy. Most of these factors are still unknown, but some, including smoking and UV exposure, have been previously identified and are being used to validate and improve the methodology.

“Ultimately, what we want to do is give patients treatments that are appropriate for their disease,” Park says. “If you are presented with two tumors, say, a brain tumor and a lung tumor, they might appear to be caused by different factors. But it could be that the same mechanism is causing mutations in both. Sequencing the genomics of cancer patients will soon be a routine practice, and this type of genome analysis will help us sift through the mutations that reflect the history of the tumor, so that we can identify the best drug or combination of drugs to use for the patient.”


The tools of natural language processing have shown great promise when applied to biological data, but they are no less valuable within the context of their original intent: to provide computers with the capability to do useful things with human language.

Since 2005, the number of papers and abstracts on biomedical topics indexed by the National Institutes of Health’s PubMed search engine has doubled, sitting at somewhere around 27 million, with thousands more being added daily.

“Scientific literature is growing so large that we can’t keep up with it all, even within fields,” says John Bachman, a research fellow in therapeutic science in the Laboratory for Systems Pharmacology (LSP) and the Harvard Program in Therapeutic Science (HiTS) at HMS. “And it is extremely difficult to know if something relevant to your research might exist in some other field.”

In 2014, DARPA, a research and development wing of the U.S. Department of Defense, launched a project to address this growing concern. Dubbed the Big Mechanism program, DARPA tasked research teams with developing computational tools that could intelligently scan and make sense of scientific literature.

To tackle this challenge, a group led by Peter Sorger, the Otto Krayer Professor of Systems Pharmacology at HMS and director of the LSP and HiTS, relied heavily on natural language processing. Led by Bachman and Benjamin Gyori, a research fellow in therapeutic science in the LSP, the team is developing a software platform that reads papers and builds models of complex biochemical networks and can also support interactive dialog with scientists in a manner akin to Apple’s Siri.

John Bachman (left) and Denjamin Gyori
John Bachman (left) and Denjamin Gyori

The platform, named INDRA (the Integrated Network and Dynamical Reasoning Assembler) first uses machine language to parse scientific publications and abstracts to look for phrases of interest. These phrases can include biochemical names and processes, as well as key words, for example, “tumorigenesis” or “metastasis.”

“When these systems extract information from the literature, it comes out as this big, error-prone, redundant, fragmented bag of facts,” Gyori says. “The main goal of INDRA is to turn those facts into coherent, predictive, and explanatory models. We’re not just looking for statistical associations in text, like co-occurrence of a drug name with a disease name. We want to extract causal events.”

To do so, the team developed what they’re calling a knowledge assembly methodology. INDRA cross-references raw phrases against each other as well as against databases and other knowledge sources in a manner analogous to sequence alignment. Guided by sophisticated algorithms, INDRA eliminates redundant statements and likely errors about biological processes and identifies the mechanisms that connect them.

The scale at which INDRA can do this is difficult, if not impossible, for humans to achieve. In one proof-of-concept trial, INDRA assembled a biochemical network model after scanning a corpus of 95,000 papers that contained information relevant to a single study of interest. This study reported on tests involving the efficacy of nearly one hundred drug combinations on melanoma cell lines from which the twenty-two strongest drug effects were selected. The team asked INDRA to find the mechanisms involved. Of the twenty-two observed effects of a drug on a protein, INDRA generated detailed biochemical explanations for twenty, a 90 percent success rate.

With additional natural language processing development, the team has devised a software prototype, provisionally named Bob, that one day will allow any scientist to ask INDRA questions in English and receive an answer in English, basically a virtual lab assistant that can supply information to help researchers formulate and evaluate hypotheses.


For patients, tools like INDRA and the topic model used by Park and Gulhan have tremendous potential in opening new lines of research and discovery that can someday affect their health and quality of life. But natural language processing can also have a direct benefit at the bedside.

Perhaps the largest data sets that exist in the biomedical sciences are EMRs, which contain clinical narratives and details such as disease pathology and treatments for hundreds of millions of patients. There is, however, no universal system for EMRs, so they can differ greatly in how critical data elements are presented, from coding for medications to vocabulary use.

This lack of conformity presents an ideal problem for natural language processing tools, one that Guergana Savova, an HMS associate professor and director of the Natural Language Processing Lab at Boston Children’s Hospital, may help solve. Savova and her colleagues are building systems that can read and analyze anonymized clinical notes from EMRs and combine that information with other types of information.

One of their efforts is aimed at performing “deep phenotyping” on cancer. Through their analysis of the plain text within millions of EMRs, they hope to reveal the relationships between the characteristics of a cancer, including its molecular profile, grade, and metastasis patterns, and information extracted about patients, such as family histories, tests, treatments, and comorbidities.

“We need to learn as much as we can about these connections if we are to achieve the goal of precision medicine, because every patient and every tumor has a different set of characteristics,” says Savova, a computational linguist and computer scientist by training. “These questions can be answered only if researchers have large corpora of data from large cohorts of patients to compare. Manually, it’s just not doable.”

Alexa McCray
Alexa McCray

But state-of-the-art natural language processing systems are not a panacea, and no system is perfect, Savova says. Although errors can be controlled for—INDRA, for example, has a “belief engine” to allow it to determine its probability of correctness—inaccuracies arise for a variety of reasons that range from language variations to the differences in statistical and computational algorithms that underlie any given system.

“We build extraction tools, but there is a tremendous difference between extraction of information and such a complex decision-making process as diagnosis,” Savova says. “What a physician observes or hears or feels, the logical and creative steps that humans are capable of, are not necessarily recorded in the EMRs, and they are as important as any amount of text processing. The big question for artificial intelligence in general is how to encode this comprehensive knowledge into one representation.”

The vast majority of current-generation natural language processing systems rely on human-initiated resources, such as a list of Latin phrases or biochemical names to search for in a corpus of data or a backbone of medical terms to which clinical notes are connected. This can be a troubling variable.

“There are people who disagree with me,’” says Alexa McCray, a professor of medicine in the Department of Biomedical Informatics at HMS and Beth Israel Deaconess Medical Center, “but if you’re working with not-so-good data on the way in, then what comes out the other end is not going to be so good either.”

Ensuring access to high-quality data for computational applications has been a priority for McCray for almost her entire career. A linguist who joined IBM as the field of computational linguistics was blossoming, McCray spent decades at the National Library of Medicine at the NIH.

There, she helped develop standards such as the Unified Medical Language System, a comprehensive and curated database of millions of biomedical concepts and names. That system now serves as the backbone for many natural language processing applications.

For biomedical researchers to make full use of natural language processing and uncover knowledge that can affect human health and disease, there must be a strong foundation of data built through human effort.

“Data standards, curation, and language processing, these are areas where I think we have to put more of our combined energy,” McCray says. “Otherwise, it’s the Tower of Babel. What we need to get to is a point where we can compare apples to apples across biomedicine.”

Kevin Jiang is a science writer in the HMS Office of Communications and External Relations.

Images: John Soares