A View from Kaggle’s CORD-19 Data Science Competition
If a brave citizen immediately began an effort to read all English-language academic articles on COVID-19 to discover the best known treatments, she — assuming she could read one article per hour and needed no sleep and assuming no new articles were published — would finish sometime in 2027. We applaud any such brave souls out there. For the rest of us (who have to sleep at least a little in the next seven years), we are going to need a method to synthesize and create scientific knowledge on COVID-19 treatments more quickly.
One possible route to rapid scientific synthesis related to treatment of COVID-19 is natural language processing (NLP) and text mining used for scientific knowledge discovery. For example, Covid19Primer applies traditional NLP techniques such as topic classification and keyword extraction, along with twitter data, to create dashboards and daily briefings on trending COVID-19 articles. Google’s COVID-19 research explorer is a biomedical search engine that tries to answer a user’s questions. (When we ask it, “Can I go running outside?” it returns abstracts from the 2006 29th annual meeting of the Japan Neuroscience Society, but also an interesting simulation of the effect of one-way pedestrian traffic on COVID-19 spread. Perhaps the question was unfair.)
In fact, the data science competition platform Kaggle, in cooperation with the White House and a number of other parties, tasked data scientists with COVID-19 knowledge discovery using text mining techniques on research articles. The month-long competition in March 2020 challenged data scientists to use machine learning to answer a variety of scientific questions. A few of us at IQT Labs participated in the competition, and we are sharing our experience and reflections below.
Here’s our bottom line for this first blog post in a series on this competition: our review of existing approaches suggests that a general-purpose knowledge discovery tool that can quickly distill and summarize relevant, accurate information from a corpus of scientific papers is elusive. We suspect part of the problem in developing such a tool is the lack of a principled, systematic method of measuring its contributions. Consequently, in our opinion, the best research summaries on COVID-19 are still manually curated, and perhaps paying a talented team of graduate students to regularly wade through academic articles might be both more reliable and more effective than NLP techniques, for the time being.
Kaggle, COVID-19, and Two Approaches to Biomedical Knowledge Discovery
Our interest in this topic piqued when, quarantined in our homes earlier this year, we read a call to action by the White House that led to a Kaggle data science competition in which participants were to use their data science-related skills to mine scientific articles for knowledge about COVID-19. The competition asked for answers to questions both general (“What do we know about vaccines and therapeutics?”) and specific (“[What do we know about] oral medications that might potentially work?”). All participants had access to a dataset of more than 30,000 relevant research articles and pre-prints.
After the competition’s first challenge round ended April 16, we examined the competition entries for our task and discovered multiple approaches. The most common approach might be called “question and answer engines” (Q&A engines). These are tools that take a question as input and then provide relevant scientific articles and/or snippets from them. Such submissions sought to literally answer the questions posed by the Kaggle challenge. Other approaches attempted high-level knowledge summarization via topic modeling, clustering, and knowledge graphs; we chose the latter for our submission.
One can think of knowledge graphs and Q&A systems as passive and active information gathering, respectively. These two approaches would, ideally, be combined; a knowledge graph can push a collection of investigational drugs to the user, who can then use a Q&A system to pull lower-level details about specific queries they might have about those drugs. Such a system can be envisioned to perform well at both breadth and depth, and completeness. Indeed, the winning notebook used such a strategy.
The Difficulty of Evaluating Knowledge Discovery Tools
The reader might ask — as we did and still do — how submissions to this competition are, and ought to, be evaluated? The formal criteria for evaluation published by the competition is below.
These criteria are reasonable but seem to miss an important component: where’s the criterion that evaluates the utility of a proposed approach? Is this criterion implicit in the “accomplish the task” criteria? This nagging question colored our perspective on the entire enterprise of knowledge discovery and, in particular, this challenge since it entered into our minds mid-competition.
The Q&A engine approach brings the difficulty of evaluation into stark contrast. Of course, Q&A engines are attractive due to the availability of large Q&A medical training datasets and powerful, pre-trained Q&A models such as BERT, but Q&A engines assume that naïve user queries will yield useful matches. For instance, a question such as “How deadly is coronavirus?” may return results that talk about population-wide case fatality rates (CFRs). The user might, however, really be curious about the infection fatality rates (IFRs) for their young child or older parent. The IFR could be two orders of magnitude different than the general CFR. Q&A engines, in other words, require that the questions themselves be carefully constructed by a user with sufficient knowledge. Therefore any evaluation of Q&A systems cannot be done on its technical merits alone, but requires a subjective evaluation via the user’s perspective.
So how were the Q&A engines in this competition evaluated? To be honest, we’re not sure. The winning submission for the therapeutics task — the task in which we participated — employed a Q&A engine. It is unclear to us though if the judges had a systematic means by which to evaluate whether the results were reliable. In fact, our scan of the submissions didn’t find any experiment to try to establish confidence in their results. This omission motivated our own evaluation of our submission (yes, we know it’s unfair to evaluate our own work but it’s better than no evaluation at all), which we’ll explain after we discuss our approach.
A Knowledge Graph Approach
Our submission built an interactive knowledge graph by mining sentences from article bodies that contained at least one drug keyword. The method then weighted the relevance of these sentences using a novel efficacy valence model: the idea is that sentences that were more likely to contain potentially interesting results were given higher weight (we’ll discuss the details of this BERT-based efficacy valence model in our next blog post). Our algorithm generates a core graph of therapeutics and other concepts, and each of those nodes can be expanded to reveal flyout-graphs (Figure 2). The approach is fully automated. We should note that we discovered a tradeoff existed between including more text (and thus more relevant nodes) and interpretability. Very rich knowledge graphs are too dense to quickly analyze visually.
One crucial motivation for choosing a knowledge graph is that they are also more straightforward to evaluate for correctness than Q&A systems; you can automatically compare the nodes and edges between two graphs for matching, provided the graphs can intelligently recognize synonyms in nodes. With this in mind, we sought to evaluate the knowledge graphs our submission could generate by comparing them to a baseline gold standard: a knowledge graph manually generated from a trusted, but independent, literature survey on COVID-19 therapeutics that we downloaded on April 5. (Figure 3)
The baseline captures nine potential drug and treatment options (including plasma/serum and interleukin-6 [suppression]) as well as other core topics. Failure to flag any of the treatments that were in the literature survey would count as a failure of any tool. However, it was much more difficult to evaluate additional therapeutics recommended by various competition submissions, which is something we’ll explore in future blog posts.
Comparing Knowledge Graphs to a Baseline
Our approach, when constructed from only article bodies, finds the majority of the baseline treatments in our goal graph, but missed four. Of these, the first two (sarilumab and siltuximab) never appeared in any of the articles we used to build the graph. We attribute this omission to a limitation with the articles we chose (we only used preprints due to computing constraints associated with Kaggle), and not our algorithm. The last two (interleukin-6 and favipiravir) did not appear in our core graph (unless you count cytokine for interleukin-6), but favipiravir showed up in our flyout graphs.
Meanwhile, the winning notebook was able to flag both favipiravir and siltuximab because they used the full dataset of articles but missed interleukin-6 and plasma/serum as treatments in their drug list.
Are either of our submissions good enough? Which one is better?
Comparing Knowledge Graphs to Each Other
Because both our submission and the winning notebook generated a list of therapeutics and core concepts, we decided to compare the algorithmic design decisions between these approaches. Such an analysis could compare the “nodes” generated by the winning notebook against the nodes generated by our analysis, and against those nodes in the goal graph.
We tried our best to do a fair comparison. We decided to run an experiment using the same article abstracts, the same publicly available drug lists, and the same stopwords. We limited our algorithm to return only about 50 nodes, matching the winning notebook. For clarification, we preserved the winning notebook’s ability to look for chemicals that were not in the drug lists (a wise design decision that we would follow given a second chance). We also limited both approaches to therapeutics only in this experiment (no vaccines), and only gave credit for details that were returned from a fully automated analysis (the winning notebook seemed to do some manual summary under their Q&A portion of their tool, but we only examined their raw, automated output for this analysis). The results are shown in Table 1.
Even though our model was not focused on returning only drugs (it allowed for other topics), we ended up finding more of the baseline drugs in this experiment than the winning notebook. However, one of the drugs that the winning notebook found, but we didn’t, is something we think that will grow in importance as we learn more about COVID-19: heparin. Heparin, and drugs like it, can be used to prevent and dissolve blood clots, which may turn out to be a primary focus of how we treat this disease in the future.
Why did the winning notebook find heparin, but our versions didn’t? It turns out heparin was not in the drug list we used in this experiment, and the winning submission made the sensible decision to screen for chemicals, in addition to drugs, which we didn’t. Notably, heparin was not on the goal graph either, so we view the winning notebook’s ability to flag it as a unique discovery in this experiment.
Interestingly, when we applied our algorithm to article abstracts only, interleukin-6 was flagged by our approach, but not the competition winner in this experiment. We knew interleukin-6 was an important treatment from our goal graph, but what else can we say about the different drugs returned by different models, shown above? We’re going to need an expert to weigh in, which we’ll do in a future blog post.
Utility of Knowledge Discovery
While we found that comparing knowledge graphs to a baseline graph generated from an independent literature survey could increase confidence in their reliability, this doesn’t say anything about the utility of the additional drugs found by various approaches that were not in the goal graph. These results also don’t speak to if these tools are useful to researchers and academics, especially in how our submissions chose to present the information. Notably, our approach is fully automated, while the winner seemed to rely on manual summary of the Q&A portion of their tool. How can we improve these types of tools for researchers who need them? And do crowdsourcing competitions work for this sort of goal? Are they ethical?
In our next post, we’ll further evaluate a novel contribution of our work towards this knowledge discovery end: identifying which sentences in articles are most likely to contain useful results and facts, by building a BERT-based model of sentence efficacy valence. In our final installment on this topic, we’ll explore the issues and suggested improvements for evaluating such knowledge discovery tools more formally, with a deeper user study.