Ebola: Text analytics over patent sources for medicinal chemistry

November 13, 2014

The 2014 Ebola outbreak is officially the deadliest in history. Governments and organizations are searching for ways to halt the spread – both responding with humanitarian help, and looking for treatments to prevent or cure the viral infection.

CDC image of Ebola virus

Ebola virus disease (or Ebola haemorrhagic fever) is caused by the Ebola filovirus

A couple of weeks ago we received a tweet from Chris Southan, who has been looking at crowdsourcing anti-Ebola medicinal chemistry. He asked us to mine Ebola C07D patents (i.e. those for heterocyclic small molecules, the standard chemistry for most drugs) using our text analytics tool I2E, and provide him with the resulting chemical structures.

We wanted to help. What anti-Ebola research has been patented, that might provide value to the scientific community? Searching patents for chemistry using an automated approach is notoriously tricky; patent documents are long, and often purposefully obfuscated with chemicals frequently being obscured by the complex language used to described them or corrupted by OCR errors and destroyed by the overall poor formatting of the patents.

Andrew Hinton, one of our Application Specialists with a background in chemistry, used I2E to map the patent landscape around Ebola, identify patents for small molecules described to target Ebola, and extract the chemical structures. He compiled queries to answer the key questions and find those patents which were most relevant:

  • Does the patent mention Ebola or Ebola-like diseases? More importantly, is Ebola the major focus of the patent?
  • Who is the pharma or biotech company?
  • Is it a small molecule or non-small molecule patent?
  • What’s the exemplified chemistry? What’s the claimed chemistry? What’s the Markush chemistry?
  • What chemistry is found as an image? What chemistry is found in a table? Can we extract these structures too?

Andrew ran these queries over patents from USPTO, EPO and WIPO for the past 20 years on data derived from IFI CLAIMS.

Ebola patent trend over past decade

Graph showing C07D patents (blue, left-hand axis) and non-C07D patents (red line, right-hand axis) for Ebola related patents from 1994 to 2014, from the three major patent registries [please note different scales for the axes].

The results showed a general increase in the number of patent records related to Ebola, but they are comparatively small – for example there were about 50k C07D patents published in 2010 across all therapeutic areas; of these, we found that only about 100 patents that related to Ebola (and the likely number of truly unique patent families is going to be a smaller subset of the above figure). This isn’t really that surprising; along with most viral diseases, the main emphasis for therapies has been on biologics and other non-small-molecule treatments – in fact, of the 16k total patents that mention Ebola, only 1% are C07D patents focused specifically on Ebola.

Ebola patents -ranked by organisations

Heatmap showing that the top 3 organizations with small molecule C07D patents in this area contribute 1/5 of all Ebola patents.

So what is the outcome of this? Using I2E, we have been able to extract the set of molecules reported in these Ebola-related patents, and will provide a set of these to Chris Southan for his chemoinformatics analysis. Let’s hope that this little step might further research towards providing a solution to the current Ebola epidemic.

Ebola patents - chemistry table

Screenshot of I2E results showing names and structures extracted using text analytics from Ebola-related patent set. Structures are generated using ChemAxon’s Name-to-Structure conversion tools (http://www.chemaxon.com/products/naming/).


Pharmacogenomics across the drug discovery pipeline – are we nearly there?

August 20, 2014

Since the human genome was published in 2001, we have been talking about the potential application of this knowledge to personalized medicine, and in the last couple of years, we seem at last to be approaching this goal.

A better understanding of the molecular basis of diseases is key to development of personalized medicine across pharmaceutical R&D, as was discussed last year by Janet Woodcock, Director of the FDA’s Center for Drug Evaluation and Research (CDER). FDA CDER has been urging adoption of pharmacogenomics strategies and pursuit of targeted therapies for a variety of reasons. These include the potential for decreasing the variability of response, improving safety, and increasing the size of treatment effect, by stratifying patient populations.

Pharmacogenomics is the study of the role an individual’s genome plays in drug response, which can vary from  adverse drug reactions to lack of therapeutic efficacy. With the recent explosion in sequence data from next generation sequencing (NGS) technologies, one of the bottlenecks in application of genomic variation data to understanding disease is access to annotation. From NGS workflows, scientists can quickly identify long lists of candidate genes that differ between two conditions (case-control, or family hierarchies, for example). Gene annotations are essential to interpret these gene lists and to discover fundamental properties like gene function and disease relevance.

Key sources for these annotations include the ever-growing biomedical literature either in structured databases (such as COSMIC, GAD, DGA) but much valuable information is in textual sources such as PubMed Central, MEDLINE, and OMIM. Extracting actionable insight rapidly and accurately from text documents is greatly helped by advanced text analytics – and users of our I2E text analytics solution have been asking for access to OMIM, which will soon be available in our OnDemand portfolio.

In particular, there are two common use cases that I2E users want to address with enhanced text analytics over the OMIM data:

One use case comes from the clinical side of drug discovery-development; the clinicians provide information on a particular case or phenotype, and I2E is used to extract from OMIM the potential genes that might be relevant to sequence from the clinical samples to see if there is involvement in the disease pathway.

The other use case comes from early on in the drug discovery-development pipeline, at the initial stages of a project, where for a new disease area I2E is used to pull out a set of potential targets from OMIM. Obviously, before any lab work starts, more in-depth research is needed, but this provides an excellent seed for entry into a new therapeutic area.

Utilizing I2E to access OMIM brings benefits such as:

  • High quality results from this manually curated, fact-dense data source, compared to, for example, querying original articles in peer-reviewed literature
  • The use of our domain-specific ontologies (e.g. for diseases, genes, mutations and other gene variants) enables high recall compared to searching via the OMIM interface (for example, using ontologies to search for “liver cancer”, and being able to also find records with annotations for “liver neoplasm”, “hepatic cancer”, “cancer of the liver”, etc)
  • Clustering of various synonyms and expressions from the use of Preferred Terms (PT) such as Gene Symbols
  • The ability to build in-depth queries, such as extraction of gene-gene interactions, and to hit a wide variety of concepts and synonyms, for example many different ways in which gene/protein mutations may be named (see figure legend)
Network of Disease-Gene-Mutation relationships from I2E results in Cytoscape

The image shows a network of Disease – Gene – Mutation relationships from I2E results in Cytoscape. I2E was used to extract gene (green squares) and mutation (circles) information for stroke (central red triangle), showing some overlap of gene interactions with a related disease, cerebral infarction. Utilising the Linguamatics Mutation resource enables easy extraction of precise information patterns (e.g. “an ACTA2 mutation”; “proline for serine at codon 116″; “4895A/G”; “a 4-bp deletion”; “Q193Sfs*12″; “a 377A-T transversion”), which would be hugely time-consuming to do by manual curation.

Combining OMIM access with extraction of genotype-phenotype relationships from MEDLINE, PubMed Central and ClinicalTrials.gov will give I2E users an excellent resource for NGS annotation, target discovery, and clinical genomics, in order to better target the molecular basis of disease.

If you are interested in accessing OMIM using the power of I2E (or any other of the current content options on I2E OnDemand i.e. MEDLINE, ClinicalTrials.gov, FDA Drug Labels, NIH Grants, PubMed Central Open Subset, and Patents) , please get in touch and we can provide more information and keep you updated on our progress.





Economics of the Obesity Epidemic – Extracting Knowledge with Advanced Text Analytics

July 15, 2014

In the current competitive marketplace for healthcare, pharmaceutical and medical technology companies must be able to demonstrate clinical and economic evidence of benefit to providers, healthcare decision-makers and payers. Now more than ever, pricing pressure and regulatory restrictions are generating increased demand for this kind of outcomes evidence.

Health Economics and Outcomes Research (HEOR) aims to assess the direct and indirect health care costs associated with a disease or a therapeutic area, and associated interventions in real-world clinical practice. These costs include:

• Direct economic loss

• Economic loss through hospitalization

• Indirect costs from loss of wider societal productivity

The availability of increasing amount of data on patients, prescriptions, markets, and scientific literature combined with the wider use of comparative effectiveness make traditional keyword based search techniques ineffectual. I2E can provide the starting point for efficiently performing evidence based systematic reviews over very large sets of scientific literature, enabling researchers to answer questions such as:

• What is the economic burden of disease within the healthcare system? Across states, and globally?

• Does XYZ new intervention merit funding? What are the economic implications of its use?

• How do the incremental costs compare with the anticipated benefits for specific patient groups?

• How does treatment XYZ affect quality of life? Activities of daily living? Health status indicators? Patient satisfaction?

A recent project looking at the economics of obesity used I2E to search all 23 million abstracts in Medline for research on the incidence of comorbid diseases, with associated information on patient cohort, geographic location, severity of disease, and associated costs (e.g. hospitalisation cost, treatment, etc.). From the I2E output, more advanced visual analytics can be carried out. For example, the pie chart shows the prevalence of the various comorbid diseases (from 2013 Medline abstracts with both HEOR terms, obesity and a comorbid disease), showing the high frequency of hypertension and various other cardiovascular diseases. Another view of the same extracted intelligence shows the geographic spread of health economics and obesity research, with a predominance across northern America, but also data from China and Brazil, for example.

Prevalence of cardiovascular co-morbid diseases

Prevalence of cardiovascular co-morbid diseases


Geographic view of HEOR research, mined from Medline from 2013

Geographic view of HEOR research, mined from Medline from 2013

If you are interested in getting a better understanding of the power of advanced text analytics for HEOR, please contact us.

Are the terms text mining and text analytics largely interchangeable?

January 17, 2014

I saw this comment in a recent article by Seth Grimes where he discusses the terms Text Analysis and Text Analytics (see article Naming and Classifying: Text Analysis vs. Text Analytics ) within the article Mr. Grimes states that text mining and text analytics are largely interchangeable terms

“The terms “text analytics” and “text mining” are largely interchangeable. They name the same set of methods, software tools, and applications. Their distinction stems primarily from the background of the person using each — “text mining” seems most used by data miners, and “text analytics” by individuals and organizations in domains where the road to insight was paved by business intelligence tools and methods — so that the difference is largely a matter of dialect.” ref: http://www.huffingtonpost.com/seth-grimes/naming-classifying-text-a_b_4556621.html

I asked Linguamatics CTO, David Milward, for his thoughts:

There is certainly overlap, but I think there are cases of analytics that would not be classed as text mining and vice versa. Text analytics tends to be more about processing a document collection as a whole, text mining traditionally has more of the needle in a haystack connotation.

For example, word clouds might be classified as text analytics, but not text mining. Use of natural language processing (NLP) for advanced searching is not so naturally classified under text analytics. Something like the Linguamatics I2E text mining platform is used for many text analytics applications, but its agile nature means it is also used as an alternative to professional search tools.

A further term is Text Data Mining. This is usually used to distinguish cases where new knowledge is being generated, rather than old knowledge being rediscovered. The typical case is indirect relationships: one item is associated with a second item in one document, and the second item is associated with a third in another document. This provides a possible relationship between the first item and the third item: something which may not be possible to find within any one document.

Pharma and healthcare come together to see the future of text mining

October 16, 2013

The Linguamatics Text Mining Summit has become Linguamatics flagship US event over the past few years attracting a wide variety of attendees from Pharma, Biotech and Healthcare industries.

This year was no exception; the Summit drew a record crowd of over 85 attendees and a fantastic line up of speakers including: AMIA, Shire Pharmaceuticals, Huntsman Cancer Institute, AstraZeneca, Regeneron Pharmaceuticals, Merck, Georgetown University Medical Center, UNC Charlotte, City of Hope, Pfizer and Roche.

The Summit took place at the Hyatt Regency, Newport, Rhode Island in the beautiful surroundings of the Narragansett bay from October 7-9, 2013.


Delegates were provided with an excellent opportunity to explore trends in text mining and analytics, natural language processing and knowledge discovery. Delegates discovered how I2E is delivering valuable intelligence from text in a range of applications, including the mining of scientific literature, news feeds, Electronic Health Records (EHRs), clinical trial data, FDA drug labels and more. Customer presentations demonstrated how I2E helps workers in knowledge driven organizations meet the challenge of information overload, maximize the value of their information assets and increase speed to insight.

One customer presentation explained how a “Recent I2E literature search saved $180K and 2-3 months [of] time”.

Kevin Fickenscher, President and CEO of AMIA kicked off Tuesday October 8th main sessions with his forward-thinking talk on Big Data and its implications on personalized medicine and performance based outcomes in healthcare. The talk was engaging, well received and set the tone for a great day of presentations.


The final talk of Tuesday from Jonathan Hartmann, an Informationist at Georgetown University Medical Center,  also caused quite a stir by demonstrating an iPad-based web portal interface onto I2E to support identification of case histories used during physician’s rounds. The novel real time usage of I2E prompted many questions from delegates and the presentation was discussed long into the night at the Summit dinner.


The Summit afforded new and potential users with a timely update on the latest I2E product developments:  including an introduction to Linked Servers, which allows streamlined access to content by linking enterprise servers with content servers on the cloud, and enhanced chemical querying capabilities.  Both of these new features are available in the latest I2E release, I2E 4.1. CTO David Milward’s forward-looking technology road map received a great deal of interest from delegates.

This year we invited a few of our partners to exhibit and participate in 5-minute lightning round presentations, we heard from Cambridge Semantics, ChemAxon, Copyright Clearance Center and Accelrys on their solutions and how they integrate with Linguamatics I2E.

The whole event was a great success topped off with some fabulous feedback from all our attendees – to quote one of the feedback forms “One of the best summits and training sessions I have ever attended. The pioneering NLP efforts and user responsiveness is unmatched by any other text mining (NLP) vendor. The Linguamatics approach and I2E system is relatively intuitive, easy to manage, powerful and useful”.

Presentations are available on I2Edia and by email request.

Thanks to everyone who attended and contributed to the Linguamatics Text Mining Summit 2013, we look forward to seeing you next year!

Big Data in the San Francisco Bay Area

September 3, 2013

Natural Language Processing (NLP), big data and precision medicine are three of the hottest topics in healthcare at the moment and consequently attracted a large audience to the first NLP & Big Data Symposium, focussed on precision medicine.  The event took place on August 27th, hosted at the new UCSF site at Mission Bay in San Francisco and sponsored by Linguamatics and UCSF Helen Diller Family Comprehensive Cancer Center. Over 75 delegates came to hear the latest insights and projects from some of the West Coast’s leading institutions including Kaiser Permanente, Oracle, Huntsman Cancer Institute and UCSF. The event was held in the midst of an explosion in new building development to house the latest in medical research and informatics, something that big data will be at the heart of. Linguamatics and UCSF recognized the need for a meeting on NLP in the west and put together an exciting program that clearly caught the imagination of many groups in the area.

Delegates at the Symposium

Over 75 delegates attended the Symposium

Key presentations included:

  • The keynote presentation from Frank McCormick, Director of UCSF Helen Diller Family Comprehensive Cancer Center, was a tour de force of the latest insights into cancer research and future prospects. With the advances in genetic sequencing and associated understanding of cancer biology, we are much closer to major breakthroughs in tackling the genetic chaos of cancer
  • Kaiser Permanente presented a predictive model of pneumonia assessment based on Linguamatics I2E that has been trained and tested using over 200,000 patient records. This paper has just been published and can be found here. In addition, Kaiser presented plans for a new project on re-hospitalization that takes into account social factors in-addition to standard demographic and diagnosis data
  • Huntsman Cancer Institute showed how pathology reports are being mined using I2E to extract data for use in a research data warehouse to support cohort selection for internal studies and clinical trials
  • Oracle presented their approach to enterprise data warehousing and translational medicine data management, highlighting why a sustainable single source of truth for data is key and how NLP can be used to support this environment
  • UCSF also provided an overview of the current approaches to the use of NLP in medical and research informatics, emphasizing the need for such approaches to deliver the raw data for advanced research
  • Linguamatics’ CTO, David Milward, presented a positional piece on why it is essential for EHRs and NLP to be more closely integrated and illustrated some of the challenges and approaches that can be used with I2E to overcome them
  • Linguamatics’ Tracy Gregory also showed how NLP can be used in the early stages of biomarker research to assess potential research directions and support knowledge discovery

    Panel from NLP & Bid Data Symposium

    Panel of speakers, from left to right – Tony Sheaffer, Vinnie Liu, Brady Davis, David Milward, Samir Courdy, Tracy Gregory and Gabriel Escobar

With unstructured data accounting for 75-80% of data in EHRs, the use of NLP in healthcare analytics and translational research is essential to achieve the required improvements in treatment and disease understanding. This event provided a great forum for understanding the current state-of-the-art in this field and allowed participants to engage in many useful discussions. West coast residents interested in the field can look forward to another opportunity to get together in 2014, or if you can get over to the east coast, the Linguamatics Text Mining Summit will take place on October 7-9 2013 in Newport, RI.

I2E and the Semantic Web

August 6, 2013

The internet right now, as Tim Berners-Lee points out in Scientific American, is a web of documents;  documents that are designed to be read, primarily, by humans. The vision behind the Semantic Web is a web of information, designed to be processed by machines. The vision is being implemented: important parts of the key enabling technologies are already in place.

RDF or the resource description framework is one such key technology. RDF is the language for expressing information in the semantic web. Every statement in RDF is a simple triple, which you can think of as subject/verb/object and a set of statements is just a set of triples. Three example triples might be: Armstrong/visited/moon,  Armstrong/isa/human and moon/isa/astronomical body. The power of RDF lies partly in the fact that a set of triples is also a graph and graphs are perfect for machines to traverse and, increasingly, reason over.  After all, when you surf the web, you’re just traversing the graph of hyperlinks. And that’s the second powerful feature of RDF. The individual parts, such as Armstrong and moon, are not just strings of letters but web-addressable Uniform Resource Identifiers (URIs). When I publish my little graph about Armstrong it becomes part of a vast world-wide graph: the Semantic Web. So, machines hunting for information about Armstrong can reach my graph and every other graph about Armstrong. This approach allows the web to become a huge distributed knowledge base.

There’s a new component for I2E: the XML to RDF convertor.  It turns standard I2E results into web-standard RDF. Each row in the table of results (each assertion) becomes one or more triples. For example, suppose you run an astronomy query against a news website and it returns the structured fact: Armstrong, visited, the moon, 1969. Let’s also suppose Armstrong and the moon were identified using an ontology derived from Wikipedia. The output RDF will include one URI to identify this structured fact and four more for the fact’s constituents (the first is Armstrong, the second is the relation of visiting, the third is the moon and so forth). Then, there will be a number of triples relating these constituents, for example, that the subject of the visiting is Armstrong. In addition, all the other information available in the traditional I2E results is presented in additional triples. For example, one triple might state that Armstrong’s preferred term is “Neil Armstrong”, another might state that the source concept is en_wikipedia.org/wiki/Neil_Armstrong, a third might state that the hit string (text justifying the concept) is Neil Alden Armstrong. The set of possible output triples for the I2E convertor is fully defined by an RDF schema.

RDF Triple & Semantic Web

Visualization of the RDF Triple Armstrong, visited, the moon

Why is this a good thing? First and foremost, I2E results can now join the semantic web. Even if you don’t want to publish your results, you can still exploit the growing list of semantic web tools for processing your own data and integrating them with other data which has been published. Second, to quote Tim Berners-Lee again The Semantic Web will enable machines to COMPREHEND semantic documents and data, not human speech and writings“. I2E is all about extracting structured information from human writings. So link these two things together and you have a powerful tool for traversing the worlds of structured and unstructured data together.

Find out more about Linguamatics I2E

%d bloggers like this: