Mining Nuggets from Safety Silos

May 26, 2015

Better access to the high value information in legacy safety reports has been, for many folk in pharma safety assessment, a “holy grail”. Locked away in these historical data are answers to questions such as:  Has this particular organ toxicity been seen before? In what species, and with what chemistry? Could new biomarker or imaging studies predict the toxicity earlier? What compounds could be leveraged to help build capabilities?

I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations are not expected to be human-relevant, and reducing late stage failures.

I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations may not be human-relevant, and reducing late stage failures.

Coming as I do from a decade of working in data informatics for safety/tox prediction, I was excited by one of the talks at the recent Linguamatics Spring User conference. Wendy Cornell (ex-Merck) presented on an ambitious project to use Linguamatics text mining platform, I2E, in a workflow to extract high value information from safety assessment reports stored in Documentum.

Access to historic safety data is a potential advantage that will be helped with the use of standards in electronic data submission for regulatory studies (e.g. CDISC’s SEND, the standard for exchange of non-clinical data).

Standardizing the formats and vocabularies for key domains in safety studies will enable these data to be fed into searchable databases; however these structured data miss the intellectual content added by the pathologists and toxicologists, whose conclusions are essential for understanding whether evidence of a particular tox finding (e.g. hyalinosis, single cell necrosis, blood enzyme elevations) signals a potential serious problem in humans or is specific to the animal model.

For these key conclusions, access to the full study reports is essential.

At Merck, Wendy’s Proprietary Information and Knowledge Management group, in collaboration with the Safety Assessment and Laboratory Animal Resources (SALAR) group, developed an I2E workflow that extracted the key findings from safety assessment ante- and post-mortem reports, final reports, and protocols, in particular pulling out:

  • Study annotation (species, study duration, compound, target, dosage)
  • Interpreted results sections (i.e. summary or conclusions sections)
  • Organ-specific toxicology and histopathology findings
  • Haematological and serum biochemistry findings

In addition, a separate arm in the workflow leveraged the ABBYY OCR software to extract toxicokinetic (TK) parameters such as area under the curve (AUC), maximum drug concentration (Cmax), and time after dosing of peak drug plasma exposure (TMax) from PDF versions of the documents.

The extracted and normalized information was loaded into a semantic knowledgebase in the Cambridge Semantics ANZO tool and searched and visualized using a tailored ANZO dashboard. This faceted browsing environment enabled the SALAR researchers to ask questions such as, “what compounds with rat 3-month studies show kidney effects, and for these compounds, what long term studies do we have?”

Wendy presented several use cases showing real value of this system to the business, including the potential to influence regulatory guidelines. For example, the team were able to run an analysis to assess the correlation between 3-month sub-chronic non-rodent studies, and 9- or 12-month chronic non-rodent results; they found that in nearly 30% of cases an important new toxicologic finding was identified in the long-term studies, confirming the ongoing need for long-term studies.

Wendy stated, “This unstructured information represents a rich body of knowledge which, in aggregate, has potential to identify capability gaps and evaluate individual findings on active pipeline compounds in the context of broad historical data.”

With the current focus on refinement, replacement and reduction of animal studies, being able to identify when long-term studies are needed vs. when they are not essential for human risk assessment, will be hugely valuable; and extracting these nuggets of information from historic data will contribute to this understanding.

Expert interpretations and conclusions from thousands of past studies can potentially be converted into actionable knowledge. These findings exist as unstructured text in Safety documents. See Wendy Cornell speak on this, at our upcoming NLP and Big Data Symposium in San Francisco.

BioIT 2015: Bench to Bedside value from Data Science and Text Analytics

April 29, 2015

Last week’s BioIT World Expo kicked off with a great keynote from Philip Bourne (Associate Director for Data Science, National Institutes of Health) setting the scene on a theme that ran through out the conference – how can we benefit from big data analytics, or data science, for pharma R&D and delivery into healthcare. With 2 days of talks, 12 tracks covering cloud computing, NGS analytics, Pharmaceutical R&D Informatics, Data Visualization & Exploration Tools, and Data Security, plus a full day of workshops beforehand, and a busy exhibition hall, there was plenty to see, do, take in and discuss.  I attended several talks on best practise in data science, by speakers from Merck, Roche, and BMS – and I was pleased to hear speakers mention text analytics, particularly natural language processing, as a key part of the overall data science solution.

All this, and in beautiful Boston, just after the Marathon; luckily for us, the weather warmed up and we were treated to a couple of lovely sunny days.  As one of the speakers in the Pharma R&D informatics track, I presented some of the Linguamatics use cases of text analytics across the drug discovery – development – delivery pipeline.  Our customers are getting value along this bench-to-bedside continuum, using text mining techniques for gene annotation, patent analytics, regulatory QA/QC, clinical trial matching for patients, and through into extracting key endpoints from pathology reports and EHRs for better patient care. If you missed the talk, we will be presenting the material at our webinar on 3rd June.

Boston at night

Boston at night

Advancing clinical trial R&D: industry’s most powerful NLP text analytics meets world-class curated life sciences content

December 19, 2014

What challenges were seen in competitive R&D and clinical stages? What outcomes were measured in related trials? Does the drug I am creating have potential efficacy or safety challenges? What does the patient population look like?

These are the sort of critical business questions that many life science researchers need to answer. And now, there’s a solution that can help you.

We all know the importance of high quality content you can depend on when it comes to making key business decisions across the pharma life cycle. We also know that the best way to get from textual data to new insights is using natural language processing-based text analytics. And that’s where our partnership with Thomson Reuters comes in. We’ve worked together on a solution to bring Linguamatics market-leading text mining platform, I2E, together with Thomson Reuters Cortellis high-quality clinical and epidemiology content: Cortellis Informatics Clinical Text Analytics for I2E.

Cortellis Informatics Clinical Text Analytics for I2E applies the power of natural language processing-based text mining from Linguamatics I2E to Cortellis clinical and epidemiology content sets. Taking this approach allows users to rapidly extract relevant information using the advanced search capabilities of I2E. The solution also allows users to identify concepts using a rich set of combined vocabularies from Thomson Reuters and Linguamatics.

Through one single interface users can quickly and easily gain access to new insights to support R&D, clinical development and clinical operations. This is the first time a cloud-based text mining service has been applied to commercial grade clinical and epidemiology content. The wide-ranging content set consists of global clinical trial reports, literature, press releases, conferences and epidemiology data in a secure, ready-to-use on-demand format.

Key features of the solution include:

  • High precision information extraction, using state of the art text analytics, combined with high quality, hand curated data
  • Search using a combination of Cortellis ontologies, plus domain specific and proprietary ontologies.
  • Find numeric information e.g. experimental assay results, patient numbers, trial outcome timepoints, financial values, dates.
  • Generate data tables to support you in your preclinical studies, trial design, and in understanding the impact of clinical trials.
  • Generate new hypotheses through identification of entity relationships in unstructured text e.g. assay and indication.

To find out more about how to save time and get better results from your clinical data searches, visit the Linguamatics website or contact us to gain access.

Ebola: Text analytics over patent sources for medicinal chemistry

November 13, 2014

The 2014 Ebola outbreak is officially the deadliest in history. Governments and organizations are searching for ways to halt the spread – both responding with humanitarian help, and looking for treatments to prevent or cure the viral infection.

CDC image of Ebola virus

Ebola virus disease (or Ebola haemorrhagic fever) is caused by the Ebola filovirus

A couple of weeks ago we received a tweet from Chris Southan, who has been looking at crowdsourcing anti-Ebola medicinal chemistry. He asked us to mine Ebola C07D patents (i.e. those for heterocyclic small molecules, the standard chemistry for most drugs) using our text analytics tool I2E, and provide him with the resulting chemical structures.

We wanted to help. What anti-Ebola research has been patented, that might provide value to the scientific community? Searching patents for chemistry using an automated approach is notoriously tricky; patent documents are long, and often purposefully obfuscated with chemicals frequently being obscured by the complex language used to described them or corrupted by OCR errors and destroyed by the overall poor formatting of the patents.

Andrew Hinton, one of our Application Specialists with a background in chemistry, used I2E to map the patent landscape around Ebola, identify patents for small molecules described to target Ebola, and extract the chemical structures. He compiled queries to answer the key questions and find those patents which were most relevant:

  • Does the patent mention Ebola or Ebola-like diseases? More importantly, is Ebola the major focus of the patent?
  • Who is the pharma or biotech company?
  • Is it a small molecule or non-small molecule patent?
  • What’s the exemplified chemistry? What’s the claimed chemistry? What’s the Markush chemistry?
  • What chemistry is found as an image? What chemistry is found in a table? Can we extract these structures too?

Andrew ran these queries over patents from USPTO, EPO and WIPO for the past 20 years on data derived from IFI CLAIMS.

Ebola patent trend over past decade

Graph showing C07D patents (blue, left-hand axis) and non-C07D patents (red line, right-hand axis) for Ebola related patents from 1994 to 2014, from the three major patent registries [please note different scales for the axes].

The results showed a general increase in the number of patent records related to Ebola, but they are comparatively small – for example there were about 50k C07D patents published in 2010 across all therapeutic areas; of these, we found that only about 100 patents that related to Ebola (and the likely number of truly unique patent families is going to be a smaller subset of the above figure). This isn’t really that surprising; along with most viral diseases, the main emphasis for therapies has been on biologics and other non-small-molecule treatments – in fact, of the 16k total patents that mention Ebola, only 1% are C07D patents focused specifically on Ebola.

Ebola patents -ranked by organisations

Heatmap showing that the top 3 organizations with small molecule C07D patents in this area contribute 1/5 of all Ebola patents.

So what is the outcome of this? Using I2E, we have been able to extract the set of molecules reported in these Ebola-related patents, and will provide a set of these to Chris Southan for his chemoinformatics analysis. Let’s hope that this little step might further research towards providing a solution to the current Ebola epidemic.

Ebola patents - chemistry table

Screenshot of I2E results showing names and structures extracted using text analytics from Ebola-related patent set. Structures are generated using ChemAxon’s Name-to-Structure conversion tools (

Economics of the Obesity Epidemic – Extracting Knowledge with Advanced Text Analytics

July 15, 2014

In the current competitive marketplace for healthcare, pharmaceutical and medical technology companies must be able to demonstrate clinical and economic evidence of benefit to providers, healthcare decision-makers and payers. Now more than ever, pricing pressure and regulatory restrictions are generating increased demand for this kind of outcomes evidence.

Health Economics and Outcomes Research (HEOR) aims to assess the direct and indirect health care costs associated with a disease or a therapeutic area, and associated interventions in real-world clinical practice. These costs include:

• Direct economic loss

• Economic loss through hospitalization

• Indirect costs from loss of wider societal productivity

The availability of increasing amount of data on patients, prescriptions, markets, and scientific literature combined with the wider use of comparative effectiveness make traditional keyword based search techniques ineffectual. I2E can provide the starting point for efficiently performing evidence based systematic reviews over very large sets of scientific literature, enabling researchers to answer questions such as:

• What is the economic burden of disease within the healthcare system? Across states, and globally?

• Does XYZ new intervention merit funding? What are the economic implications of its use?

• How do the incremental costs compare with the anticipated benefits for specific patient groups?

• How does treatment XYZ affect quality of life? Activities of daily living? Health status indicators? Patient satisfaction?

A recent project looking at the economics of obesity used I2E to search all 23 million abstracts in Medline for research on the incidence of comorbid diseases, with associated information on patient cohort, geographic location, severity of disease, and associated costs (e.g. hospitalisation cost, treatment, etc.). From the I2E output, more advanced visual analytics can be carried out. For example, the pie chart shows the prevalence of the various comorbid diseases (from 2013 Medline abstracts with both HEOR terms, obesity and a comorbid disease), showing the high frequency of hypertension and various other cardiovascular diseases. Another view of the same extracted intelligence shows the geographic spread of health economics and obesity research, with a predominance across northern America, but also data from China and Brazil, for example.

Prevalence of cardiovascular co-morbid diseases

Prevalence of cardiovascular co-morbid diseases


Geographic view of HEOR research, mined from Medline from 2013

Geographic view of HEOR research, mined from Medline from 2013

If you are interested in getting a better understanding of the power of advanced text analytics for HEOR, please contact us.

How to protect and develop your enterprise search investment with text analytics

June 24, 2014

It’s funny, isn’t it? Search at home just works. You’re looking for a holiday, train times, a particular recipe or the answer to your kid’s homework. You sit down and type your keyword/s into your search engine. Milliseconds later, results appear – the one you’re looking for is usually one of the first ones – you click on it and voila! You have what you were looking for.

But search at work doesn’t seem to be as effective. Maybe you are looking for information internally. You know it exists but you’re not quite sure where. The information lies across silos and it’s a mix of structured and unstructured. As a scientist it’s important for you to easily find information hidden in memos, project plans, meeting minutes, study reports, literature etc. You type a keyword search in your enterprise search engine. A list of documents comes back but none of them look like the one you want. You feel like you’re wasting your time. Sound familiar?

You’re not alone. At least that is what recent surveys and conferences on enterprise search have revealed. According to a recent report from Findwise 64% of organizations say it’s difficult to find information within their organization. Why?

  • Poor search functionality
  • Inconsistencies in how information is tagged
  • People don’t know where to look or what to look for

So how can we address this? Well, there’s already been talk of using text analytics to improve enterprise search. Text analytics, also referred to as text mining, allows users to go beyond keyword search to interpret the meaning of text in documents. While text analytics solutions have existed for some years now, more recently they’ve been working in harmony with enterprise search to improve the quality of results and make information more discoverable.

Let me give you an example, for over 10 years Linguamatics I2E has been mining data and content such as scientific literature, patents, clinical trials data, news feeds, electronic health records, social media and proprietary content – working with 17 of the top 20 pharmaceutical companies to improve and power their knowledge discovery. Meanwhile organizations have been deploying enterprise search engines to search internally.

Having been dissatisfied with their search solution and familiar with using I2E in other areas, a top 20 pharma wanted to see if the power of I2E’s text analytics could be applied to their enterprise search system. A proof of concept was proposed using Microsoft SharePoint. The organization did some internal requirement’s gathering and worked with both Microsoft and Linguamatics to come up with a solution to improve their search. I2E worked in the background, using its natural language processing technology to identify concepts and mark up semantic entities such as genes, drugs, diseases, organizations, anatomy, authors and other relevant concepts and relationships. Once annotated, taxonomies/thesauri were built and the marked-up documents were fed back into SharePoint.

To the users, the search interface remained the same but there was a difference in the results. I2E was able to provide semantic facets for the search engine to allow the user to quickly filter the results to what they were looking for. The facets were concepts rather than words and this allowed users to filter results to a more intuitive set of things they were looking for e.g. just show me the results for ‘breast cancer’ as a concept. This would also include all results that had variations of how that concept was found in the text e.g. breast carcinoma, breast tumor, cancer of the breast etc .In addition, I2E provided SharePoint with the ability to autocomplete terms as the user was typing them, and when performing the search, SharePoint was taught to look for synonyms of the word/s typed in.

The organization was incredibly happy with the improved search performance. Stating the main benefits as improved efficiency, improved search results quality, information became more transparent and available, which stimulated innovation within the organization.

This is just the beginning. The capabilities of I2E could also be applied to other search engines and scenarios where search needs to be improved to increase the return on investment made in the system and protect and develop future investments, increase usage and findability.

If you’d like to find out more, sign up for Linguamatics’ webinar or contact us for a demo.

Are the terms text mining and text analytics largely interchangeable?

January 17, 2014

I saw this comment in a recent article by Seth Grimes where he discusses the terms Text Analysis and Text Analytics (see article Naming and Classifying: Text Analysis vs. Text Analytics ) within the article Mr. Grimes states that text mining and text analytics are largely interchangeable terms

“The terms “text analytics” and “text mining” are largely interchangeable. They name the same set of methods, software tools, and applications. Their distinction stems primarily from the background of the person using each — “text mining” seems most used by data miners, and “text analytics” by individuals and organizations in domains where the road to insight was paved by business intelligence tools and methods — so that the difference is largely a matter of dialect.” ref:

I asked Linguamatics CTO, David Milward, for his thoughts:

There is certainly overlap, but I think there are cases of analytics that would not be classed as text mining and vice versa. Text analytics tends to be more about processing a document collection as a whole, text mining traditionally has more of the needle in a haystack connotation.

For example, word clouds might be classified as text analytics, but not text mining. Use of natural language processing (NLP) for advanced searching is not so naturally classified under text analytics. Something like the Linguamatics I2E text mining platform is used for many text analytics applications, but its agile nature means it is also used as an alternative to professional search tools.

A further term is Text Data Mining. This is usually used to distinguish cases where new knowledge is being generated, rather than old knowledge being rediscovered. The typical case is indirect relationships: one item is associated with a second item in one document, and the second item is associated with a third in another document. This provides a possible relationship between the first item and the third item: something which may not be possible to find within any one document.

%d bloggers like this: