Who’s Doing What with Linguamatics I2E?

June 26, 2015

Over the past few months there have been several publications which have used Linguamatics I2E to extract key information to provide value in a variety of different projects. We are constantly amazed by the inventiveness of our users, applying text analytics across the bench to bedside continuum; and these different publications are no exceptions. Using the natural language processing power of I2E, researchers are able to answer their questions rapidly and extract the results they need, with high precision and good recall; compared to more standard keyword search, which returns a document set that they then need to read.

Let’s start with Hornbeck et al., “PhosphoSitePlus, 2014: mutations, PTMs and recalibrations”. PhosphoSitePlus is an online systems biology resource for the study of protein post-translational modifications (PTMs) including phosphorylation, ubiquitination, acetylation and methylation. It’s provided by Cell Signaling Technology who have been users of I2E for several years. In the paper, they describe the value from integrating data on protein modifications from high-throughput mass spectrometry studies, with high-quality data from manual curation of published low-throughput (LTP) scientific literature.

The authors say: “The use of I2E, a powerful natural language processing software application, to identify articles and highlight information for manual curation, has significantly increased the efficiency and throughput of our LTP curation efforts, and made recuration of selected information a realistic and less time-consuming option.” The CST scientists rescore and recurate PTM assignments, reviewing for coherence and reliability – and use of these data can “provide actionable insights into the interplay between disease mutations and cell signalling mechanisms.”

A paper by a group from Roche, Zhang et al., “Pathway reporter genes define molecular phenotypes of human cells” described a new approach to understanding the effect of diseases or drugs on biological systems, by looking for “molecular phenotypes”, or fingerprints, patterns of differential gene expression triggered by a change to the cell. Here, text analytics played a small role in the project, which was (along with other tools) to compile a panel of over 900 human pathway reporter genes – representing 154 human signalling and metabolic networks. These were then used to gain understanding of cardiomyocyte development (relevant to diabetic cardiomyopathy) and assessment of common toxicity mechanisms (relevant to the mechanistic basis of adverse drug events).

The last one I wanted to highlight moves away from the realms of genes and cells, into analysis of co-prescription trends and drug-drug interactions (Sutherland et al., “Co-prescription trends in a large cohort of subjects predict substantial drug-drug interactions”). Better understanding of drug-drug interactions is of increasing importance for good healthcare delivery, as more and more patients are routinely taking multiple medications, particularly in the elderly – and the huge number of potential combinations prohibit testing for safety for all these combinations in clinical trials.

Over 30% of older adults take 5 or more drugs. Text analytics can extract clinical knowledge about potential drug-drug interactions.

Over 30% of older adults take 5 or more drugs. Text analytics can extract clinical knowledge about potential drug-drug interactions.

In this study, the authors used prescription data from NHANES surveys to find what drugs or drug classes were most routinely prescribed together; and then used I2E to search MEDLINE for a set of 133 co-prescribed drugs to assess the availability of clinical knowledge about potential drug-drug interactions. The authors found that over 30% of older adults take 5 or more drugs – but these combinations were pretty much unique. Of the co-prescribed pairs, a large percentage were not mentioned together in any MEDLINE record – demonstrating a need for further study. The authors conclude that these data show “that personalized medicine is indeed the norm, as patients taking several medications are experiencing unique pharmacotherapy” – and yet there is little published research on either efficacy or safety of these combinations.

What do these three studies have in common? The use of text analytics, not as the only tool or necessarily even the major tool, but as part of an integrated analysis of data, to answer focused and specific questions, whether those questions relate to protein modification, molecular patterns of genes in pathways or drug interactions and potential adverse events. And I wonder, where will Linguamatics I2E be used next?


Mining Nuggets from Safety Silos

May 26, 2015

Better access to the high value information in legacy safety reports has been, for many folk in pharma safety assessment, a “holy grail”. Locked away in these historical data are answers to questions such as:  Has this particular organ toxicity been seen before? In what species, and with what chemistry? Could new biomarker or imaging studies predict the toxicity earlier? What compounds could be leveraged to help build capabilities?

I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations are not expected to be human-relevant, and reducing late stage failures.

I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations may not be human-relevant, and reducing late stage failures.

Coming as I do from a decade of working in data informatics for safety/tox prediction, I was excited by one of the talks at the recent Linguamatics Spring User conference. Wendy Cornell (ex-Merck) presented on an ambitious project to use Linguamatics text mining platform, I2E, in a workflow to extract high value information from safety assessment reports stored in Documentum.

Access to historic safety data is a potential advantage that will be helped with the use of standards in electronic data submission for regulatory studies (e.g. CDISC’s SEND, the standard for exchange of non-clinical data).

Standardizing the formats and vocabularies for key domains in safety studies will enable these data to be fed into searchable databases; however these structured data miss the intellectual content added by the pathologists and toxicologists, whose conclusions are essential for understanding whether evidence of a particular tox finding (e.g. hyalinosis, single cell necrosis, blood enzyme elevations) signals a potential serious problem in humans or is specific to the animal model.

For these key conclusions, access to the full study reports is essential.

At Merck, Wendy’s Proprietary Information and Knowledge Management group, in collaboration with the Safety Assessment and Laboratory Animal Resources (SALAR) group, developed an I2E workflow that extracted the key findings from safety assessment ante- and post-mortem reports, final reports, and protocols, in particular pulling out:

  • Study annotation (species, study duration, compound, target, dosage)
  • Interpreted results sections (i.e. summary or conclusions sections)
  • Organ-specific toxicology and histopathology findings
  • Haematological and serum biochemistry findings

In addition, a separate arm in the workflow leveraged the ABBYY OCR software to extract toxicokinetic (TK) parameters such as area under the curve (AUC), maximum drug concentration (Cmax), and time after dosing of peak drug plasma exposure (TMax) from PDF versions of the documents.

The extracted and normalized information was loaded into a semantic knowledgebase in the Cambridge Semantics ANZO tool and searched and visualized using a tailored ANZO dashboard. This faceted browsing environment enabled the SALAR researchers to ask questions such as, “what compounds with rat 3-month studies show kidney effects, and for these compounds, what long term studies do we have?”

Wendy presented several use cases showing real value of this system to the business, including the potential to influence regulatory guidelines. For example, the team were able to run an analysis to assess the correlation between 3-month sub-chronic non-rodent studies, and 9- or 12-month chronic non-rodent results; they found that in nearly 30% of cases an important new toxicologic finding was identified in the long-term studies, confirming the ongoing need for long-term studies.

Wendy stated, “This unstructured information represents a rich body of knowledge which, in aggregate, has potential to identify capability gaps and evaluate individual findings on active pipeline compounds in the context of broad historical data.”

With the current focus on refinement, replacement and reduction of animal studies, being able to identify when long-term studies are needed vs. when they are not essential for human risk assessment, will be hugely valuable; and extracting these nuggets of information from historic data will contribute to this understanding.

Expert interpretations and conclusions from thousands of past studies can potentially be converted into actionable knowledge. These findings exist as unstructured text in Safety documents. See Wendy Cornell speak on this, at our upcoming NLP and Big Data Symposium in San Francisco.

BioIT 2015: Bench to Bedside value from Data Science and Text Analytics

April 29, 2015

Last week’s BioIT World Expo kicked off with a great keynote from Philip Bourne (Associate Director for Data Science, National Institutes of Health) setting the scene on a theme that ran through out the conference – how can we benefit from big data analytics, or data science, for pharma R&D and delivery into healthcare. With 2 days of talks, 12 tracks covering cloud computing, NGS analytics, Pharmaceutical R&D Informatics, Data Visualization & Exploration Tools, and Data Security, plus a full day of workshops beforehand, and a busy exhibition hall, there was plenty to see, do, take in and discuss.  I attended several talks on best practise in data science, by speakers from Merck, Roche, and BMS – and I was pleased to hear speakers mention text analytics, particularly natural language processing, as a key part of the overall data science solution.

All this, and in beautiful Boston, just after the Marathon; luckily for us, the weather warmed up and we were treated to a couple of lovely sunny days.  As one of the speakers in the Pharma R&D informatics track, I presented some of the Linguamatics use cases of text analytics across the drug discovery – development – delivery pipeline.  Our customers are getting value along this bench-to-bedside continuum, using text mining techniques for gene annotation, patent analytics, regulatory QA/QC, clinical trial matching for patients, and through into extracting key endpoints from pathology reports and EHRs for better patient care. If you missed the talk, we will be presenting the material at our webinar on 3rd June.

Boston at night

Boston at night

Patent Landscaping – Text Analytics Extracts the Value

March 23, 2015

Patent literature is a hugely valuable source of novel information for life science research and business intelligence. The wealth of knowledge disclosed in patents may not be found in other information sources, such as MEDLINE or full text journal articles.

Patent landscape reports (also known as patent mapping or IP landscaping) provide a snap-shot of the patent situation of a specific technology, and can be used to understand freedom to operate issues, to identify in- and out-licensing opportunities, to examine competitor strengths and weaknesses, or as part of a more comprehensive market analysis.

innovative_use_quoteThese are valuable searches, but demand advanced search and data visualization techniques, as any particular landscape reports requires examination of many hundreds or thousands of patent documents. Patent text is unstructured; the information needed is often embedded within the body of the patent and may be scattered throughout the lengthy descriptions; and the language is often complex and designed to obfuscate.

Text analytics can provide the key to unlock the value. A recent paper by a team at Bristol Myers Squibb describes a novel workflow to discover trends in kinase assay technology. The aim was to strengthen their internal kinase screening technology, with the first step being to analyze industry trends and benchmark BMS’ capabilities against other pharmaceutical companies, with key questions including:

  • What are the kinase assay technology trends?
  • What are the trends for different therapeutic areas?
  • What are the trends for technology platforms used by the big pharmaceutical companies?

The BMS team built a workflow using several tools: Minesoft’s Patbase, for the initial patent document set collection; Linguamatics I2E, for knowledge extraction; and TIBCO’s Spotfire, for data visualization. The project used I2E to create precise, effective search queries to extract key information around 500 kinases, 5 key screening technologies, 5 therapeutic areas, and across 14 pharmaceutical companies. Use of I2E allowed queries to be designed using domain specific vocabularies for these information entities, for example using over 10,000 synonyms for the kinases, hugely improving the recall of these patent searches. These I2E “macros” enabled information to be extracted regardless of how the facts were described by inventors. Using these vocabularies also allowed semantic normalization; so however the assignee described a concept, the output was standardized to a preferred term, for example, Pfizer for Wyeth, Warner Lambert, etc.

Using I2E also meant that searches could be focused on specific regions of the patent documents for more precise search; for example, the kinase information was extracted from claims (enhancing the precision of the search).

Using the novel approach the patent analysis team mined over 7100 full text patents. That’s approximately half a million pages of full text looking for relevant kinase technology trends and the corresponding therapeutic area information. To put this business value into perspective, it takes ~1h to manually read one patent for data extraction and a scope this large would require around 175 person-weeks (or nearly 3.5 years!) to accomplish. The authors state that innovative use of I2E enabled a 99% efficiency gain for delivering the relevant information. They also say that this project took 2 patent analysts 3 months (i.e. about 25 weeks) which is a 7-fold saving in FTE time.

The deliverables provided key actionable insights that empowered timely business decisions for BMS researchers; and this paper demonstrates that rich information contained in full text patents can be analyzed if innovative tools/methods are used.


Data to knowledge: visualisation of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

Data to knowledge: visualization of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

Advancing clinical trial R&D: industry’s most powerful NLP text analytics meets world-class curated life sciences content

December 19, 2014

What challenges were seen in competitive R&D and clinical stages? What outcomes were measured in related trials? Does the drug I am creating have potential efficacy or safety challenges? What does the patient population look like?

These are the sort of critical business questions that many life science researchers need to answer. And now, there’s a solution that can help you.

We all know the importance of high quality content you can depend on when it comes to making key business decisions across the pharma life cycle. We also know that the best way to get from textual data to new insights is using natural language processing-based text analytics. And that’s where our partnership with Thomson Reuters comes in. We’ve worked together on a solution to bring Linguamatics market-leading text mining platform, I2E, together with Thomson Reuters Cortellis high-quality clinical and epidemiology content: Cortellis Informatics Clinical Text Analytics for I2E.

Cortellis Informatics Clinical Text Analytics for I2E applies the power of natural language processing-based text mining from Linguamatics I2E to Cortellis clinical and epidemiology content sets. Taking this approach allows users to rapidly extract relevant information using the advanced search capabilities of I2E. The solution also allows users to identify concepts using a rich set of combined vocabularies from Thomson Reuters and Linguamatics.

Through one single interface users can quickly and easily gain access to new insights to support R&D, clinical development and clinical operations. This is the first time a cloud-based text mining service has been applied to commercial grade clinical and epidemiology content. The wide-ranging content set consists of global clinical trial reports, literature, press releases, conferences and epidemiology data in a secure, ready-to-use on-demand format.

Key features of the solution include:

  • High precision information extraction, using state of the art text analytics, combined with high quality, hand curated data
  • Search using a combination of Cortellis ontologies, plus domain specific and proprietary ontologies.
  • Find numeric information e.g. experimental assay results, patient numbers, trial outcome timepoints, financial values, dates.
  • Generate data tables to support you in your preclinical studies, trial design, and in understanding the impact of clinical trials.
  • Generate new hypotheses through identification of entity relationships in unstructured text e.g. assay and indication.

To find out more about how to save time and get better results from your clinical data searches, visit the Linguamatics website or contact us to gain access.

Economics of the Obesity Epidemic – Extracting Knowledge with Advanced Text Analytics

July 15, 2014

In the current competitive marketplace for healthcare, pharmaceutical and medical technology companies must be able to demonstrate clinical and economic evidence of benefit to providers, healthcare decision-makers and payers. Now more than ever, pricing pressure and regulatory restrictions are generating increased demand for this kind of outcomes evidence.

Health Economics and Outcomes Research (HEOR) aims to assess the direct and indirect health care costs associated with a disease or a therapeutic area, and associated interventions in real-world clinical practice. These costs include:

• Direct economic loss

• Economic loss through hospitalization

• Indirect costs from loss of wider societal productivity

The availability of increasing amount of data on patients, prescriptions, markets, and scientific literature combined with the wider use of comparative effectiveness make traditional keyword based search techniques ineffectual. I2E can provide the starting point for efficiently performing evidence based systematic reviews over very large sets of scientific literature, enabling researchers to answer questions such as:

• What is the economic burden of disease within the healthcare system? Across states, and globally?

• Does XYZ new intervention merit funding? What are the economic implications of its use?

• How do the incremental costs compare with the anticipated benefits for specific patient groups?

• How does treatment XYZ affect quality of life? Activities of daily living? Health status indicators? Patient satisfaction?

A recent project looking at the economics of obesity used I2E to search all 23 million abstracts in Medline for research on the incidence of comorbid diseases, with associated information on patient cohort, geographic location, severity of disease, and associated costs (e.g. hospitalisation cost, treatment, etc.). From the I2E output, more advanced visual analytics can be carried out. For example, the pie chart shows the prevalence of the various comorbid diseases (from 2013 Medline abstracts with both HEOR terms, obesity and a comorbid disease), showing the high frequency of hypertension and various other cardiovascular diseases. Another view of the same extracted intelligence shows the geographic spread of health economics and obesity research, with a predominance across northern America, but also data from China and Brazil, for example.

Prevalence of cardiovascular co-morbid diseases

Prevalence of cardiovascular co-morbid diseases


Geographic view of HEOR research, mined from Medline from 2013

Geographic view of HEOR research, mined from Medline from 2013

If you are interested in getting a better understanding of the power of advanced text analytics for HEOR, please contact us.

How to protect and develop your enterprise search investment with text analytics

June 24, 2014

It’s funny, isn’t it? Search at home just works. You’re looking for a holiday, train times, a particular recipe or the answer to your kid’s homework. You sit down and type your keyword/s into your search engine. Milliseconds later, results appear – the one you’re looking for is usually one of the first ones – you click on it and voila! You have what you were looking for.

But search at work doesn’t seem to be as effective. Maybe you are looking for information internally. You know it exists but you’re not quite sure where. The information lies across silos and it’s a mix of structured and unstructured. As a scientist it’s important for you to easily find information hidden in memos, project plans, meeting minutes, study reports, literature etc. You type a keyword search in your enterprise search engine. A list of documents comes back but none of them look like the one you want. You feel like you’re wasting your time. Sound familiar?

You’re not alone. At least that is what recent surveys and conferences on enterprise search have revealed. According to a recent report from Findwise 64% of organizations say it’s difficult to find information within their organization. Why?

  • Poor search functionality
  • Inconsistencies in how information is tagged
  • People don’t know where to look or what to look for

So how can we address this? Well, there’s already been talk of using text analytics to improve enterprise search. Text analytics, also referred to as text mining, allows users to go beyond keyword search to interpret the meaning of text in documents. While text analytics solutions have existed for some years now, more recently they’ve been working in harmony with enterprise search to improve the quality of results and make information more discoverable.

Let me give you an example, for over 10 years Linguamatics I2E has been mining data and content such as scientific literature, patents, clinical trials data, news feeds, electronic health records, social media and proprietary content – working with 17 of the top 20 pharmaceutical companies to improve and power their knowledge discovery. Meanwhile organizations have been deploying enterprise search engines to search internally.

Having been dissatisfied with their search solution and familiar with using I2E in other areas, a top 20 pharma wanted to see if the power of I2E’s text analytics could be applied to their enterprise search system. A proof of concept was proposed using Microsoft SharePoint. The organization did some internal requirement’s gathering and worked with both Microsoft and Linguamatics to come up with a solution to improve their search. I2E worked in the background, using its natural language processing technology to identify concepts and mark up semantic entities such as genes, drugs, diseases, organizations, anatomy, authors and other relevant concepts and relationships. Once annotated, taxonomies/thesauri were built and the marked-up documents were fed back into SharePoint.

To the users, the search interface remained the same but there was a difference in the results. I2E was able to provide semantic facets for the search engine to allow the user to quickly filter the results to what they were looking for. The facets were concepts rather than words and this allowed users to filter results to a more intuitive set of things they were looking for e.g. just show me the results for ‘breast cancer’ as a concept. This would also include all results that had variations of how that concept was found in the text e.g. breast carcinoma, breast tumor, cancer of the breast etc .In addition, I2E provided SharePoint with the ability to autocomplete terms as the user was typing them, and when performing the search, SharePoint was taught to look for synonyms of the word/s typed in.

The organization was incredibly happy with the improved search performance. Stating the main benefits as improved efficiency, improved search results quality, information became more transparent and available, which stimulated innovation within the organization.

This is just the beginning. The capabilities of I2E could also be applied to other search engines and scenarios where search needs to be improved to increase the return on investment made in the system and protect and develop future investments, increase usage and findability.

If you’d like to find out more, sign up for Linguamatics’ webinar or contact us for a demo.

%d bloggers like this: