NLP & Big Data Symposium in San Francisco

June 29, 2015

Life sciences and healthcare professionals gathered at the UCSF Mission Bay campus for the West Coast Natural Language Processing (NLP) & Big Data Symposium on June 18th. The symposium, co-hosted by UCSF, featured presenters from UCSF, Merck, City of Hope, Copyright Clearance Center and Linguamatics and delegates from a diverse range of organizations.

The central theme of this year’s symposium was “From bench-to-bedside, unlocking key insights from your data”. Healthcare delegates were keen to find new ways to address meaningful use and accountable care leveraging NLP text mining of electronic health records. Life sciences delegates were keen to increase the efficiency and effectiveness of their business operations by mining real world data. There was also a strong interest in forging partnership opportunities between pharma/biotech and hospitals/cancer centers.

Sorena Nadaf, the CIO and Director of Translational Informatics at UCSF Helen Diller Family Comprehensive Cancer Center delivered the welcome address and highlighted the foundation of clinical NLP and its common uses for extracting and transforming narrative information in EMR’s to support and accelerate clinical research.

Sorena Nadaf at the NLP & Big Data Symposium in San Francisco.

Sorena Nadaf at the NLP & Big Data Symposium in San Francisco.

Wendy Cornell, retired from Merck, described Merck’s development of a natural language processing (NLP) workflow to extract conclusions and interpretations from their large corpus of internal reports using the Linguamatics I2E software and the integration and analysis of the data using the ANZO platform from Cambridge Semantics. Automated extraction of conclusions and interpretations from internal preclinical safety reports using I2E was the primary use case discussed and generated a lot of interest and discussion.

Joyce Niland, Chief Research Information Officer, & Rebecca Ottesen, Biostatistician, from City of Hope (COH) presented a recent project with Linguamatics where they created a disease registry using Iterative Interactive Enrichment (IIE) of NLP queries shared across institutions. The I2E queries, initially written by the Huntsman Cancer Institute (HCI) and Linguamatics, identify immunohistochemistry (IHC) marker results from unstructured pathology dictations on malignant Non-Hodgkin’s Lymphoma patients. They were shared with COH, to assess their exportability from one institution to another. Linguamatics, COH, and HCI applied an IIE process through several phases to improve the IHC queries while sharing the improvements between institutions. Precision and recall were measured for each phase to assess the completeness and accuracy of information extraction, and to identify the most critical NLP features that impact these results. Final F Scores for both COH and HCI were .91 and .94 respectively. This impressive level of precision and recall across two institutions validates the Linguamatics approach of sharing it’s wealth of existing healthcare queries with I2E customers to help accelerate research and improve patient outcomes.

Chris Hilbert from Copyright Clearance Center presented CCC’s new RightFind XML for Mining service, the integration with Linguamatics’ I2E and how the combined solution improves the results of text and data mining queries and mitigates infringement risk. Chris demonstrated how customers can obtain and index full-text XML articles from multiple scientific publishers in I2E and avoid many of the data format and licensing issues associated with working with PDF’s. As existing licenced literature does not have to be repurchased, delegates saw this service as highly effective way of leveraging existing full text investments and extracting more value via I2E text mining.

To complement our customer and partner presentations, Linguamatics led presentations including an introduction to NLP text mining; healthcare NLP strategies to improve patient care, reduce costs and enhance population health; and Real World Data and text analytics.

It was wonderful to catch up with many of our customers, meet some new ones and help foster introductions and discussions between the various delegates. Keep an eye out for upcoming opportunities to meet with Linguamatics at our events page including our Princeton seminar on July 16 and Text Mining Summit and I2E Healthcare Hackathon in October.


Who’s Doing What with Linguamatics I2E?

June 26, 2015

Over the past few months there have been several publications which have used Linguamatics I2E to extract key information to provide value in a variety of different projects. We are constantly amazed by the inventiveness of our users, applying text analytics across the bench to bedside continuum; and these different publications are no exceptions. Using the natural language processing power of I2E, researchers are able to answer their questions rapidly and extract the results they need, with high precision and good recall; compared to more standard keyword search, which returns a document set that they then need to read.

Let’s start with Hornbeck et al., “PhosphoSitePlus, 2014: mutations, PTMs and recalibrations”. PhosphoSitePlus is an online systems biology resource for the study of protein post-translational modifications (PTMs) including phosphorylation, ubiquitination, acetylation and methylation. It’s provided by Cell Signaling Technology who have been users of I2E for several years. In the paper, they describe the value from integrating data on protein modifications from high-throughput mass spectrometry studies, with high-quality data from manual curation of published low-throughput (LTP) scientific literature.

The authors say: “The use of I2E, a powerful natural language processing software application, to identify articles and highlight information for manual curation, has significantly increased the efficiency and throughput of our LTP curation efforts, and made recuration of selected information a realistic and less time-consuming option.” The CST scientists rescore and recurate PTM assignments, reviewing for coherence and reliability – and use of these data can “provide actionable insights into the interplay between disease mutations and cell signalling mechanisms.”

A paper by a group from Roche, Zhang et al., “Pathway reporter genes define molecular phenotypes of human cells” described a new approach to understanding the effect of diseases or drugs on biological systems, by looking for “molecular phenotypes”, or fingerprints, patterns of differential gene expression triggered by a change to the cell. Here, text analytics played a small role in the project, which was (along with other tools) to compile a panel of over 900 human pathway reporter genes – representing 154 human signalling and metabolic networks. These were then used to gain understanding of cardiomyocyte development (relevant to diabetic cardiomyopathy) and assessment of common toxicity mechanisms (relevant to the mechanistic basis of adverse drug events).

The last one I wanted to highlight moves away from the realms of genes and cells, into analysis of co-prescription trends and drug-drug interactions (Sutherland et al., “Co-prescription trends in a large cohort of subjects predict substantial drug-drug interactions”). Better understanding of drug-drug interactions is of increasing importance for good healthcare delivery, as more and more patients are routinely taking multiple medications, particularly in the elderly – and the huge number of potential combinations prohibit testing for safety for all these combinations in clinical trials.

Over 30% of older adults take 5 or more drugs. Text analytics can extract clinical knowledge about potential drug-drug interactions.

Over 30% of older adults take 5 or more drugs. Text analytics can extract clinical knowledge about potential drug-drug interactions.

In this study, the authors used prescription data from NHANES surveys to find what drugs or drug classes were most routinely prescribed together; and then used I2E to search MEDLINE for a set of 133 co-prescribed drugs to assess the availability of clinical knowledge about potential drug-drug interactions. The authors found that over 30% of older adults take 5 or more drugs – but these combinations were pretty much unique. Of the co-prescribed pairs, a large percentage were not mentioned together in any MEDLINE record – demonstrating a need for further study. The authors conclude that these data show “that personalized medicine is indeed the norm, as patients taking several medications are experiencing unique pharmacotherapy” – and yet there is little published research on either efficacy or safety of these combinations.

What do these three studies have in common? The use of text analytics, not as the only tool or necessarily even the major tool, but as part of an integrated analysis of data, to answer focused and specific questions, whether those questions relate to protein modification, molecular patterns of genes in pathways or drug interactions and potential adverse events. And I wonder, where will Linguamatics I2E be used next?

Mining Nuggets from Safety Silos

May 26, 2015

Better access to the high value information in legacy safety reports has been, for many folk in pharma safety assessment, a “holy grail”. Locked away in these historical data are answers to questions such as:  Has this particular organ toxicity been seen before? In what species, and with what chemistry? Could new biomarker or imaging studies predict the toxicity earlier? What compounds could be leveraged to help build capabilities?

I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations are not expected to be human-relevant, and reducing late stage failures.

I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations may not be human-relevant, and reducing late stage failures.

Coming as I do from a decade of working in data informatics for safety/tox prediction, I was excited by one of the talks at the recent Linguamatics Spring User conference. Wendy Cornell (ex-Merck) presented on an ambitious project to use Linguamatics text mining platform, I2E, in a workflow to extract high value information from safety assessment reports stored in Documentum.

Access to historic safety data is a potential advantage that will be helped with the use of standards in electronic data submission for regulatory studies (e.g. CDISC’s SEND, the standard for exchange of non-clinical data).

Standardizing the formats and vocabularies for key domains in safety studies will enable these data to be fed into searchable databases; however these structured data miss the intellectual content added by the pathologists and toxicologists, whose conclusions are essential for understanding whether evidence of a particular tox finding (e.g. hyalinosis, single cell necrosis, blood enzyme elevations) signals a potential serious problem in humans or is specific to the animal model.

For these key conclusions, access to the full study reports is essential.

At Merck, Wendy’s Proprietary Information and Knowledge Management group, in collaboration with the Safety Assessment and Laboratory Animal Resources (SALAR) group, developed an I2E workflow that extracted the key findings from safety assessment ante- and post-mortem reports, final reports, and protocols, in particular pulling out:

  • Study annotation (species, study duration, compound, target, dosage)
  • Interpreted results sections (i.e. summary or conclusions sections)
  • Organ-specific toxicology and histopathology findings
  • Haematological and serum biochemistry findings

In addition, a separate arm in the workflow leveraged the ABBYY OCR software to extract toxicokinetic (TK) parameters such as area under the curve (AUC), maximum drug concentration (Cmax), and time after dosing of peak drug plasma exposure (TMax) from PDF versions of the documents.

The extracted and normalized information was loaded into a semantic knowledgebase in the Cambridge Semantics ANZO tool and searched and visualized using a tailored ANZO dashboard. This faceted browsing environment enabled the SALAR researchers to ask questions such as, “what compounds with rat 3-month studies show kidney effects, and for these compounds, what long term studies do we have?”

Wendy presented several use cases showing real value of this system to the business, including the potential to influence regulatory guidelines. For example, the team were able to run an analysis to assess the correlation between 3-month sub-chronic non-rodent studies, and 9- or 12-month chronic non-rodent results; they found that in nearly 30% of cases an important new toxicologic finding was identified in the long-term studies, confirming the ongoing need for long-term studies.

Wendy stated, “This unstructured information represents a rich body of knowledge which, in aggregate, has potential to identify capability gaps and evaluate individual findings on active pipeline compounds in the context of broad historical data.”

With the current focus on refinement, replacement and reduction of animal studies, being able to identify when long-term studies are needed vs. when they are not essential for human risk assessment, will be hugely valuable; and extracting these nuggets of information from historic data will contribute to this understanding.

Expert interpretations and conclusions from thousands of past studies can potentially be converted into actionable knowledge. These findings exist as unstructured text in Safety documents. See Wendy Cornell speak on this, at our upcoming NLP and Big Data Symposium in San Francisco.

II-SDV 2015: The International Information Conference on Search, Data Mining and Visualization

May 12, 2015

The recent two day II-SDV meeting in the beautiful town of Nice on the Côte d’Azur, France, started with a day of talks considering the question of how to best maximise the value of data extracted from a wide range of sources: patents, full text articles and even big data.

The programme kicked off with a presentation from Aleksander Kapisoda from Boehringer Ingelheim (BI) describing how innovative use of custom search techniques beyond that currently offered by standard public search machines can bring tangible benefits to a global pharmaceutical company.

One theme that emerged was the potential use of text mining particularly in constructing landscapes related to emerging technologies. Jane List (Extract information UK) described some of the tools, workflows, and visualisations for patent landscaping, with a great quote from Marcel Proust: “The real voyage of discovery consists not in seeking new landscapes, but in having new eyes”. Emmanuelle Fortune (INIP, France) discussed the ability to classify world cities dubbed “Smart Cities” as hubs for technological development directly from mining the patent literature.

Staying on the topic of text mining I presented a number of use cases related on the subject of “time” and pressed home the message that using text mining can provide clear advantages for access to timely information. This presentation was then followed up by news from the Copyright Clearance Center (CCC) that the difficult process of obtaining legal permission for the purposes of text mining has recently become a lot easier with the ability to now directly create set of full text documents ready for immediate use in text mining. This has long been a goal for many information scientists, as there are valuable nuggets of information in full text that just can’t be gained from mining abstracts.

Finally the conference heard from a different group within Boehringer Ingelheim,  concerned with automating the currently time consuming process of extracting medicinally relevant chemistry from patents. Matthias Negri, collaborating with technology partners Chemaxon, has established a Knime™ workflow that makes use of Linguamatics I2E to extract the additional surrounding pharmacological context to chemistry described within the patent, to provide “a solid information base of value to any phase of a drug discovery project”.

With participants from both US and Europe, the conference provided a great opportunity to meet information specialists, patent experts and scientists from across life science specialities, as well as hearing from vendors about their new product developments. If you are interested in any of the topics discussed, more information can be found at the conference website, and I’d be happy to hear your comments.

Andrew Hinton, Application Specialist, Linguamatics

BioIT 2015: Bench to Bedside value from Data Science and Text Analytics

April 29, 2015

Last week’s BioIT World Expo kicked off with a great keynote from Philip Bourne (Associate Director for Data Science, National Institutes of Health) setting the scene on a theme that ran through out the conference – how can we benefit from big data analytics, or data science, for pharma R&D and delivery into healthcare. With 2 days of talks, 12 tracks covering cloud computing, NGS analytics, Pharmaceutical R&D Informatics, Data Visualization & Exploration Tools, and Data Security, plus a full day of workshops beforehand, and a busy exhibition hall, there was plenty to see, do, take in and discuss.  I attended several talks on best practise in data science, by speakers from Merck, Roche, and BMS – and I was pleased to hear speakers mention text analytics, particularly natural language processing, as a key part of the overall data science solution.

All this, and in beautiful Boston, just after the Marathon; luckily for us, the weather warmed up and we were treated to a couple of lovely sunny days.  As one of the speakers in the Pharma R&D informatics track, I presented some of the Linguamatics use cases of text analytics across the drug discovery – development – delivery pipeline.  Our customers are getting value along this bench-to-bedside continuum, using text mining techniques for gene annotation, patent analytics, regulatory QA/QC, clinical trial matching for patients, and through into extracting key endpoints from pathology reports and EHRs for better patient care. If you missed the talk, we will be presenting the material at our webinar on 3rd June.

Boston at night

Boston at night

Linguamatics I2E users lead the way in text mining for patents, safety and more at this year’s Spring Users Conference

April 28, 2015

We are always amazed and impressed at the inventiveness of Linguamatics customers, in their applications of text analytics to address their information challenges. Our annual Linguamatics Spring Users Conference showcased some examples of their innovation, with presentations on text mining used for patent analytics, chemical pharmacokinetics and pharmacodynamics data extraction, creating value from legacy safety reports, and integrating open source tools for advanced entity recognition. We had a record-breaking number of attendees this year, representing over 20 organizations, ranging from our most experienced I2E users to text mining novices.

A record-breaking number of attendees enjoyed the opportunity to experience Cambridge and share insights with one another at this year's conference.

A record-breaking number of attendees enjoyed the opportunity to experience Cambridge and share insights with one another at this year’s conference.

Patent analytics featured in two of the presentations, demonstrating the value of NLP in extracting critical information from obtuse and lengthy patent documents. Julia Heinrich (Senior Patent Analyst, Biotechnology at Bristol-Myers Squibb, Princeton, New Jersey) asked the question: “Can the infoglut of biotech patent publications be quickly reviewed to enable timely business decisions?”. She admirably demonstrated that with smart use of I2E’s NLP queries, BMS have been able to search the patent body for information on antibody-drug conjugates and convert “unstructured data” into user-friendly, analysis-ready data sets. Thorsten Schweikardt (Senior Information Scientist, Boehringer Ingelheim) gave an overview of workflows developed using KNIME to create patent landscapes for specific disease areas, target identification, and discovery of tool compounds.

Wendy Cornell (former head of the Merck Proprietary Information and Knowledge Management Group), like Julia Heinrich, flew over from the US for the meeting. Wendy presented on the automated extraction of conclusions from internal preclinical safety reports using I2E. These internal safety assessment reports contain a wealth of historical data around safety and toxicity of developmental compounds, and many pharma organizations have sought ways to gain benefits from these valuable legacy documents. Wendy’s group developed a strategy to access Documentum-based safety assessment reports, and were able to pull out histopath findings, organ toxicities, haematological and blood biochemistry results, even pulling out toxicokinetic parameters from tabular sections. Three use cases were presented, showing significant business impact within the Safety Assessment organization.

Wendy Cornell details how she used I2E to create NLP-driven workflow tapping into the large body of valuable knowledge located in structured and unstructured internal documents.

Wendy Cornell details how she used I2E to create NLP-driven workflow tapping into the large body of valuable knowledge located in structured and unstructured internal documents.

Linguamatics’ speakers gave an update on future innovations in the I2E roadmap, the new features in I2E 4.3, and the software’s applications in the life sciences and healthcare. Guy Singh showed how I2E 4.3’s Connected Data Technology allows users to exploit big data better no matter where the data are located (on premise, on the cloud), whatever structure they have, and doing this at speed, with digestible results. Phil Hastings gave a brief overview of Linguamatics I2E in Healthcare; and NLP Specialist James Cormack took us through Linguamatics’ approach and results for our submission to the i2b2 2014 Cardiac Risk Factors challenge. You can find out more about what we’re doing in healthcare via this short video.

We heard from a few of our partners in 5-minute lightning round presentations: IFI Claims Patent Services, ChemAxon, Copyright Clearance Center, Thomson Reuters and KNIME discussed their solutions and how they integrate with Linguamatics I2E.

In addition to the presentations, the Linguamatics Spring Users Conference provided opportunities for hands-on training, with workshops aimed at different levels of text mining experience. And of course, there was plenty of time for networking and idea sharing. Our evening events were hosted in the Old Combination Room at Corpus Christi College and the Pembroke College Old Library. We enjoyed beautiful, warm spring evenings at two of Cambridge University’s oldest colleges. One delegate remarked ‘It’s so nice to be shown hidden Cambridge treasures like these, which we would never know about if it wasn’t for the events at the Linguamatics conference.’

Evening social events at Cambridge University's historic colleges

Evening social events at Cambridge University’s historic colleges

The whole event was a great success that brought together the text mining community from across Europe (and across the pond!).

Presentations which have been approved to share are available on I2Edia and by email request.

Thanks to everyone who attended and contributed to the Linguamatics Spring Users Conference 2015, we look forward to seeing you in October at the Text Mining Summit in Newport, RI or in Cambridge, UK next year.

Accelerating Drug Approvals with better Regulatory QA

April 7, 2015

Submitting a drug approval package to the FDA, whether for an NDA, BLA or ANDA, is a costly process. The final amalgamation of different reports and documents into the overview document set can involve a huge amount of manual checking and cross-checking, from the subsidiary documents to the master. It is crucial to get the review process right. Any errors, and the FDA can send back the whole package, delaying the application. But the manual checking involved in the review process is tedious, slow, and error-prone.

A delayed application can also be costly. How much are we talking about? While not every drug is a blockbuster, these numbers are indicative of what you could be losing:  the top 20 drugs in the United States accounted for $319.9 billion in sales in 2011; so a newly launched blockbuster could make around $2Bn in the first year launched – that’s $6M per day.  If errors in the quality review hold up an NDA for even just a week this could generate significant costs.

So – how can text analytics improve this quality assurance process?  Linguamatics has worked with some of our top 20 pharma customers to develop an automated process to improve quality control of regulatory document submission. The process cross-checks MedDRA coding, references to tables, decimal place errors, and discrepancies between the summary document and source documents. This requires the use of advanced processing to extract information from tables in PDF documents as well as natural language processing to analyze the free text.

The errors that can be identified include:

  • Incorrect formatting: doubled period, incorrect number of decimal places, addition of percent sign
  • Incorrect calculation: number of patients divided by total number does not agree with percent term
  • Incorrect threshold: presence of row does not agree with table title
  • Text-Table inconsistency: numbers in the table do not agree with numbers in the accompanying text


Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately.

Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately.

Using advanced text mining processing, we are able to identify inconsistencies within FDA submission documents, across tables and textual parts of the reports. Overall, we found that using automated text analysis for quality assurance of submission documents can save countless hours or weeks of tedious manual checking, and potentially prevent a re-submission request, with potential savings of millions of dollars.

This work was presented by Jim Dixon, Linguamatics, at the Pharmaceutical Users Software Exchange Computational Science Symposium in March 2015.



%d bloggers like this: