BioIT 2015: Bench to Bedside value from Data Science and Text Analytics

April 29, 2015

Last week’s BioIT World Expo kicked off with a great keynote from Philip Bourne (Associate Director for Data Science, National Institutes of Health) setting the scene on a theme that ran through out the conference – how can we benefit from big data analytics, or data science, for pharma R&D and delivery into healthcare. With 2 days of talks, 12 tracks covering cloud computing, NGS analytics, Pharmaceutical R&D Informatics, Data Visualization & Exploration Tools, and Data Security, plus a full day of workshops beforehand, and a busy exhibition hall, there was plenty to see, do, take in and discuss.  I attended several talks on best practise in data science, by speakers from Merck, Roche, and BMS – and I was pleased to hear speakers mention text analytics, particularly natural language processing, as a key part of the overall data science solution.

All this, and in beautiful Boston, just after the Marathon; luckily for us, the weather warmed up and we were treated to a couple of lovely sunny days.  As one of the speakers in the Pharma R&D informatics track, I presented some of the Linguamatics use cases of text analytics across the drug discovery – development – delivery pipeline.  Our customers are getting value along this bench-to-bedside continuum, using text mining techniques for gene annotation, patent analytics, regulatory QA/QC, clinical trial matching for patients, and through into extracting key endpoints from pathology reports and EHRs for better patient care. If you missed the talk, we will be presenting the material at our webinar on 3rd June.

Boston at night

Boston at night

Accelerating Drug Approvals with better Regulatory QA

April 7, 2015

Submitting a drug approval package to the FDA, whether for an NDA, BLA or ANDA, is a costly process. The final amalgamation of different reports and documents into the overview document set can involve a huge amount of manual checking and cross-checking, from the subsidiary documents to the master. It is crucial to get the review process right. Any errors, and the FDA can send back the whole package, delaying the application. But the manual checking involved in the review process is tedious, slow, and error-prone.

A delayed application can also be costly. How much are we talking about? While not every drug is a blockbuster, these numbers are indicative of what you could be losing:  the top 20 drugs in the United States accounted for $319.9 billion in sales in 2011; so a newly launched blockbuster could make around $2Bn in the first year launched – that’s $6M per day.  If errors in the quality review hold up an NDA for even just a week this could generate significant costs.

So – how can text analytics improve this quality assurance process?  Linguamatics has worked with some of our top 20 pharma customers to develop an automated process to improve quality control of regulatory document submission. The process cross-checks MedDRA coding, references to tables, decimal place errors, and discrepancies between the summary document and source documents. This requires the use of advanced processing to extract information from tables in PDF documents as well as natural language processing to analyze the free text.

The errors that can be identified include:

  • Incorrect formatting: doubled period, incorrect number of decimal places, addition of percent sign
  • Incorrect calculation: number of patients divided by total number does not agree with percent term
  • Incorrect threshold: presence of row does not agree with table title
  • Text-Table inconsistency: numbers in the table do not agree with numbers in the accompanying text


Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately.

Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately.

Using advanced text mining processing, we are able to identify inconsistencies within FDA submission documents, across tables and textual parts of the reports. Overall, we found that using automated text analysis for quality assurance of submission documents can save countless hours or weeks of tedious manual checking, and potentially prevent a re-submission request, with potential savings of millions of dollars.

This work was presented by Jim Dixon, Linguamatics, at the Pharmaceutical Users Software Exchange Computational Science Symposium in March 2015.



Picking your brain: Synergy of OMIM and PubMed in Understanding Gene-Disease Associations for Synapse Proteins

January 5, 2015

I read with interest a recent publication which sheds light on the complex interactions of synapse protein complexes with human disease. The study (run by the Genes to Cognition neuroscience research programme) combined wet-lab research with bioinformatics and text analytics to uncover genetic associations with these protein complexes in over seventy human brain diseases, including Alzheimer’s Disease, Schizophrenia and Autism spectrum disorders. The idea was to identify and develop suitable screening assays for synapse proteomes from post-mortem and neurosurgical brain samples, focusing specifically on Membrane-associated guanylate kinase (MAGUK) associated signalling complexes (MASC).

Our CTO, David Milward was involved in the text analytics work. He used the natural language processing capabilities of Linguamatics I2E platform to extract gene-mutation-disease associations from PubMed abstracts. The flexibility of I2E enabled an appropriate balance of recall and precision, thus providing comprehensive results while not overloading curators with noise. Queries were built using linguistic patterns to allow associations to be discovered between a list of several thousand relevant gene identifiers, and appropriate MedDRA disease terms. The key aim was to provide comprehensive results with suitable accuracy to allow fast curation. These text-mined results were combined with data from Online Mendelian Inheritance in Man (OMIM) on human MASC genes and genetic disease associations.

In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone, and 49 in PubMed alone.

In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone; and 49 via text mining from PubMed alone.

I wanted to dig a little deeper into the data from the paper and the comparison of OMIM and PubMed. Supplementary Table 5 has information on the list of genes coding for MASC proteins and causing inherited diseases as described in the OMIM repository, or identified using text mining software as associated to disease. In total, 143 gene-disease associations were found (see Figure), but only 26 associations were found in both sources. This shows the synergistic value of combining data from these two sources, and the need for integration of multiple sources to get the fullest picture possible, for any particular gene-disease involvement.

Ebola: Text analytics over patent sources for medicinal chemistry

November 13, 2014

The 2014 Ebola outbreak is officially the deadliest in history. Governments and organizations are searching for ways to halt the spread – both responding with humanitarian help, and looking for treatments to prevent or cure the viral infection.

CDC image of Ebola virus

Ebola virus disease (or Ebola haemorrhagic fever) is caused by the Ebola filovirus

A couple of weeks ago we received a tweet from Chris Southan, who has been looking at crowdsourcing anti-Ebola medicinal chemistry. He asked us to mine Ebola C07D patents (i.e. those for heterocyclic small molecules, the standard chemistry for most drugs) using our text analytics tool I2E, and provide him with the resulting chemical structures.

We wanted to help. What anti-Ebola research has been patented, that might provide value to the scientific community? Searching patents for chemistry using an automated approach is notoriously tricky; patent documents are long, and often purposefully obfuscated with chemicals frequently being obscured by the complex language used to described them or corrupted by OCR errors and destroyed by the overall poor formatting of the patents.

Andrew Hinton, one of our Application Specialists with a background in chemistry, used I2E to map the patent landscape around Ebola, identify patents for small molecules described to target Ebola, and extract the chemical structures. He compiled queries to answer the key questions and find those patents which were most relevant:

  • Does the patent mention Ebola or Ebola-like diseases? More importantly, is Ebola the major focus of the patent?
  • Who is the pharma or biotech company?
  • Is it a small molecule or non-small molecule patent?
  • What’s the exemplified chemistry? What’s the claimed chemistry? What’s the Markush chemistry?
  • What chemistry is found as an image? What chemistry is found in a table? Can we extract these structures too?

Andrew ran these queries over patents from USPTO, EPO and WIPO for the past 20 years on data derived from IFI CLAIMS.

Ebola patent trend over past decade

Graph showing C07D patents (blue, left-hand axis) and non-C07D patents (red line, right-hand axis) for Ebola related patents from 1994 to 2014, from the three major patent registries [please note different scales for the axes].

The results showed a general increase in the number of patent records related to Ebola, but they are comparatively small – for example there were about 50k C07D patents published in 2010 across all therapeutic areas; of these, we found that only about 100 patents that related to Ebola (and the likely number of truly unique patent families is going to be a smaller subset of the above figure). This isn’t really that surprising; along with most viral diseases, the main emphasis for therapies has been on biologics and other non-small-molecule treatments – in fact, of the 16k total patents that mention Ebola, only 1% are C07D patents focused specifically on Ebola.

Ebola patents -ranked by organisations

Heatmap showing that the top 3 organizations with small molecule C07D patents in this area contribute 1/5 of all Ebola patents.

So what is the outcome of this? Using I2E, we have been able to extract the set of molecules reported in these Ebola-related patents, and will provide a set of these to Chris Southan for his chemoinformatics analysis. Let’s hope that this little step might further research towards providing a solution to the current Ebola epidemic.

Ebola patents - chemistry table

Screenshot of I2E results showing names and structures extracted using text analytics from Ebola-related patent set. Structures are generated using ChemAxon’s Name-to-Structure conversion tools (

Mining big data for key insights into healthcare, life sciences and social media at the Linguamatics San Francisco Seminar

September 3, 2014

Natural Language Processing (NLP) and text analytics experts from pharmaceutical and biotech companies, healthcare providers and payers gathered together to discuss the latest industry trends and to hear the product news and case studies from Linguamatics on August 26th.

The keynote presentation from Dr Gabriel Escobar was the highlight of the event, covering a rehospitalization prediction project that the Kaiser Permanente Department of Research have been working on in collaboration with Linguamatics. The predictive model has been developed using a cohort of approximately 400,000 patients and combine scores from structured clinical data with factors derived from unstructured data using I2E. Key factors that could affect a patient’s likelihood of rehospitalization are trapped in text; these include ambulatory status, social support network and functional status. I2E queries enabled KP to extract these factors and use them to indicate the accuracy of the structured data’s predictive score.

Leading the use of I2E in healthcare, Linguamatics exemplified how cancer centers are working together to develop queries for pathology reports, mining medical literature and predicting pneumonia from radiology reports. They also demonstrated a prototype application to match patients to clinical trials and a cohort selection tool using semantic tagging of patient narratives in the Apache Solr search engine.

Semantic enrichment was discussed in the context of life sciences using SharePoint as the search engine. This drew great interest from the many life science companies in the audience, who understand the need to improve searching of internal scientific data. This discussion highlighted the challenges of getting a consistent view of internal and external data. The latest version of I2E will address this challenge with a new federated capability that provides merged results sets from internal and external searches. These new I2E capabilities have strong potential to improve insight and they also incorporate a model that allows content providers to more actively support text mining.

Another hot topic was mining social media and news outlets for competitive intelligence and insights into company and product sentiment. The mass of information now available from social media requires a careful strategy of multiple levels of filtering; this will enable extracting the relevant data from the millions of tweets and posting that occur daily. Once these have been identified this data can be text mined but users need to factor in support for abbreviations and shorter linguistic syntax. Mining social media and news outlets is an area that will continue to grow and require active support.

Linguamatics were grateful for such an engaged and interactive audience and look forward to future discussions on these exciting trends. Keep an eye out for information about our upcoming Text Mining Summit.

Pharmacogenomics across the drug discovery pipeline – are we nearly there?

August 20, 2014

Since the human genome was published in 2001, we have been talking about the potential application of this knowledge to personalized medicine, and in the last couple of years, we seem at last to be approaching this goal.

A better understanding of the molecular basis of diseases is key to development of personalized medicine across pharmaceutical R&D, as was discussed last year by Janet Woodcock, Director of the FDA’s Center for Drug Evaluation and Research (CDER). FDA CDER has been urging adoption of pharmacogenomics strategies and pursuit of targeted therapies for a variety of reasons. These include the potential for decreasing the variability of response, improving safety, and increasing the size of treatment effect, by stratifying patient populations.

Pharmacogenomics is the study of the role an individual’s genome plays in drug response, which can vary from  adverse drug reactions to lack of therapeutic efficacy. With the recent explosion in sequence data from next generation sequencing (NGS) technologies, one of the bottlenecks in application of genomic variation data to understanding disease is access to annotation. From NGS workflows, scientists can quickly identify long lists of candidate genes that differ between two conditions (case-control, or family hierarchies, for example). Gene annotations are essential to interpret these gene lists and to discover fundamental properties like gene function and disease relevance.

Key sources for these annotations include the ever-growing biomedical literature either in structured databases (such as COSMIC, GAD, DGA) but much valuable information is in textual sources such as PubMed Central, MEDLINE, and OMIM. Extracting actionable insight rapidly and accurately from text documents is greatly helped by advanced text analytics – and users of our I2E text analytics solution have been asking for access to OMIM, which will soon be available in our OnDemand portfolio.

In particular, there are two common use cases that I2E users want to address with enhanced text analytics over the OMIM data:

One use case comes from the clinical side of drug discovery-development; the clinicians provide information on a particular case or phenotype, and I2E is used to extract from OMIM the potential genes that might be relevant to sequence from the clinical samples to see if there is involvement in the disease pathway.

The other use case comes from early on in the drug discovery-development pipeline, at the initial stages of a project, where for a new disease area I2E is used to pull out a set of potential targets from OMIM. Obviously, before any lab work starts, more in-depth research is needed, but this provides an excellent seed for entry into a new therapeutic area.

Utilizing I2E to access OMIM brings benefits such as:

  • High quality results from this manually curated, fact-dense data source, compared to, for example, querying original articles in peer-reviewed literature
  • The use of our domain-specific ontologies (e.g. for diseases, genes, mutations and other gene variants) enables high recall compared to searching via the OMIM interface (for example, using ontologies to search for “liver cancer”, and being able to also find records with annotations for “liver neoplasm”, “hepatic cancer”, “cancer of the liver”, etc)
  • Clustering of various synonyms and expressions from the use of Preferred Terms (PT) such as Gene Symbols
  • The ability to build in-depth queries, such as extraction of gene-gene interactions, and to hit a wide variety of concepts and synonyms, for example many different ways in which gene/protein mutations may be named (see figure legend)
Network of Disease-Gene-Mutation relationships from I2E results in Cytoscape

The image shows a network of Disease – Gene – Mutation relationships from I2E results in Cytoscape. I2E was used to extract gene (green squares) and mutation (circles) information for stroke (central red triangle), showing some overlap of gene interactions with a related disease, cerebral infarction. Utilising the Linguamatics Mutation resource enables easy extraction of precise information patterns (e.g. “an ACTA2 mutation”; “proline for serine at codon 116″; “4895A/G”; “a 4-bp deletion”; “Q193Sfs*12″; “a 377A-T transversion”), which would be hugely time-consuming to do by manual curation.

Combining OMIM access with extraction of genotype-phenotype relationships from MEDLINE, PubMed Central and will give I2E users an excellent resource for NGS annotation, target discovery, and clinical genomics, in order to better target the molecular basis of disease.

If you are interested in accessing OMIM using the power of I2E (or any other of the current content options on I2E OnDemand i.e. MEDLINE,, FDA Drug Labels, NIH Grants, PubMed Central Open Subset, and Patents) , please get in touch and we can provide more information and keep you updated on our progress.





Retrieving Properties of a Published Query

February 27, 2014

Although I2E Queries and Multi Queries are binary objects, the I2E Web Services API provides an interface to a subset of the properties of those items, including some that can be modified when running a query programmatically.

Query properties that are read-only and that can be retrieved using the API include title, creator, comments and column headers. Query properties that can be modified before query submissions include number of hits, time limit and smart query parameters.

I2E has two, related, query resources: Saved Queries (that represent the binary files on disk, stored in the Repository) and Published Queries (that represent the Published location of the Saved Queries). To ensure that Users have permissions to see Query Properties, it is recommended that you only expose access to Published Queries.

Retrieving (by GET) a Published Query provides a “handle” to the Saved Query:

HTTP Header = X-Version: *, Accept: application/json
Success 200

Read the rest of this entry »

%d bloggers like this: