Picking your brain: Synergy of OMIM and PubMed in Understanding Gene-Disease Associations for Synapse Proteins

January 5, 2015

I read with interest a recent publication which sheds light on the complex interactions of synapse protein complexes with human disease. The study (run by the Genes to Cognition neuroscience research programme) combined wet-lab research with bioinformatics and text analytics to uncover genetic associations with these protein complexes in over seventy human brain diseases, including Alzheimer’s Disease, Schizophrenia and Autism spectrum disorders. The idea was to identify and develop suitable screening assays for synapse proteomes from post-mortem and neurosurgical brain samples, focusing specifically on Membrane-associated guanylate kinase (MAGUK) associated signalling complexes (MASC).

Our CTO, David Milward was involved in the text analytics work. He used the natural language processing capabilities of Linguamatics I2E platform to extract gene-mutation-disease associations from PubMed abstracts. The flexibility of I2E enabled an appropriate balance of recall and precision, thus providing comprehensive results while not overloading curators with noise. Queries were built using linguistic patterns to allow associations to be discovered between a list of several thousand relevant gene identifiers, and appropriate MedDRA disease terms. The key aim was to provide comprehensive results with suitable accuracy to allow fast curation. These text-mined results were combined with data from Online Mendelian Inheritance in Man (OMIM) on human MASC genes and genetic disease associations.

In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone, and 49 in PubMed alone.

In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone; and 49 via text mining from PubMed alone.

I wanted to dig a little deeper into the data from the paper and the comparison of OMIM and PubMed. Supplementary Table 5 has information on the list of genes coding for MASC proteins and causing inherited diseases as described in the OMIM repository, or identified using text mining software as associated to disease. In total, 143 gene-disease associations were found (see Figure), but only 26 associations were found in both sources. This shows the synergistic value of combining data from these two sources, and the need for integration of multiple sources to get the fullest picture possible, for any particular gene-disease involvement.


Advancing clinical trial R&D: industry’s most powerful NLP text analytics meets world-class curated life sciences content

December 19, 2014

What challenges were seen in competitive R&D and clinical stages? What outcomes were measured in related trials? Does the drug I am creating have potential efficacy or safety challenges? What does the patient population look like?

These are the sort of critical business questions that many life science researchers need to answer. And now, there’s a solution that can help you.

We all know the importance of high quality content you can depend on when it comes to making key business decisions across the pharma life cycle. We also know that the best way to get from textual data to new insights is using natural language processing-based text analytics. And that’s where our partnership with Thomson Reuters comes in. We’ve worked together on a solution to bring Linguamatics market-leading text mining platform, I2E, together with Thomson Reuters Cortellis high-quality clinical and epidemiology content: Cortellis Informatics Clinical Text Analytics for I2E.

Cortellis Informatics Clinical Text Analytics for I2E applies the power of natural language processing-based text mining from Linguamatics I2E to Cortellis clinical and epidemiology content sets. Taking this approach allows users to rapidly extract relevant information using the advanced search capabilities of I2E. The solution also allows users to identify concepts using a rich set of combined vocabularies from Thomson Reuters and Linguamatics.

Through one single interface users can quickly and easily gain access to new insights to support R&D, clinical development and clinical operations. This is the first time a cloud-based text mining service has been applied to commercial grade clinical and epidemiology content. The wide-ranging content set consists of global clinical trial reports, literature, press releases, conferences and epidemiology data in a secure, ready-to-use on-demand format.

Key features of the solution include:

  • High precision information extraction, using state of the art text analytics, combined with high quality, hand curated data
  • Search using a combination of Cortellis ontologies, plus domain specific and proprietary ontologies.
  • Find numeric information e.g. experimental assay results, patient numbers, trial outcome timepoints, financial values, dates.
  • Generate data tables to support you in your preclinical studies, trial design, and in understanding the impact of clinical trials.
  • Generate new hypotheses through identification of entity relationships in unstructured text e.g. assay and indication.

To find out more about how to save time and get better results from your clinical data searches, visit the Linguamatics website or contact us to gain access.

Ebola: Text analytics over patent sources for medicinal chemistry

November 13, 2014

The 2014 Ebola outbreak is officially the deadliest in history. Governments and organizations are searching for ways to halt the spread – both responding with humanitarian help, and looking for treatments to prevent or cure the viral infection.

CDC image of Ebola virus

Ebola virus disease (or Ebola haemorrhagic fever) is caused by the Ebola filovirus

A couple of weeks ago we received a tweet from Chris Southan, who has been looking at crowdsourcing anti-Ebola medicinal chemistry. He asked us to mine Ebola C07D patents (i.e. those for heterocyclic small molecules, the standard chemistry for most drugs) using our text analytics tool I2E, and provide him with the resulting chemical structures.

We wanted to help. What anti-Ebola research has been patented, that might provide value to the scientific community? Searching patents for chemistry using an automated approach is notoriously tricky; patent documents are long, and often purposefully obfuscated with chemicals frequently being obscured by the complex language used to described them or corrupted by OCR errors and destroyed by the overall poor formatting of the patents.

Andrew Hinton, one of our Application Specialists with a background in chemistry, used I2E to map the patent landscape around Ebola, identify patents for small molecules described to target Ebola, and extract the chemical structures. He compiled queries to answer the key questions and find those patents which were most relevant:

  • Does the patent mention Ebola or Ebola-like diseases? More importantly, is Ebola the major focus of the patent?
  • Who is the pharma or biotech company?
  • Is it a small molecule or non-small molecule patent?
  • What’s the exemplified chemistry? What’s the claimed chemistry? What’s the Markush chemistry?
  • What chemistry is found as an image? What chemistry is found in a table? Can we extract these structures too?

Andrew ran these queries over patents from USPTO, EPO and WIPO for the past 20 years on data derived from IFI CLAIMS.

Ebola patent trend over past decade

Graph showing C07D patents (blue, left-hand axis) and non-C07D patents (red line, right-hand axis) for Ebola related patents from 1994 to 2014, from the three major patent registries [please note different scales for the axes].

The results showed a general increase in the number of patent records related to Ebola, but they are comparatively small – for example there were about 50k C07D patents published in 2010 across all therapeutic areas; of these, we found that only about 100 patents that related to Ebola (and the likely number of truly unique patent families is going to be a smaller subset of the above figure). This isn’t really that surprising; along with most viral diseases, the main emphasis for therapies has been on biologics and other non-small-molecule treatments – in fact, of the 16k total patents that mention Ebola, only 1% are C07D patents focused specifically on Ebola.

Ebola patents -ranked by organisations

Heatmap showing that the top 3 organizations with small molecule C07D patents in this area contribute 1/5 of all Ebola patents.

So what is the outcome of this? Using I2E, we have been able to extract the set of molecules reported in these Ebola-related patents, and will provide a set of these to Chris Southan for his chemoinformatics analysis. Let’s hope that this little step might further research towards providing a solution to the current Ebola epidemic.

Ebola patents - chemistry table

Screenshot of I2E results showing names and structures extracted using text analytics from Ebola-related patent set. Structures are generated using ChemAxon’s Name-to-Structure conversion tools (http://www.chemaxon.com/products/naming/).

Discovering new uses for NLP text analytics at the Linguamatics Text Mining Summit

October 23, 2014

This year, Linguamatics returned to the beautiful town of Newport, Rhode Island, for our annual Text Mining Summit on October 13-15. We were delighted to return to this exquisite setting, where again delegates competed over who could take the most beautiful sunrise and sunset photos.

Sunrise at Newport, Linguamatics Text Mining Summit 2014

Sunrise outside the Hyatt Regency,Newport, RI, Linguamatics Text Mining Summit 2014

Sunset at Newport, Linguamatics Text Mining Summit 2014

Sunset captured at the Linguamatics Text Mining Summit 2014

The Text Mining Summit offers unique opportunities to learn about the latest use cases of Natural Language Processing (NLP) text analytics across pharma and healthcare, plus hands-on training, networking and idea sharing. We hosted a fantastic line up of presenters from Novartis, Bristol-Myers Squibb, Georgetown University Medical Center, Spartanburg Regional Healthcare System, Boehringer Ingelheim, Pfizer, Cell Signalling Technology, Thomson Reuters, GenoSpace and Microsoft.

Mark Burfoot, Global Head, Knowledge Office for Novartis kicked off the proceedings on Tuesday with a keynote talk looking at the future of text mining and knowledge strategy within an evolving pharma landscape. Several presenters shared their use cases and experiences employing I2E. Jonathan Hartmann, Hospital Informationist, Georgetown University Medical Center, gave us an update on the use of I2E for on-the-spot clinical decision support; he said of his experience with text mining: “I2E allows me to find things I wouldn’t have been able to find.” Ryan Owens, Systems Analyst, Spartanburg Regional Healthcare, described the use of I2E to structure essential information from electronic medical records; he said what his organization was accomplishing would be impossible without the help of I2E: “You have to use NLP. There is no other way.”

Ryan Owens  speaking on 'Gaining critical information for clinical trials using NLP'

Ryan Owens speaking on ‘Gaining critical information for clinical trials using NLP’

Other presentations included Jonathan Keeling, Senior Scientist, Boehringer Ingelheim, who described how researchers are linking genomics (and other ‘omics) data in TranSMART to gene annotation information extracted using I2E; and Mick Correll, Chief Operating Officer, GenoSpace, who discussed how I2E has dramatically improved the clinical trial matching process, finding trials for critically ill patients faster than otherwise possible.

Jonathan Keeling speaking on 'Integration of text mining into the tranSMART knowledge management platform'

Jonathan Keeling speaking on ‘Integration of text mining into the tranSMART knowledge management platform’

User presentations received a warm welcome from delegates, who enabled very lively Q&A sessions on the different applications of NLP in improving the speed and quality of insight extraction. Delegates were also inspired by Linguamatics speakers’ presentations, which shed light on the power of I2E in more industry-specific contexts. In particular, John Brimacombe (Linguamatics Executive Chairman) laid out the roadmap to “Connecting Knowledge” through federated text mining over any content source, which will enable users to text mine multiple public and private knowledge sources in a single query.

A prime example of the potential of federated text mining was given by Andrew Garrow, Text Analytics Manager at Thomson Reuters. He previewed the Cortellis Informatics Clinical Text Analytics for I2E – exciting new capabilities to gain insights in support of clinical design from Cortellis clinical and epidemiology content sets.

Guy Singh and Andrew Garrow presenting on Cortellis Informatics Clinical Text Analytics for I2E

Guy Singh and Andrew Garrow presenting on Cortellis Informatics Clinical Text Analytics for I2E

The evening social events gave delegates a welcome opportunity to network and continue their discussions over the risotto bar at the Hyatt and fresh New England oysters, drinks, and a stunning waterfront view at the Landing restaurant.

The Text Mining Summit was once again a very successful event which brought together Linguamatics, our partners and our users across the healthcare, pharmaceutical and biotech industries to share dialogue on the past, present and future of text mining and data analytics. Many thanks to everyone who attended and contributed.

We very much look forward to seeing you at the TMS next year and also at our Spring Users Conference in Cambridge, UK (April 13-15, 2015).

Mining big data for key insights into healthcare, life sciences and social media at the Linguamatics San Francisco Seminar

September 3, 2014

Natural Language Processing (NLP) and text analytics experts from pharmaceutical and biotech companies, healthcare providers and payers gathered together to discuss the latest industry trends and to hear the product news and case studies from Linguamatics on August 26th.

The keynote presentation from Dr Gabriel Escobar was the highlight of the event, covering a rehospitalization prediction project that the Kaiser Permanente Department of Research have been working on in collaboration with Linguamatics. The predictive model has been developed using a cohort of approximately 400,000 patients and combine scores from structured clinical data with factors derived from unstructured data using I2E. Key factors that could affect a patient’s likelihood of rehospitalization are trapped in text; these include ambulatory status, social support network and functional status. I2E queries enabled KP to extract these factors and use them to indicate the accuracy of the structured data’s predictive score.

Leading the use of I2E in healthcare, Linguamatics exemplified how cancer centers are working together to develop queries for pathology reports, mining medical literature and predicting pneumonia from radiology reports. They also demonstrated a prototype application to match patients to clinical trials and a cohort selection tool using semantic tagging of patient narratives in the Apache Solr search engine.

Semantic enrichment was discussed in the context of life sciences using SharePoint as the search engine. This drew great interest from the many life science companies in the audience, who understand the need to improve searching of internal scientific data. This discussion highlighted the challenges of getting a consistent view of internal and external data. The latest version of I2E will address this challenge with a new federated capability that provides merged results sets from internal and external searches. These new I2E capabilities have strong potential to improve insight and they also incorporate a model that allows content providers to more actively support text mining.

Another hot topic was mining social media and news outlets for competitive intelligence and insights into company and product sentiment. The mass of information now available from social media requires a careful strategy of multiple levels of filtering; this will enable extracting the relevant data from the millions of tweets and posting that occur daily. Once these have been identified this data can be text mined but users need to factor in support for abbreviations and shorter linguistic syntax. Mining social media and news outlets is an area that will continue to grow and require active support.

Linguamatics were grateful for such an engaged and interactive audience and look forward to future discussions on these exciting trends. Keep an eye out for information about our upcoming Text Mining Summit.

Pharmacogenomics across the drug discovery pipeline – are we nearly there?

August 20, 2014

Since the human genome was published in 2001, we have been talking about the potential application of this knowledge to personalized medicine, and in the last couple of years, we seem at last to be approaching this goal.

A better understanding of the molecular basis of diseases is key to development of personalized medicine across pharmaceutical R&D, as was discussed last year by Janet Woodcock, Director of the FDA’s Center for Drug Evaluation and Research (CDER). FDA CDER has been urging adoption of pharmacogenomics strategies and pursuit of targeted therapies for a variety of reasons. These include the potential for decreasing the variability of response, improving safety, and increasing the size of treatment effect, by stratifying patient populations.

Pharmacogenomics is the study of the role an individual’s genome plays in drug response, which can vary from  adverse drug reactions to lack of therapeutic efficacy. With the recent explosion in sequence data from next generation sequencing (NGS) technologies, one of the bottlenecks in application of genomic variation data to understanding disease is access to annotation. From NGS workflows, scientists can quickly identify long lists of candidate genes that differ between two conditions (case-control, or family hierarchies, for example). Gene annotations are essential to interpret these gene lists and to discover fundamental properties like gene function and disease relevance.

Key sources for these annotations include the ever-growing biomedical literature either in structured databases (such as COSMIC, GAD, DGA) but much valuable information is in textual sources such as PubMed Central, MEDLINE, and OMIM. Extracting actionable insight rapidly and accurately from text documents is greatly helped by advanced text analytics – and users of our I2E text analytics solution have been asking for access to OMIM, which will soon be available in our OnDemand portfolio.

In particular, there are two common use cases that I2E users want to address with enhanced text analytics over the OMIM data:

One use case comes from the clinical side of drug discovery-development; the clinicians provide information on a particular case or phenotype, and I2E is used to extract from OMIM the potential genes that might be relevant to sequence from the clinical samples to see if there is involvement in the disease pathway.

The other use case comes from early on in the drug discovery-development pipeline, at the initial stages of a project, where for a new disease area I2E is used to pull out a set of potential targets from OMIM. Obviously, before any lab work starts, more in-depth research is needed, but this provides an excellent seed for entry into a new therapeutic area.

Utilizing I2E to access OMIM brings benefits such as:

  • High quality results from this manually curated, fact-dense data source, compared to, for example, querying original articles in peer-reviewed literature
  • The use of our domain-specific ontologies (e.g. for diseases, genes, mutations and other gene variants) enables high recall compared to searching via the OMIM interface (for example, using ontologies to search for “liver cancer”, and being able to also find records with annotations for “liver neoplasm”, “hepatic cancer”, “cancer of the liver”, etc)
  • Clustering of various synonyms and expressions from the use of Preferred Terms (PT) such as Gene Symbols
  • The ability to build in-depth queries, such as extraction of gene-gene interactions, and to hit a wide variety of concepts and synonyms, for example many different ways in which gene/protein mutations may be named (see figure legend)
Network of Disease-Gene-Mutation relationships from I2E results in Cytoscape

The image shows a network of Disease – Gene – Mutation relationships from I2E results in Cytoscape. I2E was used to extract gene (green squares) and mutation (circles) information for stroke (central red triangle), showing some overlap of gene interactions with a related disease, cerebral infarction. Utilising the Linguamatics Mutation resource enables easy extraction of precise information patterns (e.g. “an ACTA2 mutation”; “proline for serine at codon 116″; “4895A/G”; “a 4-bp deletion”; “Q193Sfs*12″; “a 377A-T transversion”), which would be hugely time-consuming to do by manual curation.

Combining OMIM access with extraction of genotype-phenotype relationships from MEDLINE, PubMed Central and ClinicalTrials.gov will give I2E users an excellent resource for NGS annotation, target discovery, and clinical genomics, in order to better target the molecular basis of disease.

If you are interested in accessing OMIM using the power of I2E (or any other of the current content options on I2E OnDemand i.e. MEDLINE, ClinicalTrials.gov, FDA Drug Labels, NIH Grants, PubMed Central Open Subset, and Patents) , please get in touch and we can provide more information and keep you updated on our progress.





Economics of the Obesity Epidemic – Extracting Knowledge with Advanced Text Analytics

July 15, 2014

In the current competitive marketplace for healthcare, pharmaceutical and medical technology companies must be able to demonstrate clinical and economic evidence of benefit to providers, healthcare decision-makers and payers. Now more than ever, pricing pressure and regulatory restrictions are generating increased demand for this kind of outcomes evidence.

Health Economics and Outcomes Research (HEOR) aims to assess the direct and indirect health care costs associated with a disease or a therapeutic area, and associated interventions in real-world clinical practice. These costs include:

• Direct economic loss

• Economic loss through hospitalization

• Indirect costs from loss of wider societal productivity

The availability of increasing amount of data on patients, prescriptions, markets, and scientific literature combined with the wider use of comparative effectiveness make traditional keyword based search techniques ineffectual. I2E can provide the starting point for efficiently performing evidence based systematic reviews over very large sets of scientific literature, enabling researchers to answer questions such as:

• What is the economic burden of disease within the healthcare system? Across states, and globally?

• Does XYZ new intervention merit funding? What are the economic implications of its use?

• How do the incremental costs compare with the anticipated benefits for specific patient groups?

• How does treatment XYZ affect quality of life? Activities of daily living? Health status indicators? Patient satisfaction?

A recent project looking at the economics of obesity used I2E to search all 23 million abstracts in Medline for research on the incidence of comorbid diseases, with associated information on patient cohort, geographic location, severity of disease, and associated costs (e.g. hospitalisation cost, treatment, etc.). From the I2E output, more advanced visual analytics can be carried out. For example, the pie chart shows the prevalence of the various comorbid diseases (from 2013 Medline abstracts with both HEOR terms, obesity and a comorbid disease), showing the high frequency of hypertension and various other cardiovascular diseases. Another view of the same extracted intelligence shows the geographic spread of health economics and obesity research, with a predominance across northern America, but also data from China and Brazil, for example.

Prevalence of cardiovascular co-morbid diseases

Prevalence of cardiovascular co-morbid diseases


Geographic view of HEOR research, mined from Medline from 2013

Geographic view of HEOR research, mined from Medline from 2013

If you are interested in getting a better understanding of the power of advanced text analytics for HEOR, please contact us.

%d bloggers like this: