Who’s Doing What with Linguamatics I2E?

June 26, 2015

Over the past few months there have been several publications which have used Linguamatics I2E to extract key information to provide value in a variety of different projects. We are constantly amazed by the inventiveness of our users, applying text analytics across the bench to bedside continuum; and these different publications are no exceptions. Using the natural language processing power of I2E, researchers are able to answer their questions rapidly and extract the results they need, with high precision and good recall; compared to more standard keyword search, which returns a document set that they then need to read.

Let’s start with Hornbeck et al., “PhosphoSitePlus, 2014: mutations, PTMs and recalibrations”. PhosphoSitePlus is an online systems biology resource for the study of protein post-translational modifications (PTMs) including phosphorylation, ubiquitination, acetylation and methylation. It’s provided by Cell Signaling Technology who have been users of I2E for several years. In the paper, they describe the value from integrating data on protein modifications from high-throughput mass spectrometry studies, with high-quality data from manual curation of published low-throughput (LTP) scientific literature.

The authors say: “The use of I2E, a powerful natural language processing software application, to identify articles and highlight information for manual curation, has significantly increased the efficiency and throughput of our LTP curation efforts, and made recuration of selected information a realistic and less time-consuming option.” The CST scientists rescore and recurate PTM assignments, reviewing for coherence and reliability – and use of these data can “provide actionable insights into the interplay between disease mutations and cell signalling mechanisms.”

A paper by a group from Roche, Zhang et al., “Pathway reporter genes define molecular phenotypes of human cells” described a new approach to understanding the effect of diseases or drugs on biological systems, by looking for “molecular phenotypes”, or fingerprints, patterns of differential gene expression triggered by a change to the cell. Here, text analytics played a small role in the project, which was (along with other tools) to compile a panel of over 900 human pathway reporter genes – representing 154 human signalling and metabolic networks. These were then used to gain understanding of cardiomyocyte development (relevant to diabetic cardiomyopathy) and assessment of common toxicity mechanisms (relevant to the mechanistic basis of adverse drug events).

The last one I wanted to highlight moves away from the realms of genes and cells, into analysis of co-prescription trends and drug-drug interactions (Sutherland et al., “Co-prescription trends in a large cohort of subjects predict substantial drug-drug interactions”). Better understanding of drug-drug interactions is of increasing importance for good healthcare delivery, as more and more patients are routinely taking multiple medications, particularly in the elderly – and the huge number of potential combinations prohibit testing for safety for all these combinations in clinical trials.

Over 30% of older adults take 5 or more drugs. Text analytics can extract clinical knowledge about potential drug-drug interactions.

Over 30% of older adults take 5 or more drugs. Text analytics can extract clinical knowledge about potential drug-drug interactions.

In this study, the authors used prescription data from NHANES surveys to find what drugs or drug classes were most routinely prescribed together; and then used I2E to search MEDLINE for a set of 133 co-prescribed drugs to assess the availability of clinical knowledge about potential drug-drug interactions. The authors found that over 30% of older adults take 5 or more drugs – but these combinations were pretty much unique. Of the co-prescribed pairs, a large percentage were not mentioned together in any MEDLINE record – demonstrating a need for further study. The authors conclude that these data show “that personalized medicine is indeed the norm, as patients taking several medications are experiencing unique pharmacotherapy” – and yet there is little published research on either efficacy or safety of these combinations.

What do these three studies have in common? The use of text analytics, not as the only tool or necessarily even the major tool, but as part of an integrated analysis of data, to answer focused and specific questions, whether those questions relate to protein modification, molecular patterns of genes in pathways or drug interactions and potential adverse events. And I wonder, where will Linguamatics I2E be used next?

Advertisements

Picking your brain: Synergy of OMIM and PubMed in Understanding Gene-Disease Associations for Synapse Proteins

January 5, 2015

I read with interest a recent publication which sheds light on the complex interactions of synapse protein complexes with human disease. The study (run by the Genes to Cognition neuroscience research programme) combined wet-lab research with bioinformatics and text analytics to uncover genetic associations with these protein complexes in over seventy human brain diseases, including Alzheimer’s Disease, Schizophrenia and Autism spectrum disorders. The idea was to identify and develop suitable screening assays for synapse proteomes from post-mortem and neurosurgical brain samples, focusing specifically on Membrane-associated guanylate kinase (MAGUK) associated signalling complexes (MASC).

Our CTO, David Milward was involved in the text analytics work. He used the natural language processing capabilities of Linguamatics I2E platform to extract gene-mutation-disease associations from PubMed abstracts. The flexibility of I2E enabled an appropriate balance of recall and precision, thus providing comprehensive results while not overloading curators with noise. Queries were built using linguistic patterns to allow associations to be discovered between a list of several thousand relevant gene identifiers, and appropriate MedDRA disease terms. The key aim was to provide comprehensive results with suitable accuracy to allow fast curation. These text-mined results were combined with data from Online Mendelian Inheritance in Man (OMIM) on human MASC genes and genetic disease associations.

In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone, and 49 in PubMed alone.

In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone; and 49 via text mining from PubMed alone.

I wanted to dig a little deeper into the data from the paper and the comparison of OMIM and PubMed. Supplementary Table 5 has information on the list of genes coding for MASC proteins and causing inherited diseases as described in the OMIM repository, or identified using text mining software as associated to disease. In total, 143 gene-disease associations were found (see Figure), but only 26 associations were found in both sources. This shows the synergistic value of combining data from these two sources, and the need for integration of multiple sources to get the fullest picture possible, for any particular gene-disease involvement.


%d bloggers like this: