NLP & Big Data Symposium in San Francisco

June 29, 2015

Life sciences and healthcare professionals gathered at the UCSF Mission Bay campus for the West Coast Natural Language Processing (NLP) & Big Data Symposium on June 18th. The symposium, co-hosted by UCSF, featured presenters from UCSF, Merck, City of Hope, Copyright Clearance Center and Linguamatics and delegates from a diverse range of organizations.

The central theme of this year’s symposium was “From bench-to-bedside, unlocking key insights from your data”. Healthcare delegates were keen to find new ways to address meaningful use and accountable care leveraging NLP text mining of electronic health records. Life sciences delegates were keen to increase the efficiency and effectiveness of their business operations by mining real world data. There was also a strong interest in forging partnership opportunities between pharma/biotech and hospitals/cancer centers.

Sorena Nadaf, the CIO and Director of Translational Informatics at UCSF Helen Diller Family Comprehensive Cancer Center delivered the welcome address and highlighted the foundation of clinical NLP and its common uses for extracting and transforming narrative information in EMR’s to support and accelerate clinical research.

Sorena Nadaf at the NLP & Big Data Symposium in San Francisco.

Sorena Nadaf at the NLP & Big Data Symposium in San Francisco.

Wendy Cornell, retired from Merck, described Merck’s development of a natural language processing (NLP) workflow to extract conclusions and interpretations from their large corpus of internal reports using the Linguamatics I2E software and the integration and analysis of the data using the ANZO platform from Cambridge Semantics. Automated extraction of conclusions and interpretations from internal preclinical safety reports using I2E was the primary use case discussed and generated a lot of interest and discussion.

Joyce Niland, Chief Research Information Officer, & Rebecca Ottesen, Biostatistician, from City of Hope (COH) presented a recent project with Linguamatics where they created a disease registry using Iterative Interactive Enrichment (IIE) of NLP queries shared across institutions. The I2E queries, initially written by the Huntsman Cancer Institute (HCI) and Linguamatics, identify immunohistochemistry (IHC) marker results from unstructured pathology dictations on malignant Non-Hodgkin’s Lymphoma patients. They were shared with COH, to assess their exportability from one institution to another. Linguamatics, COH, and HCI applied an IIE process through several phases to improve the IHC queries while sharing the improvements between institutions. Precision and recall were measured for each phase to assess the completeness and accuracy of information extraction, and to identify the most critical NLP features that impact these results. Final F Scores for both COH and HCI were .91 and .94 respectively. This impressive level of precision and recall across two institutions validates the Linguamatics approach of sharing it’s wealth of existing healthcare queries with I2E customers to help accelerate research and improve patient outcomes.

Chris Hilbert from Copyright Clearance Center presented CCC’s new RightFind XML for Mining service, the integration with Linguamatics’ I2E and how the combined solution improves the results of text and data mining queries and mitigates infringement risk. Chris demonstrated how customers can obtain and index full-text XML articles from multiple scientific publishers in I2E and avoid many of the data format and licensing issues associated with working with PDF’s. As existing licenced literature does not have to be repurchased, delegates saw this service as highly effective way of leveraging existing full text investments and extracting more value via I2E text mining.

To complement our customer and partner presentations, Linguamatics led presentations including an introduction to NLP text mining; healthcare NLP strategies to improve patient care, reduce costs and enhance population health; and Real World Data and text analytics.

It was wonderful to catch up with many of our customers, meet some new ones and help foster introductions and discussions between the various delegates. Keep an eye out for upcoming opportunities to meet with Linguamatics at our events page including our Princeton seminar on July 16 and Text Mining Summit and I2E Healthcare Hackathon in October.

Who’s Doing What with Linguamatics I2E?

June 26, 2015

Over the past few months there have been several publications which have used Linguamatics I2E to extract key information to provide value in a variety of different projects. We are constantly amazed by the inventiveness of our users, applying text analytics across the bench to bedside continuum; and these different publications are no exceptions. Using the natural language processing power of I2E, researchers are able to answer their questions rapidly and extract the results they need, with high precision and good recall; compared to more standard keyword search, which returns a document set that they then need to read.

Let’s start with Hornbeck et al., “PhosphoSitePlus, 2014: mutations, PTMs and recalibrations”. PhosphoSitePlus is an online systems biology resource for the study of protein post-translational modifications (PTMs) including phosphorylation, ubiquitination, acetylation and methylation. It’s provided by Cell Signaling Technology who have been users of I2E for several years. In the paper, they describe the value from integrating data on protein modifications from high-throughput mass spectrometry studies, with high-quality data from manual curation of published low-throughput (LTP) scientific literature.

The authors say: “The use of I2E, a powerful natural language processing software application, to identify articles and highlight information for manual curation, has significantly increased the efficiency and throughput of our LTP curation efforts, and made recuration of selected information a realistic and less time-consuming option.” The CST scientists rescore and recurate PTM assignments, reviewing for coherence and reliability – and use of these data can “provide actionable insights into the interplay between disease mutations and cell signalling mechanisms.”

A paper by a group from Roche, Zhang et al., “Pathway reporter genes define molecular phenotypes of human cells” described a new approach to understanding the effect of diseases or drugs on biological systems, by looking for “molecular phenotypes”, or fingerprints, patterns of differential gene expression triggered by a change to the cell. Here, text analytics played a small role in the project, which was (along with other tools) to compile a panel of over 900 human pathway reporter genes – representing 154 human signalling and metabolic networks. These were then used to gain understanding of cardiomyocyte development (relevant to diabetic cardiomyopathy) and assessment of common toxicity mechanisms (relevant to the mechanistic basis of adverse drug events).

The last one I wanted to highlight moves away from the realms of genes and cells, into analysis of co-prescription trends and drug-drug interactions (Sutherland et al., “Co-prescription trends in a large cohort of subjects predict substantial drug-drug interactions”). Better understanding of drug-drug interactions is of increasing importance for good healthcare delivery, as more and more patients are routinely taking multiple medications, particularly in the elderly – and the huge number of potential combinations prohibit testing for safety for all these combinations in clinical trials.

Over 30% of older adults take 5 or more drugs. Text analytics can extract clinical knowledge about potential drug-drug interactions.

Over 30% of older adults take 5 or more drugs. Text analytics can extract clinical knowledge about potential drug-drug interactions.

In this study, the authors used prescription data from NHANES surveys to find what drugs or drug classes were most routinely prescribed together; and then used I2E to search MEDLINE for a set of 133 co-prescribed drugs to assess the availability of clinical knowledge about potential drug-drug interactions. The authors found that over 30% of older adults take 5 or more drugs – but these combinations were pretty much unique. Of the co-prescribed pairs, a large percentage were not mentioned together in any MEDLINE record – demonstrating a need for further study. The authors conclude that these data show “that personalized medicine is indeed the norm, as patients taking several medications are experiencing unique pharmacotherapy” – and yet there is little published research on either efficacy or safety of these combinations.

What do these three studies have in common? The use of text analytics, not as the only tool or necessarily even the major tool, but as part of an integrated analysis of data, to answer focused and specific questions, whether those questions relate to protein modification, molecular patterns of genes in pathways or drug interactions and potential adverse events. And I wonder, where will Linguamatics I2E be used next?

Mining Nuggets from Safety Silos

May 26, 2015

Better access to the high value information in legacy safety reports has been, for many folk in pharma safety assessment, a “holy grail”. Locked away in these historical data are answers to questions such as:  Has this particular organ toxicity been seen before? In what species, and with what chemistry? Could new biomarker or imaging studies predict the toxicity earlier? What compounds could be leveraged to help build capabilities?

I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations are not expected to be human-relevant, and reducing late stage failures.

I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations may not be human-relevant, and reducing late stage failures.

Coming as I do from a decade of working in data informatics for safety/tox prediction, I was excited by one of the talks at the recent Linguamatics Spring User conference. Wendy Cornell (ex-Merck) presented on an ambitious project to use Linguamatics text mining platform, I2E, in a workflow to extract high value information from safety assessment reports stored in Documentum.

Access to historic safety data is a potential advantage that will be helped with the use of standards in electronic data submission for regulatory studies (e.g. CDISC’s SEND, the standard for exchange of non-clinical data).

Standardizing the formats and vocabularies for key domains in safety studies will enable these data to be fed into searchable databases; however these structured data miss the intellectual content added by the pathologists and toxicologists, whose conclusions are essential for understanding whether evidence of a particular tox finding (e.g. hyalinosis, single cell necrosis, blood enzyme elevations) signals a potential serious problem in humans or is specific to the animal model.

For these key conclusions, access to the full study reports is essential.

At Merck, Wendy’s Proprietary Information and Knowledge Management group, in collaboration with the Safety Assessment and Laboratory Animal Resources (SALAR) group, developed an I2E workflow that extracted the key findings from safety assessment ante- and post-mortem reports, final reports, and protocols, in particular pulling out:

  • Study annotation (species, study duration, compound, target, dosage)
  • Interpreted results sections (i.e. summary or conclusions sections)
  • Organ-specific toxicology and histopathology findings
  • Haematological and serum biochemistry findings

In addition, a separate arm in the workflow leveraged the ABBYY OCR software to extract toxicokinetic (TK) parameters such as area under the curve (AUC), maximum drug concentration (Cmax), and time after dosing of peak drug plasma exposure (TMax) from PDF versions of the documents.

The extracted and normalized information was loaded into a semantic knowledgebase in the Cambridge Semantics ANZO tool and searched and visualized using a tailored ANZO dashboard. This faceted browsing environment enabled the SALAR researchers to ask questions such as, “what compounds with rat 3-month studies show kidney effects, and for these compounds, what long term studies do we have?”

Wendy presented several use cases showing real value of this system to the business, including the potential to influence regulatory guidelines. For example, the team were able to run an analysis to assess the correlation between 3-month sub-chronic non-rodent studies, and 9- or 12-month chronic non-rodent results; they found that in nearly 30% of cases an important new toxicologic finding was identified in the long-term studies, confirming the ongoing need for long-term studies.

Wendy stated, “This unstructured information represents a rich body of knowledge which, in aggregate, has potential to identify capability gaps and evaluate individual findings on active pipeline compounds in the context of broad historical data.”

With the current focus on refinement, replacement and reduction of animal studies, being able to identify when long-term studies are needed vs. when they are not essential for human risk assessment, will be hugely valuable; and extracting these nuggets of information from historic data will contribute to this understanding.

Expert interpretations and conclusions from thousands of past studies can potentially be converted into actionable knowledge. These findings exist as unstructured text in Safety documents. See Wendy Cornell speak on this, at our upcoming NLP and Big Data Symposium in San Francisco.

II-SDV 2015: The International Information Conference on Search, Data Mining and Visualization

May 12, 2015

The recent two day II-SDV meeting in the beautiful town of Nice on the Côte d’Azur, France, started with a day of talks considering the question of how to best maximise the value of data extracted from a wide range of sources: patents, full text articles and even big data.

The programme kicked off with a presentation from Aleksander Kapisoda from Boehringer Ingelheim (BI) describing how innovative use of custom search techniques beyond that currently offered by standard public search machines can bring tangible benefits to a global pharmaceutical company.

One theme that emerged was the potential use of text mining particularly in constructing landscapes related to emerging technologies. Jane List (Extract information UK) described some of the tools, workflows, and visualisations for patent landscaping, with a great quote from Marcel Proust: “The real voyage of discovery consists not in seeking new landscapes, but in having new eyes”. Emmanuelle Fortune (INIP, France) discussed the ability to classify world cities dubbed “Smart Cities” as hubs for technological development directly from mining the patent literature.

Staying on the topic of text mining I presented a number of use cases related on the subject of “time” and pressed home the message that using text mining can provide clear advantages for access to timely information. This presentation was then followed up by news from the Copyright Clearance Center (CCC) that the difficult process of obtaining legal permission for the purposes of text mining has recently become a lot easier with the ability to now directly create set of full text documents ready for immediate use in text mining. This has long been a goal for many information scientists, as there are valuable nuggets of information in full text that just can’t be gained from mining abstracts.

Finally the conference heard from a different group within Boehringer Ingelheim,  concerned with automating the currently time consuming process of extracting medicinally relevant chemistry from patents. Matthias Negri, collaborating with technology partners Chemaxon, has established a Knime™ workflow that makes use of Linguamatics I2E to extract the additional surrounding pharmacological context to chemistry described within the patent, to provide “a solid information base of value to any phase of a drug discovery project”.

With participants from both US and Europe, the conference provided a great opportunity to meet information specialists, patent experts and scientists from across life science specialities, as well as hearing from vendors about their new product developments. If you are interested in any of the topics discussed, more information can be found at the conference website, and I’d be happy to hear your comments.

Andrew Hinton, Application Specialist, Linguamatics

BioIT 2015: Bench to Bedside value from Data Science and Text Analytics

April 29, 2015

Last week’s BioIT World Expo kicked off with a great keynote from Philip Bourne (Associate Director for Data Science, National Institutes of Health) setting the scene on a theme that ran through out the conference – how can we benefit from big data analytics, or data science, for pharma R&D and delivery into healthcare. With 2 days of talks, 12 tracks covering cloud computing, NGS analytics, Pharmaceutical R&D Informatics, Data Visualization & Exploration Tools, and Data Security, plus a full day of workshops beforehand, and a busy exhibition hall, there was plenty to see, do, take in and discuss.  I attended several talks on best practise in data science, by speakers from Merck, Roche, and BMS – and I was pleased to hear speakers mention text analytics, particularly natural language processing, as a key part of the overall data science solution.

All this, and in beautiful Boston, just after the Marathon; luckily for us, the weather warmed up and we were treated to a couple of lovely sunny days.  As one of the speakers in the Pharma R&D informatics track, I presented some of the Linguamatics use cases of text analytics across the drug discovery – development – delivery pipeline.  Our customers are getting value along this bench-to-bedside continuum, using text mining techniques for gene annotation, patent analytics, regulatory QA/QC, clinical trial matching for patients, and through into extracting key endpoints from pathology reports and EHRs for better patient care. If you missed the talk, we will be presenting the material at our webinar on 3rd June.

Boston at night

Boston at night

Linguamatics I2E users lead the way in text mining for patents, safety and more at this year’s Spring Users Conference

April 28, 2015

We are always amazed and impressed at the inventiveness of Linguamatics customers, in their applications of text analytics to address their information challenges. Our annual Linguamatics Spring Users Conference showcased some examples of their innovation, with presentations on text mining used for patent analytics, chemical pharmacokinetics and pharmacodynamics data extraction, creating value from legacy safety reports, and integrating open source tools for advanced entity recognition. We had a record-breaking number of attendees this year, representing over 20 organizations, ranging from our most experienced I2E users to text mining novices.

A record-breaking number of attendees enjoyed the opportunity to experience Cambridge and share insights with one another at this year's conference.

A record-breaking number of attendees enjoyed the opportunity to experience Cambridge and share insights with one another at this year’s conference.

Patent analytics featured in two of the presentations, demonstrating the value of NLP in extracting critical information from obtuse and lengthy patent documents. Julia Heinrich (Senior Patent Analyst, Biotechnology at Bristol-Myers Squibb, Princeton, New Jersey) asked the question: “Can the infoglut of biotech patent publications be quickly reviewed to enable timely business decisions?”. She admirably demonstrated that with smart use of I2E’s NLP queries, BMS have been able to search the patent body for information on antibody-drug conjugates and convert “unstructured data” into user-friendly, analysis-ready data sets. Thorsten Schweikardt (Senior Information Scientist, Boehringer Ingelheim) gave an overview of workflows developed using KNIME to create patent landscapes for specific disease areas, target identification, and discovery of tool compounds.

Wendy Cornell (former head of the Merck Proprietary Information and Knowledge Management Group), like Julia Heinrich, flew over from the US for the meeting. Wendy presented on the automated extraction of conclusions from internal preclinical safety reports using I2E. These internal safety assessment reports contain a wealth of historical data around safety and toxicity of developmental compounds, and many pharma organizations have sought ways to gain benefits from these valuable legacy documents. Wendy’s group developed a strategy to access Documentum-based safety assessment reports, and were able to pull out histopath findings, organ toxicities, haematological and blood biochemistry results, even pulling out toxicokinetic parameters from tabular sections. Three use cases were presented, showing significant business impact within the Safety Assessment organization.

Wendy Cornell details how she used I2E to create NLP-driven workflow tapping into the large body of valuable knowledge located in structured and unstructured internal documents.

Wendy Cornell details how she used I2E to create NLP-driven workflow tapping into the large body of valuable knowledge located in structured and unstructured internal documents.

Linguamatics’ speakers gave an update on future innovations in the I2E roadmap, the new features in I2E 4.3, and the software’s applications in the life sciences and healthcare. Guy Singh showed how I2E 4.3’s Connected Data Technology allows users to exploit big data better no matter where the data are located (on premise, on the cloud), whatever structure they have, and doing this at speed, with digestible results. Phil Hastings gave a brief overview of Linguamatics I2E in Healthcare; and NLP Specialist James Cormack took us through Linguamatics’ approach and results for our submission to the i2b2 2014 Cardiac Risk Factors challenge. You can find out more about what we’re doing in healthcare via this short video.

We heard from a few of our partners in 5-minute lightning round presentations: IFI Claims Patent Services, ChemAxon, Copyright Clearance Center, Thomson Reuters and KNIME discussed their solutions and how they integrate with Linguamatics I2E.

In addition to the presentations, the Linguamatics Spring Users Conference provided opportunities for hands-on training, with workshops aimed at different levels of text mining experience. And of course, there was plenty of time for networking and idea sharing. Our evening events were hosted in the Old Combination Room at Corpus Christi College and the Pembroke College Old Library. We enjoyed beautiful, warm spring evenings at two of Cambridge University’s oldest colleges. One delegate remarked ‘It’s so nice to be shown hidden Cambridge treasures like these, which we would never know about if it wasn’t for the events at the Linguamatics conference.’

Evening social events at Cambridge University's historic colleges

Evening social events at Cambridge University’s historic colleges

The whole event was a great success that brought together the text mining community from across Europe (and across the pond!).

Presentations which have been approved to share are available on I2Edia and by email request.

Thanks to everyone who attended and contributed to the Linguamatics Spring Users Conference 2015, we look forward to seeing you in October at the Text Mining Summit in Newport, RI or in Cambridge, UK next year.

Accelerating Drug Approvals with better Regulatory QA

April 7, 2015

Submitting a drug approval package to the FDA, whether for an NDA, BLA or ANDA, is a costly process. The final amalgamation of different reports and documents into the overview document set can involve a huge amount of manual checking and cross-checking, from the subsidiary documents to the master. It is crucial to get the review process right. Any errors, and the FDA can send back the whole package, delaying the application. But the manual checking involved in the review process is tedious, slow, and error-prone.

A delayed application can also be costly. How much are we talking about? While not every drug is a blockbuster, these numbers are indicative of what you could be losing:  the top 20 drugs in the United States accounted for $319.9 billion in sales in 2011; so a newly launched blockbuster could make around $2Bn in the first year launched – that’s $6M per day.  If errors in the quality review hold up an NDA for even just a week this could generate significant costs.

So – how can text analytics improve this quality assurance process?  Linguamatics has worked with some of our top 20 pharma customers to develop an automated process to improve quality control of regulatory document submission. The process cross-checks MedDRA coding, references to tables, decimal place errors, and discrepancies between the summary document and source documents. This requires the use of advanced processing to extract information from tables in PDF documents as well as natural language processing to analyze the free text.

The errors that can be identified include:

  • Incorrect formatting: doubled period, incorrect number of decimal places, addition of percent sign
  • Incorrect calculation: number of patients divided by total number does not agree with percent term
  • Incorrect threshold: presence of row does not agree with table title
  • Text-Table inconsistency: numbers in the table do not agree with numbers in the accompanying text


Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately.

Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately.

Using advanced text mining processing, we are able to identify inconsistencies within FDA submission documents, across tables and textual parts of the reports. Overall, we found that using automated text analysis for quality assurance of submission documents can save countless hours or weeks of tedious manual checking, and potentially prevent a re-submission request, with potential savings of millions of dollars.

This work was presented by Jim Dixon, Linguamatics, at the Pharmaceutical Users Software Exchange Computational Science Symposium in March 2015.



Patent Landscaping – Text Analytics Extracts the Value

March 23, 2015

Patent literature is a hugely valuable source of novel information for life science research and business intelligence. The wealth of knowledge disclosed in patents may not be found in other information sources, such as MEDLINE or full text journal articles.

Patent landscape reports (also known as patent mapping or IP landscaping) provide a snap-shot of the patent situation of a specific technology, and can be used to understand freedom to operate issues, to identify in- and out-licensing opportunities, to examine competitor strengths and weaknesses, or as part of a more comprehensive market analysis.

innovative_use_quoteThese are valuable searches, but demand advanced search and data visualization techniques, as any particular landscape reports requires examination of many hundreds or thousands of patent documents. Patent text is unstructured; the information needed is often embedded within the body of the patent and may be scattered throughout the lengthy descriptions; and the language is often complex and designed to obfuscate.

Text analytics can provide the key to unlock the value. A recent paper by a team at Bristol Myers Squibb describes a novel workflow to discover trends in kinase assay technology. The aim was to strengthen their internal kinase screening technology, with the first step being to analyze industry trends and benchmark BMS’ capabilities against other pharmaceutical companies, with key questions including:

  • What are the kinase assay technology trends?
  • What are the trends for different therapeutic areas?
  • What are the trends for technology platforms used by the big pharmaceutical companies?

The BMS team built a workflow using several tools: Minesoft’s Patbase, for the initial patent document set collection; Linguamatics I2E, for knowledge extraction; and TIBCO’s Spotfire, for data visualization. The project used I2E to create precise, effective search queries to extract key information around 500 kinases, 5 key screening technologies, 5 therapeutic areas, and across 14 pharmaceutical companies. Use of I2E allowed queries to be designed using domain specific vocabularies for these information entities, for example using over 10,000 synonyms for the kinases, hugely improving the recall of these patent searches. These I2E “macros” enabled information to be extracted regardless of how the facts were described by inventors. Using these vocabularies also allowed semantic normalization; so however the assignee described a concept, the output was standardized to a preferred term, for example, Pfizer for Wyeth, Warner Lambert, etc.

Using I2E also meant that searches could be focused on specific regions of the patent documents for more precise search; for example, the kinase information was extracted from claims (enhancing the precision of the search).

Using the novel approach the patent analysis team mined over 7100 full text patents. That’s approximately half a million pages of full text looking for relevant kinase technology trends and the corresponding therapeutic area information. To put this business value into perspective, it takes ~1h to manually read one patent for data extraction and a scope this large would require around 175 person-weeks (or nearly 3.5 years!) to accomplish. The authors state that innovative use of I2E enabled a 99% efficiency gain for delivering the relevant information. They also say that this project took 2 patent analysts 3 months (i.e. about 25 weeks) which is a 7-fold saving in FTE time.

The deliverables provided key actionable insights that empowered timely business decisions for BMS researchers; and this paper demonstrates that rich information contained in full text patents can be analyzed if innovative tools/methods are used.


Data to knowledge: visualisation of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

Data to knowledge: visualization of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

Big data, real world data… where does text analytics fit in?

February 26, 2015

Big data? Real world data? What do we really mean…?

I was at a conference a couple of weeks ago, an interesting two days spent discussing what is big data in the life science domain, and what value can we expect to gain from better access and use. The key note speaker kicked off the first day with a great quote from Atul Butte: “Hiding within those mounds of data is knowledge that could change the life of a patient or change the world”.

This is a really great ambition for data analytics.  But one interesting topic was, what do we mean by big data? One common definition from some of the Pharma folk was, that it was any sort of data that originated outside their organization that related to patient information.  To me, this definition seems to refer more to real world data – adverse event reports, electronic health records, voice of the customer (VoC) feeds, social media data, claims data, patient group blogs. Again, any data that hasn’t been influenced by the drug provider, and can give an external view – either from the patient, payer, or healthcare provider.

Many of these real world sources have free text fields, and this is where text analytics, and natural language processing (NLP), can fit in. We have customers who are using text analytics to get actionable insight from real world data – and finding valuable intelligence that can inform commercial business strategies. Valuable information could be found in electronic health records, but these are notoriously hard to access for Pharma, with regulations and restrictions around data use, data privacy etc. So, what real world data are accessible? Our customers are inventive, and have used data types such as clinical trial reports, clinical investigator brochures, National Comprehensive Cancer Network (NCCN) guidelines, and VoC call transcripts.

VoC call transcripts are a rich seam of potential patient reported outcomes, side effects, drug interactions, and more. The medical information group at Pfizer have used Linguamatics I2E text analytics solution to access insights that can have a huge impact on commercial business decisions. It has been their strategic goal to efficiently analyze unstructured data to prompt decision makers to the signals that come from users of Pfizer products.

Workflow for text analytics over unstructured VOTC feeds

Workflow for text analytics over unstructured VoC feeds

Researchers in the predictive analytics group built a workflow to take the call transcripts, process them using advanced text analytics to make sense of the unstructured feeds, and visualize the output to see trends, and build predictive models around the different products and the real-world data coming back from patients, consumers, medical assistants, pharmacists, or sales representatives. The calls could be categorized and tagged for key metadata such as caller demographics, and reason for calling (e.g. complaint, formulation information, side effect, drug-drug interactions).

Key product questions posed by Med Informatics, to examine unexpected side effects, off-label use, lack of efficacy, dose-related questions, and separating side effects from pre-existing conditions.

Key product questions posed by Medical Informatics, to examine unexpected side effects, off-label use, lack of efficacy, dose-related questions, and separating side effects from pre-existing conditions.

Text analytics enabled the medical affairs researchers to deepen the relationship for drug-disease associations, by looking within the call logs for information on pre-existing conditions, and relating these to the potential side effects reported in the call log. These associations enabled over 70% of the reported side effects to be related to underlying pre-exisiting conditions – and not an ADR.

So does this count as big data? Of course it all depends on your definition. But if you think of the classic 3 Vs – velocity, variety, and volume – then maybe there is a fit – these feeds are unstructured complex text, and Pfizer receive about 1 million messages per year on their 1-800 number. So, not huge velocity, but reasonable volume, and definitely variety.  And, if analysed well, there’s huge potential value.  At least, that’s our view – we’d love to hear what you think?



I2E Querying: License Consumption

February 23, 2015

There’s a variety of ways of running searches using I2E but for most purposes, the modes can be simplified to:

  1. Search using the I2E Java Client, and
  2. Everything else

This distinction is important for users, administrators and developers because access to querying is licensed in the same way. Today’s post will explain the differences between the two modes as well as how to make sure that you’re using your existing capabilities in the most efficient way, with reference to license pools, capabilities and user groups.

Read the rest of this entry »

%d bloggers like this: