Patent Landscaping – Text Analytics Extracts the Value

March 23, 2015

Patent literature is a hugely valuable source of novel information for life science research and business intelligence. The wealth of knowledge disclosed in patents may not be found in other information sources, such as MEDLINE or full text journal articles.

Patent landscape reports (also known as patent mapping or IP landscaping) provide a snap-shot of the patent situation of a specific technology, and can be used to understand freedom to operate issues, to identify in- and out-licensing opportunities, to examine competitor strengths and weaknesses, or as part of a more comprehensive market analysis.

innovative_use_quoteThese are valuable searches, but demand advanced search and data visualization techniques, as any particular landscape reports requires examination of many hundreds or thousands of patent documents. Patent text is unstructured; the information needed is often embedded within the body of the patent and may be scattered throughout the lengthy descriptions; and the language is often complex and designed to obfuscate.

Text analytics can provide the key to unlock the value. A recent paper by a team at Bristol Myers Squibb describes a novel workflow to discover trends in kinase assay technology. The aim was to strengthen their internal kinase screening technology, with the first step being to analyze industry trends and benchmark BMS’ capabilities against other pharmaceutical companies, with key questions including:

  • What are the kinase assay technology trends?
  • What are the trends for different therapeutic areas?
  • What are the trends for technology platforms used by the big pharmaceutical companies?

The BMS team built a workflow using several tools: Minesoft’s Patbase, for the initial patent document set collection; Linguamatics I2E, for knowledge extraction; and TIBCO’s Spotfire, for data visualization. The project used I2E to create precise, effective search queries to extract key information around 500 kinases, 5 key screening technologies, 5 therapeutic areas, and across 14 pharmaceutical companies. Use of I2E allowed queries to be designed using domain specific vocabularies for these information entities, for example using over 10,000 synonyms for the kinases, hugely improving the recall of these patent searches. These I2E “macros” enabled information to be extracted regardless of how the facts were described by inventors. Using these vocabularies also allowed semantic normalization; so however the assignee described a concept, the output was standardized to a preferred term, for example, Pfizer for Wyeth, Warner Lambert, etc.

Using I2E also meant that searches could be focused on specific regions of the patent documents for more precise search; for example, the kinase information was extracted from claims (enhancing the precision of the search).

Using the novel approach the patent analysis team mined over 7100 full text patents. That’s approximately half a million pages of full text looking for relevant kinase technology trends and the corresponding therapeutic area information. To put this business value into perspective, it takes ~1h to manually read one patent for data extraction and a scope this large would require around 175 person-weeks (or nearly 3.5 years!) to accomplish. The authors state that innovative use of I2E enabled a 99% efficiency gain for delivering the relevant information. They also say that this project took 2 patent analysts 3 months (i.e. about 25 weeks) which is a 7-fold saving in FTE time.

The deliverables provided key actionable insights that empowered timely business decisions for BMS researchers; and this paper demonstrates that rich information contained in full text patents can be analyzed if innovative tools/methods are used.


Data to knowledge: visualisation of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

Data to knowledge: visualization of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.


Big data, real world data… where does text analytics fit in?

February 26, 2015

Big data? Real world data? What do we really mean…?

I was at a conference a couple of weeks ago, an interesting two days spent discussing what is big data in the life science domain, and what value can we expect to gain from better access and use. The key note speaker kicked off the first day with a great quote from Atul Butte: “Hiding within those mounds of data is knowledge that could change the life of a patient or change the world”.

This is a really great ambition for data analytics.  But one interesting topic was, what do we mean by big data? One common definition from some of the Pharma folk was, that it was any sort of data that originated outside their organization that related to patient information.  To me, this definition seems to refer more to real world data – adverse event reports, electronic health records, voice of the customer (VoC) feeds, social media data, claims data, patient group blogs. Again, any data that hasn’t been influenced by the drug provider, and can give an external view – either from the patient, payer, or healthcare provider.

Many of these real world sources have free text fields, and this is where text analytics, and natural language processing (NLP), can fit in. We have customers who are using text analytics to get actionable insight from real world data – and finding valuable intelligence that can inform commercial business strategies. Valuable information could be found in electronic health records, but these are notoriously hard to access for Pharma, with regulations and restrictions around data use, data privacy etc. So, what real world data are accessible? Our customers are inventive, and have used data types such as clinical trial reports, clinical investigator brochures, National Comprehensive Cancer Network (NCCN) guidelines, and VoC call transcripts.

VoC call transcripts are a rich seam of potential patient reported outcomes, side effects, drug interactions, and more. The medical information group at Pfizer have used Linguamatics I2E text analytics solution to access insights that can have a huge impact on commercial business decisions. It has been their strategic goal to efficiently analyze unstructured data to prompt decision makers to the signals that come from users of Pfizer products.

Workflow for text analytics over unstructured VOTC feeds

Workflow for text analytics over unstructured VoC feeds

Researchers in the predictive analytics group built a workflow to take the call transcripts, process them using advanced text analytics to make sense of the unstructured feeds, and visualize the output to see trends, and build predictive models around the different products and the real-world data coming back from patients, consumers, medical assistants, pharmacists, or sales representatives. The calls could be categorized and tagged for key metadata such as caller demographics, and reason for calling (e.g. complaint, formulation information, side effect, drug-drug interactions).

Key product questions posed by Med Informatics, to examine unexpected side effects, off-label use, lack of efficacy, dose-related questions, and separating side effects from pre-existing conditions.

Key product questions posed by Medical Informatics, to examine unexpected side effects, off-label use, lack of efficacy, dose-related questions, and separating side effects from pre-existing conditions.

Text analytics enabled the medical affairs researchers to deepen the relationship for drug-disease associations, by looking within the call logs for information on pre-existing conditions, and relating these to the potential side effects reported in the call log. These associations enabled over 70% of the reported side effects to be related to underlying pre-exisiting conditions – and not an ADR.

So does this count as big data? Of course it all depends on your definition. But if you think of the classic 3 Vs – velocity, variety, and volume – then maybe there is a fit – these feeds are unstructured complex text, and Pfizer receive about 1 million messages per year on their 1-800 number. So, not huge velocity, but reasonable volume, and definitely variety.  And, if analysed well, there’s huge potential value.  At least, that’s our view – we’d love to hear what you think?



I2E Querying: License Consumption

February 23, 2015

There’s a variety of ways of running searches using I2E but for most purposes, the modes can be simplified to:

  1. Search using the I2E Java Client, and
  2. Everything else

This distinction is important for users, administrators and developers because access to querying is licensed in the same way. Today’s post will explain the differences between the two modes as well as how to make sure that you’re using your existing capabilities in the most efficient way, with reference to license pools, capabilities and user groups.

Read the rest of this entry »

Picking your brain: Synergy of OMIM and PubMed in Understanding Gene-Disease Associations for Synapse Proteins

January 5, 2015

I read with interest a recent publication which sheds light on the complex interactions of synapse protein complexes with human disease. The study (run by the Genes to Cognition neuroscience research programme) combined wet-lab research with bioinformatics and text analytics to uncover genetic associations with these protein complexes in over seventy human brain diseases, including Alzheimer’s Disease, Schizophrenia and Autism spectrum disorders. The idea was to identify and develop suitable screening assays for synapse proteomes from post-mortem and neurosurgical brain samples, focusing specifically on Membrane-associated guanylate kinase (MAGUK) associated signalling complexes (MASC).

Our CTO, David Milward was involved in the text analytics work. He used the natural language processing capabilities of Linguamatics I2E platform to extract gene-mutation-disease associations from PubMed abstracts. The flexibility of I2E enabled an appropriate balance of recall and precision, thus providing comprehensive results while not overloading curators with noise. Queries were built using linguistic patterns to allow associations to be discovered between a list of several thousand relevant gene identifiers, and appropriate MedDRA disease terms. The key aim was to provide comprehensive results with suitable accuracy to allow fast curation. These text-mined results were combined with data from Online Mendelian Inheritance in Man (OMIM) on human MASC genes and genetic disease associations.

In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone, and 49 in PubMed alone.

In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone; and 49 via text mining from PubMed alone.

I wanted to dig a little deeper into the data from the paper and the comparison of OMIM and PubMed. Supplementary Table 5 has information on the list of genes coding for MASC proteins and causing inherited diseases as described in the OMIM repository, or identified using text mining software as associated to disease. In total, 143 gene-disease associations were found (see Figure), but only 26 associations were found in both sources. This shows the synergistic value of combining data from these two sources, and the need for integration of multiple sources to get the fullest picture possible, for any particular gene-disease involvement.

Advancing clinical trial R&D: industry’s most powerful NLP text analytics meets world-class curated life sciences content

December 19, 2014

What challenges were seen in competitive R&D and clinical stages? What outcomes were measured in related trials? Does the drug I am creating have potential efficacy or safety challenges? What does the patient population look like?

These are the sort of critical business questions that many life science researchers need to answer. And now, there’s a solution that can help you.

We all know the importance of high quality content you can depend on when it comes to making key business decisions across the pharma life cycle. We also know that the best way to get from textual data to new insights is using natural language processing-based text analytics. And that’s where our partnership with Thomson Reuters comes in. We’ve worked together on a solution to bring Linguamatics market-leading text mining platform, I2E, together with Thomson Reuters Cortellis high-quality clinical and epidemiology content: Cortellis Informatics Clinical Text Analytics for I2E.

Cortellis Informatics Clinical Text Analytics for I2E applies the power of natural language processing-based text mining from Linguamatics I2E to Cortellis clinical and epidemiology content sets. Taking this approach allows users to rapidly extract relevant information using the advanced search capabilities of I2E. The solution also allows users to identify concepts using a rich set of combined vocabularies from Thomson Reuters and Linguamatics.

Through one single interface users can quickly and easily gain access to new insights to support R&D, clinical development and clinical operations. This is the first time a cloud-based text mining service has been applied to commercial grade clinical and epidemiology content. The wide-ranging content set consists of global clinical trial reports, literature, press releases, conferences and epidemiology data in a secure, ready-to-use on-demand format.

Key features of the solution include:

  • High precision information extraction, using state of the art text analytics, combined with high quality, hand curated data
  • Search using a combination of Cortellis ontologies, plus domain specific and proprietary ontologies.
  • Find numeric information e.g. experimental assay results, patient numbers, trial outcome timepoints, financial values, dates.
  • Generate data tables to support you in your preclinical studies, trial design, and in understanding the impact of clinical trials.
  • Generate new hypotheses through identification of entity relationships in unstructured text e.g. assay and indication.

To find out more about how to save time and get better results from your clinical data searches, visit the Linguamatics website or contact us to gain access.

Ebola: Text analytics over patent sources for medicinal chemistry

November 13, 2014

The 2014 Ebola outbreak is officially the deadliest in history. Governments and organizations are searching for ways to halt the spread – both responding with humanitarian help, and looking for treatments to prevent or cure the viral infection.

CDC image of Ebola virus

Ebola virus disease (or Ebola haemorrhagic fever) is caused by the Ebola filovirus

A couple of weeks ago we received a tweet from Chris Southan, who has been looking at crowdsourcing anti-Ebola medicinal chemistry. He asked us to mine Ebola C07D patents (i.e. those for heterocyclic small molecules, the standard chemistry for most drugs) using our text analytics tool I2E, and provide him with the resulting chemical structures.

We wanted to help. What anti-Ebola research has been patented, that might provide value to the scientific community? Searching patents for chemistry using an automated approach is notoriously tricky; patent documents are long, and often purposefully obfuscated with chemicals frequently being obscured by the complex language used to described them or corrupted by OCR errors and destroyed by the overall poor formatting of the patents.

Andrew Hinton, one of our Application Specialists with a background in chemistry, used I2E to map the patent landscape around Ebola, identify patents for small molecules described to target Ebola, and extract the chemical structures. He compiled queries to answer the key questions and find those patents which were most relevant:

  • Does the patent mention Ebola or Ebola-like diseases? More importantly, is Ebola the major focus of the patent?
  • Who is the pharma or biotech company?
  • Is it a small molecule or non-small molecule patent?
  • What’s the exemplified chemistry? What’s the claimed chemistry? What’s the Markush chemistry?
  • What chemistry is found as an image? What chemistry is found in a table? Can we extract these structures too?

Andrew ran these queries over patents from USPTO, EPO and WIPO for the past 20 years on data derived from IFI CLAIMS.

Ebola patent trend over past decade

Graph showing C07D patents (blue, left-hand axis) and non-C07D patents (red line, right-hand axis) for Ebola related patents from 1994 to 2014, from the three major patent registries [please note different scales for the axes].

The results showed a general increase in the number of patent records related to Ebola, but they are comparatively small – for example there were about 50k C07D patents published in 2010 across all therapeutic areas; of these, we found that only about 100 patents that related to Ebola (and the likely number of truly unique patent families is going to be a smaller subset of the above figure). This isn’t really that surprising; along with most viral diseases, the main emphasis for therapies has been on biologics and other non-small-molecule treatments – in fact, of the 16k total patents that mention Ebola, only 1% are C07D patents focused specifically on Ebola.

Ebola patents -ranked by organisations

Heatmap showing that the top 3 organizations with small molecule C07D patents in this area contribute 1/5 of all Ebola patents.

So what is the outcome of this? Using I2E, we have been able to extract the set of molecules reported in these Ebola-related patents, and will provide a set of these to Chris Southan for his chemoinformatics analysis. Let’s hope that this little step might further research towards providing a solution to the current Ebola epidemic.

Ebola patents - chemistry table

Screenshot of I2E results showing names and structures extracted using text analytics from Ebola-related patent set. Structures are generated using ChemAxon’s Name-to-Structure conversion tools (

Discovering new uses for NLP text analytics at the Linguamatics Text Mining Summit

October 23, 2014

This year, Linguamatics returned to the beautiful town of Newport, Rhode Island, for our annual Text Mining Summit on October 13-15. We were delighted to return to this exquisite setting, where again delegates competed over who could take the most beautiful sunrise and sunset photos.

Sunrise at Newport, Linguamatics Text Mining Summit 2014

Sunrise outside the Hyatt Regency,Newport, RI, Linguamatics Text Mining Summit 2014

Sunset at Newport, Linguamatics Text Mining Summit 2014

Sunset captured at the Linguamatics Text Mining Summit 2014

The Text Mining Summit offers unique opportunities to learn about the latest use cases of Natural Language Processing (NLP) text analytics across pharma and healthcare, plus hands-on training, networking and idea sharing. We hosted a fantastic line up of presenters from Novartis, Bristol-Myers Squibb, Georgetown University Medical Center, Spartanburg Regional Healthcare System, Boehringer Ingelheim, Pfizer, Cell Signalling Technology, Thomson Reuters, GenoSpace and Microsoft.

Mark Burfoot, Global Head, Knowledge Office for Novartis kicked off the proceedings on Tuesday with a keynote talk looking at the future of text mining and knowledge strategy within an evolving pharma landscape. Several presenters shared their use cases and experiences employing I2E. Jonathan Hartmann, Hospital Informationist, Georgetown University Medical Center, gave us an update on the use of I2E for on-the-spot clinical decision support; he said of his experience with text mining: “I2E allows me to find things I wouldn’t have been able to find.” Ryan Owens, Systems Analyst, Spartanburg Regional Healthcare, described the use of I2E to structure essential information from electronic medical records; he said what his organization was accomplishing would be impossible without the help of I2E: “You have to use NLP. There is no other way.”

Ryan Owens  speaking on 'Gaining critical information for clinical trials using NLP'

Ryan Owens speaking on ‘Gaining critical information for clinical trials using NLP’

Other presentations included Jonathan Keeling, Senior Scientist, Boehringer Ingelheim, who described how researchers are linking genomics (and other ‘omics) data in TranSMART to gene annotation information extracted using I2E; and Mick Correll, Chief Operating Officer, GenoSpace, who discussed how I2E has dramatically improved the clinical trial matching process, finding trials for critically ill patients faster than otherwise possible.

Jonathan Keeling speaking on 'Integration of text mining into the tranSMART knowledge management platform'

Jonathan Keeling speaking on ‘Integration of text mining into the tranSMART knowledge management platform’

User presentations received a warm welcome from delegates, who enabled very lively Q&A sessions on the different applications of NLP in improving the speed and quality of insight extraction. Delegates were also inspired by Linguamatics speakers’ presentations, which shed light on the power of I2E in more industry-specific contexts. In particular, John Brimacombe (Linguamatics Executive Chairman) laid out the roadmap to “Connecting Knowledge” through federated text mining over any content source, which will enable users to text mine multiple public and private knowledge sources in a single query.

A prime example of the potential of federated text mining was given by Andrew Garrow, Text Analytics Manager at Thomson Reuters. He previewed the Cortellis Informatics Clinical Text Analytics for I2E – exciting new capabilities to gain insights in support of clinical design from Cortellis clinical and epidemiology content sets.

Guy Singh and Andrew Garrow presenting on Cortellis Informatics Clinical Text Analytics for I2E

Guy Singh and Andrew Garrow presenting on Cortellis Informatics Clinical Text Analytics for I2E

The evening social events gave delegates a welcome opportunity to network and continue their discussions over the risotto bar at the Hyatt and fresh New England oysters, drinks, and a stunning waterfront view at the Landing restaurant.

The Text Mining Summit was once again a very successful event which brought together Linguamatics, our partners and our users across the healthcare, pharmaceutical and biotech industries to share dialogue on the past, present and future of text mining and data analytics. Many thanks to everyone who attended and contributed.

We very much look forward to seeing you at the TMS next year and also at our Spring Users Conference in Cambridge, UK (April 13-15, 2015).

%d bloggers like this: