Patent Landscaping – Text Analytics Extracts the Value

March 23, 2015

Patent literature is a hugely valuable source of novel information for life science research and business intelligence. The wealth of knowledge disclosed in patents may not be found in other information sources, such as MEDLINE or full text journal articles.

Patent landscape reports (also known as patent mapping or IP landscaping) provide a snap-shot of the patent situation of a specific technology, and can be used to understand freedom to operate issues, to identify in- and out-licensing opportunities, to examine competitor strengths and weaknesses, or as part of a more comprehensive market analysis.

innovative_use_quoteThese are valuable searches, but demand advanced search and data visualization techniques, as any particular landscape reports requires examination of many hundreds or thousands of patent documents. Patent text is unstructured; the information needed is often embedded within the body of the patent and may be scattered throughout the lengthy descriptions; and the language is often complex and designed to obfuscate.

Text analytics can provide the key to unlock the value. A recent paper by a team at Bristol Myers Squibb describes a novel workflow to discover trends in kinase assay technology. The aim was to strengthen their internal kinase screening technology, with the first step being to analyze industry trends and benchmark BMS’ capabilities against other pharmaceutical companies, with key questions including:

  • What are the kinase assay technology trends?
  • What are the trends for different therapeutic areas?
  • What are the trends for technology platforms used by the big pharmaceutical companies?

The BMS team built a workflow using several tools: Minesoft’s Patbase, for the initial patent document set collection; Linguamatics I2E, for knowledge extraction; and TIBCO’s Spotfire, for data visualization. The project used I2E to create precise, effective search queries to extract key information around 500 kinases, 5 key screening technologies, 5 therapeutic areas, and across 14 pharmaceutical companies. Use of I2E allowed queries to be designed using domain specific vocabularies for these information entities, for example using over 10,000 synonyms for the kinases, hugely improving the recall of these patent searches. These I2E “macros” enabled information to be extracted regardless of how the facts were described by inventors. Using these vocabularies also allowed semantic normalization; so however the assignee described a concept, the output was standardized to a preferred term, for example, Pfizer for Wyeth, Warner Lambert, etc.

Using I2E also meant that searches could be focused on specific regions of the patent documents for more precise search; for example, the kinase information was extracted from claims (enhancing the precision of the search).

Using the novel approach the patent analysis team mined over 7100 full text patents. That’s approximately half a million pages of full text looking for relevant kinase technology trends and the corresponding therapeutic area information. To put this business value into perspective, it takes ~1h to manually read one patent for data extraction and a scope this large would require around 175 person-weeks (or nearly 3.5 years!) to accomplish. The authors state that innovative use of I2E enabled a 99% efficiency gain for delivering the relevant information. They also say that this project took 2 patent analysts 3 months (i.e. about 25 weeks) which is a 7-fold saving in FTE time.

The deliverables provided key actionable insights that empowered timely business decisions for BMS researchers; and this paper demonstrates that rich information contained in full text patents can be analyzed if innovative tools/methods are used.


Data to knowledge: visualisation of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

Data to knowledge: visualization of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

Advancing clinical trial R&D: industry’s most powerful NLP text analytics meets world-class curated life sciences content

December 19, 2014

What challenges were seen in competitive R&D and clinical stages? What outcomes were measured in related trials? Does the drug I am creating have potential efficacy or safety challenges? What does the patient population look like?

These are the sort of critical business questions that many life science researchers need to answer. And now, there’s a solution that can help you.

We all know the importance of high quality content you can depend on when it comes to making key business decisions across the pharma life cycle. We also know that the best way to get from textual data to new insights is using natural language processing-based text analytics. And that’s where our partnership with Thomson Reuters comes in. We’ve worked together on a solution to bring Linguamatics market-leading text mining platform, I2E, together with Thomson Reuters Cortellis high-quality clinical and epidemiology content: Cortellis Informatics Clinical Text Analytics for I2E.

Cortellis Informatics Clinical Text Analytics for I2E applies the power of natural language processing-based text mining from Linguamatics I2E to Cortellis clinical and epidemiology content sets. Taking this approach allows users to rapidly extract relevant information using the advanced search capabilities of I2E. The solution also allows users to identify concepts using a rich set of combined vocabularies from Thomson Reuters and Linguamatics.

Through one single interface users can quickly and easily gain access to new insights to support R&D, clinical development and clinical operations. This is the first time a cloud-based text mining service has been applied to commercial grade clinical and epidemiology content. The wide-ranging content set consists of global clinical trial reports, literature, press releases, conferences and epidemiology data in a secure, ready-to-use on-demand format.

Key features of the solution include:

  • High precision information extraction, using state of the art text analytics, combined with high quality, hand curated data
  • Search using a combination of Cortellis ontologies, plus domain specific and proprietary ontologies.
  • Find numeric information e.g. experimental assay results, patient numbers, trial outcome timepoints, financial values, dates.
  • Generate data tables to support you in your preclinical studies, trial design, and in understanding the impact of clinical trials.
  • Generate new hypotheses through identification of entity relationships in unstructured text e.g. assay and indication.

To find out more about how to save time and get better results from your clinical data searches, visit the Linguamatics website or contact us to gain access.

Mining big data for key insights into healthcare, life sciences and social media at the Linguamatics San Francisco Seminar

September 3, 2014

Natural Language Processing (NLP) and text analytics experts from pharmaceutical and biotech companies, healthcare providers and payers gathered together to discuss the latest industry trends and to hear the product news and case studies from Linguamatics on August 26th.

The keynote presentation from Dr Gabriel Escobar was the highlight of the event, covering a rehospitalization prediction project that the Kaiser Permanente Department of Research have been working on in collaboration with Linguamatics. The predictive model has been developed using a cohort of approximately 400,000 patients and combine scores from structured clinical data with factors derived from unstructured data using I2E. Key factors that could affect a patient’s likelihood of rehospitalization are trapped in text; these include ambulatory status, social support network and functional status. I2E queries enabled KP to extract these factors and use them to indicate the accuracy of the structured data’s predictive score.

Leading the use of I2E in healthcare, Linguamatics exemplified how cancer centers are working together to develop queries for pathology reports, mining medical literature and predicting pneumonia from radiology reports. They also demonstrated a prototype application to match patients to clinical trials and a cohort selection tool using semantic tagging of patient narratives in the Apache Solr search engine.

Semantic enrichment was discussed in the context of life sciences using SharePoint as the search engine. This drew great interest from the many life science companies in the audience, who understand the need to improve searching of internal scientific data. This discussion highlighted the challenges of getting a consistent view of internal and external data. The latest version of I2E will address this challenge with a new federated capability that provides merged results sets from internal and external searches. These new I2E capabilities have strong potential to improve insight and they also incorporate a model that allows content providers to more actively support text mining.

Another hot topic was mining social media and news outlets for competitive intelligence and insights into company and product sentiment. The mass of information now available from social media requires a careful strategy of multiple levels of filtering; this will enable extracting the relevant data from the millions of tweets and posting that occur daily. Once these have been identified this data can be text mined but users need to factor in support for abbreviations and shorter linguistic syntax. Mining social media and news outlets is an area that will continue to grow and require active support.

Linguamatics were grateful for such an engaged and interactive audience and look forward to future discussions on these exciting trends. Keep an eye out for information about our upcoming Text Mining Summit.

How to protect and develop your enterprise search investment with text analytics

June 24, 2014

It’s funny, isn’t it? Search at home just works. You’re looking for a holiday, train times, a particular recipe or the answer to your kid’s homework. You sit down and type your keyword/s into your search engine. Milliseconds later, results appear – the one you’re looking for is usually one of the first ones – you click on it and voila! You have what you were looking for.

But search at work doesn’t seem to be as effective. Maybe you are looking for information internally. You know it exists but you’re not quite sure where. The information lies across silos and it’s a mix of structured and unstructured. As a scientist it’s important for you to easily find information hidden in memos, project plans, meeting minutes, study reports, literature etc. You type a keyword search in your enterprise search engine. A list of documents comes back but none of them look like the one you want. You feel like you’re wasting your time. Sound familiar?

You’re not alone. At least that is what recent surveys and conferences on enterprise search have revealed. According to a recent report from Findwise 64% of organizations say it’s difficult to find information within their organization. Why?

  • Poor search functionality
  • Inconsistencies in how information is tagged
  • People don’t know where to look or what to look for

So how can we address this? Well, there’s already been talk of using text analytics to improve enterprise search. Text analytics, also referred to as text mining, allows users to go beyond keyword search to interpret the meaning of text in documents. While text analytics solutions have existed for some years now, more recently they’ve been working in harmony with enterprise search to improve the quality of results and make information more discoverable.

Let me give you an example, for over 10 years Linguamatics I2E has been mining data and content such as scientific literature, patents, clinical trials data, news feeds, electronic health records, social media and proprietary content – working with 17 of the top 20 pharmaceutical companies to improve and power their knowledge discovery. Meanwhile organizations have been deploying enterprise search engines to search internally.

Having been dissatisfied with their search solution and familiar with using I2E in other areas, a top 20 pharma wanted to see if the power of I2E’s text analytics could be applied to their enterprise search system. A proof of concept was proposed using Microsoft SharePoint. The organization did some internal requirement’s gathering and worked with both Microsoft and Linguamatics to come up with a solution to improve their search. I2E worked in the background, using its natural language processing technology to identify concepts and mark up semantic entities such as genes, drugs, diseases, organizations, anatomy, authors and other relevant concepts and relationships. Once annotated, taxonomies/thesauri were built and the marked-up documents were fed back into SharePoint.

To the users, the search interface remained the same but there was a difference in the results. I2E was able to provide semantic facets for the search engine to allow the user to quickly filter the results to what they were looking for. The facets were concepts rather than words and this allowed users to filter results to a more intuitive set of things they were looking for e.g. just show me the results for ‘breast cancer’ as a concept. This would also include all results that had variations of how that concept was found in the text e.g. breast carcinoma, breast tumor, cancer of the breast etc .In addition, I2E provided SharePoint with the ability to autocomplete terms as the user was typing them, and when performing the search, SharePoint was taught to look for synonyms of the word/s typed in.

The organization was incredibly happy with the improved search performance. Stating the main benefits as improved efficiency, improved search results quality, information became more transparent and available, which stimulated innovation within the organization.

This is just the beginning. The capabilities of I2E could also be applied to other search engines and scenarios where search needs to be improved to increase the return on investment made in the system and protect and develop future investments, increase usage and findability.

If you’d like to find out more, sign up for Linguamatics’ webinar or contact us for a demo.

Are the terms text mining and text analytics largely interchangeable?

January 17, 2014

I saw this comment in a recent article by Seth Grimes where he discusses the terms Text Analysis and Text Analytics (see article Naming and Classifying: Text Analysis vs. Text Analytics ) within the article Mr. Grimes states that text mining and text analytics are largely interchangeable terms

“The terms “text analytics” and “text mining” are largely interchangeable. They name the same set of methods, software tools, and applications. Their distinction stems primarily from the background of the person using each — “text mining” seems most used by data miners, and “text analytics” by individuals and organizations in domains where the road to insight was paved by business intelligence tools and methods — so that the difference is largely a matter of dialect.” ref:

I asked Linguamatics CTO, David Milward, for his thoughts:

There is certainly overlap, but I think there are cases of analytics that would not be classed as text mining and vice versa. Text analytics tends to be more about processing a document collection as a whole, text mining traditionally has more of the needle in a haystack connotation.

For example, word clouds might be classified as text analytics, but not text mining. Use of natural language processing (NLP) for advanced searching is not so naturally classified under text analytics. Something like the Linguamatics I2E text mining platform is used for many text analytics applications, but its agile nature means it is also used as an alternative to professional search tools.

A further term is Text Data Mining. This is usually used to distinguish cases where new knowledge is being generated, rather than old knowledge being rediscovered. The typical case is indirect relationships: one item is associated with a second item in one document, and the second item is associated with a third in another document. This provides a possible relationship between the first item and the third item: something which may not be possible to find within any one document.

Only a few days until the Text Mining Summit!

October 2, 2013

There’s only a few days to go until the Linguamatics Text Mining Summit, which begins on 7th October in Newport RI. This is an opportunity for I2E users and other developers to get hands-on access to the new version of I2E — version 4.1 — as well as attend a variety of interesting presentations and a number of training sessions.

This year, there is a training session dedicated to I2E Administration and use of the Web Services API. You can meet up with individuals from other organizations who will share their experiences with our API, along with training material that covers the various parts of the API in sufficient details to start using it yourself. There will also be a case study on using the API to create workflows that integrate text mining. And, of course, lots of demos!

I look forward to seeing you there!


Big Data in the San Francisco Bay Area

September 3, 2013

Natural Language Processing (NLP), big data and precision medicine are three of the hottest topics in healthcare at the moment and consequently attracted a large audience to the first NLP & Big Data Symposium, focussed on precision medicine.  The event took place on August 27th, hosted at the new UCSF site at Mission Bay in San Francisco and sponsored by Linguamatics and UCSF Helen Diller Family Comprehensive Cancer Center. Over 75 delegates came to hear the latest insights and projects from some of the West Coast’s leading institutions including Kaiser Permanente, Oracle, Huntsman Cancer Institute and UCSF. The event was held in the midst of an explosion in new building development to house the latest in medical research and informatics, something that big data will be at the heart of. Linguamatics and UCSF recognized the need for a meeting on NLP in the west and put together an exciting program that clearly caught the imagination of many groups in the area.

Delegates at the Symposium

Over 75 delegates attended the Symposium

Key presentations included:

  • The keynote presentation from Frank McCormick, Director of UCSF Helen Diller Family Comprehensive Cancer Center, was a tour de force of the latest insights into cancer research and future prospects. With the advances in genetic sequencing and associated understanding of cancer biology, we are much closer to major breakthroughs in tackling the genetic chaos of cancer
  • Kaiser Permanente presented a predictive model of pneumonia assessment based on Linguamatics I2E that has been trained and tested using over 200,000 patient records. This paper has just been published and can be found here. In addition, Kaiser presented plans for a new project on re-hospitalization that takes into account social factors in-addition to standard demographic and diagnosis data
  • Huntsman Cancer Institute showed how pathology reports are being mined using I2E to extract data for use in a research data warehouse to support cohort selection for internal studies and clinical trials
  • Oracle presented their approach to enterprise data warehousing and translational medicine data management, highlighting why a sustainable single source of truth for data is key and how NLP can be used to support this environment
  • UCSF also provided an overview of the current approaches to the use of NLP in medical and research informatics, emphasizing the need for such approaches to deliver the raw data for advanced research
  • Linguamatics’ CTO, David Milward, presented a positional piece on why it is essential for EHRs and NLP to be more closely integrated and illustrated some of the challenges and approaches that can be used with I2E to overcome them
  • Linguamatics’ Tracy Gregory also showed how NLP can be used in the early stages of biomarker research to assess potential research directions and support knowledge discovery

    Panel from NLP & Bid Data Symposium

    Panel of speakers, from left to right – Tony Sheaffer, Vinnie Liu, Brady Davis, David Milward, Samir Courdy, Tracy Gregory and Gabriel Escobar

With unstructured data accounting for 75-80% of data in EHRs, the use of NLP in healthcare analytics and translational research is essential to achieve the required improvements in treatment and disease understanding. This event provided a great forum for understanding the current state-of-the-art in this field and allowed participants to engage in many useful discussions. West coast residents interested in the field can look forward to another opportunity to get together in 2014, or if you can get over to the east coast, the Linguamatics Text Mining Summit will take place on October 7-9 2013 in Newport, RI.

I2E and the Semantic Web

August 6, 2013

The internet right now, as Tim Berners-Lee points out in Scientific American, is a web of documents;  documents that are designed to be read, primarily, by humans. The vision behind the Semantic Web is a web of information, designed to be processed by machines. The vision is being implemented: important parts of the key enabling technologies are already in place.

RDF or the resource description framework is one such key technology. RDF is the language for expressing information in the semantic web. Every statement in RDF is a simple triple, which you can think of as subject/verb/object and a set of statements is just a set of triples. Three example triples might be: Armstrong/visited/moon,  Armstrong/isa/human and moon/isa/astronomical body. The power of RDF lies partly in the fact that a set of triples is also a graph and graphs are perfect for machines to traverse and, increasingly, reason over.  After all, when you surf the web, you’re just traversing the graph of hyperlinks. And that’s the second powerful feature of RDF. The individual parts, such as Armstrong and moon, are not just strings of letters but web-addressable Uniform Resource Identifiers (URIs). When I publish my little graph about Armstrong it becomes part of a vast world-wide graph: the Semantic Web. So, machines hunting for information about Armstrong can reach my graph and every other graph about Armstrong. This approach allows the web to become a huge distributed knowledge base.

There’s a new component for I2E: the XML to RDF convertor.  It turns standard I2E results into web-standard RDF. Each row in the table of results (each assertion) becomes one or more triples. For example, suppose you run an astronomy query against a news website and it returns the structured fact: Armstrong, visited, the moon, 1969. Let’s also suppose Armstrong and the moon were identified using an ontology derived from Wikipedia. The output RDF will include one URI to identify this structured fact and four more for the fact’s constituents (the first is Armstrong, the second is the relation of visiting, the third is the moon and so forth). Then, there will be a number of triples relating these constituents, for example, that the subject of the visiting is Armstrong. In addition, all the other information available in the traditional I2E results is presented in additional triples. For example, one triple might state that Armstrong’s preferred term is “Neil Armstrong”, another might state that the source concept is, a third might state that the hit string (text justifying the concept) is Neil Alden Armstrong. The set of possible output triples for the I2E convertor is fully defined by an RDF schema.

RDF Triple & Semantic Web

Visualization of the RDF Triple Armstrong, visited, the moon

Why is this a good thing? First and foremost, I2E results can now join the semantic web. Even if you don’t want to publish your results, you can still exploit the growing list of semantic web tools for processing your own data and integrating them with other data which has been published. Second, to quote Tim Berners-Lee again The Semantic Web will enable machines to COMPREHEND semantic documents and data, not human speech and writings“. I2E is all about extracting structured information from human writings. So link these two things together and you have a powerful tool for traversing the worlds of structured and unstructured data together.

Find out more about Linguamatics I2E

Calling all developers: the I2E Web Services API is for you!

June 20, 2013

The release of I2E 4.0 at the end of 2012 included a Web Services API (WSAPI) for our software for the first time. The availability of this interface, along with sample code and a sample GUI, meant that it was possible for developers to include integration with I2E into their code.

We’ve used standard technologies when building our API but there are many software-specific features that need to be understood before you can choose what capabilities of I2E to include in your applications. For this, we are providing additional training materials such as training sessions at our Text Mining Summit, webinars, traditional phone and email support and, well, this blog.
Read the rest of this entry »

%d bloggers like this: