II-SDV 2015: The International Information Conference on Search, Data Mining and Visualization

May 12, 2015

The recent two day II-SDV meeting in the beautiful town of Nice on the Côte d’Azur, France, started with a day of talks considering the question of how to best maximise the value of data extracted from a wide range of sources: patents, full text articles and even big data.

The programme kicked off with a presentation from Aleksander Kapisoda from Boehringer Ingelheim (BI) describing how innovative use of custom search techniques beyond that currently offered by standard public search machines can bring tangible benefits to a global pharmaceutical company.

One theme that emerged was the potential use of text mining particularly in constructing landscapes related to emerging technologies. Jane List (Extract information UK) described some of the tools, workflows, and visualisations for patent landscaping, with a great quote from Marcel Proust: “The real voyage of discovery consists not in seeking new landscapes, but in having new eyes”. Emmanuelle Fortune (INIP, France) discussed the ability to classify world cities dubbed “Smart Cities” as hubs for technological development directly from mining the patent literature.

Staying on the topic of text mining I presented a number of use cases related on the subject of “time” and pressed home the message that using text mining can provide clear advantages for access to timely information. This presentation was then followed up by news from the Copyright Clearance Center (CCC) that the difficult process of obtaining legal permission for the purposes of text mining has recently become a lot easier with the ability to now directly create set of full text documents ready for immediate use in text mining. This has long been a goal for many information scientists, as there are valuable nuggets of information in full text that just can’t be gained from mining abstracts.

Finally the conference heard from a different group within Boehringer Ingelheim,  concerned with automating the currently time consuming process of extracting medicinally relevant chemistry from patents. Matthias Negri, collaborating with technology partners Chemaxon, has established a Knime™ workflow that makes use of Linguamatics I2E to extract the additional surrounding pharmacological context to chemistry described within the patent, to provide “a solid information base of value to any phase of a drug discovery project”.

With participants from both US and Europe, the conference provided a great opportunity to meet information specialists, patent experts and scientists from across life science specialities, as well as hearing from vendors about their new product developments. If you are interested in any of the topics discussed, more information can be found at the conference website, and I’d be happy to hear your comments.

Andrew Hinton, Application Specialist, Linguamatics

Patent Landscaping – Text Analytics Extracts the Value

March 23, 2015

Patent literature is a hugely valuable source of novel information for life science research and business intelligence. The wealth of knowledge disclosed in patents may not be found in other information sources, such as MEDLINE or full text journal articles.

Patent landscape reports (also known as patent mapping or IP landscaping) provide a snap-shot of the patent situation of a specific technology, and can be used to understand freedom to operate issues, to identify in- and out-licensing opportunities, to examine competitor strengths and weaknesses, or as part of a more comprehensive market analysis.

innovative_use_quoteThese are valuable searches, but demand advanced search and data visualization techniques, as any particular landscape reports requires examination of many hundreds or thousands of patent documents. Patent text is unstructured; the information needed is often embedded within the body of the patent and may be scattered throughout the lengthy descriptions; and the language is often complex and designed to obfuscate.

Text analytics can provide the key to unlock the value. A recent paper by a team at Bristol Myers Squibb describes a novel workflow to discover trends in kinase assay technology. The aim was to strengthen their internal kinase screening technology, with the first step being to analyze industry trends and benchmark BMS’ capabilities against other pharmaceutical companies, with key questions including:

  • What are the kinase assay technology trends?
  • What are the trends for different therapeutic areas?
  • What are the trends for technology platforms used by the big pharmaceutical companies?

The BMS team built a workflow using several tools: Minesoft’s Patbase, for the initial patent document set collection; Linguamatics I2E, for knowledge extraction; and TIBCO’s Spotfire, for data visualization. The project used I2E to create precise, effective search queries to extract key information around 500 kinases, 5 key screening technologies, 5 therapeutic areas, and across 14 pharmaceutical companies. Use of I2E allowed queries to be designed using domain specific vocabularies for these information entities, for example using over 10,000 synonyms for the kinases, hugely improving the recall of these patent searches. These I2E “macros” enabled information to be extracted regardless of how the facts were described by inventors. Using these vocabularies also allowed semantic normalization; so however the assignee described a concept, the output was standardized to a preferred term, for example, Pfizer for Wyeth, Warner Lambert, etc.

Using I2E also meant that searches could be focused on specific regions of the patent documents for more precise search; for example, the kinase information was extracted from claims (enhancing the precision of the search).

Using the novel approach the patent analysis team mined over 7100 full text patents. That’s approximately half a million pages of full text looking for relevant kinase technology trends and the corresponding therapeutic area information. To put this business value into perspective, it takes ~1h to manually read one patent for data extraction and a scope this large would require around 175 person-weeks (or nearly 3.5 years!) to accomplish. The authors state that innovative use of I2E enabled a 99% efficiency gain for delivering the relevant information. They also say that this project took 2 patent analysts 3 months (i.e. about 25 weeks) which is a 7-fold saving in FTE time.

The deliverables provided key actionable insights that empowered timely business decisions for BMS researchers; and this paper demonstrates that rich information contained in full text patents can be analyzed if innovative tools/methods are used.


Data to knowledge: visualisation of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

Data to knowledge: visualization of the different pharma companies and the numbers of relevant patents for each in the kinase assay technologies. Taken from Yang et al. (2014) WPI 39: 24-34.

%d bloggers like this: