Pharma and healthcare come together to see the future of text mining

October 16, 2013

The Linguamatics Text Mining Summit has become Linguamatics flagship US event over the past few years attracting a wide variety of attendees from Pharma, Biotech and Healthcare industries.

This year was no exception; the Summit drew a record crowd of over 85 attendees and a fantastic line up of speakers including: AMIA, Shire Pharmaceuticals, Huntsman Cancer Institute, AstraZeneca, Regeneron Pharmaceuticals, Merck, Georgetown University Medical Center, UNC Charlotte, City of Hope, Pfizer and Roche.

The Summit took place at the Hyatt Regency, Newport, Rhode Island in the beautiful surroundings of the Narragansett bay from October 7-9, 2013.


Delegates were provided with an excellent opportunity to explore trends in text mining and analytics, natural language processing and knowledge discovery. Delegates discovered how I2E is delivering valuable intelligence from text in a range of applications, including the mining of scientific literature, news feeds, Electronic Health Records (EHRs), clinical trial data, FDA drug labels and more. Customer presentations demonstrated how I2E helps workers in knowledge driven organizations meet the challenge of information overload, maximize the value of their information assets and increase speed to insight.

One customer presentation explained how a “Recent I2E literature search saved $180K and 2-3 months [of] time”.

Kevin Fickenscher, President and CEO of AMIA kicked off Tuesday October 8th main sessions with his forward-thinking talk on Big Data and its implications on personalized medicine and performance based outcomes in healthcare. The talk was engaging, well received and set the tone for a great day of presentations.


The final talk of Tuesday from Jonathan Hartmann, an Informationist at Georgetown University Medical Center,  also caused quite a stir by demonstrating an iPad-based web portal interface onto I2E to support identification of case histories used during physician’s rounds. The novel real time usage of I2E prompted many questions from delegates and the presentation was discussed long into the night at the Summit dinner.


The Summit afforded new and potential users with a timely update on the latest I2E product developments:  including an introduction to Linked Servers, which allows streamlined access to content by linking enterprise servers with content servers on the cloud, and enhanced chemical querying capabilities.  Both of these new features are available in the latest I2E release, I2E 4.1. CTO David Milward’s forward-looking technology road map received a great deal of interest from delegates.

This year we invited a few of our partners to exhibit and participate in 5-minute lightning round presentations, we heard from Cambridge Semantics, ChemAxon, Copyright Clearance Center and Accelrys on their solutions and how they integrate with Linguamatics I2E.

The whole event was a great success topped off with some fabulous feedback from all our attendees – to quote one of the feedback forms “One of the best summits and training sessions I have ever attended. The pioneering NLP efforts and user responsiveness is unmatched by any other text mining (NLP) vendor. The Linguamatics approach and I2E system is relatively intuitive, easy to manage, powerful and useful”.

Presentations are available on I2Edia and by email request.

Thanks to everyone who attended and contributed to the Linguamatics Text Mining Summit 2013, we look forward to seeing you next year!

Big Data in the San Francisco Bay Area

September 3, 2013

Natural Language Processing (NLP), big data and precision medicine are three of the hottest topics in healthcare at the moment and consequently attracted a large audience to the first NLP & Big Data Symposium, focussed on precision medicine.  The event took place on August 27th, hosted at the new UCSF site at Mission Bay in San Francisco and sponsored by Linguamatics and UCSF Helen Diller Family Comprehensive Cancer Center. Over 75 delegates came to hear the latest insights and projects from some of the West Coast’s leading institutions including Kaiser Permanente, Oracle, Huntsman Cancer Institute and UCSF. The event was held in the midst of an explosion in new building development to house the latest in medical research and informatics, something that big data will be at the heart of. Linguamatics and UCSF recognized the need for a meeting on NLP in the west and put together an exciting program that clearly caught the imagination of many groups in the area.

Delegates at the Symposium

Over 75 delegates attended the Symposium

Key presentations included:

  • The keynote presentation from Frank McCormick, Director of UCSF Helen Diller Family Comprehensive Cancer Center, was a tour de force of the latest insights into cancer research and future prospects. With the advances in genetic sequencing and associated understanding of cancer biology, we are much closer to major breakthroughs in tackling the genetic chaos of cancer
  • Kaiser Permanente presented a predictive model of pneumonia assessment based on Linguamatics I2E that has been trained and tested using over 200,000 patient records. This paper has just been published and can be found here. In addition, Kaiser presented plans for a new project on re-hospitalization that takes into account social factors in-addition to standard demographic and diagnosis data
  • Huntsman Cancer Institute showed how pathology reports are being mined using I2E to extract data for use in a research data warehouse to support cohort selection for internal studies and clinical trials
  • Oracle presented their approach to enterprise data warehousing and translational medicine data management, highlighting why a sustainable single source of truth for data is key and how NLP can be used to support this environment
  • UCSF also provided an overview of the current approaches to the use of NLP in medical and research informatics, emphasizing the need for such approaches to deliver the raw data for advanced research
  • Linguamatics’ CTO, David Milward, presented a positional piece on why it is essential for EHRs and NLP to be more closely integrated and illustrated some of the challenges and approaches that can be used with I2E to overcome them
  • Linguamatics’ Tracy Gregory also showed how NLP can be used in the early stages of biomarker research to assess potential research directions and support knowledge discovery

    Panel from NLP & Bid Data Symposium

    Panel of speakers, from left to right – Tony Sheaffer, Vinnie Liu, Brady Davis, David Milward, Samir Courdy, Tracy Gregory and Gabriel Escobar

With unstructured data accounting for 75-80% of data in EHRs, the use of NLP in healthcare analytics and translational research is essential to achieve the required improvements in treatment and disease understanding. This event provided a great forum for understanding the current state-of-the-art in this field and allowed participants to engage in many useful discussions. West coast residents interested in the field can look forward to another opportunity to get together in 2014, or if you can get over to the east coast, the Linguamatics Text Mining Summit will take place on October 7-9 2013 in Newport, RI.

I2E and the Semantic Web

August 6, 2013

The internet right now, as Tim Berners-Lee points out in Scientific American, is a web of documents;  documents that are designed to be read, primarily, by humans. The vision behind the Semantic Web is a web of information, designed to be processed by machines. The vision is being implemented: important parts of the key enabling technologies are already in place.

RDF or the resource description framework is one such key technology. RDF is the language for expressing information in the semantic web. Every statement in RDF is a simple triple, which you can think of as subject/verb/object and a set of statements is just a set of triples. Three example triples might be: Armstrong/visited/moon,  Armstrong/isa/human and moon/isa/astronomical body. The power of RDF lies partly in the fact that a set of triples is also a graph and graphs are perfect for machines to traverse and, increasingly, reason over.  After all, when you surf the web, you’re just traversing the graph of hyperlinks. And that’s the second powerful feature of RDF. The individual parts, such as Armstrong and moon, are not just strings of letters but web-addressable Uniform Resource Identifiers (URIs). When I publish my little graph about Armstrong it becomes part of a vast world-wide graph: the Semantic Web. So, machines hunting for information about Armstrong can reach my graph and every other graph about Armstrong. This approach allows the web to become a huge distributed knowledge base.

There’s a new component for I2E: the XML to RDF convertor.  It turns standard I2E results into web-standard RDF. Each row in the table of results (each assertion) becomes one or more triples. For example, suppose you run an astronomy query against a news website and it returns the structured fact: Armstrong, visited, the moon, 1969. Let’s also suppose Armstrong and the moon were identified using an ontology derived from Wikipedia. The output RDF will include one URI to identify this structured fact and four more for the fact’s constituents (the first is Armstrong, the second is the relation of visiting, the third is the moon and so forth). Then, there will be a number of triples relating these constituents, for example, that the subject of the visiting is Armstrong. In addition, all the other information available in the traditional I2E results is presented in additional triples. For example, one triple might state that Armstrong’s preferred term is “Neil Armstrong”, another might state that the source concept is, a third might state that the hit string (text justifying the concept) is Neil Alden Armstrong. The set of possible output triples for the I2E convertor is fully defined by an RDF schema.

RDF Triple & Semantic Web

Visualization of the RDF Triple Armstrong, visited, the moon

Why is this a good thing? First and foremost, I2E results can now join the semantic web. Even if you don’t want to publish your results, you can still exploit the growing list of semantic web tools for processing your own data and integrating them with other data which has been published. Second, to quote Tim Berners-Lee again The Semantic Web will enable machines to COMPREHEND semantic documents and data, not human speech and writings“. I2E is all about extracting structured information from human writings. So link these two things together and you have a powerful tool for traversing the worlds of structured and unstructured data together.

Find out more about Linguamatics I2E

%d bloggers like this: