Apache Tika Plugin

  • Tags: analysis
  • Latest: 1.3.0.1
  • Last Updated: 03 May 2013
  • Grails version: 2.0 > *
2 votes
Dependency:
compile ":tika-parser:1.3.0.1"

 Documentation  Source  Issues

Summary

Bundles the Apache Tika along with a parser service.

Installation

grails install-plugin tika-parser

Description

Apache Tika is a toolkit to extract content and metadata from many different file formats like MS Word or Excel, which are difficult to parse without the right libraries.

The tikaParser plugin provides an artifact 'tikaService', which offers a simple parseFile method to retrieve an XHTML representation of the given file containing both metadata and content.

Note: this plugin downloads 'all the things' that Tika depends on - which will increase your application's war size by about 10-30 MByte. But it saves you a huge amount of pain, trying to handle a lot of different file formats on your own and managing the individual parser libraries required to read them.

Usage scenario

This plugin is used by the open source Cinnamon CMS in combination with Apache Lucene to index the content of all kind of documents and provide powerful search capabilities. For example, with the XHTML representation of a MS Word file, you can define XPath based index items to index headlines and content separately. Then, a user may search for a word document containing a headline with the word "Tika" and the word "Grails" in the content.