16
MIME Magic with Apache Tika Jukka Zitting Tika committer and mentor

Mime Magic With Apache Tika

Embed Size (px)

DESCRIPTION

Fast Feather Track presentation at ApacheCon EU 2008 in Amsterdam

Citation preview

Page 1: Mime Magic With Apache Tika

MIME Magic withApache Tika

Jukka Zitting

Tika committer and mentor

Page 2: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The Client

Page 3: Mime Magic With Apache Tika

The Problem

PDFBoxApache POI

Apache XercesICU4J

NekoHTMLetc.

Lucene index

Page 4: Mime Magic With Apache Tika

It's even worse!

Licensing/PatentsDependencies

Metadata extractionStructured content

Encryption/CompressionPackage formats

Streaming/Performance

Processing ofdigital media

?

?

?

???

??

Page 5: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The Client

Page 6: Mime Magic With Apache Tika

The Solution: Technical

• Generic API for extracting metadata and structured text content from a document– Input: byte stream + optional metadata– Output: XHTML SAX events + metadata

• Automatic content type detection– Magic bytes– File name patterns

Page 7: Mime Magic With Apache Tika

The Solution: Legal / Social

• Apache License– (L)GPL projects can implement the Tika

API

• Pooling of efforts– Active development and maintenance– Already beyond the functionality of most

custom solutions

Page 8: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The Client

Page 9: Mime Magic With Apache Tika

Project Status

• Incubating since March 2007

• Sponsoring PMC: Apache Lucene

• First release (0.1-incubating) in December 2007

• Interaction with PDFBox, POI, etc.

• Currently in early adopter phase

Page 10: Mime Magic With Apache Tika

Current Features

• 73 registered media types– 167 glob patterns– 26 magic header patterns

• 7 built-in parser classes– 51 supported media types– MS Office, OpenOffice, HTML, PDF,

XML, RTF, plain text

Page 11: Mime Magic With Apache Tika

Project Statistics

Page 12: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The Client

Page 13: Mime Magic With Apache Tika

Tika Parser APIpackage org.apache.tika.parser;

public interface Parser {// Parses document content and metadatavoid parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException;// Parses document metadata, @since Tika 0.2void parse( InputStream stream, Metadata metadata) throws IOException, TikaException;

}

Page 14: Mime Magic With Apache Tika

Example: Text extractionpublic static void main(String[] args)

throws Exception {

InputStream stream = System.in;

ContentHandler handler =

new WriteOutContentHandler(System.out);

Metadata metadata = new Metadata();

new AutoDetectParser().parse(

stream, handler, metadata);

}

Page 15: Mime Magic With Apache Tika

Demo: Tika GUI

Page 16: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The ClientThank You!