Upload
jukka-zitting
View
2.984
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Fast Feather Track presentation at ApacheCon EU 2008 in Amsterdam
Citation preview
MIME Magic withApache Tika
Jukka Zitting
Tika committer and mentor
Agenda
The Problem
The Solution
The Project
The Client
The Problem
PDFBoxApache POI
Apache XercesICU4J
NekoHTMLetc.
Lucene index
It's even worse!
Licensing/PatentsDependencies
Metadata extractionStructured content
Encryption/CompressionPackage formats
Streaming/Performance
Processing ofdigital media
?
?
?
???
??
Agenda
The Problem
The Solution
The Project
The Client
The Solution: Technical
• Generic API for extracting metadata and structured text content from a document– Input: byte stream + optional metadata– Output: XHTML SAX events + metadata
• Automatic content type detection– Magic bytes– File name patterns
The Solution: Legal / Social
• Apache License– (L)GPL projects can implement the Tika
API
• Pooling of efforts– Active development and maintenance– Already beyond the functionality of most
custom solutions
Agenda
The Problem
The Solution
The Project
The Client
Project Status
• Incubating since March 2007
• Sponsoring PMC: Apache Lucene
• First release (0.1-incubating) in December 2007
• Interaction with PDFBox, POI, etc.
• Currently in early adopter phase
Current Features
• 73 registered media types– 167 glob patterns– 26 magic header patterns
• 7 built-in parser classes– 51 supported media types– MS Office, OpenOffice, HTML, PDF,
XML, RTF, plain text
Project Statistics
Agenda
The Problem
The Solution
The Project
The Client
Tika Parser APIpackage org.apache.tika.parser;
public interface Parser {// Parses document content and metadatavoid parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException;// Parses document metadata, @since Tika 0.2void parse( InputStream stream, Metadata metadata) throws IOException, TikaException;
}
Example: Text extractionpublic static void main(String[] args)
throws Exception {
InputStream stream = System.in;
ContentHandler handler =
new WriteOutContentHandler(System.out);
Metadata metadata = new Metadata();
new AutoDetectParser().parse(
stream, handler, metadata);
}
Demo: Tika GUI
Agenda
The Problem
The Solution
The Project
The ClientThank You!