21
1 Aspire Document Processing 1

1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

Embed Size (px)

Citation preview

Page 1: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

1

Aspire DocumentProcessing

1

Page 2: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

2Document Processing – “Aspire”

• Very High Performance• Structured Document Processing Architecture• Dynamic configuration and deployment• Based on Open Source Technologies• Well Supported (wiki, javadoc)• Administration interface built-in• Vendor Neutral (CMS and search engine)

2

Page 3: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

3Top-Level Overview

Aspire

Data Sources

Feeders

Document Processing Pipelines Indexing Index

Page 4: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

4

Aspire

Common Resources

Components In Aspire (today)

Content Control DB

SubJob Extractors

Unload ARC Files

Unload CSV

Component Manager Pipeline Manager

MetadataManipulation

Text Extraction

Date Chooser

Split Multi-valued data

Host to Domain

Groovy Scripting

JDBC Connection

Feeders

RSS

Hot Folder

Single Page

RDB

Enhancers

Get CCD Metadata

RDB Enhancer

Output

Push XML to REST

Error Job Handler

Debug Output

JMS

RDB Unloader

Feed One

Fetch URL

Category Tagger

Content Boost

Page 5: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

5Functions Handled by Aspire

• Threading• Collection Deployment• Error handling and notification

• Including individual sub-job notifications• Collection Configuration• Component Scripting• Job Processing• Admin I/F, performance, live system status

Page 6: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

6Benefits

• Much lower lifecycle cost• File processing no longer an ad-hoc

collection of java objects and methods• Encourages re-use of components• New collections with no programming

• Just re-configure existing components

• Flexibility: deploy collections individually• Much better visibility into the file processing

internals, performance, and queuing

Page 7: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

7Typical Installation Structure

Machine #1 Machine #2

CrawlerAspire

(other feeders and doc processing)

Search Engine

Page 8: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

8

Aspire Architecture and Components

Details

Page 9: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

9Top-Level Component Architecture

Page 10: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

10Aspire and OSGi Components

AspireComponent

AspireComponent

Factory

OSGi Bundle

Java Jar File

Manufactured By

ISA

ISA

Page 11: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

11The Contents of a Bundle/Component Factory

Page 12: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

12Component and Factory Details

Page 13: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

13

Page 14: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

14

Aspire Sample Configurations

Page 15: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

15Web Site Crawler / Search

Page 16: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

16Processing CSV Files

Page 17: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

17RSS Feeds, Single Pages

Page 18: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

18

Aspire Deployment

Page 19: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

19Deployment

• Architected to the latest deployment standards• Distribution Archetypes• Component Repositories

• Redeploy collections independently• In a live running system

• Redeploy and update components• In a live running system

• Ready for the cloud

19

Page 20: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

20Deployment Structure

Aspire

Resources

CollectionConfigCollection

ConfigCollectionConfigCollection

ConfigCollectionConfigCollection

Config

Feeders & Pipelines

Administrator

load/reloadconfiguration

ConfigurationControl

re-useable components

ComponentRepository

Page 21: 1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration

21Deployment Implications

• Collections are configured independently• Collections use standard components• Can be dynamically and remotely deployed

Remote System

Aspire(always running)

CollectionConfig

load remoteconfigurations

remoteadmincontrol