17
July 30th, 2009 Lexical Knowledge from Ngrams 1 N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)

N-gram Search Engine on Wikipedia

  • Upload
    levana

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

N-gram Search Engine on Wikipedia. Satoshi Sekine (NYU) Kapil Dalwani (JHU). Hammer : Fast and multi-functional n-gram search engine. Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text. ngrams. 2. Characteristics. Search up to 7 grams with wildcards - PowerPoint PPT Presentation

Citation preview

Page 1: N-gram Search Engine on Wikipedia

July 30th, 2009 Lexical Knowledge from

Ngrams1

N-gram Search Engine on Wikipedia

Satoshi Sekine (NYU)Kapil Dalwani (JHU)

Page 2: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

2

Hammer : Fast and multi-functional n-gram search engine

2

ngrams

Search ngram:

FAST

INPUT: token, POS, chunk, NE

OUTPUT: frequency to text

Page 3: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

3

Characteristics

• Search up to 7 grams with wildcards• Multi-level input

– Token, POS, chunk, NE, combinations– NOT, OR for POS, chunk, NE

•Multi-level output– Token, POS, chunk, NE– document information– Original sentences, KWIC, ngram

•Display– Show the results in the order of frequency

•Running Environment– Single CPU, PC-Linux, 400MB process, 500GB disk

3

Page 4: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

4

Demo

• http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2

Page 5: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

5

Available for you

• Web system– At NYU

• http://nlp.cs.nyu.edu/nsearch

– At JHU?

• USB Hard drive

Page 6: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

6

1. Search candidates

2. Filtering3. Display

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

Page 7: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

7

1. Search candidates

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

Page 8: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

8

• Example: 3-grams

•Posting list

From n-gram to Inverted Index

Ngram ID Position=1 Position=2 Position=3

1 A B C

2 A B B

3 B A C

3A pos=2

1 2A pos=1

3B pos=1

1 2B pos=2

2B pos=3

1 3C pos=3

Page 9: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

9

Posting list

• Wide variation of posting list size (in 7-gram: 1.27B)– “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672)– conscipcuous, consiety, Mizuk, (1)

• 3 types for faster speed and smaller index size– Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list)

– List of ngramID

– Encoded into pointer (freq=1)

1 3C pos=3

1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1

C pos=3 5

Page 10: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

10

Search

• Given an n-gram request (A B C)– Get posting lists for A, B and C– Search intersections of posting lists– Use “look ahead” to speed up the search

• Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996)

4 33 34 55 76 80 89 92 99

4 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98

SKIP

Page 11: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

11

1 Search candidates.

2. Filtering

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

Page 12: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

12

Filtering

• Not all candidate ngramID’s match the request

• We need frequency, sentence information to matched n-grams

• POS, chunk and NE information is presented as ID– Reduce the index more than 200GB

NN

VB

PERSON

LOC

A BFreq=123

Freq=10Freq=5

Page 13: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

13

1. Search candidates

3. Display2. Filtering

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

Page 14: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

14

Display

• N-gram will be displayed in the descending order of frequency– N-gram ID is ordered by the frequency

• Sentences are searched using suffix array• POS, chunk, NE are displayed with sentence,

KWIC, ngram• Doc ID, title of Wikipedia (and possible

features of doc) is displayed with sentences and KWIC

Page 15: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

15

Size of data

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayFor text

POS, chunk, NEfor

N-gram data

108 GB

6 GB

8 GB

8 GB

260 GB

100 GB

Others

40 GB

Text 1.7 G words 200M sentences 2.4M articles

Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B

Total530GB

Page 16: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

16

Future Work

• Other information (ex: parse, coref, relation, genre, discourse…)

• Longer n-gram• Compress index, dictionary• Ease the indexing load

– Now we need a big memory machine– Distributing indexing

• Union operation for tokens

Page 17: N-gram Search Engine on Wikipedia

July 30th, 2009

Lexical Knowledge from Ngrams

17

Available for you

• Web demo– At NYU

• http://nlp.cs.nyu.edu/nsearch

– At JHU?

• USB Hard drive