GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA,
PhD Computer Science 482 Introduction to Text Analytics
Slide 3
2 Data created July 2009 Version 1 file format N-gram \t year
\t match_count \t page_count \t volume_count \n N-gram is the
1gram, 2gram, 3gram, 4gram, 5gram Year is the publication year
match_count is the occurrences for that year page_count is the
number of pages on which the ngram appeared volume_count is the
number of books where the ngram occurred Version 1
Slide 4
3 http://aws.amazon.com/datasets/8172056142375670
http://aws.amazon.com/datasets/8172056142375670 Stored in AWS
Simple Storage Service (S3) AWS Public Dataset
Slide 5
4 Stored as compressed data Luckily Hadoop supports GZIP BZIP2
LZO (see below) DEFLATE (zlib implementation) But Hadoop does not
support WinZip And Hadoop supports LZO only if you create a version
with it yourself AWS Public Dataset
Slide 6
5 Compression Format ToolAlgorithmFilename Extension Multiple
files? Able to be Split? DEFLATE (zlib)No CLI
toolsDEFLATE.deflateNo gzip DEFLATE+.gzNo bzip2.bz2NoYes
LZOlzopLZO.lzoNo Hadoop Compression Formats Source: Hadoop The
Definitive Guide
Project Assignment I 7 Use the nwcdatabucket as the bucket for
input Use the tmp folder in nwcdatabucket Input is
nwcdatabucket/tmp Write Python code (in > 1.py files) Find the
twenty most frequently occurring 5-grams for a 10 year period. You
may hard-code the 10 year period E.g. 1950 to 1959 You need not
worry about error checking the range
Slide 9
Project Assignment II 8 Setting reducers Use the extra
arguments in the bottom of the first page The following creates 1
reducer -D mapred.reduce.tasks=1 Upload your results as a text file
Upload your Python code modules