Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

  • View
    3.160

  • Download
    2

Embed Size (px)

DESCRIPTION

co-taught with Jason Baldridge, topic for the day: practical Hadoop

Text of Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

  • Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 4 September 15, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at AustinJasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • Acknowledgments Course design and slides based on Jimmy Lins cloud computing courses at the University of Maryland, College ParkSome figures courtesy of the followingexcellent Hadoop books (order yours today!) Chuck Lams Hadoop In Action (2010) Tom Whites Hadoop: The Definitive Guide, 2nd Edition (2010)
  • Todays Agenda Practical Hadoop Input/Ouput Splits: small file and whole file operations Compression Mounting HDFS Hadoop Workflow and EC2/S3
  • Practical Hadoop
  • Hello World: Word Countmap ( K( K1=String, V1=String ) list ( K2=String, V2=Integer )reduce ( K2=String, list(V2=Integer) ) list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • Courtesy of Chuck Lams Hadoop In Action (2010), p. 17
  • Courtesy of Chuck Lams Hadoop In Action (2010), pp. 48-49
  • Courtesy of Chuck Lams Hadoop In Action (2010), p. 51
  • Courtesy of Tom Whites Hadoop: The Definitive Guide, 2nd Edition (2010), p. 191
  • Command-Line Parsing Courtesy of Tom Whites Hadoop: The Definitive Guide, 2nd Edition (2010), p. 135
  • Data Types in Hadoop Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComparable Defines a sort order. All keys must be of this type (but not values). IntWritable Concrete classes for different data types. LongWritable Text SequenceFiles Binary encoded of a sequence of key/value pairs
  • Hadoop basic types Courtesy of Chuck Lams Hadoop In Action (2010), p. 46
  • Complex Data Types in Hadoop How do you implement complex data types? The easiest way: Encoded it as Text, e.g., (a, b) = a:b Use regular expressions to parse and extract data Works, but pretty hack-ish The hard way: Define a custom implementation of WritableComprable Must implement: readFields, write, compareTo Computationally efficient, but slow for rapid prototyping Alternatives: Cloud9 offers two other choices: Tuple and JSON (Actually, not that useful in practice)
  • InputFormat &RecordReader Courtesy of Tom Whites Hadoop: The Definitive Guide, 2nd Edition (2010), pp. 198-199 Split is logical; atomic records are never split Note re-use key & value objects!
  • Courtesy of Tom WhitesHadoop: The Definitive Guide,2nd Edition (2010), p. 201
  • Input Courtesy of Chuck Lams Hadoop In Action (2010), p. 53
  • Output Courtesy of Chuck Lams Hadoop In Action (2010), p. 58
  • OutputFormat Reducer Reducer Reduce RecordWriter RecordWriter RecordWriter Output File Output File Output FileSource: redrawn from a slide by Cloduera, cc-licensed
  • Creating Input Splits (White p. 202-203) FileInputFormat: large files split into blocks isSplitable() default TRUE computeSplitSize() = max(minSize, min(maxSize,blockSize) ) getSplits() How to prevent splitting? Option 1: set mapred.min.splitsize=Long.MAX_VALUE Option 2: subclass FileInputFormat, set isSplitable()=FALSE
  • How to process whole file as a single record? e.g. file conversion Preventing splitting is necessary, but not sufficient Need a RecordReader that delivers entire file as a record Implement WholeFile input format & record reader recipe See White pp. 206-209 Overrides getRecordReader() in FileInputFormat Defines new WholeFileRecordReader
  • Small Files Files < Hadoop block size are never split (by default) Note this is with default mapred.min.splitsize = 1 byte Could extend FileInputFormat to override this behavior Using many small files inefficient in Hadoop Overhead for TaskTracker, JobTracker, Map object, Requires more disk seeks Wasteful for NameNode memory How to deal with small files??
  • Dealing with small files Pre-processing: merge into one or more bigger files Doubles disk space, unless clever (can delete after merge) Create Hadoop Archive (White pp. 72-73) Doesnt solve splitting problem, just reduces NameNode memory Simple text: just concatenate (e.g. each record on a single line) XML: concatenate, specify start/end tags StreamXmlRecordReader (as newline is end tag for Text) Create a SequenceFile (see White pp. 117-118) Sequence of records, all with same (key,value) type E.g. Key=filename, Value=text or bytes of original file Can also use for larger files, e.g. if block processing is really fast Use CombineFileInputFormat Reduces map overhead, but not seeks or NameNode memory Only an abstract class provided, you get to implement it :-< Could use to speed up the pre-processing above
  • Multiple File Formats? What if you have multiple formats for same content type? MultipleInputs (White pp. 214-215) Specify InputFormat & Mapper to use on a per-path basis Path could be a directory or a single file Even a single file could have many records (e.g. Hadoop archive or SequenceFile) All mappers must have the same output signature! Same reducer used for all (only input format is different, not the logical records being processed by the different mappers) What about multiple file formats stored in the same Archive or SequenceFile? Multiple formats stored in the same directory? How are multiple file types typically handled in general? e.g. factory pattern, White p. 80
  • White 77-86, Lam 153-155Data Compression Big data = big disk space & I/O (bound) transfer times Affects both intermediate (mapper output) and persistent data Compression makes big data less big (but still cool) Often 1/4th size of original data Main issues Does the compression format support splitting? What happens to parallelization if an entire 8GB compressed file has to be decompressed before we can access the splits? Compression/decompression ratio vs. speed More compression reduces disk space and transfer times, but Slow compression can take longer than reduced transfer time savings Use native libraries!
  • Courtesy of Tom Whites Hadoop: The Definitive Guide, 2nd Edition (2010), Ch. 4Slow; decompression cant keep pace disk reads
  • Compression Speed LZO 2x faster than gzip LZO ~15-20x faster than bzip2http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html
  • Splittable LZO to the rescue LZO format not internally splittable, but we can create a separate, accompanying index of split pointsRecipe Get LZO from Cloudera or elsewhere, and setup See URL on last slide for instructions LZO compress files, copy to HDFS at /path Index them: $ hadoop jar /path/to/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer /path Use hadoop-lzos LzoTextInputFormat instead of TextInputFormat Voila!
  • Compression API for persistent data JobConf helper functions or set properties Input conf.setInputFormatClass(LzoTextInputFormat.class); Persistent (reducer) output FileOutputFormat.setCompressOutput(conf, true) FileOutputFormat.setOutputCompressorClass(conf, LzopCodec.class) Courtesy of Tom Whites Hadoop: The Definitive Guide, 2nd Edition (2010), p. 85
  • Compression API for intermediate data Similar JobConf helper functions or set properties conf.setCompressMapOutput() Conf.setMapOutputCompressClass(LzopCodec.class) Courtesy of Chuck Lams Hadoop In Action(2010), pp. 153-155
  • SequenceFile & compression Use SequenceFile for passing data between Hadoop jobs Optimized for this usage case conf.setOutputFormat(SequenceFileOutputFormat.class) With compression, one more parameter to set Default compression per-record; almost always preferable to compress on a per-block basis
  • From hadoop fs X -> Mounted HDFS See White p. 50; hadoop: src/contrib/fuse-dfs
  • Hadoop Workflow 1. Load data into HDFS 2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step 2 Hadoop Cluster You