Upload
datatorrent
View
33
Download
0
Embed Size (px)
Citation preview
Fault-tolerantFile Input & Output
Chandni Singh - Committer Apache ApexMay 4, 2016
Background- Windows in Apex- Window: finite piece of a data set along temporal boundaries*- Apex assigns an id to each window which helps with fault-tolerance.- An operator is provided hooks to know which window id it is on.
File Input- AbstractFileInputOperator for reading approx. equal sized
files.- Out of box supported file formats include txt, Parque and
Avro.
- FileSplitterInput & AbstractFSBlockReader - reading different sized files - parallelizing read on a single file.
AbstractFileInputOperator- Scans a folder periodically for new files.
- Parses the file for records.
- Fault-tolerant and scalable.
AbstractFileInputOperator : Fault tolerance- A record is not lost.
- A record is associated with only one window id irrespective of failures.
- If a window is replayed then all the records associated with it will be replayed.
AbstractFileInputOperator : Fault tolerance cont’d
FailureWindow 0: committed
AbstractFileInputOperator : Fault tolerance cont’dFault tolerance is achieved by
- Support from platform - Automatic checkpointing of the state of every operator in the dag.- Automatic restoring a failed operator in another container.
- WindowDataManager - Saves incremental state every window. - Helps with replaying windows that were completed by this operator.
AbstractFileInputOperator : Scalability- Operator partitions read different subset of files.
- Files are distributed between partitions based on their hash.
- Number of partitions can be changed at run time by changing a property.
- For advanced use cases, subclasses can override the directory scanner to customize behavior such as having each partition scan a different directory.
- Auto-scalability supported as well in AbstractThroughputFileInputOperator.
AbstractFileInputOperator : Implementations- LineByLineFileInputOperator in Malhar library- Custom implementation
public class CustomFileInputOperator<RECORD> extends AbstractFileInputOperator<RECORD>{ public final transient DefaultOutputPort<RECORD> output = new DefaultOutputPort<RECORD>();
@Override protected RECORD readEntity() throws IOException { //read record from input stream RECORD record= inputStream.read(...); return record; }
@Override protected void emit(RECORD tuple) { output.emit(tuple); }}
FileSplitterInput & AbstractFSBlockReader- Task of discovering files and reading them is separated into different logical
operators.- File splitter discovers files asynchronously and creates task descriptions-
FileBlockMetadata.- Block readers use FileBlockMetada to read a particular block of file.- Fault-tolerant, parallelizes reading on a single file and is auto-scalable.
FileSplitterInput & AbstractFSBlockReader: Fault tolerance- Platform supports checkpointing state and re-deployment automatically.
- FileSplitterInput uses WindowDataManager to replay tuples of completed windows.
- AbstractFSBlockReader relies on the upstream buffer-server to replay tuples from a given window.
- Buffer-server is a buffer associated with each output port of an operator which holds tuples emitted by that port.
FileSplitterInput & AbstractFSBlockReader: Fault tolerance cont’d
FileSplitterInput & AbstractFSBlockReader: Scalability- FileSplitterInput is a simple operator which does not take much resources.- Block reader does the actual work of reading files and is auto-scalable (in beta).
- Min and max partitions are configurable.- Frequency of re-partition is controlled by a time interval property.- Scales up/down based on the pending FileBlockMetadata in the input port queue.
FileSplitterInput & AbstractFSBlockReader: Implementations- FileSplitterInput is concrete. Default behavior can be overridden.- FS Block Readers
- FSSliceReader : record is a slice- AbstractFSLineReader and AbstractFSReadAheadLineReader: record is a line
- Custom FS Block Readerpublic class CustomFSBlockReader<RECORD> extends AbstractFSBlockReader<RECORD>
{
public CustomFSBlockReader() { //initialize reader context this.readerContext = new RecordReaderContext(); }
@Override protected RECORD convertToRecord(byte[] bytes) { //convert bytes to RECORD return RECORD.from(bytes); }}
AbstractFileOutputOperator- Persists data to a single file or multiple files.
- Automatic rotation of files (optional) based on- file size- window count
- Optional compression and encryption of data.
- Fault-tolerant
- Scalable as long as different partitions write to different files. Subclasses can achieve this by appending the operator id to the file name.
AbstractFileOutputOperator : Fault tolerance
Record is persisted exactly once.
- A record is never missed.
- A record is not duplicated.
Example application that persists data exactly once:AtomicFileOutputApp
AbstractFileOutputOperator : Fault tolerance cont’d
AbstractFileOutputOperator : Fault tolerance cont’d
To write exactly once
- Assumes idempotent processing
- Checkpoint consists of size of each file the operator has written so far.
- Truncation of files to the size saved in the restoration checkpoint.
AbstractFileOutputOperator : Fault tolerance cont’d
To avoid dangling lease issues in HDFS- Data is always written to temporary files
- Renaming temp files to actual files when a file is finalized, that is, closed for writing.
- User can choose when the files get finalized. Rotation handles finalization automatically.
AbstractFileOutputOperator : Custom Implementationpublic class CustomFileOutputOperator<RECORD> extends AbstractFileOutputOperator<RECORD>{ public CustomFileOutputOperator() { setMaxLength(1024 * 1024); setRotationWindows(600); } @Override protected String getFileName(RECORD tuple) { //file name return tuple.getFileName(); } @Override protected byte[] getBytesForTuple(RECORD tuple) { //bytes from record return tuple.toBytes(); }}
Acknowledgements- Apex dev team
Munagala RamanathPramod ImmaneniSasha Parfenov Thomas WeiseTimothy Farkas
- Meetup organizersAmol Kekre
Qybare PulaIan Gomez
- Apache Apex Community
© 2016 DataTorrent
Resources
22
• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter
ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product
ᵒ https://www.datatorrent.com/product/startup-accelerator/
© 2016 DataTorrent
We Are Hiring
23
• [email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders