21
Motivation Our Solution Evaluation Future Work Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework Biao Xu Ruairí de Fréin Eric Robson Mícheál Ó Foghlú Telecommunications Software & Systems Group Waterford Institute of Technology ICFCA 2012 Leuven, Blegium Biao Xu, etc. Distributed FCA Algorithms MR ?

Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework

Embed Size (px)

DESCRIPTION

An article from the Telecommunications Software & Systems Group, Waterford Institute of Technology, Ireland describing algorithms for distributed Formal Concept Analysis ABSTRACT While many existing formal concept analysis algorithms are efficient, they are typically unsuitable for distributed implementation. Taking the MapReduce (MR) framework as our inspiration we introduce a distributed approach for performing formal concept mining. Our method has its novelty in that we use a light-weight MapReduce runtime called Twister which is better suited to iterative algorithms than recent distributed approaches. First, we describe the theoretical foundations underpinning our distributed formal concept analysis approach. Second, we provide a representative exemplar of how a classic centralized algorithm can be implemented in a distributed fashion using our methodology: we modify Ganter's classic algorithm by introducing a family of MR* algorithms, namely MRGanter and MRGanter+ where the prefix denotes the algorithm's lineage. To evaluate the factors that impact distributed algorithm performance, we compare our MR* algorithms with the state-of-the-art. Experiments conducted on real datasets demonstrate that MRGanter+ is efficient, scalable and an appealing algorithm for distributed problems. Accepted for publication at the International Conference for Formal Concept Analysis 2012. Project participants: Biao Xu, Ruairí de Fréin, Eric Robson, Mícheál Ó Foghlú Ruairí de Fréin: rdefrein (at) gmail (dot) com bibtex: @incollection{ year={2012}, isbn={978-3-642-29891-2}, booktitle={Formal Concept Analysis}, volume={7278}, series={Lecture Notes in Computer Science}, editor={Domenach, Florent and Ignatov, DmitryI. and Poelmans, Jonas}, doi={10.1007/978-3-642-29892-9_26}, title={Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework}, url={http://dx.doi.org/10.1007/978-3-642-29892-9_26}, publisher={Springer Berlin Heidelberg}, keywords={Formal Concept Analysis; Distributed Mining; MapReduce}, author={Xu, Biao and Fréin, Ruairí and Robson, Eric and Ó Foghlú, Mícheál}, pages={292-308} } DOWNLOAD The article Arxiv: http://arxiv.org/abs/1210.2401

Citation preview

  • 1. MotivationOur SolutionEvaluationFuture WorkDistributed Formal Concept AnalysisAlgorithms Based on an Iterative MapReduceFrameworkBiao Xu Ruair de Frin Eric Robson Mchel FoghlTelecommunications Software & Systems GroupWaterford Institute of TechnologyICFCA 2012 Leuven, BlegiumBiao Xu, etc. Distributed FCA Algorithms MR

2. MotivationOur SolutionEvaluationFuture WorkOutline1 MotivationThe Basic Problems of Current FCA AlgorithmsRelated Work2 Our SolutionAdopt Iterative MapReduce FrameworkFCA Algorithms Adaptation3 Evaluation4 Future WorkBiao Xu, etc. Distributed FCA Algorithms MR 3. MotivationOur SolutionEvaluationFuture WorkThe Basic Problems of Current FCA AlgorithmsRelated WorkOutline1 MotivationThe Basic Problems of Current FCA AlgorithmsRelated Work2 Our SolutionAdopt Iterative MapReduce FrameworkFCA Algorithms Adaptation3 Evaluation4 Future WorkBiao Xu, etc. Distributed FCA Algorithms MR 4. MotivationOur SolutionEvaluationFuture WorkThe Basic Problems of Current FCA AlgorithmsRelated WorkApply FCA algorithms in real world applicationsTime-consuming to large and high-demension data.Table: Execution time of traditional FCA algorithms (in seconds).Dataset mushroom anon-web census-incomesize 8124125 32711294 103950133NextClosure 618 14671 18230CloseByOne 2543 656 7465Hard to deal with distributed database.Data volumeCommunicationPrivacySecurityBiao Xu, etc. Distributed FCA Algorithms MR 5. MotivationOur SolutionEvaluationFuture WorkThe Basic Problems of Current FCA AlgorithmsRelated WorkOutline1 MotivationThe Basic Problems of Current FCA AlgorithmsRelated Work2 Our SolutionAdopt Iterative MapReduce FrameworkFCA Algorithms Adaptation3 Evaluation4 Future WorkBiao Xu, etc. Distributed FCA Algorithms MR 6. MotivationOur SolutionEvaluationFuture WorkThe Basic Problems of Current FCA AlgorithmsRelated WorkFew work on distributed FCA algorithmsA distributed version of CloseByOne based on HadoopMapReduce.Petr Krajca, etc. Distributed Algorithm for ComputingFormal Concepts Using Map-Reduce Framework. IDA,2009.Differences in our work.using an iterative MapReduce, Twister.mining formal concepts in the least iterations.Biao Xu, etc. Distributed FCA Algorithms MR 7. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationOutline1 MotivationThe Basic Problems of Current FCA AlgorithmsRelated Work2 Our SolutionAdopt Iterative MapReduce FrameworkFCA Algorithms Adaptation3 Evaluation4 Future WorkBiao Xu, etc. Distributed FCA Algorithms MR 8. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationFeatures of MapReduce FrameworkDivide and conquer strategy: map + reduce function.Table: Partitioned datasets S1 and S2S1 or (OS1, P, IS1)a b c d e f g1 2 3 S2 or (OS2, P, IS2)a b c d e f g4 5 6 Move algorithms to nodes other than datasets.Utilize a cluster not only single machine.Fault tolerance.Biao Xu, etc. Distributed FCA Algorithms MR 9. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationMapReduce Data FlowSplit 0 mapreduce Part 0reduce Part 1Split 1 mapSplit 2 mapInputOutputnode 0sortcopymergeBiao Xu, etc. Distributed FCA Algorithms MR 10. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationTwister: an Iterative MapReduce RuntimeA lightweight MapReduce runtime developed by IndianaUniversity.Efcient support for Iterative MapReduce computations.Table: Comparison between Twister and HadoopTwister HadoopLong running map/reduce task Single step map/reduceIterative supporting Jobs chainingStatic & dynamic data Static data onlyBiao Xu, etc. Distributed FCA Algorithms MR 11. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationTwister ArchitectureTwister DaemonWorker PoolMaster NodeMain ProgramTwister DriverTwister DaemonWorker Poolmapreducemap mapreduce reduceCacheable TasksLocal Disk Local DiskData distribution,collection, andpartition le creationWorker NodeBBBWorker NodePub/subBroker NetworkBiao Xu, etc. Distributed FCA Algorithms MR 12. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationOutline1 MotivationThe Basic Problems of Current FCA AlgorithmsRelated Work2 Our SolutionAdopt Iterative MapReduce FrameworkFCA Algorithms Adaptation3 Evaluation4 Future WorkBiao Xu, etc. Distributed FCA Algorithms MR 13. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationDecompose the FCA AlgorithmMap phase produces local concepts, FYSn.Reduce phase generates global concepts by merging localconcepts from mappers.Theorem: Given the closuresFYS1, , FYSnfrom n disjoint partitions,FYS = FYS1 FYSn.Named our algorithms with MR : MRCbo, MRGanter,MRGanter+.Biao Xu, etc. Distributed FCA Algorithms MR 14. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationMRGanter Work FlowData Split 1MapcomputeClosure()while(!isLastClosure(Closure))runMapReduce()Reduce 1merging()check()Data Split nMapcomputeClosure()Reduce nmerging()check()ClosureDDDS SDatr1, localClosure1atrj, localClosurejatr1, localClosure1atri, localClosureiFigure: Static data labeled by S and dynamic data labeled by D.Biao Xu, etc. Distributed FCA Algorithms MR 15. MotivationOur SolutionEvaluationFuture WorkAdopt Iterative MapReduce FrameworkFCA Algorithms AdaptationRunning example of MRGanter and MRGanter+.d p_i F1 from S1 F2 from S2 Fg {c,g} {b,c,f,g} {c,g}f {b,d,f} {f} {f}e {a,c,e,g} {d,e} {e}d {b,d,f} {d,e} {d}c {c,g} {b,c,f,g} {c,g}b {b,d,f} {b} {b}a {a} {a,d,e,f} {a}{f}g {b,c,d,f,g} {b,c,f,g} {b,c,f,g}e {a,c,e,g} {d,e} {e}d {b,d,f} {d,e} {d}c {c,g} {b,c,f,g} {c,g}b {b,d,f} {b} {b}a {a} {a,d,e,f} {a}{e}g {a,c,e,g} {a,. . . ,g} {a,c,e,g}f {a,. . . ,g} {a,d,e,f} {a,d,e,f}d {b,d,f} {d,e} {d}c {c,g} {b,c,f,g} {c,g}b {b,d,f} {b} {b}a {a} {a,d,e,f} {a}{d}g {b,c,d,f,g} {a,. . . ,g} {b,c,d,f,g}f {b,d,f} {a,d,e,f} {d,f}e {a,. . . ,g} {d,e} {d,e}c {c,g} {b,c,f,g} {c,g}b {b,d,f} {b} {b}a {a} {a,d,e,f} {a}d p_i F1 from S1 F2 from S2 Fg {c,g} {b,c,f,g} {c,g}f {b,d,f} {f} {f}e {a,c,e,g} {d,e} {e}d {b,d,f} {d,e {d}c {c,g} {b,c,f,g} {c,g}b {b,d,f} {b} {b}a {a} {a,d,e,f} {a}{cg}f {b,c,d,f,g} {b,c,f,g} {b,c,f,g}e {a,c,e,g} {a,. . . ,g} {a,c,e,g}d {b,c,d,f,g} {a,. . . ,g} {b,c,d,f,g}b {b,d,f} {b} {b}a {a} {a,d,e,f} {a}{f}g {b,c,d,f,g} {b,c,f,g} {b,c,f,g}e {a,c,e,g} {d,e} {e}d {b,d,f} {d,e} {d}c {c,g} {b,c,f,g} {c,g}b {b,d,f} {b} {b}a {a} {a,d,e,f} {a}{e}g {a,c,e,g} {a,. . . ,g} {a,c,e,g}f {a,. . . ,g} {a,d,e,f} {a,d,e,f}d {b,d,f} {d,e} {d}c {c,g} {b,c,f,g} {c,g}b {b,d,f} {b} {b}a {a} {a,d,e,f} {a}Biao Xu, etc. Distributed FCA Algorithms MR 16. MotivationOur SolutionEvaluationFuture WorkEfciency of MRTable: Execution time: Distributed algorithms are the fastest (inseconds) on certain number of machines (in round brackets).Dataset mushroom anon-web census-incomeconcepts 219010 129009 96531Density 17.36% 1.03% 6.7%NextClosure 618 14671 18230CloseByOne 2543 656 7465MRCbo 241 (11) 693 (11) 803 (11)MRGanter 20269 (5) 20110 (3) 9654 (11)MRGanter+ 198 (9) 496 (9) 358 (11)Biao Xu, etc. Distributed FCA Algorithms MR 17. MotivationOur SolutionEvaluationFuture WorkScalability of MR (1)0 2 4 6 8 10 12102103104105Nodes (Count)CPUTime(Second)MRGanter+MRCboMRGanterFigure: Mushroom dataset: comparison of MRGanter+, MRCbo andMRGanter. MRGanter+ outperforms MRCbo and MRGanter whendense data is processed.Biao Xu, etc. Distributed FCA Algorithms MR 18. MotivationOur SolutionEvaluationFuture WorkScalability of MR (2)0 2 4 6 8 10 12102103104105Nodes (Count)CPUTime(Second)MRGanter+MRCboMRGanterFigure: Anon-web dataset: comparison of MRGanter+, MRCbo andMRGanter. MRGanter+ is faster when more than 3 nodes are used.Biao Xu, etc. Distributed FCA Algorithms MR 19. MotivationOur SolutionEvaluationFuture WorkScalability of MR (3)0 2 4 6 8 10 12102103104105Nodes (Count)CPUTime(Second)MRGanter+MRCboMRGanterFigure: Census dataset: comparison of MRGanter+, MRCbo andMRGanter. MRGanter+ is fastest when a large dataset is processed.Biao Xu, etc. Distributed FCA Algorithms MR 20. MotivationOur SolutionEvaluationFuture WorkFuture WorkExplore the effect of data distribution between clusternodes.Examine MR performance with larger dataset sizes.Extend our approach by reducing the size of intermediatedata.Biao Xu, etc. Distributed FCA Algorithms MR 21. MotivationOur SolutionEvaluationFuture WorkThank youQuestions?Biao Xu, etc. Distributed FCA Algorithms MR