Introduction to cloud computing and big data - part1

  • Published on
    18-Jul-2015

  • View
    448

  • Download
    2

Embed Size (px)

Transcript

<ul><li><p>Introduction</p><p>Amir H. Payberahamir@sics.se</p><p>Amirkabir University of Technology(Tehran Polytechnic)</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 1 / 48</p></li><li><p>Course Information</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 2 / 48</p></li><li><p>Course Objective</p><p>I Introduction to main concepts and principles of cloud computingand data intensive computing.</p><p>I How to read, review and present a scientific paper.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 3 / 48</p></li><li><p>Course Objective</p><p>I Introduction to main concepts and principles of cloud computingand data intensive computing.</p><p>I How to read, review and present a scientific paper.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 3 / 48</p></li><li><p>Topics of Study</p><p>I Topics we will cover include: Cloud platforms Cloud storage, NoSQL and NewSQL databases Cloud resource management Batch processing frameworks Stream processing frameworks Graph processing frameworks</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 4 / 48</p></li><li><p>Course Material</p><p>I Mainly based on research papers.</p><p>I You will find all the material on the course web page:http://www.sics.se/amir/cloud14.htm</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 5 / 48</p></li><li><p>Course Examination</p><p>I Mid term exam: 20%</p><p>I Final exam: 20%</p><p>I Reading assignments: 27%</p><p>I Final presentation: 23%</p><p>I Final project: 10%</p><p>I Lab assignments: 0% :)</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 6 / 48</p></li><li><p>Reading Assignments</p><p>I Nine reading assignments.</p><p>I You should write a review for each paper (at most two pages).</p><p>I For each paper you should: Identify and motivate the problem. Pinpoint the main contributions. Identify positive/negative aspects of the solution/paper.</p><p>I For each paper, you might be given some questions to answer.</p><p>I Students will work in groups of two.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 7 / 48</p></li><li><p>Reading Assignments</p><p>I Nine reading assignments.</p><p>I You should write a review for each paper (at most two pages).</p><p>I For each paper you should: Identify and motivate the problem. Pinpoint the main contributions. Identify positive/negative aspects of the solution/paper.</p><p>I For each paper, you might be given some questions to answer.</p><p>I Students will work in groups of two.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 7 / 48</p></li><li><p>Reading Assignments</p><p>I Nine reading assignments.</p><p>I You should write a review for each paper (at most two pages).</p><p>I For each paper you should: Identify and motivate the problem. Pinpoint the main contributions. Identify positive/negative aspects of the solution/paper.</p><p>I For each paper, you might be given some questions to answer.</p><p>I Students will work in groups of two.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 7 / 48</p></li><li><p>Reading Assignments</p><p>I Nine reading assignments.</p><p>I You should write a review for each paper (at most two pages).</p><p>I For each paper you should: Identify and motivate the problem. Pinpoint the main contributions. Identify positive/negative aspects of the solution/paper.</p><p>I For each paper, you might be given some questions to answer.</p><p>I Students will work in groups of two.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 7 / 48</p></li><li><p>Presentation</p><p>I Each group give a 30 minutes talk on a scientific paper.</p><p>I The list of papers will be available in the course web page.</p><p>I You are also free to choose any other paper, but it should be con-firmed.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 8 / 48</p></li><li><p>Lab Assignments and the Project</p><p>I Implement simple applications on different frameworks through thecourse.</p><p>I The solution of each lab assignment will be uploaded on the coursepage, one week after their start dates.</p><p>I The final project on top of Spark.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 9 / 48</p></li><li><p>Discussion Forum</p><p>I Use the course discussion forum if you have any questions:http://www.sics.se/amir/cloud14.htm</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 10 / 48</p></li><li><p>Course Overview</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 11 / 48</p></li><li><p>Data is not information, information is not knowledge, knowledge isnot understanding, understanding is not wisdom.</p><p>- Clifford Stoll</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 12 / 48</p></li><li><p>Big Data</p><p>small data big data</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 13 / 48</p></li><li><p>I Big Data refers to datasets and flows largeenough that has outpaced our capability tostore, process, analyze, and understand.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 14 / 48</p></li><li><p>The Four Dimensions of Big Data</p><p>I Volume: data size</p><p>I Velocity: data generation rate</p><p>I Variety: data heterogeneity</p><p>I This 4th V is for Vacillation:Veracity/Variability/Value</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 15 / 48</p></li><li><p>Where DoesBig Data Come From?</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 16 / 48</p></li><li><p>Big Data Market Driving Factors</p><p>The number of web pages indexed by Google, which were aroundone million in 1998, have exceeded one trillion in 2008, and itsexpansion is accelerated by appearance of the social networks.</p><p>[Wei Fan et al., Mining big data: current status, and forecast to the future, 2013]</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 17 / 48</p></li><li><p>Big Data Market Driving Factors</p><p>The amount of mobile data traffic is expected to grow to 10.8Exabyte per month by 2016.</p><p>[Dan Vesset et al., Worldwide Big Data Technology and Services 2012-2015 Forecast, 2013]</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 18 / 48</p></li><li><p>Big Data Market Driving Factors</p><p>More than 65 billion devices were connected to the Internet by2010, and this number will go up to 230 billion by 2020.</p><p>[John Mahoney et al., The Internet of Things Is Coming, 2013]</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 19 / 48</p></li><li><p>Big Data Market Driving Factors</p><p>Open source communities</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 20 / 48</p></li><li><p>Big Data Market Driving Factors</p><p>Many companies are moving towards using Cloud services toaccess Big Data analytical tools.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 21 / 48</p></li><li><p>History of Data</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 22 / 48</p></li><li><p>4000 B.C</p><p>I Manual recording</p><p>I From tablets to papyrus, to parchment, and then to paper</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 23 / 48</p></li><li><p>1450</p><p>I Gutenbergs printing press</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 24 / 48</p></li><li><p>1800s - 1940s</p><p>I Punched cards (no fault-tolerance)</p><p>I Binary data</p><p>I 1890: US census</p><p>I 1911: IBM appeared</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 25 / 48</p></li><li><p>1940s - 1950s</p><p>I Magnetic tapes</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 26 / 48</p></li><li><p>1950s - 1960s</p><p>I Large-scale mainframe computers</p><p>I Batch transaction processing</p><p>I File-oriented record processing model (e.g., COBOL)</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 27 / 48</p></li><li><p>1960s - 1970s</p><p>I Hierarchical DBMS (one-to-many)</p><p>I Network DBMS (many-to-many)</p><p>I VM OS by IBM multiple VMs on a single physical node.</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 28 / 48</p></li><li><p>1970s - 1980s</p><p>I Relational DBMS (tables) and SQL</p><p>I ACID</p><p>I Client-server computing</p><p>I Parallel processing</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 29 / 48</p></li><li><p>1990s - 2000s</p><p>I Virtualized Private Network connections (VPN)</p><p>I The Internet...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 30 / 48</p></li><li><p>2000s - Now</p><p>I Cloud computing</p><p>I NoSQL: BASE instead of ACID</p><p>I Big Data</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 31 / 48</p></li><li><p>Cloud and Big Data</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 32 / 48</p></li><li><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 33 / 48</p></li><li><p>Big Data Analytics Stack</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 34 / 48</p></li><li><p>Hadoop Big Data Analytics Stack</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 35 / 48</p></li><li><p>Spark Big Data Analytics Stack</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 36 / 48</p></li><li><p>Big Data - File systems</p><p>I Traditional file-systems are not well-designed for large-scale dataprocessing systems.</p><p>I Efficiency has a higher priority than other features, e.g., directoryservice.</p><p>I Massive size of data tends to store it across multiple machines in adistributed way.</p><p>I HDFS/GFS, Amazon S3, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 37 / 48</p></li><li><p>Big Data - File systems</p><p>I Traditional file-systems are not well-designed for large-scale dataprocessing systems.</p><p>I Efficiency has a higher priority than other features, e.g., directoryservice.</p><p>I Massive size of data tends to store it across multiple machines in adistributed way.</p><p>I HDFS/GFS, Amazon S3, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 37 / 48</p></li><li><p>Big Data - File systems</p><p>I Traditional file-systems are not well-designed for large-scale dataprocessing systems.</p><p>I Efficiency has a higher priority than other features, e.g., directoryservice.</p><p>I Massive size of data tends to store it across multiple machines in adistributed way.</p><p>I HDFS/GFS, Amazon S3, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 37 / 48</p></li><li><p>Big Data - File systems</p><p>I Traditional file-systems are not well-designed for large-scale dataprocessing systems.</p><p>I Efficiency has a higher priority than other features, e.g., directoryservice.</p><p>I Massive size of data tends to store it across multiple machines in adistributed way.</p><p>I HDFS/GFS, Amazon S3, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 37 / 48</p></li><li><p>Big Data - Database</p><p>I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.</p><p>I NoSQL databases relax one or more of the ACID properties: BASE</p><p>I Different data models: key/value, column-family, graph, document.</p><p>I Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 38 / 48</p></li><li><p>Big Data - Database</p><p>I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.</p><p>I NoSQL databases relax one or more of the ACID properties: BASE</p><p>I Different data models: key/value, column-family, graph, document.</p><p>I Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 38 / 48</p></li><li><p>Big Data - Database</p><p>I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.</p><p>I NoSQL databases relax one or more of the ACID properties: BASE</p><p>I Different data models: key/value, column-family, graph, document.</p><p>I Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 38 / 48</p></li><li><p>Big Data - Database</p><p>I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.</p><p>I NoSQL databases relax one or more of the ACID properties: BASE</p><p>I Different data models: key/value, column-family, graph, document.</p><p>I Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 38 / 48</p></li><li><p>Big Data - Resource Management</p><p>I Different frameworks require different computing resources.</p><p>I Large organizations need the ability to share data and resourcesbetween multiple frameworks.</p><p>I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.</p><p>I Mesos, YARN, Quincy, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 39 / 48</p></li><li><p>Big Data - Resource Management</p><p>I Different frameworks require different computing resources.</p><p>I Large organizations need the ability to share data and resourcesbetween multiple frameworks.</p><p>I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.</p><p>I Mesos, YARN, Quincy, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 39 / 48</p></li><li><p>Big Data - Resource Management</p><p>I Different frameworks require different computing resources.</p><p>I Large organizations need the ability to share data and resourcesbetween multiple frameworks.</p><p>I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.</p><p>I Mesos, YARN, Quincy, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 39 / 48</p></li><li><p>Big Data - Resource Management</p><p>I Different frameworks require different computing resources.</p><p>I Large organizations need the ability to share data and resourcesbetween multiple frameworks.</p><p>I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.</p><p>I Mesos, YARN, Quincy, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 39 / 48</p></li><li><p>Big Data - Execution Engine</p><p>I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.</p><p>I Data-parallel programming model for clusters of commodity ma-chines.</p><p>I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 40 / 48</p></li><li><p>Big Data - Execution Engine</p><p>I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.</p><p>I Data-parallel programming model for clusters of commodity ma-chines.</p><p>I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 40 / 48</p></li><li><p>Big Data - Execution Engine</p><p>I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.</p><p>I Data-parallel programming model for clusters of commodity ma-chines.</p><p>I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 40 / 48</p></li><li><p>Big Data - Query/Scripting Language</p><p>I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.</p><p>I Need high-level language to improve the query capabilities of exe-cution engines.</p><p>I It translates user-defined functions to low-level API of the executionengines.</p><p>I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 41 / 48</p></li><li><p>Big Data - Query/Scripting Language</p><p>I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.</p><p>I Need high-level language to improve the query capabilities of exe-cution engines.</p><p>I It translates user-defined functions to low-level API of the executionengines.</p><p>I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 41 / 48</p></li><li><p>Big Data - Query/Scripting Language</p><p>I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.</p><p>I Need high-level language to improve the query capabilities of exe-cution engines.</p><p>I It translates user-defined functions to low-level API of the executionengines.</p><p>I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...</p><p>Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 41 / 48</p></li><li><p>Big Data - Query/Scripting Language</p><p>I Low-level programming of execution engines, e.g...</p></li></ul>