26
Getting Started with AWS Analyzing Big Data

AmazonWebServices-Analyzing Big Data

Embed Size (px)

DESCRIPTION

Analyzing Big Data

Citation preview

  • Getting Started with AWSAnalyzing Big Data

  • Getting Started with AWS: Analyzing Big Data

    Getting Started with AWS Analyzing Big Data

  • Overview ................................................................................................................................................. 1Getting Started ....................................................................................................................................... 3Step 1: Sign Up for the Service .............................................................................................................. 4Step 2: Create a Key Pair ....................................................................................................................... 4Step 3: Create an Interactive Job Flow Using the Console .................................................................... 5Step 4: SSH into the Master Node .......................................................................................................... 8Step 5: Start and Configure Hive .......................................................................................................... 11Step 6: Create the Hive Table and Load Data into HDFS ..................................................................... 11Step 7: Query Hive ............................................................................................................................... 12Step 8: Clean Up .................................................................................................................................. 13Variations .............................................................................................................................................. 15Calculating Pricing ................................................................................................................................ 17Related Resources ............................................................................................................................... 21Document History ................................................................................................................................. 23

    3

    Getting Started with AWS Analyzing Big Data

  • Overview

    Big data refers to data sets that are too large to be hosted in traditional relational databases and areinefficient to analyze using nondistributed applications. As the amount of data that businesses generateand store continues to grow rapidly, tools and techniques to process this large-scale data becomeincreasingly important.

    This guide explains how Amazon Web Services helps you manage these large data sets. As an example,well look at a common source of large scale data: web server logs.

    Web server logs can contain a wealth of information about visitors to your site: where their interests lie,how they use your website, and how they found it; however, web service logs grow rapidly, and theirformat is not immediately compatible with relational data stores.

    A popular way to analyze big data sets is with clusters of commodity computers running in parallel. Eachcomputer processes a portion of the data, and then the results are aggregated. One technique that usesthis strategy is called map-reduce.

    In the map phase, the problem set is apportioned among a set of computers in atomic chunks. Forexample, if the question was, How many people used the search keyword chocolate to find this page?,the set of web server logs would be divided among the set of computers, each counting instances of thatkeyword in their partial data set.

    The reduce phase aggregates the results from all the computers to a final result.To continue our example,as each computer finished processing its set of data, it would report its results to a single node that talliesthe total to produce the final answer.

    Apache Hadoop is an open-source implementation of MapReduce that supports distributed processingof large data sets across clusters of computers.You can configure the size of your Hadoop cluster basedon the number of physical machines you would like to use; however, purchasing and maintaining a setof physical servers can be an expensive proposition. Further, if your processing needs fluctuate, youcould find yourself paying for the upkeep of idle machines.

    How does AWS help?Amazon Web Services provides several services to help you process large-scale data.You pay for onlythe resources that you use, so you don't need to maintain a cluster of physical servers and storagedevices. Further, AWS makes it easy to provision, configure, and monitor a Hadoop cluster.

    1

    Getting Started with AWS Analyzing Big DataHow does AWS help?

  • The following table shows the Amazon Web Services that can help you manage big data.

    BenefitsAmazon Web ServicesChallenges

    Amazon S3 can store largeamounts of data, and its capacitycan grow to meet your needs. It ishighly redundant, protectingagainst data loss.

    Amazon Simple Storage Service(Amazon S3)

    Web server logs can be verylarge. They need to be stored atlow cost while protecting againstcorruption or loss.

    By running map-reduceapplications on virtual AmazonEC2 servers, you pay for theservers only while the applicationis running, and you can expandthe number of servers to matchthe processing needs of yourapplication.

    Amazon Elastic Compute Cloud(Amazon EC2)

    Maintaining a cluster of physicalservers to process data isexpensive and time-consuming.

    Amazon EMR handles Hadoopcluster configuration, monitoring,and management while integratingseamlessly with other AWSservices to simplify large-scaledata processing in the cloud.

    Amazon EMRHadoop and other open-sourcebig-data tools can be challengingto configure, monitor, andoperate.

    Data Processing ArchitectureIn the following diagram, the data to be processed is stored in an Amazon S3 bucket. Amazon EMRstreams the data from Amazon S3 and launches Amazon EC2 instances to process the data in parallel.When Amazon EMR launches the EC2 instances, it initializes them with an Amazon Machine Image (AMI)that can be preloaded with open-source tools, such as Hadoop, Hive, and Pig, which are optimized towork with other AWS services.

    Let's see how we can use this architecture to analyze big data.

    2

    Getting Started with AWS Analyzing Big DataData Processing Architecture

  • Getting Started

    Topics Step 1: Sign Up for the Service (p. 4) Step 2: Create a Key Pair (p. 4) Step 3: Create an Interactive Job Flow Using the Console (p. 5) Step 4: SSH into the Master Node (p. 8) Step 5: Start and Configure Hive (p. 11) Step 6: Create the Hive Table and Load Data into HDFS (p. 11) Step 7: Query Hive (p. 12) Step 8: Clean Up (p. 13)

    Suppose you host a popular e-commerce website. In order to understand your customers better, youwant to analyze your Apache web logs to discover how people are finding your site.Youd especially liketo determine which of your online ad campaigns are most successful in driving traffic to your online store.

    The web services logs, however, are too large to import into a MySql database, and they are not in arelational format.You need another way to analyze them.

    Amazon EMR integrates open-source applications such as Hadoop and Hive with Amazon Web Servicesto provide a scalable and efficient architecture to analyze large scale data, such as your Apache weblogs.

    In the following tutorial, well import data from Amazon S3 and launch an Amazon EMR job flow from theAWS Management Console. Then we'll use Secure Shell (SSH) to connect to the master node of the jobflow, where we'll run Hive to query the Apache logs using a simplified SQL syntax.

    Working through this tutorial will cost you 29 cents each hour or partial hour your job flow is running. Thetutorial typically takes less than an hour to complete. The input data we will use is hosted on an AmazonS3 bucket owned by the Amazon Elastic MapReduce team, so you pay only for the Amazon ElasticMapReduce processing on three m1.small EC2 instances (29 cents per hour) to launch and manage thejob flow in the US-East region.Let's begin!

    3

    Getting Started with AWS Analyzing Big Data

  • Step 1: Sign Up for the ServiceIf you don't already have an AWS account, youll need to get one.Your AWS account gives you accessto all services, but you will be charged only for the resources that you use. For this example walkthrough,the charges will be minimal.

    To sign up for AWS

    1. Go to http://aws.amazon.com and click Sign Up.2. Follow the on-screen instructions.

    AWS notifies you by email when your account is active and available for you to use.

    You use your AWS account credentials to deploy and manage resources within AWS. If you give otherpeople access to your resources, you will probably want to control who has access and what they cando. AWS Identity and Access Management (IAM) is a web service that controls access to your resourcesby other people. In IAM, you create users, which other people can use to obtain access and permissionsthat you define. For more information about IAM, go to Using IAM.

    Step 2: Create a Key PairYou can create a key pair to connect to the Amazon EC2 instances that Amazon EMR launches. Forsecurity reasons, EC2 instances use a public/private key pair, rather than a user name and password,to authenticate connection requests. The public key half of this pair is embedded in the instance, so youcan use the private key to log in securely without a password. In this step we will use the AWS ManagementConsole to create a key pair. Later, well use this key pair to connect to the master node of the AmazonEMR job flow in order to run Hive.

    To generate a key pair

    1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.2. In the top navigation bar, in the region selector, click US East (N. Virginia).3. In the left navigation pane, under Network and Security, click Key Pairs.4. Click Create Key Pair.5. Type mykeypair in the new Key Pair Name box and then click Create.6. Download the private key file, which is named mykeypair.pem, and keep it in a safe place.You will

    need it to access any instances that you launch with this key pair.ImportantIf you lose the key pair, you cannot connect to your Amazon EC2 instances.

    For more information about key pairs, see Getting an SSH Key Pair in the Amazon Elastic ComputeCloud User Guide.

    4

    Getting Started with AWS Analyzing Big DataStep 1: Sign Up for the Service

  • Step 3: Create an Interactive Job Flow Using theConsole

    To create a job flow using the console1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at

    https://console.aws.amazon.com/elasticmapreduce/.2. In the Region box, click US East, and then click Create New Job Flow.

    3. On the Define Job Flow page, click Run your own application. In the Choose a Job Type box,click Hive Program, and then click Continue.You can leave Job Flow Name as My Job Flow, oryou can rename it.

    Select which version of Hadoop to run on your job flow in Hadoop Version.You can choose to runthe Amazon distribution of Hadoop or one of two MapR distributions. For more information aboutMapR distributions for Hadoop, see Using the MapR Distribution for Hadoop in the Amazon ElasticMapReduce Developer Guide.

    5

    Getting Started with AWS Analyzing Big DataStep 3: Create an Interactive Job Flow Using the Console

  • 4. On the Specify Parameters page, click Start an Interactive Hive Session, and then click Continue.Hive is an open-source tool that runs on top of Hadoop. With Hive, you can query job flows by usinga simplified SQL syntax.We are selecting an interactive session because well be issuing commandsfrom a terminal window.

    NoteYou can select the Execute a Hive Script check box to run Hive commands that you storein a text file in an Amazon S3 bucket. This option is useful for automating Hive queries youwant to perform on an recurring basis. Because of the requirements of Hadoop, AmazonS3 bucket names used with Amazon EMR must contain only lowercase letters, numbers,periods (.), and hyphens (-).

    5. On the Configure EC2 Instances page, you can set the number and type of instances used toprocess the big data set in parallel. To keep the cost of this tutorial low, we will accept the defaultinstance types, an m1.small master node and two m1.small core nodes. Click Continue.

    6

    Getting Started with AWS Analyzing Big DataStep 3: Create an Interactive Job Flow Using the Console

  • NoteWhen you are analyzing data in a real application, you may want to increase the size ornumber of these nodes to increase processing power and reduce computational time. Inaddition, you can specify some or all of your job flow to run as Spot Instances, a way ofpurchasing unused Amazon EC2 capacity at reduced cost. For more information about spotinstances, go to Lowering Costs with Spot Instances in the Amazon Elastic MapReduceDeveloper Guide.

    6. On the Advanced Options page, specify the key pair that you created earlier.

    Leave the rest of the settings on this page at the default values. For example, Amazon VPC SubnetId should remain set to No preference.

    7

    Getting Started with AWS Analyzing Big DataStep 3: Create an Interactive Job Flow Using the Console

  • NoteIn a production environment, debugging can be a useful tool to analyze errors or inefficienciesin a job flow. For more information on how to use debugging in Amazon EMR, go toTroubleshooting in the Amazon Elastic MapReduce Developer Guide.

    7. On the Bootstrap Actions page, click Proceed with no Bootstrap Actions and then click Continue.Bootstrap actions load custom software onto the AMIs that Amazon EMR launches. For this tutorial,we will be using Hive, which is already installed on the AMI, so no bootstrap action is needed.

    8. On the Review page, review the settings for your job flow. If everything looks correct, click CreateJob Flow. When the confirmation window closes, your new job flow will appear in the list of job flowsin the Amazon EMR console with the status STARTING. It will take a few minutes for Amazon EMRto provision the Amazon EC2 instances for your job flow.

    Step 4: SSH into the Master NodeWhen the Job Flows Status in the Amazon EMR console is WAITING, the master node is ready for youto connect to it. Before you can do that, however, you must acquire the DNS name of the master nodeand configure your connection tools and credentials.

    8

    Getting Started with AWS Analyzing Big DataStep 4: SSH into the Master Node

  • To locate the DNS name of the master node

    Locate the DNS name of the master node in the Amazon EMR console by selecting the job fromfrom the list of running job flows. This causes details about the job flow to appear in the lower pane.The DNS name you will use to connect to the instance is listed on the Description tab as MasterPublic DNS Name. Make a note of the DNS name; youll need it in the next step.

    Next we'll use a Secure Shell (SSH) application to open a terminal connection to the master node. AnSSH application is installed by default on most Linux, Unix, and Mac OS installations. Windows userscan use an application called PuTTY to connect to the master node. Platform-specific instructions forconfiguring a Windows application to open an SSH connection are described later in this topic.

    You must first configure your credentials, or SSH will return an error message saying that your privatekey file is unprotected, and it will reject the key.You need to do this step only the first time you use theprivate key to connect.

    To configure your credentials on Linux/Unix/Mac OS X

    1. Open a terminal window. This is found at Applications/Utilities/Terminal on Mac OS X and atApplications/Accessories/Terminal on many Linux distributions.

    2. Set the permissions on the PEM file for your Amazon EC2 key pair so that only the key owner haspermissions to access the key. For example, if you saved the file as mykeypair.pem in the usershome directory, the command is:

    chmod og-rwx ~/mykeypair.pem

    To connect to the master node using Linux/Unix/Mac OS X

    1. From the terminal window, enter the following command line, where ssh is the command, hadoopis the user name you are using to connect, the at symbol (@) joins the username and the DNS of themachine you are connecting to, and the -i parameter indicates the location of the private key fileyou saved in step 6 of Step 2: Create a Key Pair (p. 4). In this example, were assuming its beensaved to your home directory.

    9

    Getting Started with AWS Analyzing Big DataStep 4: SSH into the Master Node

  • ssh hadoop@master-public-dns-name \-i ~/mykeypair.pem

    2. You will receive a warning that the authenticity of the host you are connecting to cant be verified.Type yes to continue connecting.

    If you are using a Windows-based computer, you will need to install an SSH program in order to connectto the master node. In this tutorial, we will use PuTTY. If you have already installed PuTTY and configuredyour key pair, you can skip this procedure.

    To install and configure PuTTY on Windows

    1. Download PuTTYgen.exe and PuTTY.exe to your computer fromhttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

    2. Launch PuTTYgen.3. Click Load. Select the PEM file you created earlier.You may have to change the search parameters

    from file of type PuTTY Private Key Files (*.ppk) to All Files (*.*).4. Click Open.5. On the PuTTYgen Notice telling you the key was successfully imported, click OK.6. To save the key in the PPK format, click Save private key.7. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.8. Enter a name for your PuTTY private key, such as mykeypair.ppk.

    To connect to the master node using Windows/Putty

    1. Start PuTTY.2. In the Category list, click Session. In the Host Name box, type hadoop@DNS. The input looks

    similar to [email protected]. In the Category list, expand Connection, expand SSH, and then click Auth.4. In the Options controlling SSH authentication pane, click Browse for Private key file for

    authentication, and then select the private key file that you generated earlier. If you are followingthis guide, the file name is mykeypair.ppk.

    5. Click Open.6. To connect to the master node, click Open.7. In the PuTTY Security Alert, click Yes.

    NoteFor more information about how to install PuTTY and use it to connect to an EC2 instance, suchas the master node, go to Appendix D: Connecting to a Linux/UNIX Instance from Windowsusing PuTTY in the Amazon Elastic Compute Cloud User Guide.

    When you are successfully connected to the master node via SSH, you will see a welcome screen andprompt like the following.

    -----------------------------------------------------------------------------

    Welcome to Amazon EMR running Hadoop and Debian/Lenny.

    Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check

    10

    Getting Started with AWS Analyzing Big DataStep 4: SSH into the Master Node

  • /mnt/var/log/hadoop/steps for diagnosing step failures.

    The Hadoop UI can be accessed via the following commands:

    JobTracker lynx http://localhost:9100/ NameNode lynx http://localhost:9101/

    -----------------------------------------------------------------------------

    hadoop@ip-10-245-190-34:~$

    Step 5: Start and Configure HiveApache Hive is a data warehouse application you can use to query data contained in Amazon EMR jobflows by using a SQL-like language. Because we launched the job flow as a Hive application, AmazonEMR will install Hive on the Amazon EC2 instances it launches to process the job flow.We will use Hive interactively to query the web server log data.To begin, we will start Hive, and then wellload some additional libraries that add functionality such as the ability to easily access Amazon S3. Theadditional libraries are contained in a Java archive file named hive_contrib.jar on the master node. Whenyou load these libraries, Hive bundles them with the map-reduce job that it launches to process yourqueries.

    To learn more about Hive, go to http://hive.apache.org/.

    To start and configure Hive on the master node

    1. From the command line of the master node, type hive, and then press Enter.2. At the hive> prompt, type the following command, and then press Enter.

    hive> add jar /home/hadoop/hive/lib/hive_contrib.jar;

    Step 6: Create the Hive Table and Load Data intoHDFS

    In order for Hive to interact with data, it must translate the data from its current format (in the case ofApache web logs, a text file) into a format that can be represented as a database table. Hive does thistranslation using a serializer/deserializer (SerDe). SerDes exist for all kinds of data formats. For informationabout how to write a custom SerDe, go to the Apache Hive Developer Guide.

    The SerDe well use in this example uses regular expressions to parse the log file data. It comes fromthe Hive open source community, and the code for this SerDe can be found online athttps://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java.Using this SerDe, we can define the log files as a table, which well query using SQL-like statements laterin this tutorial.

    11

    Getting Started with AWS Analyzing Big DataStep 5: Start and Configure Hive

  • To translate the Apache log file data into a Hive table

    Copy the following multiline command. At the hive command prompt, paste the command, and thenpress Enter.

    CREATE TABLE serde_regex( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s") LOCATION 's3://elasticmapreduce/samples/pig-apache/input/';

    In the command, the LOCATION parameter specifies the location of the Apache log files in Amazon S3.For this tutorial we are using a set of sample log files. To analyze your own Apache web server log files,you would replace the Amazon S3 URL above with the location of your own log files in Amazon S3.Because of the requirements of Hadoop, Amazon S3 bucket names used with Amazon EMR must containonly lowercase letters, numbers, periods (.), and hyphens (-).After you run the command above, you should receive a confirmation like this one:

    Found class for org.apache.hadoop.hive.contrib.serde2.RegexSerDeOKTime taken: 12.56 secondshive>

    Once Hive has loaded the data, the data will persist in HDFS storage as long as the Amazon EMR jobflow is running, even if you shut down your Hive session and close the SSH terminal.

    Step 7: Query HiveYou are now ready to start querying the Apache log file data. Below are some queries to run.

    Count the number of rows in the Apache webserver log files.

    12

    Getting Started with AWS Analyzing Big DataStep 7: Query Hive

  • select count(1) from serde_regex;

    Return all fields from one row of log file data.

    select * from serde_regex limit 1;

    Count the number of requests from the host with an IP address of 192.168.1.198.

    select count(1) from serde_regex where host="192.168.1.198";

    To return query results, Hive translates your query into a MapReduce application that is run on the AmazonEMR job flow.You can invent your own queries for the web server log files. Hive SQL is a subset of SQL. For moreinformation about the available query syntax, go to the Hive Language Manual.

    Step 8: Clean UpTo prevent your account from accruing additional charges, you should terminate the job flow when youare done with this tutorial. Because you launched the job flow as interactive, it has to be manuallyterminated.

    To disconnect from Hive and SSH

    1. In Secure Shell, press Click CTRL+C to exit Hive.2. At the SSH command prompt, type exit, and then press Enter. Afterwards you can close the terminal

    or PuTTY window.

    exit

    To terminate a job flow In the Amazon Elastic MapReduce console, click the job flow, and then click Terminate.

    The next step is optional. It deletes the key pair you created in Step 2.You are not charged for key pairs.If you are planning to explore Amazon EMR further, you should retain the key pair.

    To delete a key pair

    1. From the Amazon EC2 console, select Key Pairs from the left-hand pane.2. In the right pane, select the key pair you created in Step 2 and click Delete.

    13

    Getting Started with AWS Analyzing Big DataStep 8: Clean Up

  • The next step is optional. It deletes two security groups created for you by Amazon EMR when youlaunched the job flow.You are not charged for security groups. If you are planning to explore AmazonEMR further, you should retain them.

    To delete Amazon EMR security groups

    1. From the Amazon EC2 console, in the Navigation pane, click Security Groups.2. In the Security Groups pane, click the ElasticMapReduce-slave security group.3. In the details pane for the ElasticMapReduce-slave security group, delete all actions that reference

    ElasticMapReduce. Click Apply Rule Changes.4. In the right pane, select the ElasticMapReduce-master security group.5. In the details pane for the ElasticMapReduce-master security group, delete all actions that reference

    ElasticMapReduce. Click Apply Rule Changes.6. With ElasticMapReduce-master still selected in the Security Groups pane, click Delete. Click Yes

    to confirm.7. In the Security Groupspane, click ElasticMapReduce-slave, and then click Delete. Click Yes to

    confirm.

    14

    Getting Started with AWS Analyzing Big DataStep 8: Clean Up

  • Variations

    Topics Script Your Hive Queries (p. 15) Use Pig Instead of Hive to Analyze Your Data (p. 15) Create Custom Applications to Analyze Your Data (p. 15)

    Now that you know how to launch and run an interactive Hive job using Amazon Elastic MapReduce toprocess large quantities of data, lets consider some alternatives.

    Script Your Hive QueriesInteractively querying data is the most direct way to get results, and is useful when you are exploring thedata and developing a set of queries that will best answer your business questions. Once youve createda set of queries that you want to run on a regular basis, you can automate the process by saving yourHive commands as a script and uploading them to Amazon S3. Then you can launch a Hive job flowusing the Execute a Hive Script option that we bypassed in Step 4 of Create an Interactive Job FlowUsing the Console. For more information on how to launch a job flow using a Hive script, go to How toCreate a Job Flow Using Hive in the Amazon Elastic MapReduce Developer Guide.

    Use Pig Instead of Hive to Analyze Your DataAmazon Elastic MapReduce provides access to many open source tools that you can use to analyzeyour data. Another of these is Pig, which uses a language called Pig Latin to abstract map-reduce jobs.For an example of how to process the same log files used in this tutorial using Pig, go to Parsing Logswith Apache Pig and Amazon Elastic MapReduce.

    Create Custom Applications to Analyze YourData

    If the functionality you need is not available as an open-source project or if your data has special analysisrequirements, you can write a custom Hadoop map-reduce application and then use Amazon Elastic

    15

    Getting Started with AWS Analyzing Big DataScript Your Hive Queries

  • MapReduce to run it on Amazon Web Services. For more information, go to Run a Hadoop Applicationto Process Data in the Amazon Elastic MapReduce Developer Guide.

    Alternatively, you can create a streaming job flow that reads data from standard input and then runs ascript or executable (a mapper) against the input. Once all the data is processed by the mapper, a secondscript or executable (a reducer) processes the mapper results. The results from the reducer are sent tostandard output. The mapper and the reducer can each be referenced as a file or you can supply a Javaclass.You can implement the mapper and reducer in any of the supported languages, including Ruby,Perl, Python, PHP, or Bash. For more information, go to Launch a Streaming Cluster in the AmazonElastic MapReduce Developer Guide.

    16

    Getting Started with AWS Analyzing Big DataCreate Custom Applications to Analyze Your Data

  • Calculating Pricing

    Topics Amazon S3 Cost Breakdown (p. 18) Amazon Elastic MapReduce Cost Breakdown (p. 19)

    The AWS Simple Monthly Calculator estimates your monthly bill. It provides a per service cost breakdown,as well as an aggregate monthly estimate.You can also use the calculator to see an estimation andbreakdown of costs for common solutions.

    This topic walks you through an example of using the AWS Simple Monthly Calculator to estimate yourmonthly bill for processing 1 GB of weekly web server log files. For additional information, download thewhitepaper How AWS Pricing Works.

    To estimate costs using the AWS Simple Monthly Calculator

    1. Go to http://calculator.s3.amazonaws.com/calc5.html.2. In the navigation pane, select a web service that your application requires. Enter your estimated

    monthly usage for that web service. Click Add To Bill to add the cost for that service to your total.Repeat this step for each web service your application uses.

    3. To see the total estimated monthly charges, click the Estimate of your Monthly Bill tab.

    Suppose you have a website that produces web logs that you upload each day to an S3 bucket. Thistransfer averages about 1 GB of web-server data a week.You want to analyze the data once a week foran hour, and then delete the web logs. To make processing the log files efficient, well use three m1.largeAmazon EC2 instances in the Amazon Elastic MapReduce job flow.

    NoteAWS pricing you see in this documentation is current at the time of publication. c

    17

    Getting Started with AWS Analyzing Big Data

  • Amazon S3 Cost BreakdownThe following table shows the characteristics for Amazon S3 we have identified for this web applicationhosting architecture. For the latest pricing information, go to AWS Service Pricing Overview.

    DescriptionMetricCharacteristic

    The website generates 1 GB of web service logs a week, andwe store only one weeks worth of data at a time in AmazonS3.

    1 GB/monthStorage

    Each day a file containing the daily web logs is uploaded toAmazon S3, resulting in 30 put requests per month.

    Each week, Amazon Elastic MapReduce loads seven dailyweb log files from Amazon S3 to HDFS, resulting in anaverage of 30 get requests per month.

    PUT requests:30/month

    GET requests:30/month

    Requests

    If the average amount of data generated by the web servicelogs each week is 1 GB, the total amount of data transferredto Amazon S3 each month is 4 GB. Data transfer into AWSis free.

    In addition, there is no charge for transferring data betweenservices within AWS in the same Region, so the transfer ofdata from Amazon S3 to HDFS on an Amazon EC2 instanceis also free.

    Data in: 4GB/month

    Data out: 4GB/month

    Data Transfer

    The following image shows the cost breakdown for Amazon S3 in the AWS Simple Monthly Calculatorfor region US-East.

    The total monthly cost is the sum of the cost for uploading the daily web logs, storing a weeks worth oflogs in Amazon S3, and then transferring them to Amazon Elastic MapReduce for processing.

    CalculationFormulaVariable

    $0.13Storage Rate($0.125 / GB)x Storage Amount(1 GB*)-------------------------------

    ProvisionedStorage

    18

    Getting Started with AWS Analyzing Big DataAmazon S3 Cost Breakdown

  • CalculationFormulaVariable

    $0.01Request Rate($0.01 / 1000requests)x Number ofrequests (30**)-------------------------------

    PUT requests

    $0.01Request Rate($0.01 / 1000requests)x Number ofrequests (30**)-------------------------------

    GET requests

    $0.15Total Amazon S3Charges

    (*) The charge of $0.125 is rounded up to $0.13.(**) Any number of requests fewer than 1000 is charged the minimum, which is $0.01.A way to reduce your Amazon S3 storage charges is to use Reduced Redundancy Storage (RRS), alower cost option for storing non-critical, reproducible data. For more information about RRS, go tohttp://aws.amazon.com/s3/faqs/#What_is_RRS. then

    Amazon Elastic MapReduce Cost BreakdownThe following table shows the characteristics for Amazon Elastic MapReduce well use for our example.For more information, go to Amazon Elastic MapReduce Pricing.

    DescriptionMetricCharacteristic

    To make processing 1GB of data efficient, well use m1.largeinstances to boost computational power.

    m1.large(on-demand)

    MachineCharacteristics

    One master node and two slave nodes.3Instance Scale

    Four hours each month to process the log files.4 hours / monthUptime

    The following image shows the cost breakdown for Amazon Elastic MapReduce in the AWS SimpleMonthly Calculator.

    19

    Getting Started with AWS Analyzing Big DataAmazon Elastic MapReduce Cost Breakdown

  • The total monthly cost is the sum of the cost of running and managing the EC2 Instances in the job flow.

    CalculationFormulaVariable

    $0.38x 3

    x 4

    ---------------

    $4.56

    Compute cost perhour

    x Number ofinstances

    x Uptime in hours

    ------------------------

    Total AmazonElasticMapReduceCharges

    InstanceManagement Cost

    20

    Getting Started with AWS Analyzing Big DataAmazon Elastic MapReduce Cost Breakdown

  • Related Resources

    The following table lists related resources that you'll find useful as you work with AWS services.

    DescriptionResource

    A comprehensive list of products and services AWS offers.AWS Products and Services

    Official documentation for each AWS product including serviceintroductions, service features, and API references, and otheruseful information.

    Documentation

    Provides the necessary guidance and best practices to buildhighly scalable and reliable applications in the AWS cloud.These resources help you understand the AWS platform, itsservices and features. They also provide architecturalguidance for design and implementation of systems that runon the AWS infrastructure.

    AWS Architecture Center

    Provides access to information, tools, and resources tocompare the costs of Amazon Web Services with ITinfrastructure alternatives.

    AWS Economics Center

    Features a comprehensive list of technical AWS whitepaperscovering topics such as architecture, security, and economics.These whitepapers have been authored either by the Amazonteam or by AWS customers or solution providers.

    AWS Cloud Computing Whitepapers

    Previously recorded webinars and videos about products,architecture, security, and more.

    Videos and Webinars

    A community-based forum for developers to discuss technicalquestions related to Amazon Web Services.

    Discussion Forums

    The home page for AWS Technical Support, including accessto our Developer Forums, Technical FAQs, Service Statuspage, and AWS Premium Support. (subscription required).

    AWS Support Center

    The primary web page for information about AWS PremiumSupport, a one-on-one, fast-response support channel to helpyou build and run applications on AWS Infrastructure Services.

    AWS Premium Support Information

    21

    Getting Started with AWS Analyzing Big Data

  • DescriptionResource

    This form is only for account questions. For technicalquestions, use the Discussion Forums.

    Form for questions related to your AWSaccount: Contact Us

    Detailed information about the copyright and trademark usageat Amazon.com and other topics.

    Conditions of Use

    22

    Getting Started with AWS Analyzing Big Data

  • Document History

    This document history is associated with the 2011-12-12 release of: Getting Started with AWS AnalyzingBig Data with AWS.

    Release DateDescriptionChange

    6 March 2012Updated to reflectreductions inAmazon EC2 andAmazon EMRservice fees.

    Updated AmazonEC2 and AmazonEMR Pricing

    9 February 2012Updated to reflectreduction inAmazon S3storage fees.

    Updated AmazonS3 Pricing

    12 December 2011Created newdocument.

    New content

    23

    Getting Started with AWS Analyzing Big Data

    Getting Started with AWSOverviewHow does AWS help?Data Processing Architecture

    Getting StartedStep 1: Sign Up for the ServiceStep 2: Create a Key PairStep 3: Create an Interactive Job Flow Using the ConsoleStep 4: SSH into the Master NodeStep 5: Start and Configure HiveStep 6: Create the Hive Table and Load Data into HDFSStep 7: Query HiveStep 8: Clean Up

    VariationsScript Your Hive QueriesUse Pig Instead of Hive to Analyze Your DataCreate Custom Applications to Analyze Your Data

    Calculating PricingAmazon S3 Cost BreakdownAmazon Elastic MapReduce Cost Breakdown

    Related ResourcesDocument History