8
SOLUTION BRIEF How important is big data to the future success of businesses around the world? According to the McKinsey Global Institute, “a retailer using big data to the full could increase its operating margin by more than 60 percent.” 1 The competitive advantages can be enormous, and they aren’t just for retailers. Big data solutions are being used for everything from fraud detection and genetic research to IT infrastructure optimization and social network analysis—and we’re still in the earliest stages of the big data revolution. As big data analytics moves into the mainstream, business success will increasingly depend on the ability to store, process, and analyze massive volumes of structured and unstructured data in near real time. But traditional data warehouse technologies can’t handle the growing flood of diverse, fast-moving data or the rapid change in requirements that businesses face today. A far more flexible, scalable, and open infrastructure is required. Together with the open-source community and the big data ecosystem, Intel is helping businesses overcome these obstacles so they can harness big data to make smarter business decisions. Intel software engineers make extensive contributions to open-source projects that span the breadth of the big data solution stack, including Linux*, Java*, Hadoop*, HBase*, and many others. We also collaborate with leading software vendors and global IT organizations to optimize and deploy their big data solutions on Intel® architecture; work with academia to foster technology advancements; and invest in commercial ventures to nurture critical new technologies and solutions. These efforts help to accelerate open-source innovation across the big data ecosystem. They also help to ensure you and your customers get the highest possible value from big data solutions built on Intel architecture. OPEN SOURCE ON INTEL Do More with Apache Hadoop* on Intel® Architecture Simplify Development With Intel® reference architectures, tuning guides, and best practice recommendations Boost Performance Through simpler and more effective performance tuning Deliver Higher Value With advanced server technologies and highly optimized software Stay on Track With rapid innovation on a proven platform According to IDC, more than half a billion dollars in venture capital has been invested in big data technology, and the market is growing at a compound annual rate (CAGR) of 40 percent. 2 AND DEEPER INSIGHT HARNESS APACHE HADOOP* FOR FASTER PERFORMANCE S

best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

solution brief

How important is big data to the future success of businesses around the world? According to the McKinsey Global institute, “a retailer using big data to the full could increase its operating margin by more than 60 percent.”1 the competitive advantages can be enormous, and they aren’t just for retailers. big data solutions are being used for everything from fraud detection and genetic research to it infrastructure optimization and social network analysis—and we’re still in the earliest stages of the big data revolution.

As big data analytics moves into the mainstream, business success will increasingly depend on the ability to store, process, and analyze massive volumes of structured and unstructured data in near real time. but traditional

data warehouse technologies can’t handle the growing flood of diverse, fast-moving data or the rapid change in requirements that businesses face today. A far more flexible, scalable, and open infrastructure is required.

Together with the open-source community and the big data ecosystem, intel is helping businesses overcome these obstacles so they can harness big data to make smarter business decisions. Intel software engineers make extensive contributions to open-source projects that span the breadth of the big data solution stack, including Linux*, Java*, Hadoop*, HBase*, and many others. We also collaborate with leading software vendors and global it organizations to optimize and deploy their big data solutions on intel® architecture; work with academia to foster technology

advancements; and invest in commercial ventures to nurture critical new technologies and solutions. these efforts help to accelerate open-source innovation across the big data ecosystem. they also help to ensure you and your customers get the highest possible value from big data solutions built on intel architecture.

op

en s

ou

rce

on

int

el

Do More with Apache Hadoop* on Intel® Architecture

Simplify DevelopmentWith Intel® reference architectures, tuning guides, and best practice recommendations

Boost Performancethrough simpler and more effective performance tuning

Deliver Higher ValueWith advanced server technologies and highly

optimized software

Stay on TrackWith rapid innovation on a proven platform

According to iDc, more than half a billion dollars in venture

capital has been invested in big data technology, and the market

is growing at a compound annual rate (cAGr) of 40 percent.2

and deeper InsIght

HArneSS APAcHe HADooP* for fASTer PerforMAnce

S

Page 2: best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

speed InnOVatIVe BIg data sOLUtIOns tO MarKet

with Less effort

As the demand for big data solutions has grown, Apache Hadoop has quickly become one of the preferred platforms for storing and processing large volumes of structured and unstructured data. Businesses can deploy this open-source software framework across a small number of intel® Xeon® processor-based servers to get started with big data analytics quickly and at remarkably low cost. they can then gradually scale their Apache Hadoop cluster to hundreds or even thousands of nodes to enable sub-second query response times across multiple petabytes of data.

there are many factors to consider when designing, provisioning, and tuning Apache Hadoop solutions, and the decisions you make can have a direct impact on the depth, breadth, and timeliness of the insights your customers can glean from their fast-growing data sets. Intel is collaborating with the Apache Hadoop community to enable system administrators to squeeze the maximum performance out of their Apache Hadoop clusters—with minimum complexity.

As a member of the open-source community, we have made extensive contributions to provide you with resources that will help you deliver Apache Hadoop solutions that are optimized for intel architecture more quickly and with less effort.

• Go from planning to production in just weeks. You can design and implement Apache Hadoop solutions in less time and with greater confidence using Intel reference architectures, tuning guides, and best practice recommendations. Detailed technical recommendations provide you with a solid starting point for designing your own best-fit solutions.

• Deliver higher returns through faster analytics. identify and resolve performance issues that would be intractable using traditional software tools. intel developed the Hitune performance analyzer and the Hibench benchmark suite to cut through the complexity of performance tuning for Apache Hadoop—and now makes them freely available as open-source software tools.

What is Apache Hadoop*?

Apache Hadoop is an open-source software framework that enables distributed storage and processing of massive volumes of structured and unstructured data.

it has already become a key competitive differentiator for some of today’s most successful companies, enabling them to extract valuable insights from up to hundreds of petabytes of data in near real time.

www.hadoop.apache.org

2

Page 3: best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

discover

OptIMIZe hadOOp CLUsters and WOrKLOads

FOR FASTER ANALYTICS HiTune Performance Analyzer A key advantage of Apache Hadoop is that it’s easier to deploy and use than a traditional data warehouse. Yet optimizing Apache Hadoop clusters and workloads for high performance can be challenging due to the complex interactions among hardware and software in a distributed environment. intel developed Hitune to address this challenge, providing developers with simple tools to develop highly scalable applications. this scalable, lightweight, and extensible performance analyzer can help you deliver higher performing Apache Hadoop clusters and applications to your customers. it can also help your customers get higher value throughout the life of their cluster.

typical Apache Hadoop queries are written using an intuitive, high-level, data-flow model. This is great for programmers, because all the messy details of data partitioning, task distribution, load balancing, fault tolerance, and node communications are handled by the Apache Hadoop runtime environment. However, hiding that low-level complexity makes performance tuning a daunting challenge. engineers may have little or no insight into the low-level interactions between hardware and software that are so critical for understanding and optimizing performance. they typically must rely on trial and error, which is not only time-consuming, but also often results in less-than-optimal performance.

Hitune monitors the key performance metrics on each server in an Apache Hadoop cluster, then aggregates and correlates these low-level indicators with the high-level data flow model. Engineers gain deep insight into

it only took three years

for Apache Hadoop* to

advance from a pilot project

to large-scale commercial

distributions. that healthy growth rate continues, pointing

toward mainstream adoption throughout the industry and

the emergence of a thriving ecosystem of hardware and

software vendors. iDc predicts the market for software

related to Hadoop will grow at 60 percent a year, reaching

$812.8 million by 2016.3

60%the dynamic interactions between different tasks and stages, and can quickly pinpoint performance bottlenecks, application hotspots, and hardware problems that slow performance.

• Simplify and accelerate performance tuning. Hitune provides detailed analysis and visualizations, has negligible performance impact on running applications, and requires no modifications to source code. Intel engineers have used it extensively and have realized performance gains as high as six times, in many cases through relatively simple hardware or software adjustments.

• Scale analyses across thousands of servers. Hitune can be used to analyze applications with tens of thousands of simultaneous processes running across thousands of servers in production environments. the Hitune analysis engine runs as a Apache Hadoop job to enable fast analysis of large amounts of performance data through massively parallel execution. There is no need to analyze just part of an application running on part of a cluster. engineers can gather and analyze complete information to obtain more useful insights.

• Get higher value over time. intel will continue to extend and optimize HiTune for Apache Hadoop and other distributed, big data solutions. Hitune has already been used at intel to tune and optimize performance for Apache Hive, an open-source data warehouse built on top of Apache Hadoop. The tuning expertise you develop today will deliver even higher value in the future.

S

3

Page 4: best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

tO VerIFY VaLUeQUANTIFY PERFORMANCE

HiBench Benchmark Suite optimizing and verifying performance for Apache Hadoop clusters will become increasingly important as the market grows and customers begin using big data insights in near-real time to improve revenue flows, profitability, and operational efficiency. With the HiBench benchmark suite, you can measure, validate, and compare performance for Apache Hadoop clusters accurately and consistently across diverse workloads to provide your customers with better information and greater confidence.

HiBench provides convenient access to 10 Apache Hadoop workloads that are simple to use and have been extended, configured, and customized to reflect typical deployments. You can measure performance for specific, common tasks, such as sorting and word counting, or for more comprehensive real-world applications, such as web searching, machine learning, and data analytics. the different workloads have different characteristics, so you can establish test matrices that reflect the resource demands of specific environments.

Intel will continue to extend and improve HiBench and is also working with leading vendors and standards bodies to develop industry-standard performance benchmarks for Apache Hadoop. Once these benchmarks are established, you’ll have an even better foundation for understanding architectural issues and for measuring and verifying the performance of your Apache Hadoop solutions.

power

Get the Technical Information You need

System administrators can squeeze the maximum performance out of their Apache Hadoop* clusters—with minimum complexity—using a wealth of tools and resources.

• White Paper: “Optimizing Hadoop Deployments”

• Reference Architecture: Intel® Cloud Builders Guide to Apache Hadoop

• Hadoop Performance Best Practices

• HiTune: Dataflow-Based Performance Analysis for Big Data cloud

• HiBench: A Representative and Comprehensive Hadoop Benchmark Suite

www.intel.com/opensource/bigdata

4

Page 5: best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

BUILd On a

PROVEN FOUNDATION Designing fully-optimized Apache Hadoop clusters requires a deep understanding of the entire solution stack. You could spend months exploring the characteristics of Apache Hadoop workloads and how they interact with the underlying hardware and software. or you could take advantage of the expertise Intel has developed through years of research and collaboration with companies that are now running some of the largest and most successful Apache Hadoop implementations in the world, including Google, Yahoo!, and several leading telecommunications and financial services companies.

power

Deliver Higher Value through Intel® Technologies Businesses face new challenges as they work to store and mine growing data volumes and distribute real-time insights to people and processes throughout their organizations. Performance, security, compliance, and reliability become increasingly important—especially when open-source-based solutions are used to support revenue-producing transactions.

create

intel helps you meet these demands more easily and at lower cost by integrating forward-looking technologies into Intel Xeon processors and working with the software ecosystem to ensure optimized support.

• Higher performance. Built-in technologies such as Intel® Turbo Boost Technology and Intel® Hyper-Threading Technology help to deliver higher performance and superior scalability for massively parallel, data-intensive applications, such as Apache Hadoop.

• Mission-critical reliability. Advanced reliability, availability, and serviceability (rAs) features protect systems and data more effectively by detecting and correcting errors throughout the server platform, automatically recovering from a wide range of faults, and making it easier for it organizations to predict, identify, and resolve problems without downtime.

• Stronger security. integrated security technologies provide a better foundation for protecting systems and data. For example, Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) provides integrated support for high-speed, low-overhead encryption. Your customers can use it to improve security and compliance without slowing performance—all the way from their data centers to their mobile clients.

Intel also works extensively with vendors to optimize and enhance traditional relational database and analytics solutions. Innovative new technologies, such as in-memory analytics, in-database analytics, and columnar data structures take advantage of key advancements in Intel Xeon processors to enable significant gains in performance and scalability. building your Apache Hadoop solutions on the same server platform helps to ensure you and your customers have a common foundation for optimizing performance, reliability, and security across complex and widely distributed analytics environments.

Intel engineers have distilled this expertise into reference architectures, tuning guides, and best practice recommendations you can use as a starting point for designing and deploying your Apache Hadoop clusters. With clear guidance that extends all the way from hardware specifications through the complete software stack, you can design, build, and configure best-fit solutions more quickly and at lower cost.

You can also choose from a number of leading Apache Hadoop distributions, all of which are highly optimized for intel Xeon processors. intel works with cloudera, Hortonworks, ibM, and other commercial distributors to help ensure you get the best possible performance on Intel architecture using software that has been extended, hardened, and tested for production-readiness in enterprise environments.

reference Architectures, Tuning Guides, Best Practice recommendations

5

Page 6: best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

staY On traCK

as the Pace acceleratesbig data solutions are poised to transform the competitive landscape across many industries, and keeping pace with ongoing developments will be both essential and challenging. Intel can help you stay on track. We work directly with academic researchers, the open-source community, hardware and software vendors, cloud providers, and standards bodies throughout the world to help advance open-source innovation. We then build on these efforts by integrating key technologies into next-generation Intel Xeon processors and working to ensure optimized support throughout the solution stack and vendor community—so your customers get leading performance and functionality with each new server they add to their Apache Hadoop clusters.

Fostering open-source innovation As a member of the open-source community, Intel works upstream to help ensure that critical new capabilities are widely diffused throughout the big data ecosystem. intel is one of the leading contributors to the Linux kernel and Java open-source software. We are now poised to make substantial contributions to Hadoop, HBase, R, Cassandra*, and many other big data projects to help you get better performance, increased functionality, and higher value in future distributions.

Advancing industry standards Businesses need standards-based solutions they can deploy with confidence to solve real-world challenges. As technical advisor to the open Data center Alliance (oDcA) and its new Data services workgroup, intel maintains a connection with more than 300 global it organizations. intel is also a founding underwriter of the international institute for

Analytics (iiA). these and many other engagements help intel shape research and development efforts to ensure that future technologies deliver high value.

furthering technology breakthroughs through academic researchIntel Labs has injected USD 140 million over five years to fuel research through the rollout of global academic centers—intel science and technology centers (istcs) and intel collaborative research institutes (icris)—that bring together top professionals and researchers in strategic areas of computing. Research is conducted using an open-source model to facilitate collaboration, and results are shared with the educational community and the technology industry. researchers at the istc for big Data are working to produce new data management systems and compute architectures that together can help users process data that exceeds the scale, rate, or sophistication of data processing that existing systems provide. the center will also demonstrate the effectiveness of these solutions on real-world applications in science, engineering, and medicine.

Our goal is to innovate and guide the work of the Intel

Science and Technology Center for Big Data acros

s multiple fields—from medical

to media—to extract meaning from large amounts of data.

– Justin Rattner, Chief Technology Officer, Intel“ “

the new intel science and technology center (istc) of big Data is located at the Computer Science and Artificial Intelligence laboratory (csAil) at the Massachusetts institute of technology (Mit).

6

Page 7: best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

Spurring the development of pivotal technologiesthe most groundbreaking innovations sometimes come from small startup companies that have big ideas. Intel Capital identifies and invests in these companies to help them thrive and to increase their impact on the big data ecosystem. A prime example is revolution Analytics, a company that helps enterprise customers get higher value from r, an open-source statistics language that has exploded in popularity to become one of the programming languages of choice for many data analysts.

readying the Linux* Kernel for the era of Big Data As businesses look to harness big data to make smarter business decisions, scalability and performance of the Linux kernel only becomes more critical. With each successive release of the kernel—and its own platform hardware—intel aims to ensure its ongoing scalability to take

best advantage of the compute capabilities of Intel® architecture-based servers. Together with other members of the Linux community, Intel initiated the Linux Kernel Performance project, to continuously monitor kernel performance, evaluating every dot release with key workloads. Beyond contributions to the scalability of Linux, Intel has helped improve Linux power efficiency, graphics operations, wired and wireless networking, and firmware and platform integration.

There is no doubt in my mind that as trends like

big data and open source continue to converge, al

l of that will be

taking place on high-performance Intel® architecture in

the cloud..

– Zack Urlocker, Chief Operating Officer,

Zendesk and Board Member, Revolution Analytics“ “

InnovateIceland’s Advania Thor Data Center, powered by 288 Intel® Xeon® processor-based clusters (each featuring 3,456 compute cores), houses the world’s first zero-emissions supercomputer, drawing power from 100-percent renewable resources, including the geothermal plant shown here. intel strongly supports the Apache Hadoop*/NoSQL ecosystem and other open-source projects that make facilities such as the Advania thor Data center possible.

7

Page 8: best practice recommendations optimized software HArneSS ... · discover OptIMIZe hadOOp CLUsters and WOrKLOads FOr Faster anaLYtICs HiTune Performance Analyzer A key advantage of

sparkwww.intel.com/opensource

Linux contributions

building blocks

industry standards

commercial ecosystem

academic research

tools and resources

customer solutions

Open sOUrCe

on Intel

1 Source: “Big data: The next frontier for innovation, competition and productivity,” by James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers, McKinsey Global Institute, May, 2011. http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation

2 Source: IDC Press Release: IDC Releases First Worldwide Big Data Technology and Services Market Forecast, Shows Big Data as the Next Essential Capability and a Foundation for the Intelligent Economy, March 7, 2012. http://www.idc.com/getdoc.jsp?c ontainerId=prUS23355112

3 Hadoop software market to hit $812.8 million in 2016, ZDNet, www.zdnet.com/blog/btl/hadoop-software-market-to-hit-812-8-million-in-2016-says-idc/76310

Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel‘s terms and conditions of sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Unless otherwise agreed in writing by Intel, the Intel products are not designed nor intended for any application in which the failure of the Intel product could create a situation where personal injury or death may occur.Copyright © 2012 Intel Corporation. All rights reserved. Intel, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. 1012/NR/PRW/PDF 328063-001US

Apache Hadoop is transforming the way companies store and use data by delivering powerful new capabilities on a distributed architecture that is far more scalable, flexible, and affordable than traditional data warehouse platforms. Forward-thinking companies are gaining first-mover advantages by mining insight from massive data sets and fast-moving data streams that would otherwise be impossible to analyze. Many others are following in their footsteps, leading to rapid market growth for big data products and solutions. intel offers tools, resources, and platforms that can help you get innovative big data solutions to market faster and with less effort—and deliver higher value to your customers both now and in the future.

The big data revolution is underway. Join us.

www.intel.com/opensource.bigdata

get started now

Intel takes pride in being a long-standing member of the open-source community. We believe in open source development as a means to create rich business opportunities, advance promising technologies, and bring together top talent from diverse fields to solve computing challenges. Our contributions to the community include reliable hardware architectures, professional development tools, work on essential open-source components, collaboration and co-engineering with leading companies, investment in academic research and commercial businesses, and helping to build a thriving ecosystem around open source.