55
IMEX RESEARCH.COM Are SSDs Ready for Enterprise Storage Systems Anil Vasudeva, President & Chief Analyst, IMEX Research NextGen Infrastructure for Big Data Anil Vasudeva President & Chief Analyst [email protected] 408-268-0800 © 2007-12 IMEX Research All Rights Reserved Copying Prohibited Contact IMEX for authorization IMEX RESEARCH.COM

NextGen Infrastructure for Big Data

Embed Size (px)

DESCRIPTION

Anil Vasudeva President & Chief Analyst [email protected] 408-268-0800

Citation preview

  • 1. IMEXRESEARCH.COM NextGenInfrastructureAre SSDs Ready for Enterprise Storage SystemsAnil Vasudeva, President & Chief Analyst, IMEX Researchfor Big Data 2007-12 IMEX Research All Rights Reserved Copying Prohibited Contact IMEX for authorizationAnil VasudevaPresident & Chief [email protected] 2010-12 IMEX Research, Copying prohibited. All rights reserved.

2. AbstractIMEX RESEARCH.COM NextGen Infrastructure for Big Data This session will appeal to Business Planning, Marketing, Technology System Integrators and DataCenter Managers seeking to understand the drivers behind the demand for and rise of Big Data. Abstract The internet has spawned an explosion in data growth in the form of data sets, called Big Data, thatare so large they are difficult to store, manage and analyze using traditional RDBMS which aretuned for Online Transaction Processing (OLTP) only. Not only is this new data heavilyunstructured, voluminous and streams rapidly and difficult to harness but even more importantly, theinfrastructure cost of HW and SW required to crunch it using traditional RDBMS, to derive anyanalytics or business intelligence online (OLAP) from it, is prohibitive. To capitalize on the Big Data trend, a new breed of Big Data technologies (such as Hadoop andothers) many companies have emerged which are leveraging new parallelized processing,commodity hardware, open source software and tools to capture and analyze these new data setsand provide a price/performance that is 10 times better than existing Database/DataWarehousing/Business Intelligence Systems. Learning Objectives The presentation will illustrate the existing operational challenges businesses face today usingRDBMS systems despite using fast access in-memory and solid state storage technologies. Itdetails how IT is harnessing the emergent Big Data to manage massive amounts of data and newtechniques such as parallelization and virtualization to solve complex problems in order to empowerbusinesses with knowledgeable decision-making. It lays out the rapidly evolving big data technology ecosystem - different big data technologies fromHadoop, Distributed File Systems, emerging NoSQL derivatives for implementation in private andhybrid cloud-based environments, Storage Infrastructure Requirements to Store, Access, Secure,Prepare for analytics and visualization of data while manipulating it rapidly to derive businessintelligence online, to run businesses smartly.22 3. Big Data in IT Industry RoadmapIMEXRESEARCH.COMAnalytics BI IT Industry Predictive Analytics - Unstructured DataRoadmapFrom Dashboards Visualization to Prediction Engines using Big Data.CloudizationOn-Premises > Private Clouds > Public CloudsDC to Cloud-Aware Infrast. & Apps. Cascade migration to SPs/Public Clouds. Automation Automatically Maintains Application SLAs (Self-Configuration, Self-HealingIMEX, Self-Acctg. Charges etc.)VirtualizationPools Resources. Provisions, Optimizes, MonitorsShuffles Resources to optimize Delivery of various Business Services Integration/ConsolidationIntegrate Physical Infrast./Blades to meet CAPSIMS IMEXCost, Availability, Performance, Scalability, Inter-operability, Manageability & SecurityStandardizationStandard IT Infrastructure- Volume Economics HW/Syst SW(Servers, Storage, Networking Devices, System Software (OS, MW & Data Mgmt. SW) 3 4. Harnessing Big Data for Business InsightsIMEXRESEARCH.COM Data Big DataBusinessSourcesInfrastructure InsightsInformation is at the center of Majority of data growth is being 80% of worlds data is unstructuredNew Wave of opportunity driven by unstructured datadriven by rise in Mobility devices,and billions of large objectscollaboration machine generated data. 4 5. Corporate Need: Business Perf... OptimizationIMEXRESEARCH.COMUnstructured Big Data can provide Next GenAnalytics to help businesses make informed,better decision in: Product Strategy Targeting Sales Just-In-Time Supply-Chain Economics Business Performance Optimization Predictive Analytics & Recommendations Country Resources Management5 6. Corporate Need: Real Time Analytics IMEXRESEARCH.COM 6 7. Corporate Need: Business Insights IMEX RESEARCH.COMItem Issue. SolutionInformation Exploding A unified information/contentVolume: Digital Content doubling every 18 storage methodology that enablesStore months. Velocity: >80% growth driven from users to manage the volume, velocity andunstructured data. Variety: sources of data variety of information from multiple sourceschangingComplexity in "managing"A solution portfolio of tools andinformation. - Need to classify,services to manage all types of Manage synchronize, aggregate, integrate, share, information in a hybrid storage environmenttransform, profile, move, cleanse, protect,retireCurrent solutions limited to BI tools Build/buy packaged Real-Time Analyzefocused on structured and lagging Predictive Analytical Solutions forinformation unstructured analytics toolsMultiple access methods needed to Centralized share, collaborate and Collaboratemeet needs of a diverse audience. act on insights anytime, anywhere on anydevice.Ability to understand how the Model Information on current Model/ Adapt information impacts the business. operations w/potential strategyHow to transfer to action.impact. Leverage Tech. to adapt. 7 8. Opportunity: Converting Big Data Delugeinto Predictive Analytics & Insights IMEX RESEARCH.COM Big Data Predictive AnalyticsPersonal Location Services Data Generated by TB/YearNavigation Devices 600Navigation Apps on Phones 20Smart Phone Opt-In Tracking 1000Geo Targeted Ads20People Locator (Emergency Calls/Search..) 10Location based Services (e.g.Games)5Other 45Total (Est.)17008 9. Issues with Existing RDBMSIMEXRESEARCH.COMKey Issues with RDBMS TechnologiesHandling Mixed Unstructured Data Unstructured Scalability - RDBMs dont handle non-tabular data Data(Notorious for doing a poor job on recursive data structure)Legacy Archaic Architecture- RDBMS dont parallelize well to accommodate Faultcommodity HW clusters CostToleranceSpeed Big Data - Seek time of physical Storage has not kept pace with Analyticsnetwork speed improvementsEU AdhocLatency ScaleAnalytics- Difficult to scale-out RDBMS efficiently Clusteringbeyond few servers notoriously hardIntegrationDeep- Data processing tasks need to combine data from non- AvailabilityInsights related sources, over a networkVolume- Data volumes have grown from 10s GB >100s TB >PBs in recent years. Existing Tabular RDBMS canthandle such large DBs 9 10. Issues with Existing RDBMS IMEX RESEARCH.COMPresent RDBMS struggling to Store & Analyze Big Data 10 11. Big Data - Database Solutions IMEXRESEARCH.COM11 12. Big Data The New Face of DBs IMEX RESEARCH.COMBig Data Paradigm - The New face of DB Systems Adopts Schema-Free Architecture Can do away with Legacy Relational DB SystemsSome data have sparse attributes, do not need relational property Key Oriented Queries Some data stored/retrieved mainly by primary key, w/o complex joins Trade-off of Consistency, Availability & Partition Tolerance Scale Out, not up, - Online Load balancing cluster growth 12 13. Analytics The Next Frontier in IT IMEXRESEARCH.COM13 14. Key Innovations: HW Technologies IMEXRESEARCH.COM14 15. Key Innovations Solid State StorageIMEX RESEARCH.COMStorage - IOPS/GB & Price Erosion - HDD vs. SSDs600 8HDDUnits (Millions)IOPS/GB300SSDIOPS/GB HDDs 00 2009 2010 2011 2012 2013 2014Note: 2U storage rack, 2.5 HDD max cap = 400GB / 24 HDDs, de-stroked to 20%, 2.5 SSD max cap = 800GB / 36 SSDsKey to Database performance are random IOPS. SSDs outshine HDD in IO price/performance a major reason, besides better space and power, for their explosive growth.2011 1515Source: IMEX Research SSD Industry Report 16. Innovations DB SW Technologies IMEX RESEARCH.COMTech Innovation1985 19901995 2000 2005 20102015OLTP Transactions Open Source /Rows LockingOptimizerParallel Query ClusteringXMLGridDB SW HadoopOLAP- Analytics MaterializedBit MappedIndexingPartitioning ColumnarIn-MemoryQuery BindingDB SWViewIndexMulti-core/Hardware32 bitSMPNUMA 64 bit FlashMPPBlades Columnar MPPBig DataMulti-core In-MemoryVisualization OLTP Database Innovation Progress Big Data: Analytics DB Technology Impact 100000 10000100,000 100000 10000100001000 10,000 $/GB 1000 TPMc / Processor 1,000DW Size TB$ / TPMc 1000 100100 $/TPMc 10010100 TPMc/Processor10101 10 1 0.1 10.011 0.1.1 00.0010.1 0.01.010 0.000119851990 199520002005 2010 2015 1985 19901995 2000 2005 2010 201516 17. Big Data - Key RequirementsIMEX RESEARCH.COMTypes of Data Organizations Analyze59%Customer/member data68% 44%Transactional data from applications68% 69% Application Logs37% 64%Other Types of Event Data23%41% Network Monitoring/Network Traffic33% 33%Online Retail Transactions34% 51%Other Log Files 26% 28%Call Data Records 32%46% Web Logs 21%36%Text data from social media and online15%36%Search logs 11% 18% Trade/quote data15% Hadoop 18% Intelligence/defense data 11% Non Hadoop 21%Multimedia (audio/video/images) 9% 8% Weather 3% 3% Smartmeter data 6% 3%Other (please specify)5% 17 18. Big Data Architectural GoalsIMEX RESEARCH.COM Meet Requirements of V3 Meet Enterprise CriterionBig DataPlatformAnalyze Data in Native Format 18 19. Big Data - Market Requirements IMEX RESEARCH.COM Unified system: Pre-integrated for Ease of Installation and Management Platform Large Scale Indexing Pre-integrated using Hadoop Foundation, Integrated Text Analytics - Address Unstructured Data Usability - User Friendly Admin Console including HDFS Explorer, Query Languages Enterprise Class Features Provisioning, Storage, Scheduler, Advance Security Supports search-centric, document-based XML data model o store documents within a transactional repository. Schema-Free: o No advance knowledge of the document structure (its "schema") needed o Index words and values from each of the loaded documents together with its document structure. Standard commodity hardware leveraged19 20. Big Data Market RequirementsIMEX RESEARCH.COMArchitectural Shared-nothing clustered DB architectureo programmable and extensible application servers. Support massive scalability to petabytes of source data Support open-source XQuery- and XSLT-driven architecture Simple to Deploy, Develop and Manage (UI & Restful Interface) Support extreme mixed workloads - a wide variety of data types including arbitrarilyhierarchical data structures, images, waveforms, data logs etc. Support thousands of geographically dispersed on--line users and programsexecuting variety of requests from ad hoc queries to strategic analysis Loading data before declaring or discovering its structure Load data in batch and streaming fashion Integrate data from multiple sources during load process at very high rates SpreadI/O and data across instances Provide consistent performance with linear cost Leverage Open Source SW Lo Costs, Multiple Sources, Hadoop Foundation Tools Connectivity with Oracle DB, Teradata Warehouse, JDBC Connectivity, 20 21. Big Data - Market RequirementsIMEXRESEARCH.COMReal Time Analytics Execution Execute streaming analytic queries in real time on incoming load data Updating data in place at full load speeds Scheduling and execution of complex multi-hundred node workflows Join a billion row dimension table to a trillion row fact table without pre-clusteringthe dimension table with the fact tablePerformance Analyze data, at very high rates >GB/sec Predictable Sub-ms response time for highly constrained standard SQL queriesAvailability Ability to configure without any single point of failure Auto-Failover Extreme High Availability o Automated failover and process continuation without operational interruption when processing nodes fail 21 22. Big Data Product Metrics Choices IMEX RESEARCH.COM PBData Set SizeTB GB Transaction MachineData Structure Unstructured Other TransactionAccess/Use SearchBig Data Analytics - ApplianceProduct Parallel Cluster < 1K ProcessingMetricsCluster > 1K In-MemoryMemory Flash ColumnarDB Technique Zero Sharing No SQL TextData ImageCataloging AudioSW Video22 23. Advantage: Big Data Products IMEX RESEARCH.COMCharacteristic Legacy ParadigmBig Data ParadigmStructureTransactional/Corporate Unstructured/Derivative/InternetMode Data Collection Data AnalysisFocusFind AnswersFind QuestionsFacility Reportive / What Happened?Analytic / Why did it Happen?Predictive / What will Happen Next?OpportunityVery Small Growth Massive GrowthPlayersLegacy PlayersAgile Start Ups, well fundedImpact Analyze Existing Businesses Create New Businesses23 24. Advantage: Big Data Products IMEX RESEARCH.COMCharacteristic Traditional RDBMS Big Data/MapReduceData SizeGB PBAccess InteractiveBatch/Near Real-TimeLatencyLowHighData Updates Read & Write Many TimesWrite Once Read Many TimesSchema/Structure Static SchemaDynamic SchemaLanguage SQLUQL/Procedural (Java,C++..)IntegrityHigh Not 100%Works Well for Process Intensive Jobs Data Intensive JobsWorks Well w Data Size GigabytesPetabytesData/ProcessingLow Latency/High BW precursor to Sends Code to Data, instead ofInteractions success. Ntwk. BW can be aSending Data to other Nodes bottleneck causing nodes to be idle (Requiring Lower BW in Cluster)Fault ToleranceCoordinating Processes with Node Fault Tolerant for HW/SW Failures Failures a challengeAccess InteractiveBatch/Near Real-timeScalingNon-linear LinearPgm-Distribution of Jobs DifficultSimple & Effective24 25. Big Data Ecosystem IMEX RESEARCH.COM 25 26. Big Data StackIMEXRESEARCH.COMMerging Hadoop innovations into Nextgen DBMS Data Analytics PlatformApplicationsSourcesAdvanced RHiveAnalytics OracleDBA PerfMon, QueryIBM Teradata Data EnterprisePlan Importer Hive Workflow ETL DatabasesOracle-to-Hive Ad-hoc query Data DataSources ImporterOLAPDataServer Data Store exporterStreaming (Hadoop) OLTP Data Server Collector DevicesRest/JSONSearch APIReal-time Admin CenterQueries26 27. Hadoops Fit in Enterprise Stack IMEX RESEARCH.COM 27 28. Big Data - Hadoop Architecture IMEXRESEARCH.COMData FlowSQL Programming(Pig)(Hive) LanguagesCoordination Management Distributed Computing Framework (Zookeeper) Computations (MapReduce)(HMS)Metadata Column-StgTabular(HCatalog) (HBase) Storage Hadoop Distributed FileSystemObject (HDFS)Storage28 29. Big Data Connectors to EDW/BIIMEXRESEARCH.COMBI Framework - Interoperable with Enterprise Data WarehousingEDW BI APPLICATIONS(Query, Analytics, Reporting, Statistics)Big Data Big Data Orchestration Framework (Connectors)HBase AvroFlume ZooKeeperBig Data Access FrameworkBackup & Recovery Hive Sqoop Management PigSecurityNetwork Big Data Storage Framework Big Data Processing Framework(HDFS) (MapReduce)Hypervisor/VMs Operating SystemServers29 30. Big Data Infrastructure MapIMEX RESEARCH.COMReduce Data FlowSQL (Pig) (Hive) MapReduceCoordinationManagement Computing (Zookeeper) (HMS) Framework)MetadataMetadataColumn-Stg(HCatalog)CV (HBase)(HCatalog)CV HadoopDistributedFile System (HDFS)Map Reduce A Distributed Computing Model Typical Pipeline:Input>Map>Shuffle/Sort>Reduce>Output Easy to Use , Developer writes few functions,Moves compute to Data Schedules work on HDFS node with data Scans through data, reducing seeks Automatic Reliability and re-execution on failure 30 31. Big Data Infrastructure HDFS IMEX RESEARCH.COM HDFS ArchitectureActively Maintaining High Availability Persistent Namespace Metadata & Journal Persistent Namespace Metadata & Journal Namespace NFSNamespaceNamespace NFSNamespaceStateBlockStateBlock NFS Map NFS MapName NodeName NodeHierarchical NamespaceFile name >> Block IDs >> Block Locations 0. Bad/3. BlockLost Block 1. ReplicatePeriodically CheckHeartbeats & Block Reports Received Block Checksumsb1b2b1b2b1b2 b1b2b1 b2b1b2b1b2b1b2b3 Data Nodeb3 Data Nodeb3 Data Node b3 Data Nodeb3 Data Node b3 Data Node2. Copy b3 Data Nodeb3 Data NodeJBOD JBODJBOD JBOD JBODJBODJBODJBOD Block ID >> DataBlock ID >> DataHorizontally Scale I/O & StorageHorizontally Scale I/O & StorageHDFS Immutable File System Read, Write, Sync/Flush No random writes Storage Server used for Computation Move Computation to Data Fault Tolerant & Easy Management Built In Redundancy, Tolerates Disk & Node Failure, Auto-Managing addition/removal of nodes, One operator/8K nodes Not a SAN but high bandwidth network access to data via Ethernet Used typically to Solve problems not feasible with traditional systems: Large Storage Capacity >100PB raw,Large IO/computational BW >4K node/cluster, scale by adding commodity HW, Cost ~$1.5/GB incl. MR cluster 32. Hadoop Distributed File System IMEXRESEARCH.COMHDFS Architecture HDFS Characteristics Based on Google GFS (Google File System) Redundant Storage for massive amounts of data Data is distributed across all nodes at load time efficient MapReduce processing Runs on commodity hardware assumes high failure rate for components Works well with lots of large files Built around Write once Read many times Large Streaming Reads Not random access High Throughtput more important than low latency 32 33. Hadoop Architecture - OverviewIMEXRESEARCH.COMHadoop Data Processing Architecture33 34. Key Technologies Required for Big Data IMEX RESEARCH.COMKey Technologies Required for Big Data Cloud Infrastructure Virtualization Networking Storage o In-Memory Data Base (Solid State Memory) o Tiered Storage Software (Performance Enhancement) o Deduplication (Cost Reduction) o Data Protection (Back Up, Archive & Recovery) 34 35. Cloud Infrastructure for Big Data IMEXRESEARCH.COM Service Cloud Services ProvidersExamples Public - BT, Telstra, T-Systems France Telecom Public Mutitenancy,OnDemand Providers Private - On Premises, Enterprise Private Hybrid IBM/Cloudburst, Hybrid Interoperable P2P ExamplesSaaS Software-as-a-Service - Servers, Network, Storage eMail - Yahoo!,Google Collaboration - Facebook,Twitter - Management, Reporting Bus.Apps - SalesForce, GoogleApps, Intuit ExamplesPaaS Platform Tools & Services - Deploy developed platforms Amazon EC2 Force.com ready for Application SW on Navitaire Cloud Aware Infrastructure Examples Infrastructure HW &IaaS Amazon S3 ServicesNirvanix - Servers, Network, Storage - Management, Reporting35 36. Cloud Infrastructure for Big Data IMEXRESEARCH.COM Cloud ComputingPrivate Cloud Hybrid Public Cloud ServiceEnterpriseCloud ProvidersSaaS Applications SaaS AppAppAppAppApp SLASLASLASLA SLA Platform Tools & Services Management PythonRuby.Net ....EJB PHP PaaS Operating Systems VirtualizationIaaS Resources (Servers, Storage, Networks)Applications SLA dictates the Resources Required to meet specific requirements of Availability, Performance, Cost, Security, Manageability etc. 37. Private Cloud Requirements for Big Data IMEX RESEARCH.COMPublicOperational FlexibilityCloudStoragePrivateCloudStorage Service CatalogVZ Choose pre-defined IT Services/user/dept.App Define SLA to efficiently meet servicesSilos Costs Self-Service PrivateService CloudAnalytics Access Resources on demand to Storage Monitor & Analyze Usage forspeed deployment and delivery Charge back Scale Resource Up/down Interactively auto-tune to optimize their usage,performance Release when not neededAvailability with SLA Automation requirements Automated Provisioning- Moving Data & Processes Seamlessly Allocation, Self Tuning of Resources tomeet Workload Requirements 38. Virtualization: Workloads Consolidation IMEXRESEARCH.COM A single server 1.5x larger thanstandard 2-way server will handleconsolidated load of 6 servers. VZ manages the workloads +important apps get the computeresources they need automaticallyw/o operator intervention. Physical consolidation of 15-20:1is easily possible Reasonable goal for VZ x86servers 40-50% utilization onlarge systems (>4way), rising asdual/quad core processorsbecomes available Savings result in Real Estate,Power & Cooling, High Availability,Hardware, ManagementSource: Dan Olds & IMEX Research 2009 39. Storage Architecture - Impact from VZ IMEX RESEARCH.COMData Protection Storage Efficiency Virtualization (VZ) requires Shared Storage for Back Up/Archive/DR Virtualization - VMotion RAID 0,1,5,6,10Thin Provisioning- Storage VMotion Virtual Tape Deduplication- HA/DRSAuto Tiering - Fault ToleranceReplicationMAID Additional Capacity Consumed for - VZ snapshots, - VM Kernel etc.Source: IMEX Research SSD Industry Report 2011 39 40. Storage Issues & Solutions IMEX RESEARCH.COMTechnologies Reducing Storage CostsStorage Costs ReductionTraditionalThin Replication ~ 95% Data GrowthSnapshots ~ 75%Virtual Clones ~80%Thin Provisioning ~30%DeDuplication ~ 25-95%CAPACITYAuto-Tiering 65-95%SAVINGS~ xx %RAID*DP ~ 40% vs. R10Capacity Requirements 41. Storage Architecture Impacting Big Data IMEX RESEARCH.COM Data ProtectionStorage EfficiencyBack Up/Archive/DR Auto-Tiering SystemStorage Virtualizationusing Flash SSDsRAID 0,1,5,6,10 Thin ProvisioningData ClassVirtual TapeDeduplication (Tiers 0,1,2,3)Auto-TieringStorage Media Type ReplicationMAID(Flash/Disk/Tape)Policy Engines(Workload Mgmt.) DRAMTransparent MigrationFlash (Data Placement)File Virtualization SSD Performance Disk (Uninterrupted App.Opns.in Migration)Capacity DiskTapeSource: IMEX Research SSD Industry Report 2011 41 42. Data Storage: Hierarchical UsageIMEX RESEARCH.COM I/O Access Frequency vs. Percent of Corporate Data 95% 75% Cloud 65%FCoE/ StorageSASSATAArrays % of I/O Accesses Back Up DataSSD Tables Archived Data Logs Indices Offsite DataVault Journals Hot Data Temp Tables Hot Tables1% 2%10%50% 100%% of Corporate DataSource:: IMEX Research - Cloud Infrastructure Report 2009-12 42 43. SSD Storage: Filling Price/Perf.Gaps IMEXRESEARCH.COMPrice$/GBCPU SDRAM DRAM gettingDRAM NOR Faster (to feed faster CPUs) & Larger (to feed Multi-cores &NAND Multi-VMs from Virtualization) PCIe SCM SSDHDDSSD segmenting into SATA PCIe SSD Cache TapeSSD- as backend to DRAM & SATA SSDHDD becoming - as front end to HDDCheaper, not fasterSource: IMEX Research SSD Industry Report 2010-12PerformanceI/O Access Latency 43 44. SSD Storage - Performance & TCO IMEXRESEARCH.COM SAN TCO using HDD vs. Hybrid Storage SAN Performance 250Improvements using SSD30010IOPS 9 200 $/IOPS250ImprovementImprovement8475%145 800% 7200 150HDD-FC 6 Cost $K 0 IOPS$/IOP HDD-36 1505 100 SATA 4 0100 3SSD64 50 75 250 RackSpace 1 2814.2Pwr/Cool 5.2 0 00HDD Only HDD/SSD FC-HDD Only SSD/SATA-HDD Power & Cooling RackSpaceSSDs HDD SATA HDD FCPerformance (IOPS) $/IOPSource: IMEX Research SSD Industry Report 201144 45. Workloads Characterization IMEX RESEARCH.COM 1000 KOLTP Transaction Processing eCommerce 100 KBusinessIOPS* (*Latency-1)(RAID - 1, 5, 6) Data Intelligence(RAID - 0, 3)Warehousing10KOLAP 1K Scientific Computing HPC ImagingTP100 Audio Web 2.0 HPC Video 10 15 1050100 500 *IOPS for a required response time ( ms) MB/sec *=(#Channels*Latency-1)Source:: IMEX Research - Cloud Infrastructure Report 2009-12 46. Workloads CharacterizationIMEX RESEARCH.COMStorage performance, management and costs are big issues in running Databases Data Warehousing Workloads are I/O intensive Predominantly read based with low hit ratios on buffer pools High concurrent sequential and random read levels Sequential Reads requires high level of I/O Bandwidth (MB/sec) Random Reads require high IOPS) Write rates driven by life cycle management and sort operations OLTP Workloads are strongly random I/O intensive Random I/O is more dominant Read/write ratios of 80/20 are most common but can be 50/50 Can be difficult to build out test systems with sufficient I/O characteristics Batch Workloads are more write intensive Sequential Writes requires high level of I/O Bandwidth (MB/sec) Backup & Recovery times are critical for these workloads Backup operations drive high level of sequential IO Recovery operation drives high levels of random I/OSource: IMEX Research SSD Industry Report 201146 47. Best Practices Storage in Big Data AppsIMEX RESEARCH.COM Goals & Implementation Establish Goals for SLAs (Performance/Cost/Availability), BC/DR (RPO/RTO)& Compliance Increase Performance for DB, OLTP and OLAP Apps: Random I/O > 20x , Sequential I/O Bandwidth > 5x Remove Stale data from Production Resources to improve performance Use Partitioning Software to Classify Data By Frequency of Access (Recent Usage) and Capacity (by percent of total Data) using general guidelines as: Hyperactive (1%), Active (5%), Less Active (20%), Historical (74%) Implementation Optimize Tiering by Classifying Hot & Cold Data Improve Query Performance by reducing number of I/Os Reduce number of Disks Needed by 25-50% using advance compression softwareachieving 2-4x compression Match Data Classification vs.Tiered Devices accordingly Flash, High Perf Disk, Low Cost Capacity Disk, Online Lowest Cost ArchivalDisk/Tape Balance Cost vs. Performance of Flash More Data in Flash > Higher Cache Hit Ratio > Improved Data Performance Create and Auto-Manage Tiering (Monitoring, Migrations, Placements) without manual intervention 47 48. Best Practices: I/O Forensics in Storage-Tiering IMEX RESEARCH.COMStorage-Tiered VirtualizationStorage-Tiering at LBA/Sub-LUN Level Physical StorageLogical VolumeSSDs Hot Data Arrays Cold DataHDDs Arrays LBA Monitoring and Tiered Placement Every workload has unique I/O access signature Historical performance data for a LUN can identify performance skews & hot data regionsAutomatic by LBAsMigration (Policy Based)Source: IBM & IMEX Research SSD Industry Report 2011 IMEX 2010-12 48 49. Best Practices: Cached StorageIMEXRESEARCH.COMApp/DB ApplicationImprovement over Cached Server vs.HDD only Oracle OLTP Benchmarks 681% SQL Server OLTP1251% Benchmark Neoload (Web Server533% Simulation SysBench (MySQL OLTP 150%LSI MegaRAID CacheCade Pro 2.0 Server) Response Time TPS 660700655 700 600600 500500 400 330373400 300300 200200 100 0 0100 0 All HDD Smart FlashPersist0 CacheData onAll HDD Smart Flash Persist Data Warpdrive Cache on Warpdrive49 50. Big Data Targets: Analytics IMEXRESEARCH.COMKey Industries Benefitting from Big Data AnalyticsGlobal IT Spending by Industry Verticals 2010-15 $B CAGR % (2010-15)5 Year Cum Global IT Spending 2010-15 ($B)50 51. Big Data Targets Storage Infrastructure IMEXRESEARCH.COMValue Potential of Using Big Data by Data Intensive VerticalsData Intensity by Industry Vertical Banking & Financial Services Media & EntrertainmentHealthcare Providers Professional ServicesTelecommunications Pharma, Life Sciences & Medical PdctsRetail & WholesalesUtilities Indl Elex & Electrical Equipment SW Publishing & Internet Srvcs Consumer ProductsInsurance Transportation Energy0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Installed Terabytes/$M Rev 201051 52. Big Data Targets: Storage InfrastructureIMEXRESEARCH.COMData Stored by Large US Enterprises Big Data Storage PotentialData Stored by Large US EnterprisesBig Data Storage Potential Data Stored by Large US Enterprise Stored Data TB/FirmStored Data by Industry (in US 2009 PB) (>1K Employees US)Discrete Manafacturing 14% 231 Government 12%150 Communications and Media 10%825Process manufacturing 10% 1,507 Banking 9%536Health Care Providers 6%801Securities & Investment Srvcs 6% 870 Professional Services6%319 Retail5%697 Education4%278 Insurance 4% 3,866 Transportation3%370 Wholesale 3% 1,931Utilities 3% 831Resource Industries 2%1,792Consumer and Recreational2%1,312 Services Construction1% 96752 53. Big Data Targets: Savings w Open Source IMEXRESEARCH.COM Legacy BI vs. Open Source Big Data Analytics53 54. Rise of Big Data Adoption IMEXRESEARCH.COM54 55. Key Takeaways IMEX RESEARCH.COMBig Data creating paradigm shift in IT IndustryLeverage the opportunity to optimize your computing infrastructure with Big DataInfrastructure after making a due diligence in selection of vendors/products, industry testingand interoperability.Apply best storage technologies listed in this presentation and elsewhereOptimize Big Data Analytics for Query Response Time vs. # of UsersImproving Query Response time for a given number of users (IOPs) or Serving more users(IOPS) for a given query response timeSelect Automated Storage Management Software Data Forensics and Tiered Placement Every workload has unique I/O access signature Historical performance data for a LUN can identify performance skews & hot data regions by LBAs.Non-disruptively migrate hot data using auto-tiering SoftwareOptimize Infrastructure to meet needs of Applications/SLA Performance Economics/Benefits Typically 4-8% of data becomes a candidate and when migrated for higher performance tiering can provide response time reduction of ~65% at peak loads. Many industry Verticals and Applications will benefit using Big DataSource: IMEX Research SSD Industry Report 201155