30
0 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management

DSX InfoSphere DataStage is Big Data Integration 2013-05-13

Embed Size (px)

DESCRIPTION

DSX InfoSphere DataStage is Big Data Integration 2013-05-13

Citation preview

0 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data IntegrationSponsored By: Presented by :Tony Curcio, InfoSphere Product Management

InfoSphere DataStage is Big Data Integration Questionsand suggestions regarding presentation topics? -send to [email protected] Downloading the presentation Click Presentation YES on Poll Question Replay will be available within one day with email with details Bonus Offer Free premium membership for your DataStage Management!Submityour managements email address and we will offer him/her access on your behalf. Email [email protected] subject line Managers special. Join us all at Linkedin http://tinyurl.com/DSXmembers ISXchange will sponsor Trial membership for new requests at Linkedin DSX members site 1 2013 IBM Corporation InfoSphere DataStage is Big Data Integration Tony Curcio InfoSphere Product Management 3 New types of data stores Big Data introduces additional data stores that need to be integrated both Hadoop based and noSQL based These data stores dont easily lend themselves to conventional methods for data movement New data types and formats Unstructured data; poly-structured data stores; JSON, Avro, and what more to come ??? Video, docs, web logs, Larger volumes Solutions need to move, transform, cleanse and otherwise prepare huge data volumes Big Data requires data scalability Bigger Data Integration Challenges Speeds Productivity Graphical design easier to use than hand coding Promotes Object Reuse Build once, share, and run anywhere (etl/elt/real-time) Simplifies Heterogeneity Common method for diverse data sources Benefits of InfoSphere DataStage Reduces Operational Cost Provides a robust framework to manage data integration Shortens Project Cycles Pre-built components reduce cost and timelines Protects from Changesisolation from underlying technologies changes as they continue to evolve Big Data is part of the Information Supply Chain Analyze Integrate Manage Business AnalyticsApplications External Information Sources CubesStreamsBig DataMaster Data ContentDataStreaming Information Govern Quality Security &Privacy Lifecycle DataWarehouses Standards Transactional & Collaborative Applications Content InformationGovernance 5 Gartner Magic Quadrant IBM is the only DBMS vendor that can offer an information architecture across the entire organization, covering information on all systems 4 Key Analytical Use Cases for Big Data Analyze a variety of machine data for improved business results Extend existing customer views by incorporating additional information sources Integrate big dataand data warehouse capabilities to increase operational efficiency Find, visualize, understand all big data to improve decision making Big Data Exploration DataWarehouse Augmentation Operations Analysis Enhanced 360o View of the Customer Integrate big data and data warehouse capabilities to increase operational efficiency Challenges Leveraging structured, unstructured, and streaming data sources for deep analysis Low latency requirements Query access to data Optimizing warehouse for big data volumes Metadata management to support impact analysis and data lineage Required capabilities Data Integration Hub Processing High-speed, massively scalable read from and write to big data sources and new data Big Data Expert Automatically build MapReduce logic through simple data flow design and coordinate workflow across traditional and big data platforms Data Warehouse Augmentation Data IntegrationHub Processing 2013 IBM Corporation 9 Connectivity Hub InfoSphere DataStage Effectively handle the complexity of enterprise information sources and types with a common design paradigm acrossheterogeneous landscape with high-speed scalable solutionto speed the delivery of analytics. 10 Disk CPU Memory Sequential Disk CPU Shared Memory CPUCPUCPU 4-way Parallel64-way Parallel UniprocessorSMP SystemMPP Clustered System Source Data Transform CleanseEnrich EDW Dynamic Instantly get better performance as hardware resources are added to any topology Extendable Add a new server to scale out through simple text file edit (or, in grid config, automatically via integration with grid management software). Data Partitioned In true MPP fashion (like Hadoop) data persisted in the data integration platform is stored in parallel to scale out the I/O. Hadoop Integrated Push all or parts of the process out to Hadoop to take advantage of its scalability in ELT fashion. 10 InfoSphere DataStage is Big Data Integration Hadoop Distributed File System massively scalable and resilient storage 11 Big Data Source Types noSQL (not-only SQL) record storage optimized for read (or write) noSQLInfoSphere Streams massive real-time analytics Available since v8.7 in 2011 Extends the simple flat file paradigm - just add your hadoop server name and port number Parallelization techniques to pipe data in and out at massive scale Performance study run up to 5.2 TB/hr before hdfs disks were complete saturated (5 node hadoop cluster) 12 Blazing Fast HDFS Simple data flow design for HDFS Read from an HDFS file in parallel Transform/ restructurethe dataJoin two HDFS files Create new HDFS file, fully parallelized 13 New connectors available on developerWorks Plugs into InfoSphere DataStage and operates just like any other stage. Includes features to exploit specific data sources Agile Connector Accelerators for noSQL 14 Open Code Sample Job with MongoDB and Hive Selects what HDFS data to send down stream. Writing data to Hive Writing data to MongoDB Accepts specific MongoDB directives 15 Parsing and composing of JSON data format Included advanced transformation framework already provided for XML capabilities Beta available on InfoSphereDataStage 9.1 FP1 16 Parse and Compose JSON (beta) Big Data Expert 2013 IBM Corporation 18 Big Data Expert InfoSphere DataStage Automatically push transformational processing close to where the data resides, both SQL for DBMS and MapReduce for Hadoop, leveraging the same simple data flow design process and coordinate workflow across all platforms 19 New in 9.1, leverage the same UI and the same stages to build MapReduce. Drag and drop stages to the canvas to create a job, rather than have to learn MapReduce programming. Push the processing to Hadoop for patterns when you dont want to transport the data on the network. Automated MapReduce Job Generation 2013 IBM Corporation Build integration jobs with the same data flow tool and stages Automatically creates MapReduce code. Automated MapReduce Job Generation 20 2013 IBM Corporation 21 Automated MapReduce Job Generation Job includes other database on separate system Recognizes what processing can run natively in Hadoop and what requires DataStage engine to move the data 22 clickstream sensors transactions content JAQLHiveHBase Masking LineageQuality Optim Masking Custom MR all sources BigInsights / Hadoop Operational Warehouse Zone Analytics Warehouse Zone Replication ETL Guardium Information Server Architecture for Warehouse Landing Zone Landing Zone Use Case Requirements:Data Warehouse Landing Zone Large Scale large data volumes, scale out requires open MPP platform Low Cost low cost storage, compute and commodity hardware Many Data Types un/semi structured and social datatype coverage Many Access Patterns exploratory, iterative and discovery oriented Oozie Integration Same design paradigm for workflows as for job design. Directly call an Oozie activity that is invoking custom MapReduce code. End-to-end Workflows Sequence right alongside other data integration and analytics activities Allows users to have the data sourcing, ETL, Analytics and delivery of information all controlled through a single process. Monitor all stages through Operations Consoles web based interace Combined Workflows for Big Data 23 Understand how traditional and big data sources are being used Assess impact of change and mitigate risks Show impact on downstream applications and BI reports Navigate through impacted areas and drill down Cross Tool Impact Analysis and Traceability Wrap-up New analytic applications drive the requirements for a big data platform Integrate and manage the full variety, velocity and volume of data Apply advanced analytics to information in its native form Visualize all available data for ad-hoc analysis Development environment for building new analytic applications Workload optimization and scheduling Security and Governance 26 The IBM Big Data Platform Accelerators Information Integration & Governance Data Warehouse Stream Computing HadoopSystem Discovery Application Development Systems Management Data Media ContentMachine Social BIG DATA PLATFORM Integrate & Link Big Data Master Big Data Audit & Archive Big Data Cleanse and Validate Big Data Protect Big Data Big Data as a Source Big Data as a Target Data Transformations Data Movement Integrate w/existing Enterprise Data Lineage & Impact Analysis Metadata Integration w/Analytics Realtime & Data Federation Activity Monitoring Data Masking Data Encryption On-Demand / In-Place Protection In-Line Protection (w/ETL etc.) Active Detection & Alerting Queryable Archive Structured and Semi-Structured Optimized Connectors to existing Apps Hot-Restorable On-the-Fly Immutable and Secure Access Automated Legal Hold Capability for Data Freeze Accuracy and Entity Matching with Social Data De-duplication and Standardization of Machine Data In-line Cleansing with Integration Trusted Data Dashboard and Reporting on Data Quality Big Data as a Supplier Big Data as a Consumer Links between Big Data and Trusted Golden Records Leverage Master Data in Big Data Analytics Entity Resolution at Extreme Scale Out Levels Probabilistic Entity Matching 27 Information Integration & Governance for Big Data 29 If youd like to explore this topic further Contact your IBM account team or your preferred IBM Partner. If youd like to explore more about InfoSphere DataStage and the Information Server platform http://www-01.ibm.com/software/data/integration/info_server/ If youre looking for a Enterprise level Hadoop distribution InfoSphere Big Insightshttp://www-01.ibm.com/software/data/infosphere/biginsights/ Where to go for learn more. Thanks