18
TIBCO Data Virtualization Technical Overview TIBCO Data Virtualization deployment, development, run-time, and management capabilities 1. INTRODUCTION With data the new competitive battleground, businesses that take advantage of their data will be the leaders; those that do not will fall behind. But gaining an advantage is a more difficult technical challenge than ever because your business requirements are ever-changing, your analytic workloads are exploding, and your data is now widely-distributed across on-premises, big data, the Internet of Things and the Cloud. TIBCO ® Data Virtualization is data virtualization software that lets you integrate data at big data scale with breakthrough speed and cost effectiveness. With TIBCO Data Virtualization, you can build and manage virtualized datasets and data services that access, transform, and deliver the data your business requires to accelerate revenue, reduce costs, lessen risk, improve compliance, and more. TIBCO Data Virtualization is: •  Fast and Economical – Integrate data reliably at a fraction of physical warehousing and ETL time, cost, and rigidity. Evolve rapidly when requirements change. •  Immediate – Deliver up-to-the-minute data as needed, using advanced performance optimization algorithms and fine-grained security. •  Business-friendly – Transform native IT structures and syntax into easy-to- understand, IT-curated datasets sharable via a self-service business directory. CONTENTS INTRODUCTION 1 OVERVIEW 2 DEPLOYING 4 DEVELOPMENT 4 RUN-TIME 9 MANAGEMENT 15 CONCLUSION 18

TIBCO Data Virtualization Technical Overview...TIBCO Data Virtualization Technical Overview TIBCO Data Virtualization deployment, development, run-time, and management capabilities

  • Upload
    others

  • View
    40

  • Download
    1

Embed Size (px)

Citation preview

TIBCO Data Virtualization Technical OverviewTIBCO Data Virtualization deployment, development, run-time, and management capabilities

1. INTRODUCTIONWith data the new competitive battleground, businesses that take advantage of their data will be the leaders; those that do not will fall behind.

But gaining an advantage is a more difficult technical challenge than ever because your business requirements are ever-changing, your analytic workloads are exploding, and your data is now widely-distributed across on-premises, big data, the Internet of Things and the Cloud.

TIBCO® Data Virtualization is data virtualization software that lets you integrate data at big data scale with breakthrough speed and cost effectiveness. With TIBCO Data Virtualization, you can build and manage virtualized datasets and data services that access, transform, and deliver the data your business requires to accelerate revenue, reduce costs, lessen risk, improve compliance, and more.

TIBCO Data Virtualization is:

•  Fast and Economical – Integrate data reliably at a fraction of physical warehousing and ETL time, cost, and rigidity. Evolve rapidly when requirements change.

•  Immediate – Deliver up-to-the-minute data as needed, using advanced performance optimization algorithms and fine-grained security.

•  Business-friendly – Transform native IT structures and syntax into easy-to-understand, IT-curated datasets sharable via a self-service business directory.

CONTENTSINTRODUCTION 1OVERVIEW 2DEPLOYING 4 DEVELOPMENT 4RUN-TIME 9MANAGEMENT 15CONCLUSION 18

WHITEPAPER | 2

•  Wide-ranging – Access data from distributed data sources including traditional enterprise, big data, cloud, and IoT. Use it across myriad analytics, self-service, business intelligence, and transactional applications.

•  Enterprise Grade – Support multiple lines of business, hundreds of projects, and thousands of users.

Beyond the TIBCO Data Virtualization, TIBCO also provides a broad array of complementary advanced services, training, technical support, customer advisory program, knowledgebase, and partner offerings so you get the complete solution needed to ensure your data virtualization success.

2. TIBCO DATA VIRTUALIZATION PRODUCT OVERVIEW TIBCO Data Virtualization is Java-based, enterprise-grade middleware whose modular structure supports all phases of data virtualization development, run-time, and management.

XML.XLS

Packaged Apps

RDBMS Excel Files

Analytics

Business Directory

Discovery

Studio

Adapters

Manager

Deployment Manager

Monitor

Active Cluster

Development Environment

ManagementEnvironment

Self-Service Business Intelligence Transactional Apps

DataWarehouse

OLAP Cubes Hadoop/Big Data

XML Docs Flat Files Web Services

Run Time Environment

SQL(ODBC, JDBC, ADO.NET)

Federation Engine

Object Repository(SQL, SQL Script, Java, XQuery, XSLT) Security

Caching

Cost-based Optimizer Rules-based Optimizer

Messaging(JMS)

Web Services(HTTP, REST, SOAP, JSON, OData)

Quality

SQL(ODBC, JDBC)

Web Services

Messaging(JMS) Hadoop Java Mainframe Application

APIs

Governance

Figure 1. TIBCO Data Virtualization Modules

2.1 BUSINESS DIRECTORYBusiness Directory is a self-service business data directory you can use to easily search, categorize, and consume IT-curated datasets developed using TIBCO Data Virtualization. Business Directory encourages dataset sharing and reuse to accelerate business outcomes while reducing IT workloads.

2.2 DISCOVERYDiscovery enables you to go beyond simple data profiling to examine data, locate important entities, and reveal hidden relationships across distinct data sources. You can quickly build and display comprehensive entity-relationship (E-R) diagrams and data models so you can meet new business requirements faster and more easily.

WHITEPAPER | 3

2.3 STUDIOStudio is the agile modeling, development, and resource management tool your data-oriented developers use to model data, design data services, build transformations, optimize queries, manage resources, and more. Easy to learn and use, Studio’s graphical modeling environment along with powerful code editors provide a flexible workspace where queries are created and tested, and a data services repository where completed objects are published. Studio also offers a rich set of transformations in addition to an easy-to-use transformation editor. Five development languages including SQL, SQL Script, Java, XQuery, and XSLT complement Studio’s graphical modeling capabilities.

2.4 ADAPTERSAdapters provide a wide range of data source connectivity for databases, files, big data, cloud sources, packaged applications, and more. Beyond schema-to-schema only connectivity, TIBCO Data Virtualization adapters integrate optimizers with data source optimizers to ensure more accurate queries and higher performance. The Data Source Tool Kit allows you to easily build additional custom adapters.

2.5 OPTIMIZERS AND ALGORITHMSOptimizers and algorithms intelligently evaluate workloads at runtime and dynamically select the ideal combination of optimizers (cost-based, rule-based, and data sources), join algorithms, and query techniques to achieve the best possible performance.

2.6 EXECUTION ENGINESThe execution engine processes the optimized query workloads within a clustered environment.

2.7 METADATA REPOSITORYThe integrated Metadata Repository lets you store and manage all relevant metadata as well as your data services throughout their development and operations lifecycle.

2.8 SECURITYMyriad fine-grained Security capabilities, including authentication, authorization encryption, and masking, leverage your existing security investments (LDAP, Active Directory, and Kerberos), standards (SAM, NTLM, TLS, and WSS), and policies to securely access and deliver sanctioned data only, no more, no less.

2.9 CACHINGFlexible Caching options enable higher performance and greater uptime by efficiently moving data to predesignated caches.

2.10 QUALITYQuality provides a range of profiling, standardization, conformance, enrichment, augmentation, and validation capabilities that you can apply to ensure delivery of the most correct and complete data possible.

2.11 GOVERNANCE Built-in Governance features including lineage, where-used, logging, and more provide complete visibility, traceability, and control over your data virtualization activities.

2.12 MANAGERManager is the administrative console your administrators use to set up user IDs, passwords, security profiles, etc., as well as check server activity, and more.

WHITEPAPER | 4

2.13 DEPLOYMENT MANAGER Deployment Manager lets you quickly and easily migrate entire projects in a single step, including their resources, cache settings, security profiles, and more across multiple TIBCO Data Virtualization instances to simplify and automate your development lifecycle.

2.14 MONITORMonitor provides a comprehensive, real-time view of TIBCO Data Virtualization clusters. Monitor displays all the pertinent system health indicators required to help your IT operations staff guide corrective actions.

2.15 ACTIVE CLUSTERActive Cluster works in conjunction with load balancers to provide high availability and greater scale to meet your challenging service level agreements. Active Cluster simplifies complex operations management by automatically sharing resources, adjusting capacity on demand, and more.

3. DEPLOYING TIBCO DATA VIRTUALIZATION 3.1 DEPLOYMENT OPTIONS TIBCO provides multiple options for deploying TIBCO Data Virtualization. You can install and run it on-premises, in your private cloud environment, or at a public cloud provider such as Amazon AWS, Google Cloud Platform, or Microsoft Azure. For Amazon, TIBCO also provides an Amazon Machine Image (AMI) on the AWS Marketplace to simplify and accelerate deployment.

3.2 INSTALLATION AND CONFIGURATION You can easily install TIBCO Data Virtualization via standard installation processes, as well as flexibly configure it using an intuitive graphical interface.

4. TIBCO DATA VIRTUALIZATION DEVELOPMENT 4.1 AGILE SOFTWARE DEVELOPMENT LIFECYCLETIBCO Data Virtualization enables agile data services development across a complete systems development lifecycle.

4.2 DATA ABSTRACTION Data Abstraction helps overcome source-to-consumer incompatibility by transforming data from its native structure and syntax into reusable data services that are easy for application developers to understand and consuming applications to use. From an enterprise architecture point of view, TIBCO Data Virtualization enables a semantic abstraction or data services layer in support of multiple consuming applications. This middle layer of reusable data services decouples the underlying source data and consuming solution layers. This provides the flexibility required to deal with each layer in the most effective manner, as well as the agility to work quickly across layers as applications, schemas, or underlying data sources change.

Development objectives when building the data services that enable abstraction include:

•  Right business information at the right time – Fulfill complete information needs on demand by linking multiple, diverse data sources together for delivery in real time.

•  Business and IT model alignment – Gain agility, efficiency, and reuse across applications via an enterprise information model or logical business model. Oftentimes known as the canonical model, this abstracted approach overcomes data complexity, structure, and location issues.

WHITEPAPER | 5

•  Business and IT change insulation – Insulate consuming applications from changes in the source and vice versa. Developers create their applications based on a more stable view of the data. Allow ongoing changes and relocation of physical data sources without impacting consumers.

•  End-to-end control – Use a single platform to design, develop, manage, and monitor data access and delivery processes across multiple sources and consumers.

•  More secure data – Consistently apply data security rules across all data sources and consumers via unified security methods and controls.

4.3 ACCESSING SOURCE DATA TIBCO Data Virtualization Adapters enable access to myriad relational; Apache™ Hadoop® multi-dimensional; cloud and NoSQL data sources; flat files; SOAP, REST, OData, and JMS services; a variety of on-premises and cloud applications; and streaming / IoT data sources.

Data source access is both protocol and vendor specific. Each data source type available in TIBCO Data Virtualization uses an associated driver. When you provision a new data source for use in TIBCO Data Virtualization, you set the driver’s parameters so that it can access a specific data source of that type (such as an Oracle database). Once connectivity is established, TIBCO Data Virtualization introspects the data source structure and stores the resulting metadata in the metadata repository. With the data source’s metadata now available, you can create data services using that data source.

Beyond simple metadata (schema) mapping and lowest common denominator access standards such as ODBC or JDBC offered by other vendors, TIBCO Data Virtualization Adapters also provide performance optimization capabilities and development acceleration. Key capabilities include:

•  Certified Vendor-specific Connectivity API – Access proprietary data using vendor-approved access methods

•  Standard SQL to Vendor-specific SQL Resolution – Ensure precise SQL translation and execution

•  Statistical Analysis and Cardinality Estimation – Accumulate critical metrics used by the cost-based optimizer

•  Capability Introspection and Coordination – Determine configurations, functionality, and parameters required to enable optimal performance

•  Vendor-specific Engineered Functions – Leverage the complete range of vendor capabilities rather than the lowest common denominator functions.

4.4 DATA INTROSPECTION AND RELATIONSHIP DISCOVERY TIBCO Data Virtualization provides developers with easy-to-use source data introspection and relationship discovery tools that simplify and accelerate data modeling.

You can use Studio to introspect data sources so you can better understand how source data is modeled and managed. You can then use Discovery to further examine the data, locate key entities and apply fuzzy matching to reveal hidden relationships across data sources. And you can use that knowledge to quickly build rich data models in Studio that show live data, making it easier to validate business requirements with end users. Once you have validated that the models match business requirements, you can use these tables and relationships to build new data services with a push of a button.

WHITEPAPER | 6

Figure 2. Discovery

4.5 DATA SERVICE DEVELOPMENT IN STUDIO Studio is the agile modeling, development, and resource management tool your data-oriented developers use to model data, design data services, build transformations, optimize queries, manage resources, and more. Easy to learn and use, Studio’s graphical modeling environment, along with powerful code editors, provide a flexible workspace where queries are created and tested, and a data services repository where completed objects are published. Studio also offers a rich set of transformations in addition to an easy-to-use transformation editor you can use to transform any source structure (such as relational and hierarchical) to any target structure. Five development languages including SQL, SQL Script, Java, XQuery, and XSLT complement Studio’s graphical modeling capabilities.

Data services normalize both the shape and type of the data. TIBCO Data Virtualization normalizes datasets into one of three shapes: tabular (rows and columns of values), hierarchical (such as an XML document), or scalar (a single value). It normalizes each datum within a dataset into a standard SQL or XML data type (all of which are listed in the reference manual). Data normalized into these standard shapes and types can then easily be manipulated, transformed, and combined with other data, without regard for vendor-specific or protocol-specific details. You can reuse existing data services to create new ones.

WHITEPAPER | 7

Figure 3. Studio

Tabular data services in TIBCO Data Virtualization are conceptually equivalent to a relational database view and look like a database table to the consumer. The key distinction is that TIBCO Data Virtualization data services can query from multiple data sources. A tabular data service may be a final integration, which can be published for clients to query using SQL. Alternatively, it may represent an intermediate result that can then be used within other data services.

Data services have a high degree of interoperability with other components in a SOA environment. TIBCO Data Virtualization supports both contract-first and design-by-example methods, as well as Service Component Architecture (SCA) and any XML standard.

Figure 4. Contract-first Data Services

WHITEPAPER | 8

The transport can occur over HTTP or the Java Message Service (JMS) message bus. There are many additional options within each of these standards and within the TIBCO Data Virtualization processing pipeline. All of the options are outlined in the documentation.

Studio’s graphical editor allows you to create data services using drag-and-drop techniques, with the resulting SQL statement created automatically. You can also create and edit the SQL statement directly. Lineage and where-used visualizations can identify where data originates, how it is used and transformed, and which views reference it for impact analysis.

SQL developed in Studio is optimized for you automatically. However, you can also provide processing hints. These hints are specified as TIBCO Data Virtualization specific syntax because there is no syntax for hints in the SQL standard. The available hints are listed and described in the reference documentation.

4.6 METADATA REPOSITORYTIBCO Data Virtualization lets you store and manage all relevant metadata as well as data services throughout their development and operations lifecycle. Any composition logic, packaged queries, and transformations are stored here. Beyond design data, the Metadata Repository also captures all user and group information, event and log data, and system management and configuration information. Finally, TIBCO Data Virtualization tracks data source metadata to simplify impact analysis in case of changes to the source structures. You can use the web services–based metadata API for easy metadata access and sharing.

4.7 RESOURCE SHARING AND VERSION CONTROL TIBCO Data Virtualization provides resource sharing and versioning and to facilitate a team-oriented systems development lifecycle.

You can think of each resource in TIBCO Data Virtualization as being analogous to a source code file in a software system. Whether it’s a data service or any other resource, its implementation is the “source code” for that resource, and it is normally stored in the Metadata Repository.

TIBCO Data Virtualization resource sharing is useful when a team creates and modifies a shared set of resources over a period of time—usually during a project life cycle. A source code control system (SCCS) is helpful to manage changes and versions of the files. TIBCO Data Virtualization integrates with popular SCCSs including GIT, SVN, and TFS to allow resources to be checked in and out directly from Studio.

To get a resource in or out of a SCCS, the resource must be in a file. When you use Studio to check in a resource to the SCCS, Studio first externalizes the resource into a file using an XML-based representation of the resource, and then checks the XML file into the SCCS. Likewise, when you use Studio to check out a resource from an SCCS, it checks out the XML file from SCCS, and then imports the file into TIBCO Data Virtualization.

To prevent you from unknowingly overwriting another person’s work, TIBCO Data Virtualization always checks the version of a resource you are editing against the version that has been saved, and will warn that you are trying to save the resource over a newer version.

Studio natively integrates with GIT, Subversion (SVN), and TFS. TIBCO Data Virtualization also provides an open API for integrating with any other SCCS.

4.8 BUSINESS DIRECTORY Both business analysts and developers can use Business Directory to easily search, categorize, and consume data services developed in TIBCO Data Virtualization. Business Directory encourages dataset sharing and reuse to accelerate business outcomes and reduce IT workloads.

WHITEPAPER | 9

5. RUN-TIME 5.1 PERFORMANCE OPTIMIZATION To help you achieve your performance SLAs, TIBCO Data Virtualization provides myriad performance optimization capabilities.

Optimizers and Algorithms intelligently evaluate workloads at runtime and dynamically select the ideal combination of optimizers (cost-based, rule-based, and data sources), join algorithms, query techniques, and execution engines (MPP or classic) to achieve the best possible performance.

Multiple execution engines allow you to run optimized data virtualization workloads at any scale. The classic execution engine processes workloads on a single node within a clustered environment. The MPP execution engine dynamically distributes big data scale workloads across processing nodes.

Granular workload management allows you to provide more intelligent allocation of resources for important workloads. Control memory usage and request length, row counts, and more, as well as avoid potentially problematic requests. These controls can be implemented at the object, user group, server, and/or cluster group level. Once in place, you can monitor these workloads using existing TIBCO Data Virtualization reporting.

5.2 INTELLIGENT OPTIMIZATION ALGORITHMS AND TECHNIQUES TIBCO Data Virtualization provides a number of optimization algorithms and techniques that maximize the use of data source processing capabilities and reduce the amount of data that must be moved across the network. These include:

•  Cost-based Optimization – Use of query characteristics, table statistics, indexes, and the data distribution histograms creates an optimal query plan.

•  Rule-based Optimization – You can specify rules that can influence the query plan.

•  Data Source Optimization – Use of data source optimizers ensure query accuracy and maximize query performance.

•  Automatic Query Rewrite – Automatic rewrite of queries to use the most efficient join strategy.

•  Complete Set of Join Algorithms – The most efficient join strategy is enabled using a full range of join algorithms including hash join, sort-merge join, distributed semi-join, data-ship join, union-join flip, nested-loop join, and others.

•  SQL Pushdown – As much processing as possible is off-loaded by pushing down select query operations such as string searches, comparisons, local joins, sorting, aggregating, and grouping into the underlying data sources.

•  Single-source Join Grouping – Data-reducing joins in the data source are run rather than bringing the data across the network.

•  Predicate Pushdown – WHERE clause predicates are pushed down into the underlying data source to reduce data at the source.

•  Full and Partial Aggregate Pushdown – Aggregate functions are pushed down to the source when applicable.

•  Serialization or Parallelization of Join Operators – Proper join order and join algorithms are determined based on estimated cardinality and join results derived from data distribution histograms.

•  Projection Pruning – All unnecessary columns from fetch nodes in a query tree are eliminated.

WHITEPAPER | 10

•  Constraint Propagation – Distribution of filters to multiple branches of the query plan, allowing data reduction by a single filter to potentially occur in multiple data sources.

•  Scan Multiplexing – Datasets that appear in multiple places in a single query plan are reused.

•  Empty Scan Detection – Detection of logical conditions that would produce empty datasets, and elimination of those parts of the query plan prior to execution.

•  Redundant Operator Cropping – Elimination of redundant or extraneous operators within a complex multi-operator query.

•  Blocking Operator Pre-fetching – Proactive running of parts of the query plan that must complete before other parts of the query plan can continue.

•  Caching – The system can be configured to cache results for query, procedure, and web service calls on a per-service and per-query basis. When enabled, the caching engine stores result sets either in a local file-based cache or in a database. The query engine will verify whether results of a query it is about to execute are already stored in the caching system and will use the cached data as appropriate. While caching can improve the overall execution time, its impact is most pronounced when used on frequently invoked queries that are executed against rarely changing data or high-latency sources such as web data services.

•  Native XML Support – Support for native XML for faster parsing and joins.

•  Multi-threaded, Asynchronous Access – Execution of calls asynchronously on separate threads, reducing idle time and data source response latency.

•  Connection Pool Sharing – Access to data sources are shared to avoid bottlenecks.

•  Hybrid Memory and Disk Usage – Automatic memory balancing and disk use for optimal performance.

•  Fine-grained workload management – Monitoring and conditional implementation of rules based on server workload thresholds. When rules and thresholds are violated, appropriate actions are executed (Run with Administrator alert, Run with User warning, Stop execution with Administrator alert and User exceptions).

•  Results Streaming – Data streaming to consumers as results are obtained.

5.3 QUERY PROCESSING FLOW Query processing in TIBCO Data Virtualization occurs in several phases:

•  Plan Generation and Optimization Phase – The plan generator and optimizer are responsible for producing one or more plans for each query. The optimal plan is automatically selected based on rules and dynamic cost functions. The cost function measures the time and computing effort required to execute each element of the query plan. Cost information comes from a service that pulls data from the repository about each data source in the query, as well as other environmental information during plan generation. For each data source, capabilities that are factored into plans include indexes, join support, and cardinality estimates.

WHITEPAPER | 11

The plan generator also compensates for the fact that each data source might have different capabilities that need to be normalized across the distributed and disparate data sources to successfully resolve the query. It’s important to note that the cost-based optimizer can work with data sources with limited metadata (such as a web service) as well as systems that have a significant volume of metadata (such as Oracle). Also, during plan generation, TIBCO Data Virtualization employs non-blocking join algorithms that use both memory-based and memory-disk algorithms to manage large result sets, and that have underlying data source awareness for pushing joins and predicates to underlying sources for execution.

•  Query Execution Phase – Proprietary algorithms enable TIBCO Data Virtualization to stream results from massively complex queries while the user query is still running. This lets applications begin consuming data before the query completes execution, and results in substantial performance gains. Queries are decomposed and executed in the target data sources, and the streaming results are assembled and pipelined directly to the user’s client software. The parallel multi-threaded architecture is truly exploited in this module when multiple queries with different run times across multiple data sources are all required to resolve a single composite query. The execution engine also does dynamic plan re-writing based on actual results found at run time.

•  Query Plan Enhancement Phase – An out-of-the-box installation of TIBCO Data Virtualization will do a very good job of optimizing queries. However, you can also enhance these query plans to improve optimization results.

During plan generation and execution, extensive statistics are collected and logged for use in the query plan visualization tool found within the Studio. Developers use these insights to learn about the cardinality and distribution of data in the data source’s tables, which in turn allows query optimization to make better decisions about join algorithms and ordering.

Join options let you manually specify some of the run-time behaviors of the join instead of letting query optimization automatically determine the behavior. For example, you can select the actual algorithm with which the join will be processed.

Query hints can be included. For example, you can provide the expected cardinality of either or both sides of a join. All available query hints are outlined and explained in the manual.

5.4 CACHINGCaching stores a queried result-set for a period of time so it can be re-used to satisfy another client request. The cached result-set might be an intermediate result that can be used to satisfy a variety of higher-level requests, or it might be a very specific final result that satisfies a specific request. You can specify when, where, and how often result-sets get cached.

Caching is generally used for two reasons. The first is to reduce loads on underlying data sources. For example, a transactional system needs to be most responsive to the transaction creation process (such as order taking) during peak hours and it should probably not be servicing reporting needs at the expense of transactional needs. By employing caching, you can protect an underlying system from too much activity while still providing recent (cached) data to reporting applications.

A second reason to cache data is performance. When a cached dataset is already available to satisfy a new request, query time can be saved.

WHITEPAPER | 12

Caching may be applied to any data service via a caching policy that instructs TIBCO Data Virtualization where to cache and how often the cache is to be updated. To simplify development, multiple resources can share a common caching policy. The cache may be stored in a local file or a specified database. If the resource being cached is a procedure (including scripts, Java components, and web services), multiple caches will be created, each corresponding to a unique set of input parameters.

During query planning, the Optimizer recognizes that a particular resource has been cached, and it automatically modifies the query plan to use the cache instead of the original resource definition. This is interesting for two reasons. First, entire sub-trees of a query plan can be eliminated because of the availability of a cache for a resource. Second, because the cache itself is a data source (either a file or a database), the Optimizer can apply further optimization techniques on the cache.You set caching policy for a resource during development. It includes:

•  Storage Location – You can cache to a file or a database. Files work very well for small and relatively static caches in the development phase. For production use, databases serve as primary storage for larger and more volatile caches. PostgreSQL is the default database cache and the following cache repositories are also supported: File; Greenplum; HSQLDB, IBM DB2 and Netezza; Microsoft SQL Server; MySQL; Oracle; SAP Hana; Sybase ASE and IQ; Teradata; and Vertica.

•  Storage Mode – In multi-table caching mode, you can use several physical tables to store cache copies. Using a single table to store the cache is best for smaller result sets that can be quickly replaced and re-indexed. Multi-table caching works best for large caches so you can update the caches in series, ensuring continuous availability and failover support.

•  Update Strategy – You can update a cache via a complete refresh or incrementally. The complete refresh strategy is the most straightforward approach, and it can be automatically applied to any cached resource. An incremental refresh strategy requires specific business logic that you provide through scripting or programming. Using a database for the cache storage allows a single cache to be shared by all nodes in the cluster.

•  Update Method – Whenever possible, TIBCO Data Virtualization leverages target repository native bulk extract and bulk load functions to load and refresh the cache because these functions typically perform better than SELECT and INSERT statements. When not available, it optimizes the SELECT and INSERT statements for you using multiple threads in parallel.

•  Update Timing – A cache may be updated periodically based on the clock or an external trigger. You can also refresh a cache manually, which is useful for caching reference tables that seldom change.

5.5 DATA DELIVERYTIBCO Data Virtualization delivers data to consumers on demand via a number of protocols using a request-response paradigm.

For views, the client application establishes a connection to TIBCO Data Virtualization using a ODBC, JDBC, or ADO.NET driver, issues SQL queries, and reads results. This form of the request-response paradigm establishes a persistent connection through which multiple request-response cycles may occur. When the client is finished, the connection is closed. There are no practical size limits on the result sets. Result sets can be as large as necessary to answer the query because TIBCO Data Virtualization streams the results through the persistent connection as long as the client continues to read the result. All of the usual metadata that is expected from a relational database is available through this type of connection.

WHITEPAPER | 13

For data services, there is no persistent connection between TIBCO Data Virtualization and the client application. Instead a call is made, the result is returned, and the session is closed. XML document limitations may constrain the size of the result set. Several web service standards are supported including SOAP, REST, OData, JSON, and raw-XML. The request-response cycle can occur over HTTP/S or JMS (a message bus). There are many additional options within each of these standards, and also within the TIBCO Data Virtualization processing pipeline. All available options are outlined in the documentation.

While generally used for queries, TIBCO Data Virtualization also provides a transaction processing engine to support database and file updates. INSERT, UPDATE, and DELETE statements are supported against underlying data sources, whether or not those sources offer native support for these functions. TIBCO Data Virtualization also supports compensating processes for data sources that are not inherently transactional like flat files and web services. When required, TIBCO Data Virtualization can also rollback any update.

5.6 TRIGGERSYou can use triggers to invoke activities independent of data consumer requests. For example, it may be desirable to have a particular data source swapped to a different source at night during a maintenance window. Triggers are also used internally to implement some timed based activities. For example, a timed cache refresh is implemented using a trigger.

A trigger resource has two parts. The first part defines what condition causes the trigger to fire or execute when appropriate conditions arise. Supported trigging events include:

•  Timer Event – The trigger fires at the specified time. It can also be set to repeat periodically.

•  System Event – The trigger fires whenever the given system event occurs. TIBCO Data Virtualization provides a documented list of system events.

•  User-defined Event – The trigger fires whenever the given user-defined event occurs. User-defined events are posted by invoking a system procedure from inside TIBCO Data Virtualization or through the public Admin API.

•  JMS Event – The trigger fires based on a JMS event.

The second part of a trigger defines what activity occurs based on the trigger. TIBCO Data Virtualization runs each activity as an independent execution thread. A number of actions are supported including executing a procedure, sending an email, generating a cache, and others.

5.7 INTEGRATED SECURITYTIBCO Data Virtualization provides myriad fine-grained Security capabilities, including authentication, authorization encryption, and masking, and leverages your existing security investments (LDAP, Active Directory, and Kerberos), standards (SAM, NTLM, TLS, and WSS), and policies to protect your sensitive data. Security is enforced for every request made and on all resources involved in the request. Security used to grant access to TIBCO Data Virtualization is called authentication. Security used to allow and enforce privileges is called authorization. You set authentication, authorization, and encryption rules via policies.

At design time, Security establishes ownership of resources and enforces resource sharing and modification privileges. At run time, Security regulates access to TIBCO Data Virtualization resources and the data it returns.

WHITEPAPER | 14

You can grant privileges to individual users or to groups that you create. Resource privileges include:

•  Grant — At design time, privileges specify whether a user or group may grant privileges for a given resource to other users or groups. The owner of a resource always has grant privileges.

•  Read, write – At design time, privileges specify whether a user or group may see or modify a resource.

•  Select, insert, update, delete – At run time, privileges specify whether a user or group may perform the given SQL command on a tabular resource.

•  Execute – At run time, privileges specify whether a user or group may execute a procedure (such as a web service or a script), and also see the result of the procedure execution.

Security Authentication Users attempting to access TIBCO Data Virtualization must present credentials that identify themselves and allow authentication (usually in the form of a username and password). Users and groups may be defined in TIBCO Data Virtualization or in an external authentication system. When configured to use an external authentication system, TIBCO Data Virtualization requests the external system to perform the authentication, which preserves the separation of roles.

Security AuthorizationWhenever you perform any action on a resource (such as executing a procedure), TIBCO Data Virtualization checks your user privileges, and the privileges of groups to which you belong, to determine whether you should be allowed to perform the action. Authorization occurs on every resource involved in the operation, which means you may need privileges for resources in addition to the resource you directly access. If you do not have the required privilege to perform the action, an appropriate error is returned.

Executing Authentication and AuthorizationWhen provisioning a data source, you can specify whether the data source should be accessed with a single shared set of authentication credentials, or with the specific credentials of the current user. If a single set of credentials are specified, all authorization performed in the underlying data source is done using those credentials. If user-specific credentials are specified (which is also called “pass-through authentication”), then authentication to and authorization within the underlying data source are performed using the current user credentials. To be clear, security (authentication and authorization) is enforced by the underlying data source because it uses the same ID that it originally generated. TIBCO Data Virtualization merely passes this ID from the requester down to the underlying data source.

Row-level and Column-level Authorization TIBCO Data Virtualization supports both row-level and column-level access control. You can set up row-level access via complex security policies. You can use Studio to set up column-level security.

WHITEPAPER | 15

Column Masking You can implement column masking rules to hide, replace, or obfuscate portions of a column’s value depending on a user’s level of access.

Multiple User Authentication OptionsBasic, Pass-through, Kerberos, SAML and NTLM authentication types are supported.

The pluggable authentication module allows administrators to integrate authentication services from any external system.

For more detail on the complete set of Security features, please see the TIBCO Data Virtualization datasheet.

5.8 DATA QUALITY AND GOVERNANCETIBCO Data Virtualization provides myriad data quality and governance capabilities the help you ensure correct and complete data with comprehensive visibility, traceability, and control. Please see the TIBCO Data Virtualization datasheet for details.

6. TIBCO DATA VIRTUALIZATION MANAGEMENT

6.1 SYSTEMS MANAGEMENT You can run TIBCO Data Virtualization continuously so your data services are always available to consumers and service level agreements are met. TIBCO provides five tools that combine to support effective systems management:

•  Manager – This browser-based set of consoles facilitates both TIBCO Data Virtualization configuration and operations. From a management point of view, you can review and modify server settings, see active client sessions, investigate running queries, and more. The Manager is the primary tool for managing a single TIBCO Data Virtualization instance.

Figure 5. Manager through a Web Browser

•  Manager Pane in Studio – Although Studio is primarily a development tool, it also has integrated manager functionality that is similar to the browser-based Manager.

•  Monitor – Monitor is the primary tool for monitoring TIBCO Data Virtualization instances. It provides a comprehensive, real-time view of TIBCO Data Virtualization clusters and displays all the pertinent system health indicators required to help your IT operations staff guide corrective actions. With Monitor, you can see collections of clients and their level of activity, monitor host memory and CPU usage, and display dashboards that include overall health indicators. If processes slow down or operations fail, you can use these indicators to guide the actions required to maintain agreed service levels.

WHITEPAPER | 16

Figure 6. Monitor

•  SNMP – Simple Network Management Protocol (SNMP) is a standard used in many network operations centers (NOCs) to monitor the health and activity of network components. Although it is generally more focused on hardware than software assets, it is still important for software servers to participate in an SNMP monitoring strategy. TIBCO Data Virtualization publishes a standard SNMP MIB that defines numerous available traps that can be monitored by the NOC dashboards and personnel.

•  Administration API – TIBCO Data Virtualization includes a public, documented API for programmatically manipulating all metadata inside an instance. Although the Admin API serves a broader scope than system management, it includes many APIs that are useful for system management. For example, all configuration variables can be read or changed using the Admin API. You can use the Admin API to implement integrated system management strategies across TIBCO Data Virtualization and more.

6.2 CLUSTERING TIBCO DATA VIRTUALIZATION INSTANCES Active Cluster supports substantial scaling of TIBCO Data Virtualization deployments while ensuring continuous availability to fulfill service level agreements. Active Cluster enables multiple instances of TIBCO Data Virtualization to work together to handle user requests. Each instance in the cluster is called a node. The TIBCO Data Virtualization clustering architecture is peer-to-peer, with all nodes treated equally. There can be up to sixteen nodes in a single cluster. With its active approach, all instances in the cluster are active at all times.

Clustering configurations are flexible based on reliability, availability, and scalability requirements.

•  Scalability – To scale up vertically, you can add capacity to existing nodes (CPU cores, RAM memory, hard disk storage, network capacity) as needed. It is not necessary for each node to run on a completely separate server as long as there are sufficient computing resources for multiple nodes running on the same server. To scale out horizontally, you can add nodes to the cluster.

WHITEPAPER | 17

•  High Availability / Disaster Recovery – To achieve high availability and to enable disaster recovery, a cluster is created with nodes running on separate servers. Spanning multiple servers protects the cluster from hardware failures. As insurance, the capacity of the cluster must have enough processing headroom to accommodate the loss of one or more nodes so that all requests can continue to be serviced when a server failure occurs.

Using a Load Balancer to Distribute Workloads Across a ClusterThe usual practice for distributing client requests is to install a load balancer in front of the TIBCO Data Virtualization cluster (just as you would for a cluster of application servers). The load balancer distributes requests among the individual nodes according to a pre-configured policy (such as round robin).

If a client request is a simple request-response (stateless) request (such as a web service invocation), then subsequent requests from that client can be distributed to different nodes by the load balancer. Alternatively, if a client request creates a persistent (stateful) connection, as is the case with JDBC/ODBC, then the connections are bound to assigned nodes for the duration of the session. If the node to which the client holds a persistent connection fails, the client would receive an error and take steps to re-establish a connection to the cluster, while the load balancer redistributes the request to another node.

Metadata Synchronization across InstancesThe cluster exists as a peer-to-peer network of TIBCO Data Virtualization nodes. Each node holds its own copy of the metadata repository, but each node’s metadata repository is continuously synchronized with all other nodes in the cluster. If a metadata change is made to one node, that change is immediately propagated around the cluster to all other nodes. As such, deploying new resources to a TIBCO Data Virtualization cluster is a simple matter of promoting the resources to a single node. The cluster takes care of propagating the updates to all other nodes.

Caching within a Cluster Caching within a cluster allows servers to share database cached datasets.If the cache is stored in a relational database, the nodes in the cluster can share a single copy of the cache if desired. This arrangement provides better cache coherency and makes more efficient use of resources. When the cluster is sharing a single cache, the nodes must coordinate the cache update procedure so that only a single node performs the update. This coordination is done through a bidding and assignment algorithm that is common in peer-to-peer architectures, and it ensures that only a single node updates the shared cache.

6.3 DEPLOYMENT MANAGEMENTDeployment Manager lets you quickly and easily migrate entire projects in a single step, including their resources, cache settings, security profiles, and more across mulitple TIBCO Data Virtualization instances to simplify and automate your development lifecycle.

Deployment Manager supports:

•  Resource Migration – Transfer or promote artifacts from one TIBCO Data Virtualization instance to another.

•  Cache Setting Migration – Transfer or promote cache table names, caching methods, refresh method, cache policies, and cache schedules.

•  User/Group Migration – Transfer or promote user/group IDs, security profile, and other information.

•  Trigger Configuration Migration – Replicate trigger configurations.

WHITEPAPER | 18

Global Headquarters 3307 Hillview Avenue Palo Alto, CA 94304+1 650-846-1000 TEL +1 800-420-8450 +1 650-846-1005 FAXwww.tibco.com

TIBCO fuels digital business by enabling better decisions and faster, smarter actions through the TIBCO Connected Intelligence Cloud. From APIs and systems to devices and people, we interconnect everything, capture data in real time wherever it is, and augment the intelligence of your business through analytical insights. Thousands of customers around the globe rely on us to build compelling experiences, energize operations, and propel innovation. Learn how TIBCO makes digital smarter at www.tibco.com.©2017–2018, TIBCO Software Inc. All rights reserved. TIBCO and the TIBCO logo are trademarks or registered trademarks of TIBCO Software Inc. or its subsidiaries in the United States and/or other countries. Apache and Hadoop are trademarks of The Apache Software Foundation in the United States and/or other countries.  All other product and company names and marks in this document are the property of their respective owners and mentioned for identification purposes only.

10/31/18

Figure 7. Deployment Manager

7. CONCLUSIONTIBCO Data Virtualization is data virtualization software that lets you integrate data at big data scale with breakthrough speed and cost effectiveness. With TIBCO Data Virtualization, you can build and manage virtualized data services that access, transform, and deliver the data your business requires to accelerate revenue, reduce costs, lessen risk, improve compliance, and more.

This document provides developers and administrators with a technical overview of TIBCO Data Virtualization deployment, development, run-time, and management capabilities. It complements the TIBCO Data Virtualization datasheet, product documentation, training courseware, and best practices whitepapers. Links to additional detail are provided where appropriate.

FOR MORE INFORMATIONContact your TIBCO representative to learn more about TIBCO Data Virtualization. Visit the TIBCO Data Virtualization website at www.tibco.com/products/tibco-data-virtualization