38
Business Intelligence 1 1. Why is Big Data important? What are the Vs that are used to define Big Data? When big data is effectively and efficiently captured, processed, and analyzed, companies are able to gain a more complete understanding of their business, customers, products, competitors, etc. which can lead to efficiency improvements, increased sales, lower costs, better customer service, and/or improved products and services. Volume: Big data implies enormous volumes of data. It is used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media the volume of data to be analyzed is massive. Variety: Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases. Velocity:

Business Intelligence Exam II Answers

  • Upload
    tuchi

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Business Intelligence

Citation preview

Business Intelligence

1Business Intelligence

2

1. Why is Big Data important? What are the Vs that are used to define Big Data?When big data is effectively and efficiently captured, processed, and analyzed, companies are able to gain a more complete understanding of their business, customers, products, competitors, etc. which can lead to efficiency improvements, increased sales, lower costs, better customer service, and/or improved products and services.Volume:Big data implies enormous volumes of data. It is used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media the volume of data to be analyzed is massive.Variety:Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases.Velocity:Big data velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc.2. What are the critical success Factors for Big Data Analytics and explain them?Ensure alignment of the organization and project.

Simply designing and building a big data application will not ensure its success. Consideration must be given to who is going to provide the infrastructure for the application and more importantly who is going to operate it and how. The application will become costly or even legacy if no thought is given to who is going to maintain it and how. Most importantly, a big data project is likely to disrupt how the business currently operates, and so the project needs to consider the business change required to make full use of the application and how it will transform. This spans process, structural and cultural change. All parts of the organization involved in the project need to focus on a common goal to succeed. Sound governance must be put in place to deliver and sustain the project to realize the desired benefits.Apply an ethical policy.

Incorporation of new data sources into big data systems coupled with significant improvements in the capabilities of analytics technology provides organizations with opportunities to gain far greater and far deeper insight than ever before. For example, bringing together corporate records on customers with log files on customers' use of applications, social media data and statistical modeling techniques, allows a rounded, up-to-date view of individuals to be formed. However, this does not mean that any insight should be derived nor should insight necessarily be acted upon. Consideration should be given to the original purpose for which the individual gave information about themselves and whether an organization's intended use of that data is reasonable, and indeed seen to be reasonable. Moreover, data quality becomes more important with big data because errors are amplified. Poor quality data may also detract from minimizing false positives and false negatives. So if resulting actions are wrong, an organization risks reputational damage or contravention of regulations.

Employ the right skills.

Organizations should utilize their existing business intelligence staff in big data projects: big data is not something separate, but it augments what these people do already. However, skills development is needed to be successful with big data. Firstly, big data systems utilize large scale infrastructure which requires skills to design and operate it successfully. Secondly, skills in statistics and programming are needed to reflect the business opportunity in the resulting applications. Taking an approach which only utilises data warehousing skills will simply result in today's techniques being applied on big data technology, thereby not fully exploiting the opportunity. As an aside, organizations should recognize that Hadoop is not necessarily a replacement for a data warehouse: they have different design points. What is well suited to one may not be best suited to the other, and the skills required to build and operate each system differ. Ultimately, maximizing the business return from a big data system is more than simply choice of technologies, and one of the factors that must be taken into account is acquisition of the right infrastructure and analytics skills to succeed.3. What are the common characteristics of emerging Big Data technologies? Business intelligence, querying, reporting, searching, including many implementation of searching, filtering, indexing, speeding up aggregation for reporting and for report generation, trend analysis, search optimization, and general information retrieval. (Examples include: Alibaba, University of North Carolina Lineberger Comprehensive Cancer Center, University of Frieburg.)

Improved performance for common data management operations, with the majority focusing on log storage, data storage and archiving, followed by sorting, running joins, Extraction/Transformation/Loading (ETL) processing, other types of data conversions, as well as duplicate analysis and elimination. (Examples: AOL, Brilig, Infochimps.)

Non-Database Applications, such as image processing, text processing in preparation for publishing, genome sequencing, protein sequencing and structure prediction, web crawling, and monitoring workflow processes. (Examples: Benipal Technologies, University of Maryland.)

Data mining and analytical applications, including social network analysis, facial recognition, profile matching, other types of text analytics, web mining, machine learning, information extraction, personalization and recommendation analysis, ad optimization, and behavior analysis.4. What is Map Reduce? What does it do? How does it do it?

Map reduce is a programming model and a related execution for preparing and producing vast information sets with a parallel, conveyed calculation on a bunch.

Map Reduce has a master and workers, but it is not all push or pull, rather, the work is a collaborative effort between them.

The master assigns a work portion to the next available worker; thus, no work portion is forgotten or left unfinished.

Workers send periodic heartbeats to the master. If the worker is silent for a period of time (usually 10 minutes), then the master presumes this worker crashed and assigns its work to another worker. The master also cleans up the unfinished portion of the crashed worker.

All of the data resides in HDFS, which avoids the central server concept, with its limitations on concurrent access and on size. Map Reduce never updates data, rather, it writes new output instead. This is one of the features of functional programming, and it avoids update lockups.

Map Reduce is network and rack aware, and it optimizes the network traffic.

5. What is Hadoop? How does it work? What are the main Hadoop components?Hadoop is a free, Java-based programming schema that backings the transforming of huge information sets in an appropriated registering environment. It is a piece of the Apache task supported by the Apache Programming Establishment.

The Apache Hadoop programming library is a skeleton that takes into consideration the appropriated transforming of expansive information sets crosswise over bunches of machines utilizing basic programming models. It is intended to scale up from single servers to a great many machines, each one offering neighborhood processing and stockpiling. As opposed to depend on equipment to convey high-accessibility, the library itself is intended to locate and handle disappointments at the application layer, so conveying a very accessible administration on top of a group of machines, each of which may be inclined to disappointments.

HDFS(storage) andMap Reduce(processing) are the two core components of Apache HadoopHadoop Distributed File System (HDFS):HDFS is a disseminated document framework that gives high-throughput access to information. It gives a restricted interface to dealing with the record framework to permit it to scale and give high throughput. HDFS makes numerous reproductions of every information piece and disseminates them on machines all through a group to empower solid and quick get to.Map Reduce:HDFS is a disseminated document framework that gives high-throughput access to information. It gives a restricted interface to dealing with the record framework to permit it to scale and give high throughput. HDFS makes numerous reproductions of every information piece and disseminates them on machines all through a group to empower solid and quick get to.

6. What is data scientist? What makes them so much in demand? What are the common characteristics of data scientist? Which one is the most important?A data scientist is some individual who is curious, who can gaze at data and spot patterns. It's just about like a Renaissance person who truly needs to learn and bring change to an association.

A data scientist speaks to a development from the business or data investigator part. The formal preparing is comparative, with a strong establishment normally in software engineering and applications, displaying, measurements, investigation and math. What separates the data scientist is solid business intuition, coupled with the capacity to convey discoveries to both business and IT pioneers in a manner that can impact how an association approaches a business challenge. Great data scientists won't simply address business issues; they will pick the right issues that have the most esteem to the association

7. Define Cloud Computing. How does it relate to PaaS, SaaS and IaaS?

Cloud computingwill be computing in which huge gatherings of remote servers are networked to permit unified information storage and online access to machine administrations or assets. Mists can be named public, private or hybrid Cloud computing providers offer their services according to several fundamental modelsPlatform as a service (PaaS):In the PaaS models, cloud providers deliver acomputing platform, typically including operating system, programming language execution environment, database, and web server. Application developers can develop and run their software solutions on a cloud platform without the cost and complexity of buying and managing the underlying hardware and software layersSoftware as a service (SaaS):

In the SaaS model, cloud providers install and operate application software in the cloud and cloud users access the software from cloud clients. Cloud users do not manage the cloud infrastructure and platform where the application runs. This eliminates the need to install and run the application on the cloud user's own computers, which simplifies maintenance and support.

Infrastructure as a service (IaaS):

In the most basic cloud-service model & according to the IETF (Internet Engineering Task Force), providers of IaaS offer computers physical or (more often) virtual machines and other resources. Cloud providers typically bill IaaS services on a utility computing basis: cost reflects the amount of resources allocated and consumed8. How does Cloud Computing affect Business Intelligence?

When looking into practicalities of moving BI into the cloud we should first consider potential benefits and then examine the risks involved.

Increased Elastic Computing

Power Computing power refers to how fast a machine or software can perform an operation. Hosting BI on the cloud means that the computing power, or processing power, depends on where the software itself is hosted, rather than the on-premises hardware. Cloud computing has become very popular over the last few years and is hailed as revolutionizing IT, freeing corporations from large IT capital investments, and enabling them to plug into extremely powerful computing resources over the network, As the volume of data increases to unprecedented levels and the growing trend of Big Data, becomes a norm rather than an exception more and more businesses are looking for BI solutions that can handle gigabytes (and eventually terabytes) of data

Potential Cost Savings

Pay-as-you-go computing power for BI tools has the potential to reduce costs. A user on the cloud only has to pay for whatever computing power is needed. Computing needs could vary considerably due to seasonal changes in demand or during high-growth phases this makes IT expenditure much more efficient.

Easy Deployment

The cloud makes it easier for a company to adopt a BI solution and quickly experience the value. Managers will see results quickly and increased confidence surrounding the success of the implementation. Deployment requires less complicated upgrades for existing processes and IT infrastructure. The development cycle is much shorter, meaning that the adoption of BI does not have to be a drawn out process, thanks to the elimination of complicated upgrade processes and IT infrastructures demanded by on-premises BI solutions.

Supportive of Nomadic Computing

Nomadic computing is the information systems support that provides computing and communication capabilities and services to users, as they move from place to place. As globalization continues to dominate all industries, nomadic computing services and solutions will grow in demand. It also allows employees and BI users to travel without losing access to the tools.9. Define Business Intelligence? Discuss the framework and the benefits of Business Intelligence?Business intelligence is a process of analyzing data and transforming the raw data into a readable data by using various tools. By using the available tools like ETL( Extract Transform Load) we can transform the raw data. Business intelligence provides effective view of data by establishing effective decision-making and strategic operational insights through functions like online analytical processing (OLAP), reporting, predictive analytics etc. Analytical tools ought to help decision makers to discover the right data rapidly and empower them to make overall educated decisions.

Business intelligence (BI) is the process of gathering the right data in the right way at the correct time, and conveying the right comes about to the right individuals for decision-production purposes.

Framework

A business intelligence framework gives the strategy, standards and best practices needed to guarantee that business intelligence reporting and analysis meets organizational prerequisites. It is contained:

Data management (Governance) standards and best practices;

Project management framework; and

Metrics management

Benefits:

Higher revenue per employee can be achieved by implementing Business Intelligence in a company.

Time Saving- This is the major advantage of BI. By implementing BI in a company, most of the processes are automated so it will save both time and costs. It ultimately increases the productivity of the organization.

By BI we can choose right decisions, In order to stay in competition with other companies every company has to take right decisions. By implementing Business Intelligence in a company we can achieve this.

BI can make data readable and accessible10. Describe the basic elements of the Balanced Scorecard (BSC) and Six Sigma Methodologies?

Balanced scorecard (BSC) is both performance estimation and a management methodology that makes a difference interpret an organization's financial, customer, internal process and learning and development objectives and focuses into a set of significant activities. As an issue approach, BSC is intended to defeat the constraints of systems that are financially centered. It does this by interpreting an organization's vision and strategy into a set of interrelated financial and nonfinancial objectives, measures, targets, and activities. The nonfinancial objectives can be categorized as one of three viewpoints:

Customers:

This objective characterizes how the organization ought to seem to its customers if it is to fulfill its vision.

Internal business processThis objective indicates the processes the organization must exceed expectations at so as to fulfill its shareholders and customers.

Learning and growth.This objective shows how an organization can improve its capacity to change and improve so as to accomplish its vision.

Six-stage process of Balanced Scorecard is

Developing and formulating a strategy.

Create and clarify the organization's mission, values, and vision; distinguish through strategic analysis the interior and outside forces affecting the strategy; and characterize the organization's strategic direction,specifying where and how the organization will contend.

Planning the strategy.

Change over statements of strategic heading into particular objectives, measures, targets, initiatives, and budgets that guide activities and adjust the organization for powerful strategy execution

Aligning the organization.

Guarantee that business unit and support unit strategies are in accordance with the corporate strategy and that employees are motivated to execute the corporate strategy.

Planning the operations

Guarantee that the changes needed by strategy are deciphered into changes in operational processes and that resource capacity, operational plans, and budgets reflect the directions and needs of the strategy.Monitoring and learning

Focus through formal operational review meetings whether short-term financial and operational exhibitions are in accordance with pointed out targets and through strategy review meetings whether the general strategy is generally executed effectively.

Testing and adapting the strategy

Focus through strategy testing and adapting meetings whether the strategy is working, whether fundamental suppositions are still valid, and whether the strategy needs to be altered or adjusted over time.

Six sigma methodology is a a process improvement methodology that enables them to scrutinize their processes, pinpoint problems, and apply remedies.

In Six Sigma, a business is viewed as a collection of processes. A business process is

a set of activities that transforms a set of inputs, including suppliers, assets, resources and information into a set of outputs for another person or process

Six sigma process consists five steps they are defining, measuring, analyzing, improving, and controlling a process.11. Describe differences between Scorecards and Dashboards?

Dashboards and scorecards both give visual displays of vital information that is consolidated and arranged on a single screen so information can be processed at a single look and effectively explored. The differences between a Dashboard and a Balanced Scorecard are

DashboardBalanced Scorecard

Is used forperformance measurement or monitoringperformance management

As a measurement tool isMetricKPI (Metric + Target)

Measure is linked to business objectivesdoesnt linklinks

It measuresPerformanceprogress (the current value versus the target value)

It is updatedin real-timePeriodically.

It focuses onoperational (short-term) goalsstrategic (long-term) goals

Its purpose is togive a high-level idea of what is happening in the companyplan and execute a strategy, identify why something is happening and what can be do about that

12. What is Data Warehousing? What are the characteristics of Data Warehousing? Explain the Data Warehousing Framework? What is the future of Data Warehousing?a data warehouse (DW) is a pool of data delivered to support decision making. it is also a repository of present and historical data of potential interest to managers all through the organization. Data are usually structured to be available in a form ready for analytical processing activities. A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decision-making process.

Characteristics of Data Warehousing

Subject oriented.

Data are organized by detailed subject, such as sales, products, or customers, containing just information relevant for decision support. Subject orientation enables users to determine how their business is performing, as well as why. A data warehouse differs from an operational database in that most operational databases have a product orientation and are tuned to handle transactions that update the database. Subject orientation provides a more comprehensive perspective of the organization.

Itegrated.

Integration is closely related to subject orientation. Data warehouses must place data from distinctive sources into a consistent format. To do so, they must deal with naming conflicts and discrepancies among units of measure. A data warehouse is presumed to be totally integrated.

Time Variant

A warehouse maintains historical data, They detect trends, deviations, long-term relationships for forecasting and comparisons leading to decision making. The data warehouse must support time.

Nonvolatile

After data entered into a data warehouse, users cannot change or update the data.

Web-based

DWs are especially developed for web-based applications.

Client/server

A data warehouse uses the client/server architecture to provide easy access for end users

Metadata

A data warehouse contains metadata (data about data) about how the data is organized and how to effectively use them. And other characteristics are relational and real-time.

Data warehouse Framework

Data sources.

Data is sourced from different autonomous operational "legacy" systems and perhaps from outer data providers. Data might likewise originate from an online transaction processing (OLTP) or ERP system. Web data as Weblogs may additionally sustain a data warehouse.

Data extraction and transformation.

Data is extracted and properly transformed using custom-written or commercial software called ETL.

Data loading.

Data are loaded into a staging area, where they are transformed and purged. The data are then prepared to load into the data warehouse and/or data marts.

Comprehensive database.

Basically, this is the EDW to support all decision analysis by giving pertinent outlined and definite information originating from numerous different sources.

Future of Data Warehouse

As Data Warehousing becomes an integral part of an organization, it is likely that it will get to be as "anonymous" as some other piece of the IS.

One test to face is thinking of a workable set of rules that guarantee protection and encouraging the utilization of huge data sets. An alternate is the need to store unstructured data, for example, multimedia, maps and sound.

The development of the Internet permits integration of outside data into a Data Warehouse, yet its fluctuating quality is prone to prompt the evolution of outsider middle people whose object is to rate data quality.

To survive in a future world of low-cost application systems, the transition to federated architecture must be made from DW.

In future Data Warehousing will be an integral part of organization. Various technologies are evolving into data warehouse like big data, tera data, Hadoop Etc.

13. Define the Business Performance Management? What is the Business Performance Management Cycle? How does BPM differ from BI? How are they the same?Business performance management (BPM) refers to the business processes, methodologies, metrics, and technologies used by enterprises to measure, monitor, and manage business performance.

BPM is a part of BI and it incorporates many of its technologies applications and techniques.

Business Intelligence describes the technology used to access, analyze, and report on

data relevant to an enterprise. It encompasses a wide spectrum of software, including

ad hoc querying, reporting, online analytical processing (OLAP), dashboards,

scorecards, search, visualization, and more.

BPM has characterized as the convergence of BI and planning. The processes that BPM encompasses are not new. Virtually

every medium and large organization has processes in place. BPM adds is a framework for integrating these processes,

methodologies, metrics, and systems into a unified solution.

BPM system is strategy driven. It encompasses closed-loop set of processes that link strategy to execution in order to optimize business performance.

BPM cycle is a continuous process it consists five major steps. Plan, Execute, Monitor, Analyze, Forecast. Every step has to follow the prior step to start the next process. BPM involves monitoring key performance indicators (KPIs) that measure whether an organization is meeting its objectives and overarching strategy. A KPI in this sense is a measure defined by a business that allows for observation of actual values, as they may emerge from line-of-business (LOB) applications

14. Define Six Sigma? What is DMAIC Performance Model? What is the payoff from Six Sigma?Six Sigma is a disciplined, data-driven approach and methodology for eliminating

defects (driving towards six standard deviations between the mean and the nearest

specification limit) in any process, product, or service.

DMAIC Performance model:

Define.

Define the goals, objectives, and limits of the improvement action. At the top level, the goals are the strategic objectives of the company. At lower levels-office or project levels-the goals are centered around particular operational forms.Measure.

Measure the current framework. Make quantitative measures that will yield measurably legitimate data. The data can be utilized to screen advance around the goals defined in the past step.

Analyze.

Analyze the framework to recognize approaches to take out the crevice between the current execution of the framework or procedure and the fancied objective.

Improve

Launch activities to dispense with the hole by discovering approaches to improve things, cheaper, or faster. Utilization project management and other arranging tools to actualize the new approach.

Control

Organize the enhanced framework by changing recompense and incentive systems, policies, procedures, producing asset arranging, budgets, operation guidelines, or other management systems.

15. What is Business Intelligence Governance?Business Intelligence (BI) Governance provides a customized framework to help senior managers design and implement good BI governance guiding principles, decision-making bodies, decision areas, and oversight mechanisms to fit your companys unique needs and culture.

The objectives of BI Governance are:

Clearly defined authority and accountability, roles and responsibilities.

Program planning, prioritization, and funding processes.

Communicating strategic business opportunities to IT.

Transparent decision-making processes for development activities.

Tracking value and reporting results16. What is Big Data analytics? What are the sources of Big Data? What are the characteristics of Big Data? What processing techniques is applied to process Big Data?

Big data analytics is the methodology of looking at substantial information sets containing an assortment of information sorts - i.e., huge information - to reveal concealed examples, obscure connections, business sector patterns, client inclination and other helpful business data.

Business intelligence, querying, reporting, searching, including many implementation of searching, filtering, indexing, speeding up aggregation for reporting and for report generation, trend analysis, search optimization, and general information retrieval. (Examples include: Alibaba, University of North Carolina Lineberger Comprehensive Cancer Center, University of Frieburg.)

Improved performance for common data management operations, with the majority focusing on log storage, data storage and archiving, followed by sorting, running joins, Extraction/Transformation/Loading (ETL) processing, other types of data conversions, as well as duplicate analysis and elimination. (Examples: AOL, Brilig, Infochimps.)

Non-Database Applications, such as image processing, text processing in preparation for publishing, genome sequencing, protein sequencing and structure prediction, web crawling, and monitoring workflow processes. (Examples: Benipal Technologies, University of Maryland.)

Data mining and analytical applications, including social network analysis, facial recognition, profile matching, other types of text analytics, web mining, machine learning, information extraction, personalization and recommendation analysis, ad optimization, and behavior analysis17. What are ROLAP, MOLAP and HOLAP? How do they differ from OLAP? OLAP (Online Analytical Processing): On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP: This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

ROLAP: This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

HOLAP: HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

18. What are the major Data Mining Processes? Classification: Mining patterns that can classify future data into known classes.

Association rule mining: Mining any rule of the form X (( Y, where X and Y are sets of data items.

Clustering: Identifying a set of similarity groups in the data

Sequential pattern mining: A sequential rule: A( B, says that event A will be immediately followed by event B with a certain confidence

Deviation detection: discovering the most significant changes in data

Data visualization: using graphical methods to show patterns in data. 19. Identify at least three of the main data mining methods. Clustering: Identifying a set of similarity groups in the data

Sequential pattern mining: A sequential rule: A( B, says that event A will be immediately followed by event B with a certain confidence

Deviation detection: discovering the most significant changes in data20. What are some of the methods for cluster analysis? There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Hierarchical methods Agglomerative methods, in which subjects start in their own separate cluster. The two closest (most similar) clusters are then combined and this is done repeatedly until all subjects are in one cluster. At the end, the optimum number of clusters is then chosen out of all cluster solutions. Divisive methods, in which all subjects start in the same cluster and the above strategy is applied in reverse until every subject is in a separate cluster. Agglomerative methods are used more often than divisive methods, so this handout will concentrate on the former rather than the latter. Non-hierarchical methods (often known as k-means clustering methods)References :https://mycc.cambridgecollege.edu/.../Data_Warehous.http://www.grantthornton.com/~/media/content-page-files/advisory/pdfs/2014/BAS-prescriptive-analytics.ashxhttp://www.tdan.com/view-articles/4681http://searchdatamanagement.techtarget.com/definition/data-analyticshttp://www.rosebt.com/blog/predictive-descriptive-prescriptive-analyticshttp://www.analytics-magazine.org/november-december-2010/54-the-analytics-journeyhttp://bigdataguru.blogspot.com/2012/09/difference-between-descriptive.htmlhttp://searchdatamanagement.techtarget.com/definition/predictive-modelinghttp://searchbusinessanalytics.techtarget.com/definition/big-data-analyticshttp://www.zdnet.com/article/top-10-categories-for-big-data-sources-and-mining- technologies/http://www.slideshare.net/venturehire/what-is-big-data-and-its-characteristicshttp://www.developer.com/db/understanding-big-data-processing-and-analytics.html http://www.csun.edu/~twang/595DM/Slides/Week6.pdfhttp://www.processmining.org/_media/blogs/pub2013/1-wvda-process-cubes-apbpm2013.pdfhttps://apandre.wordpress.com/data/datacube/http://social.technet.microsoft.com/wiki/contents/articles/19898.differences-between-olap-rolap-molap-and-holap.aspx https://www.ibm.com/developerworks/community/blogs/bigdataanalytics/entry/critical_success_factors_for_big_data_in_business_part_4?lang=en http://www.nedsi.org/proc/2013/proc/p121023001.pdf http://hadoopilluminated.com/hadoop_illuminated/MapReduce_Intro.html http://www.statstutor.ac.uk/resources/uploaded/clusteranalysis.pdf