16
An Approachable Analytical Study on Big Educational Data Mining Saeed Aghabozorgi 1 , Hamidreza Mahroeian 2 , Ashish Dutt 1 , Teh Ying Wah 1 , and Tutut Herawan 1,3 1 Department of Information System University of Malaya 50603 Pantai Valley, Kuala Lumpur, Malaysia 2 University of Otago New Zealand 3 AMCS Research Center, Yogyakarta, Indonesia {saeed,teh,tutut}@um.edu.my, [email protected], [email protected] Abstract. The persistent growth of data in education continues. More institutes now store terabytes and even petabytes of educational data. Data complexity in education is increasing as people store both structured data in relational format and unstructured data such as Word or PDF files, images, videos and geo-spatial data. Indeed learning developers, universities, and other educational sectors confirm that tremendous amount of data captured is in unstructured or semi-structured format. Educators, students, instructors, tutors, research developers and people who deal with educational data are also challenged by the velocity of different data types, organizations as well as institutes that process streaming data such as click streams from web sites, need to update data in real time to serve the right advert or present the right offers to their customers. This analytical study is oriented to the challenges and analysis with big educational data involved with uncovering or extracting knowledge from large data sets by using different educational data mining approaches and techniques. Keywords: Big Data; Educational Data; Educational Data Mining; Data Mining; Analytical Study. 1 Introduction Big data can be considered as the theory of looking at voluminous didactic amounts of data be it in physical or digital format being stored in diverse repositories ranging from tangible account bookkeeping records of an educational institution to class test or examination records to alumni records [1]. These records continue to grow in size and variety. We learn from our mistakes as the old adage goes, in a similar fashion, today the businesses are being operated based on the decisions over the data that was collected by the business. Predictions, associations, clustering and many other commonly occurring business decisions are taken each day by corporate to enhance productivity and mutual growth [2]. And these significant decisions are dependent exclusively on the data collected during business operations and human judgments. This concept of big data has now been applied to various sectors like governments,

An Approachable Analytical Study on Big Educational Data Mining

Embed Size (px)

Citation preview

An Approachable Analytical Study on Big Educational

Data Mining

Saeed Aghabozorgi1, Hamidreza Mahroeian2, Ashish Dutt1, Teh Ying Wah1, and

Tutut Herawan1,3

1Department of Information System

University of Malaya 50603 Pantai Valley, Kuala Lumpur, Malaysia

2University of Otago New Zealand

3AMCS Research Center, Yogyakarta, Indonesia

{saeed,teh,tutut}@um.edu.my,

[email protected],

[email protected]

Abstract. The persistent growth of data in education continues. More

institutes now store terabytes and even petabytes of educational data. Data complexity in education is increasing as people store both structured data in relational format and unstructured data such as Word or PDF files, images, videos and geo-spatial data. Indeed learning developers, universities, and other educational sectors confirm that tremendous amount of data captured is in unstructured or semi-structured format. Educators, students, instructors, tutors, research developers and people who deal with educational data are also challenged by the velocity of different data types,

organizations as well as institutes that process streaming data such as click streams from web sites, need to update data in real time to serve the right advert or present the right offers to their customers. This analytical study is oriented to the challenges and analysis with big educational data involved with uncovering or extracting knowledge from large data sets by using different educational data mining approaches and techniques.

Keywords: Big Data; Educational Data; Educational Data Mining; Data Mining; Analytical Study.

1 Introduction

Big data can be considered as the theory of looking at voluminous didactic amounts of data be it in physical or digital format being stored in diverse repositories ranging

from tangible account bookkeeping records of an educational institution to class test

or examination records to alumni records [1]. These records continue to grow in size

and variety. We learn from our mistakes as the old adage goes, in a similar fashion,

today the businesses are being operated based on the decisions over the data that was

collected by the business. Predictions, associations, clustering and many other

commonly occurring business decisions are taken each day by corporate to enhance

productivity and mutual growth [2]. And these significant decisions are dependent

exclusively on the data collected during business operations and human judgments.

This concept of big data has now been applied to various sectors like governments,

businesses, hospital management to name a few but there has been little research work

been done in its application in the educational sector. This is what we aim to find,

through this research work. Tomes have been written on the efficacy of Big Data, the

technologies that can be used to harness the sheer strength it exudes. But there has

been very little to negligible research work on the application of big data in

educational sector. Utilizing different data for making decisions is not new concept; corporations use complicated calculation on data generated by different customers for

business intelligence or analytics. Various techniques used in Business intelligence

can distinguish historical trends and customer patterns from data and can generate

different models that can result in prediction of future patterns and trends [3]. Consist

of proven methodologies from computer science, mathematics and statistics used for

deriving non-redundant information from large scaled datasets (big data) [4] .

One of the clear examples of exploiting useful data to discover online mode behavior

is Web analytics with different methods that sign and report visits of Web page,

specific region or particular domains and the different links that clicked through. To

understand how people use the Web, Web analytics are applied, but corporations have

utilized more complicated approaches and tools to track more sophisticated user

interactions with their websites [5, 6]. Example of web analytics include analyzing the purchasing habits of the consumer, the application of recommendation algorithms in

commercial websites search engines such that they are able to recommend the most

likely product a consumer would like, notable examples are Netflix, Amazon. The

same concept is now being applied to various e-learning systems for example Edmodo

is a free open source LMS that is able to predict similar books or resources based on

the learner’s web activity on the e-LMS [7].

New approaches and methods are considered imperative for extraction and analysis of

the aforementioned tasks so as to seamlessly integrate with the unstructured data that

these information systems generate. Big data is voluminous and would be futile to

bind it within a specific number boundary. One of the means by which it can be

defined could be its net usability worth. According to Manyika et al. [10] a data set whose computational size exceeds the processing limit of software can be categorized

as big data. Several studies have been conducted in the past that have provided

detailed insights into the application of traditional data mining algorithms like

clustering, prediction, association to tame the sheer voluminous power of big data.

Recent advances in machine learning field has provided with unique approaches to

foresee knowledge discovery in datasets. These algorithms have been successful in

finding correlations between unstructured data and one of their applications has been

into predictive modeling. Such models can be treated as virtual prototypes of a real

working model. When injected with real datasets in such models can help ascertain

any debacles that can then be promptly addressed to thus mitigating operational costs

of both man and machine labor.

Two specific fields that are significant to the exploitation of big data in education are educational data mining and learning analytics. Although there is no hard and fast

distinction between these two areas, they have had different research histories to some

extent and are developing as discrete research areas. In general, educational data

mining tries to uncover new patterns in captured data, building new algorithms or new

models, whilst learning analytics looks for identified predictive models in educational

systems[1, 4]. As it can be seen in the figure 1, educational data such as log files, user

(learner) interaction data, and social network data types are expected to grow in the

near future. This research study is oriented to the challenges and analysis with big

educational data involved with uncovering or extracting knowledge from large data

sets by using different educational data mining approaches and techniques. It is

arranged in the following ways: in the next section, background of study including

importance of education and educational data, the nature of big data, the basic understanding of data mining or knowledge discovery techniques will be described. In

section 3, from big educational data mining perspective, the concept of educational

data mining, big educational data, as well as big data mining is further discussed.

Section 4 details the major challenges concerned with big educational data mining,

and finally related discussion and conclusion is outlined.

Fig 1. Growth of different Educational Data

2 Rudimentary

2.1 Education

Learning providers, institutes, universities, schools and colleges always had the ability

to generate huge amounts of educational data [8]. Even a small kindergarten school

that only supply to a play group of children aged between 4-6 years can produce

enormous quantities of data which is ranged from their academics to their peer

activities, classroom activities and so forth. After the detonation of the buzzword,

“Big data” in different industrial sectors, researchers and industry workers are

collating towards vista’s that could presumably be affected by this surge [9].

Recent advances in technology has made it now possible to explore any previously unknown information that lay buried in deep caveats of heaps of data sets [10].

However, the most basic question that needs to be answered first is that, “Is there

really any big data in education?” or are we simply looking at an impasse.

2.2 Big Data

There are a number of similar definitions of big data. Perhaps the most well-known

and popular version is derived from IBM,2 which proposed that big data could be

differentiated by any or all of three V words to examine situations, events, and so on:

volume, variety, and velocity [1, 9, 11]. Volume is attributed to larger quantities of data being produced from a various range

of resources. For instance, big data can comprise data captured from the Internet of Things (IoT). As initially pictured, IoT is associated to the data collected from a range

of different devices and sensors networked together, over the Internet [12]. Big data

can also be cited to the explosion of information accessible on common social media

such as Facebook and Twitter [13].

Variety is referred to utilizing numerous types of data to investigate a situation or

event. On the IoT, millions of devices generating a steady flow of data results in not

only a large volume of data but different kinds of data features of different situations.

Furthermore, people on the Internet produce a highly various set of structured, semi-

structured as well as unstructured data [9]. Velocity of data which is attributed to a rapid increase in data over time for both kind

of structured and unstructured data, and more frequent decision making about that

data is essential [1]. As the world becomes more global and developed, and as the IoT generates, there is a growing frequency of data capture and decision making

procedure about those things as they progress throughout the world. Additionally, the

velocity of social media use is in its obvious upward trend. The clear example would

be 250 million tweets per day. As decisions are made using big data, those decisions

eventually can have a substantial impact on the next data that’s captured and

analyzed, counting another dimension to velocity of big data [1, 10].

2.3 Data Mining

In databases, Data mining or knowledge discovery popularly known as KDD is the

automatic mining of implied and appealing patterns from vast amounts of data [14].

Data mining is recognized as a field which is multidisciplinary in which a number of

computing paradigm congregated such as decision tree construction, rule induction, artificial neural networks, instance-based learning, Bayesian learning, logic

programming. In addition, some of the functional data mining techniques and

methods are listed like statistics, visualization, clustering, classification and

association rule mining [15,16]. These techniques discover new, implicit and practical

knowledge based on students’ usage data.

Data mining has been broadly applied in different kinds of educational systems. On

one hand, there are common traditional classroom environments such as special

education [17] and higher education [18]. On the other hand, there is education which

is computer-based as well as web-based education like well-known and learning

management systems known as LMS Systems [19], web-based adaptive hypermedia

systems [20] and intelligent tutoring systems(ITS) [21]. The major difference between

one and the other is the data accessible in each system. Traditional classrooms only have obtainable information about attendance of student, basic course syllabus; course

objectives and learners plan data. However, web and computer-based education has

much more readily information because these education systems can track all the data

pertained to specific students’ actions and interactions onto log files and databases,

(e.g. generating log files data) [22].

Fig. 2. Applying data mining to the design of educational systems

In order to improve learning effectiveness, the application of data mining approaches

and techniques to educational systems, can be observed as a formative evaluation

technique which is the evaluation of an educational program while it is still in

development phase, and with the purpose of continually enhancing the program.

Auditing the way students use the educational system is perhaps one common way to

assess instructional design in a this manner would help learning developers to have

the improved instructional materials which is going to result in having different data types such as log files, performance, transaction [23]. Data mining techniques should

be applied to collect information that can be used to assist instructional

designers/developers to build an educational foundation for judgments when

designing or improving an environment’s instructive approach. The application of

data mining to the design of educational systems is an iterative cycle of hypothesis

formation, testing, and refinement (see Figure 2). Extracted knowledge should go

through the loop towards guiding, facilitating, and enhancing learning as a whole. In

this process, the aim is not just to turn data into knowledge, but also to filter mined

knowledge for decision making [16,24].

As it is represented in Figure 2, educators and educational designers (whether in

school districts, curriculum companies, or universities) design, plan, create, and

Users (Instructors, learners,

students, course administrators, academic researchers, educators)

Knowledge components

User transaction

Log files

User performance

Identity type data

Educational system (Traditional classrooms, e-learning systems, LMSs,

intelligent tutoring systems,

web-based adaptive systems)

Data mining techniques (Visualization, clustering,

classification, statistic, association rule mining,

sequence mining)

Educational data

maintain educational systems. Students use those educational systems to learn.

Building off of the available information about courses, students, usage, and

interaction, data mining techniques can be applied in order to discover useful

knowledge that helps to improve educational designs. The discovered knowledge can

be used not only by educational designers and teachers, but also by end users—

students. Hence, the application of data mining in educational systems can be oriented to supporting the specific needs of each of these categories of stakeholders[23].

3 Big Educational Data Mining

3.1 Educational Data Mining

Educational Data Mining popularly known as EDM is a field that exploits statistical,

machine-learning, and data-mining (DM) algorithms over the different types of

educational data. Its major objective is to analyze these types of data in order to

resolve educational research issues [25,26]. EDM is concerned with developing methods to explore the relationships between unique types of data, produced in

educational settings and, using these methods, to better understand students and the

settings in which they learn. While, the increase in both instrumental educational

software as well as state databases of student’s information have created large

repositories of data reflecting how students learn [26]. Whereas, the use of Internet in

education has created a new context known as e-learning or web-based education in

which large amounts of data about teaching–learning interaction are endlessly

generated and ubiquitously available [23]. All this information provides a gold mine

of educational data. EDM seeks to tap these untouched or maiden data repositories to

better discern learners and learning abilities, and to develop computational approaches

that combine data and theory to transform practice to benefit learners. EDM has emerged as a prolific research area in recent years for researchers all over the world

from different and related research areas [7]. Education Data Mining can be extremely helpful in deducing inferences, make

predictions and more to establish students behavior and attitude as well as

concentration to its educational goals. The results deciphered by utilizing the

traditional data mining algorithms to educational context can help enhance the

educational system as all stakeholders can look into the trends found once analytic

reasoning is applied on the data of student related parameters [25]. Usually we use

regression techniques to analyze data, When we unitize the data into statistical

numbers for analytic reasons, usually the produced results can be plotted on a graphs

and trends can be found in terms of lines or combination of several data points as a

concentration of some student behavior to learning or researching or any such related activity [25]. EDM is involved with various groups of users such as learning

developers, instructors, educators, researchers. Different groups consider educational

data from different angles, based on their mission, vision, and major purpose for using

data mining as it is depicted in Table 1.

Table 1. EDM Users/Stakeholders

User/Actors Objectives for using data mining

Learners/Students/pupils To personalize e-learning, to recommend activities to learners

resources and learning tasks that could further improve their

learning, to suggest interesting learning experiences to the

students[27]

Educators/Instructors/Teachers/Tu

tors

To get objective feedback about instructions, to analyze

students’ learning and behavior, to detect which student need

support, to predict student performance, to classify learners into

groups[28]

Course Developers/Educational

Researchers

To evaluate and maintain courseware, to improve student

learning, to evaluate structure of course content and its

effectiveness in learning process[29]

Organizations/Learning

Providers/Universities/Private

Training Companies

To enhance the decision processes in higher learning

institutions to streamline the efficiency in the decision making

process, to achieve specific objectives[30]

Administrators/School District

Administrators/Network

Administrators/System

Administrators

To develop the best way to organize institutional resources and

their educational offer, to utilize available resources more

effectively, to enhance educational program offers and

determine the effectiveness of distance learning approach[31]

Today, there exists a wide variety of educational data sets that can be downloaded for

free from the Internet. Some widely acclaimed and used repositories are PSLC

DataShop (The world’s largest repository of learning interaction data), Data.gov

(official website of United States Government on Educational data sets), NSES Data

sets (is the primary federal entity for collecting and analyzing data related to

education in United States) [26,32], Barro-Lee data set (the data set provided by researchers Barrow and lee whose contribution has been discussed in section 1),

UNISTATS Dataset (website provides comparable sets of information about full or

part time undergraduate courses and is designed to meet the information needs of

prospective students), SABINS (The School Attendance Boundary Information

System) provides free of charge, aggregate census data and GIS-compatible boundary

files for school attendance areas, or school catchment areas, for selected areas in the

United States for the 2009-10, 2010-11 and 2011-12 school years. UIS (is an

UNESCO initiative), EdStats (A World Bank Initiative), Education Human

Development Network (A World Bank Initiative) and IPEDS Data Center (the

primary source for data on colleges, universities, and technical and vocational

postsecondary institutions in the United States), TLRP [33] .

3.1.1 Analysis of current tools being used for educational data sets

At present statistical tools are predominantly being used to quantify and assess the

educational data sets. Prominent ones are RapidMiner, SAS, IBM SPSS, KEEL [34]

(is a knowledge extraction tool based on evolutionary learning). Programming

language like R is mostly used for statistical analysis and plays a pivotal role in

programming custom tests that may not be available in commercial software

packages. There are some online web based data exploration tools typical java based

that gives the user the freedom to choose from the varied dataset types and see a

graphical representation of them. One of these is Education Data Explorer being

provided by Oregon Department of Education, United States. Another one is

Educational Data Analysis Tool (EDAT), it allows you to download NCES survey

datasets to your computer. EDAT guides you through selecting a survey, population, and variables relevant to your analysis [30].

3.1.2 Educational data set problem and possible solutions

Does the problem really exist or are we running behind a chimera? The “Education

for All”, a global monitoring report prepared by United Nations is the prime

instrument to assess global progress towards its goals. It seems that there is a flurry of

activity around big data and how it’s touching and transforming every aspect of our

life. Analysis of these large scale datasets can help improve the robustness and generalizability of educational research. The problem with most large scale secondary

data-sets used in higher education research is that they are constructed using complex

sample designs that often cluster lower level units (students), within higher level units

(colleges) to achieve efficiencies in the sampling process [35]. As it is clearly shown

in Figure 3, the term “Big Educational Data Mining” known as BEDM can be

proposed for the extraction of useful big educational data from vast quantities of

different large data sets.

Fig 3. Big Educational Data Mining (BEDM), Extracting new knowledge from Big Data Sets

3.2 Big Educational Data

Education has always had the capacity to produce a tremendous amount of data,

compared to any other industry. First, academic study requires many hours of

schoolwork and homework for several numbers of years. These extended interactions

with materials produce a huge quantity of data. Second, education content is tailor-

made for big data, generating cascading effects of insights thanks to the high

correlation between concepts [31]. Recent advancement in technology and data

science has made it possible to unlock/explore these large data sets [15]. The benefits

range from more effective self-paced learning to tools that enable instructors to

pinpoint interventions, create productive peer groups, and free up class time for creativity and problem solving. For instance, as it is represented in Table 2,

educational data can be categorized to five different categories: one pertaining to

student identity and on boarding, and four student activity-based data sets that have

the potential to improve learning outcomes. They are listed below in order to see how

complicated they are to attain:

Table 2. Educational Data Type classes

No. Data Type Description

1 Identity Data Personal Information, Authority, Domain Rights, Geographical Information

2 User Interaction Data

engagement metrics, click rate, page views, bounce rate, etc

3 Inferred Content Data

How well does a piece of content perform across a group, or for any one subgroup, of students? What measurable student proficiency gains result when a certain type of student interacts with a certain piece of content?

4 System-Wide Data Rosters, grades, disciplinary records, and attendance information are all examples of system-wide data.

5 Inferred Student

Data

Exactly what concepts does a student know, at exactly

what percentile of proficiency? What is the probability that a student will pass next week’s quiz, and what can she do right this moment to increase it?

Two areas that are specific to the use of big data in education are educational data

mining and learning analytics. Although there is no hard and fast distinction between

these two fields, they have had somewhat different research histories and are

developing as distinct research areas. Generally, educational data mining looks for

new patterns in data and develops new algorithms and/or new models, while learning analytics applies known predictive models in instructional systems. Big Data practical

examples in Educational context are the following:

The clear example is an education initiative. Analysts estimate that £16 billion is

wasted in productivity due to under-educated citizens. In response, the UK

government gathered data on outcomes of Kindergarten-12 education

(elementary and high school) as well as higher education (university). The data

pertained to student school performance and “success” afterwards as measured by

employment[36].

The government increasingly contributes to the open data movement; it is okay

with releasing “dirty data,” which is raw and not cleansed. Open data enables

individuals and entrepreneurs to use public data to innovate. Data visualization

tools enable parents to understand schools’ outcomes, so they can select appropriate schools for their children.

Universities can use data in exciting ways; they analyze students’ social media

sharing, patterns in checking out library materials, what courses they take (and

outcomes they achieve). This data helps them steer students to courses that are

aligned with their goals. It helps with student retention.

Big data enables interesting insights and correlations such as students that have

high library fines tend to perform worse on tests. Universities also correlate

performance data with socioeconomic and email data, so they can learn what

student characteristics predict the best performance at their schools, and they use

this to guide their recruitment. They are also starting to be able to predict which

students will drop out before graduating, which helps them give additional support [9].

Cost drivers [of education] are keys in big data adoption in the UK, which has

developed the most comprehensive database of pupils (schoolchildren) in the

world. It traces 600,000 pupils' performance from 3,000 elementary schools

through career. It has ten years of data on pupils’ exams, tests, socioeconomic

status, geography, transport, free meals, behavior issues and many others. It is a

rich dataset from which the government can learn and improve schools. It can answer political questions. The government is also combining its data with

health, crime and welfare datasets. It studies what students’ lives are like outside

school, to try to develop a fuller picture of factors that affect performance. This

can help challenge conventional thinking and guide policy.

This initiative is teaching us many things. Socioeconomic status is not as

important as we thought; school performance and responsiveness is very

important. Schools can use data to change. For example, science, technology,

engineering and math courses are far more important than we thought, even when

students don’t intend to pursue STEM careers.

Privacy is an issue with these databases, but the government believes that the

advantages outweigh the pupils' compromised privacy [37] .

Another traditional belief is that poor pupils do poorly and that schools need more money to increase performance. The data are showing that how the money

is invested is more important than how much money is in the school’s budget.

We are starting to be able to measure return on outcomes.

The UK example is more complex, but it effectively illustrates how internal and

external data can be mashed up to address complex problems such as school

performance. It's an excellent example of big data.

3.3 Big Data Mining

In typical data mining systems, the mining procedures require computational intensive

computing units for data analysis and comparisons. A computing platform is,

therefore, needed to have efficient access to, at least, two types of resources: data and

computing processors. For small scale data mining tasks, a single desktop computer, which contains hard disk and CPU processors, is sufficient to fulfill the data mining

goals. Indeed, many data mining algorithm are designed for this type of problem

settings. For medium scale data mining tasks, data are typically large (and possibly

distributed) and cannot be fit into the main memory. Common solutions are to rely on

parallel computing [43], [33] or collective mining [12] to sample and aggregate data

from different sources and then use parallel computing programming (such as the

Message Passing Interface) to carry out the mining process. For Big Data mining,

because data scale is far beyond the capacity that a single personal computer (PC) can

handle, a typical Big Data processing framework will rely on cluster computers with a

high-performance computing platform, with a data mining task being deployed by

running some parallel programming tools, such as MapReduce or Enterprise Control

Language (ECL), on a large number of computing nodes (i.e., clusters). The role of the software component is to make sure that a single data mining task, such as finding

the best match of a query from a database with billions of records, is split into many

small tasks each of which is running on one or multiple computing nodes. For

example, as of this writing, the world most powerful super computer Titan, which is

deployed at Oak Ridge National Laboratory in Tennessee, contains 18,688 nodes each

with a 16-core CPU. Such a Big Data system, which blends both hardware and

software components, is hardly available without key industrial stockholders’ support.

In fact, for decades, companies have been making business decisions based on

transactional data stored in relational databases [10]. Big Data mining offers

opportunities to go beyond traditional relational databases to rely on less structured

data: weblogs, social media, e-mail, sensors, and photographs that can be mined for useful information [1]. Major business intelligence companies, such IBM, Oracle,

Teradata, and so on, have all featured their own products to help customers acquire

and organize these diverse data sources and coordinate with customers’ existing data

to find new insights and capitalize on hidden relationships.

4 Major Challenges in Big Educational Data Mining

4.1 Is education data big enough to call it big data?

Startups like Knewton [38] and Desire2Learn [10] have been founded on the concept of Big Data. We had seen similar e-commerce startup during the early nineties when

the e-commerce boom was there but history is a mute audience to some of that

startup’s fate. Few of them have perished by now. However, the business startup’s

founded on big data in educational context would not face the similar fate because its

foundation rests on the principle of didactic unstructured data that is already present

in Informational systems. Perhaps one of the reasons for some of the ill-fated e-

commerce startups to fail was that their business model did not rest on the availability

of a constant flow of data from which information could be minded. But this is not the

case here. All we need is specialized algorithms that are designed to work with

educational datasets because we already have the data with us. Now companies like

Yahoo, Google, Dell, HP to name a few have ventured into open-source development of big data software’s like Apache foundation Hadoop to facilitate collective learning

by using contests like hackadays or hackathons [25,9]. We also need to understand

that there lies a gap between the application of big data in commerce and that in

education sector. While the former has seen various advances in it but for the latter we

are still dependent on traditional data mining algorithms. And the problem of using

such algorithm is that they may not fit the dataset and that can cause a loss of valuable

predictions that otherwise could have been ascertained by using the data mining

algorithms that would fit the educational dataset. Educational experts have posed

various deployment and implementation barriers to harness the power of big data in

education and learning analytics that most importantly includes technical lacunae,

institutional velocity, legal and quite often ethical issues by applying general data

mining algorithms. For big data to be meaningful it will require the seamless integration of specifically tailored algorithms that could the power of this raging beast

to tame it into knowledge that will be useful to both the learner and the educator [39].

4.2 BDM: Challenges in applying DM approaches on Big Data (from the

educational perspective)

A conceptual view of the Big Data processing framework can be depicted in the

figure 4, which includes three tiers from inside out with considerations on data

accessing and computing (Tier I), data privacy and domain knowledge (Tier II), and

Big Data mining algorithms (Tier III). The challenges at Tier I focus on data

accessing and actual computing procedures.

Fig. 4. A conceptual view of the Big Data processing framework [4]

Because Big Data are often stored at different locations and data volumes may

continuously grow, an effective computing platform will have to take distributed

large-scale data storage into consideration for computing [4,11]. For example, while

typical data mining algorithms require all data to be loaded into the main memory,

this is becoming a clear technical barrier for Big Data because moving data across

different locations is expensive (e.g., subject to intensive network communication and

other IO costs), even if we do have a super large main memory to hold all data for

computing. The challenges at Tier II center on semantics and domain knowledge for

different Big Data applications [40]. Such information can provide additional benefits

to the mining process, as well as add technical barriers to the Big Data access (Tier I)

and mining algorithms (Tier III). For example, depending on different domain applications, the data privacy and information sharing mechanisms between data

producers and data consumers can be significantly different. Sharing sensor network

data for applications like water quality monitoring may not be discouraged, whereas

releasing and sharing mobile users’ location information is clearly not acceptable for

majority, if not all, applications [41]. In addition to the above privacy issues, the

application domains can also provide additional information to benefit or guide Big

Data mining algorithm designs. For example, in market basket transactions data, each

transaction is considered independent and the discovered knowledge is typically

represented by finding highly correlated items, possibly with respect to different

temporal and/or spatial restrictions. In a social network, on the other hand, users are

linked and share dependency structures. The knowledge is then represented by user communities, leaders in each group, and social influence modeling etc. Therefore,

understanding semantics and application knowledge is important for both low-level

data access and for high level mining algorithm designs [16]. At Tier III, the data

mining challenges concentrate on algorithm designs in tackling the difficulties raised

by the Big Data volumes, distributed data distributions, and by complex and dynamic

data characteristics. The circle at Tier III contains three stages [4].

Primarily, sparse, heterogeneous, uncertain, incomplete, and multi-source data

are preprocessed by data fusion techniques.

Secondarily, complex and dynamic data are mined after pre-processing.

Tertiary, the global knowledge that is obtained by local learning and model

fusion is tested and relevant information is fed back to the pre-processing stage.

Then the model and parameters are adjusted according to the feedback. In the whole

process, information sharing is not only a promise of smooth development of each stage, but also a purpose of Big Data processing [30].

4.3 EDM: Challenges in applying DM approaches on education data

The recent advances in information technology have seen the proliferation of

software’s that can code a completely functional website replete with a backend

database system in less than an hour. So this has led to a rampant growth of e-learning

systems mostly cloud technology based. Most of these have incorporated

recommendation features as used by their business oriented counterparts. And both of

them are generating voluminous amounts of data. While online learning systems have

proffered the educator, developer and researcher opportunities to create personalized

learning systems but do note these personalization are using the traditional data

mining algorithms [7]. So what’s the problem then? One would ask. Well, one of the

problem is which most of the e-learning systems are not able to ascertain from an

educational point of view is that these systems are used by learners who have their individual learning styles. When a learner interacts with an LMS it leaves behind a

trail of breadcrumbs or log text files for example its interactions within the LMS

forum with either other students or with the course facilitator [33]. So it logically

follows that if we have to mine this data then it becomes imperative to figure out the

correct dataset to use so as to derive logical conclusions from it. Till now, there have

been fewer instances where data mining methods [44-48] have been introduced within

the e-learning systems to facilitate learner progress. The other problem from a

developer’s point of view would be to determine how to classify individual learning

style of a learner so as to provide it with a truly personalized learning environment.

While another challenge will be as we have repeatedly mentioned it in previous

sections too on how to develop specific data mining algorithms [49-52] that can cater to the learning analytical domain. So essentially what really matters at this point is to

find out methodologies that can help clean educational dataset so that it could further

be processed [23, 42].

5 Discussion and Conclusion

In this analytical study, on the whole, the background of study regarding to

importance of education and its educational data growth as big data, big data mining

tools and techniques to mine these vast amounts of data has been discussed. Moreover, the challenges involved with big educational data mining and extraction of

big educational data has been addressed from different educational data mining

perspectives. Working with big data using data mining and analytics is rapidly

becoming common in the commercial sector. Tools and techniques once confined to

research laboratories are being adopted by forward-looking industries, most notably

those serving end users through online systems [4,43]. Higher education institutions

are applying learning analytics to improve the services they provide and to improve

visible and measurable targets such as grades and retention. K–12 schools and school

districts are starting to adopt such institution-level analyses for detecting areas for

improvement, setting policies, and measuring results. Now, with advances in adaptive

learning systems, possibilities exist to harness the power of feedback loops at the level

of individual teachers and students [40]. Measuring and making visible students’ learning and assessment activities open up the possibility for learner’s to develop

skills in monitoring their own learning and to see directly how their effort builds onto

their success. Teachers gain views into students’ performance that help them adapt

their teaching or initiate interventions in the form of tutoring, tailored assignments,

and the like. Personalized adaptive learning systems enable educators to quickly see

the effectiveness of their adaptations and interventions, providing feedback for

continuous improvement. The practical applications of open source data mining tools

in an educational setting can augment both the researcher and developer to compare

distinct prototypes bearing the same design functionalities. The results thus obtained

could then be used to integrate within the existing in-house educational framework as

used by institutions so as to keep pace with the rapid adoption of blended learning

environment. Open source tools for adaptive learning systems, commercial offerings, and increased understanding of what data reveal are leading to fundamental shifts in

teaching and learning systems. As content moves online and mobile devices for

interacting with content enable teaching to be always on, educational data mining and

learning analytics will enable learning to be always assessed. Educators at all levels

will benefit from understanding the possibilities of the developments described in the

use of big data herein. Besides challenges of this new field which is introduced as big

educational data mining concerned with big identified educational data, the

importance of analyzing big educational data captured, extracted from large scaled

data sets using multiple approaches of big data and data mining analysis has to be

considered in further studies.

Acknowledgments. This work is supported by University of Malaya High Impact

Research Grant no vote UM.C/625/HIR/MOHE/SC/13/2 from Ministry of Education

Malaysia.

References 1. S. Sagiroglu and D. Sinanc, "Big data: A review," in Collaboration Technologies and Systems

(CTS), 2013 International Conference on, 2013, pp. 42-47.

2. A. Peña-Ayala, "Educational data mining: A survey and a data mining-based analysis of recent

works," Expert systems with applications, 2013.

3. G. Siemens and P. Long, "Penetrating the fog: Analytics in learning and education," Educause

Review, vol. 46, pp. 30-32, 2011.

4. X. Wu, X. Zhu, G. Wu, and W. Ding, "Data mining with big data," 2012.

5. C. Bizer, P. Boncz, M. L. Brodie, and O. Erling, "The meaningful use of big data: four

perspectives--four challenges," ACM SIGMOD Record, vol. 40, pp. 56-60, 2012.

6. A. Abraham, "Business intelligence from web usage mining," Journal of Information &

Knowledge Management, vol. 2, pp. 375-390, 2003.

7. C. Romero and S. Ventura, "Educational data mining: A survey from 1995 to 2005," Expert

Systems with Applications, vol. 33, pp. 135-146, 2007.

8. J. Gobert, M. Sao Pedro, R. Baker, E. Toto, and O. Montalvo, "Leveraging educational data

mining for real time performance assessment of scientific inquiry skills within microworlds,"

Journal of Educational Data Mining (accepted), 2012.

9. S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, "Big data, analytics and

the path from insights to value," MIT Sloan Management Review, vol. 52, pp. 21-31, 2011.

10. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, et al., "Big data: The next

frontier for innovation, competition, and productivity," 2011.

11. O. Trelles, P. Prins, M. Snir, and R. C. Jansen, "Big data, but are we ready?," Nature reviews

Genetics, vol. 12, pp. 224-224, 2011.

12. G. Bathuriya and M. Sai Nandhinee, "Implementation of Big Data for Future Education

Developement Data Mining Data Analytics."

13. D. Centola, "The spread of behavior in an online social network experiment," science, vol. 329,

pp. 1194-1197, 2010.

14. L.A. Kurgan and P. Musilek, "A survey of Knowledge Discovery and Data Mining process

models," Knowledge Engineering Review, vol. 21, pp. 1-24, 2006.

15. C. Romero, S. Ventura, and E. García, "Data mining in course management systems: Moodle

case study and tutorial," Computers & Education, vol. 51, pp. 368-384, 2008.

16. S.-H. Liao, P.-H. Chu, and P.-Y. Hsiao, "Data mining techniques and applications–A decade

review from 2000 to 2011," Expert Systems with Applications, vol. 39, pp. 11303-11311, 2012.

17. L. Tsantis and J. Castellani, "Enhancing learning environments through solution-based

knowledge discovery tools: Forecasting for self-perpetuating systemic reform," Journal of

Special Education Technology, vol. 16, pp. 39-52, 2001.

18. C. Romero, S. Ventura, A. Zafra, and P. d. Bra, "Applying Web usage mining for personalizing

hyperlinks in Web-based adaptive educational systems," Computers & Education, vol. 53, pp.

828-840, 2009.

19. C. Romero, S. Ventura, and P. De Bra, "Knowledge discovery with genetic programming for

providing feedback to courseware authors," User Modeling and User-Adapted Interaction, vol.

14, pp. 425-464, 2004.

20. Y. Wang, "Web mining and knowledge discovery of usage patterns," CS 748T Project, 2000.

21. S. Cetintas, L. Si, Y. P. Xin, and C. Hord, "Automatic detection of off-task behaviors in

intelligent tutoring systems with machine learning techniques," Learning Technologies, IEEE

Transactions on, vol. 3, pp. 228-236, 2010.

22. I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques:

Morgan Kaufmann, 2005.

23. C. Romero, S. Ventura, M. Pechenizkiy, and R. S. Baker, Handbook of educational data

mining: Taylor & Francis US, 2011.

24. J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques: Morgan kaufmann, 2006.

25. C. Romero and S. Ventura, "Educational data mining: a review of the state of the art," Systems,

Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 40, pp.

601-618, 2010.

26. R. Baker and K. Yacef, "The state of educational data mining in 2009: A review and future

visions," Journal of Educational Data Mining, vol. 1, pp. 3-17, 2009.

27. N. Sael, A. Marzak, and H. Behja, "Multilevel clustering and association rule mining for

learners’ profiles analysis," 2013.

28. N. Anozie and B. W. Junker, "Predicting end-of-year accountability assessment scores from

monthly student records in an online tutoring system," in Proceedings of the American

Association for Artificial Intelligence Workshop on Educational Data Mining (AAAI-06), July

17, 2006, Boston, MA, 2006, pp. 1-6.

29. L. Razzaq, M. Feng, N. T. Heffernan, K. R. Koedinger, B. Junker, G. Nuzzo-Jones, et al., "A

web-based authoring tool for intelligent tutors: blending assessment and instructional

assistance," in Intelligent Educational Machines, ed: Springer, 2007, pp. 23-49.

30. A. Peña-Ayala and L. Cárdenas, "How Educational Data Mining Empowers State Policies to

Reform Education: The Mexican Case Study," in Educational Data Mining, ed: Springer, 2014,

pp. 65-101.

31. J.A. Lara, D. Lizcano, M.A. Martínez, J. Pazos, and T. Riera, "A System for Knowledge

Discovery in E-Learning Environments within the European Higher Education Area-

Application to student data from Open University of Madrid, UDIMA," Computers &

Education, 2013.

32. M. J. Berry and G. Linoff, Data mining techniques: For marketing, sales, and customer

support: John Wiley & Sons, Inc., 1997.

33. M.-S. Chen, J. S. Park, and P. S. Yu, "Data mining for path traversal patterns in a web

environment," in Distributed Computing Systems, 1996., Proceedings of the 16th International

Conference on, 1996, pp. 385-392.

34. J. Alcalá-Fdez, L. Sánchez, S. García, M. J. del Jesús, S. Ventura, J. Garrell, et al., "KEEL: a

software tool to assess evolutionary algorithms for data mining problems," Soft Computing, vol.

13, pp. 307-318, 2009.

35. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "The KDD process for extracting useful

knowledge from volumes of data," Communications of the ACM, vol. 39, pp. 27-34, 1996.

36. G. Siemens and R. S. d. Baker, "Learning analytics and educational data mining: Towards

communication and collaboration," in Proceedings of the 2nd International Conference on

Learning Analytics and Knowledge, 2012, pp. 252-254.

37. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, "Privacy preserving mining of

association rules," Information Systems, vol. 29, pp. 343-364, 2004.

38. D. Agrawal, S. Das, and A. El Abbadi, "Big data and cloud computing: current state and future

opportunities," in Proceedings of the 14th International Conference on Extending Database

Technology, 2011, pp. 530-533.

39. H. Chen, R. H. Chiang, and V. C. Storey, "Business Intelligence and Analytics: From Big Data

to Big Impact," MIS Quarterly, vol. 36, pp. 1165-1188, 2012.

40. P. Zikopoulos, C. Eaton, D. DeRoos, T. Deutsch, and G. Lapis, "Understanding big data," New

York et al: McGraw-Hill, 2012.

41. M. Bienkowski, M. Feng, and B. Means, "Enhancing teaching and learning through educational

data mining and learning analytics: An issue brief," Washington, DC: SRI International, 2012.

42. R. Nisbet, J. Elder IV, and G. Miner, Handbook of statistical analysis and data mining

applications: Access Online via Elsevier, 2009.

43. P. Guide, "Getting Started with Big Data," 2013.

44. H. Kalia, S. Dehuri, and A. Ghosh: A Survey on Fuzzy Association Rule Mining. International

Journal of Data Warehousing and Mining 9(1): 1-27 (2013)

45. F. Waas, R. Wrembel, T. Freudenreich, M. Thiele, C. Koncilia, and P. Furtado: On-Demand

ELT Architecture for Right-Time BI: Extending the Vision. International Journal of Data

Warehousing and Mining 9(2): 21-38 (2013)

46. A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N. Mazon, F. Naumann, T. Bach

Pedersen, S. Rizzi, J. Trujillo, P. Vassiliadis, and G. Vossen: Fusion Cubes: Towards Self-

Service Business Intelligence. International Journal of Data Warehousing and Mining 9(2): 66-

88 (2013)

47. P. Williams, C. Soares, and J.E. Gilbert: A Clustering Rule Based Approach for Classification

Problems. International Journal of Data Warehousing and Mining 8(1): 1-23 (2012)

48. R.V. Priya and A. Vadivel: User Behaviour Pattern Mining from Weblog. International Journal

of Data Warehousing and Mining 8(2): 1-22 (2012)

49. T. Kwok, K.A. Smith, S. Lozano, and D. Taniar: Parallel Fuzzy c-Means Clustering for Large

Data Sets‚ Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002), Lecture

Notes in Computer Science, Volume 2400, Springer, pp: 365-374, 2002.

50. O. Daly and D. Taniar: Exception Rules Mining Based on Negative Association Rules‚

Proceedings of the International Conference on Computational Science and Its Applications

(ICCSA 2004), Part IV, Lecture Notes in Computer Science, Volume 3046, Springer, pp: 543-

552, 2004.

51. D. Taniar, W. Rahayu, V.C.S. Lee, and O. Daly: Än Exception rules in association rule mining‚

Applied Mathematics and Computation, 205(2): 735-750 (2008)

52. M.Z. Ashrafi, D. Taniar, and K.A. Smith: Redundant association rules reduction techniques‚

International Journal of Business Intelligence and Data Mining, 2(1): 29-63 (2007)

All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.