Upload
universitimalaya
View
0
Download
0
Embed Size (px)
Citation preview
An Approachable Analytical Study on Big Educational
Data Mining
Saeed Aghabozorgi1, Hamidreza Mahroeian2, Ashish Dutt1, Teh Ying Wah1, and
Tutut Herawan1,3
1Department of Information System
University of Malaya 50603 Pantai Valley, Kuala Lumpur, Malaysia
2University of Otago New Zealand
3AMCS Research Center, Yogyakarta, Indonesia
{saeed,teh,tutut}@um.edu.my,
Abstract. The persistent growth of data in education continues. More
institutes now store terabytes and even petabytes of educational data. Data complexity in education is increasing as people store both structured data in relational format and unstructured data such as Word or PDF files, images, videos and geo-spatial data. Indeed learning developers, universities, and other educational sectors confirm that tremendous amount of data captured is in unstructured or semi-structured format. Educators, students, instructors, tutors, research developers and people who deal with educational data are also challenged by the velocity of different data types,
organizations as well as institutes that process streaming data such as click streams from web sites, need to update data in real time to serve the right advert or present the right offers to their customers. This analytical study is oriented to the challenges and analysis with big educational data involved with uncovering or extracting knowledge from large data sets by using different educational data mining approaches and techniques.
Keywords: Big Data; Educational Data; Educational Data Mining; Data Mining; Analytical Study.
1 Introduction
Big data can be considered as the theory of looking at voluminous didactic amounts of data be it in physical or digital format being stored in diverse repositories ranging
from tangible account bookkeeping records of an educational institution to class test
or examination records to alumni records [1]. These records continue to grow in size
and variety. We learn from our mistakes as the old adage goes, in a similar fashion,
today the businesses are being operated based on the decisions over the data that was
collected by the business. Predictions, associations, clustering and many other
commonly occurring business decisions are taken each day by corporate to enhance
productivity and mutual growth [2]. And these significant decisions are dependent
exclusively on the data collected during business operations and human judgments.
This concept of big data has now been applied to various sectors like governments,
businesses, hospital management to name a few but there has been little research work
been done in its application in the educational sector. This is what we aim to find,
through this research work. Tomes have been written on the efficacy of Big Data, the
technologies that can be used to harness the sheer strength it exudes. But there has
been very little to negligible research work on the application of big data in
educational sector. Utilizing different data for making decisions is not new concept; corporations use complicated calculation on data generated by different customers for
business intelligence or analytics. Various techniques used in Business intelligence
can distinguish historical trends and customer patterns from data and can generate
different models that can result in prediction of future patterns and trends [3]. Consist
of proven methodologies from computer science, mathematics and statistics used for
deriving non-redundant information from large scaled datasets (big data) [4] .
One of the clear examples of exploiting useful data to discover online mode behavior
is Web analytics with different methods that sign and report visits of Web page,
specific region or particular domains and the different links that clicked through. To
understand how people use the Web, Web analytics are applied, but corporations have
utilized more complicated approaches and tools to track more sophisticated user
interactions with their websites [5, 6]. Example of web analytics include analyzing the purchasing habits of the consumer, the application of recommendation algorithms in
commercial websites search engines such that they are able to recommend the most
likely product a consumer would like, notable examples are Netflix, Amazon. The
same concept is now being applied to various e-learning systems for example Edmodo
is a free open source LMS that is able to predict similar books or resources based on
the learner’s web activity on the e-LMS [7].
New approaches and methods are considered imperative for extraction and analysis of
the aforementioned tasks so as to seamlessly integrate with the unstructured data that
these information systems generate. Big data is voluminous and would be futile to
bind it within a specific number boundary. One of the means by which it can be
defined could be its net usability worth. According to Manyika et al. [10] a data set whose computational size exceeds the processing limit of software can be categorized
as big data. Several studies have been conducted in the past that have provided
detailed insights into the application of traditional data mining algorithms like
clustering, prediction, association to tame the sheer voluminous power of big data.
Recent advances in machine learning field has provided with unique approaches to
foresee knowledge discovery in datasets. These algorithms have been successful in
finding correlations between unstructured data and one of their applications has been
into predictive modeling. Such models can be treated as virtual prototypes of a real
working model. When injected with real datasets in such models can help ascertain
any debacles that can then be promptly addressed to thus mitigating operational costs
of both man and machine labor.
Two specific fields that are significant to the exploitation of big data in education are educational data mining and learning analytics. Although there is no hard and fast
distinction between these two areas, they have had different research histories to some
extent and are developing as discrete research areas. In general, educational data
mining tries to uncover new patterns in captured data, building new algorithms or new
models, whilst learning analytics looks for identified predictive models in educational
systems[1, 4]. As it can be seen in the figure 1, educational data such as log files, user
(learner) interaction data, and social network data types are expected to grow in the
near future. This research study is oriented to the challenges and analysis with big
educational data involved with uncovering or extracting knowledge from large data
sets by using different educational data mining approaches and techniques. It is
arranged in the following ways: in the next section, background of study including
importance of education and educational data, the nature of big data, the basic understanding of data mining or knowledge discovery techniques will be described. In
section 3, from big educational data mining perspective, the concept of educational
data mining, big educational data, as well as big data mining is further discussed.
Section 4 details the major challenges concerned with big educational data mining,
and finally related discussion and conclusion is outlined.
Fig 1. Growth of different Educational Data
2 Rudimentary
2.1 Education
Learning providers, institutes, universities, schools and colleges always had the ability
to generate huge amounts of educational data [8]. Even a small kindergarten school
that only supply to a play group of children aged between 4-6 years can produce
enormous quantities of data which is ranged from their academics to their peer
activities, classroom activities and so forth. After the detonation of the buzzword,
“Big data” in different industrial sectors, researchers and industry workers are
collating towards vista’s that could presumably be affected by this surge [9].
Recent advances in technology has made it now possible to explore any previously unknown information that lay buried in deep caveats of heaps of data sets [10].
However, the most basic question that needs to be answered first is that, “Is there
really any big data in education?” or are we simply looking at an impasse.
2.2 Big Data
There are a number of similar definitions of big data. Perhaps the most well-known
and popular version is derived from IBM,2 which proposed that big data could be
differentiated by any or all of three V words to examine situations, events, and so on:
volume, variety, and velocity [1, 9, 11]. Volume is attributed to larger quantities of data being produced from a various range
of resources. For instance, big data can comprise data captured from the Internet of Things (IoT). As initially pictured, IoT is associated to the data collected from a range
of different devices and sensors networked together, over the Internet [12]. Big data
can also be cited to the explosion of information accessible on common social media
such as Facebook and Twitter [13].
Variety is referred to utilizing numerous types of data to investigate a situation or
event. On the IoT, millions of devices generating a steady flow of data results in not
only a large volume of data but different kinds of data features of different situations.
Furthermore, people on the Internet produce a highly various set of structured, semi-
structured as well as unstructured data [9]. Velocity of data which is attributed to a rapid increase in data over time for both kind
of structured and unstructured data, and more frequent decision making about that
data is essential [1]. As the world becomes more global and developed, and as the IoT generates, there is a growing frequency of data capture and decision making
procedure about those things as they progress throughout the world. Additionally, the
velocity of social media use is in its obvious upward trend. The clear example would
be 250 million tweets per day. As decisions are made using big data, those decisions
eventually can have a substantial impact on the next data that’s captured and
analyzed, counting another dimension to velocity of big data [1, 10].
2.3 Data Mining
In databases, Data mining or knowledge discovery popularly known as KDD is the
automatic mining of implied and appealing patterns from vast amounts of data [14].
Data mining is recognized as a field which is multidisciplinary in which a number of
computing paradigm congregated such as decision tree construction, rule induction, artificial neural networks, instance-based learning, Bayesian learning, logic
programming. In addition, some of the functional data mining techniques and
methods are listed like statistics, visualization, clustering, classification and
association rule mining [15,16]. These techniques discover new, implicit and practical
knowledge based on students’ usage data.
Data mining has been broadly applied in different kinds of educational systems. On
one hand, there are common traditional classroom environments such as special
education [17] and higher education [18]. On the other hand, there is education which
is computer-based as well as web-based education like well-known and learning
management systems known as LMS Systems [19], web-based adaptive hypermedia
systems [20] and intelligent tutoring systems(ITS) [21]. The major difference between
one and the other is the data accessible in each system. Traditional classrooms only have obtainable information about attendance of student, basic course syllabus; course
objectives and learners plan data. However, web and computer-based education has
much more readily information because these education systems can track all the data
pertained to specific students’ actions and interactions onto log files and databases,
(e.g. generating log files data) [22].
Fig. 2. Applying data mining to the design of educational systems
In order to improve learning effectiveness, the application of data mining approaches
and techniques to educational systems, can be observed as a formative evaluation
technique which is the evaluation of an educational program while it is still in
development phase, and with the purpose of continually enhancing the program.
Auditing the way students use the educational system is perhaps one common way to
assess instructional design in a this manner would help learning developers to have
the improved instructional materials which is going to result in having different data types such as log files, performance, transaction [23]. Data mining techniques should
be applied to collect information that can be used to assist instructional
designers/developers to build an educational foundation for judgments when
designing or improving an environment’s instructive approach. The application of
data mining to the design of educational systems is an iterative cycle of hypothesis
formation, testing, and refinement (see Figure 2). Extracted knowledge should go
through the loop towards guiding, facilitating, and enhancing learning as a whole. In
this process, the aim is not just to turn data into knowledge, but also to filter mined
knowledge for decision making [16,24].
As it is represented in Figure 2, educators and educational designers (whether in
school districts, curriculum companies, or universities) design, plan, create, and
Users (Instructors, learners,
students, course administrators, academic researchers, educators)
Knowledge components
User transaction
Log files
User performance
Identity type data
Educational system (Traditional classrooms, e-learning systems, LMSs,
intelligent tutoring systems,
web-based adaptive systems)
Data mining techniques (Visualization, clustering,
classification, statistic, association rule mining,
sequence mining)
Educational data
maintain educational systems. Students use those educational systems to learn.
Building off of the available information about courses, students, usage, and
interaction, data mining techniques can be applied in order to discover useful
knowledge that helps to improve educational designs. The discovered knowledge can
be used not only by educational designers and teachers, but also by end users—
students. Hence, the application of data mining in educational systems can be oriented to supporting the specific needs of each of these categories of stakeholders[23].
3 Big Educational Data Mining
3.1 Educational Data Mining
Educational Data Mining popularly known as EDM is a field that exploits statistical,
machine-learning, and data-mining (DM) algorithms over the different types of
educational data. Its major objective is to analyze these types of data in order to
resolve educational research issues [25,26]. EDM is concerned with developing methods to explore the relationships between unique types of data, produced in
educational settings and, using these methods, to better understand students and the
settings in which they learn. While, the increase in both instrumental educational
software as well as state databases of student’s information have created large
repositories of data reflecting how students learn [26]. Whereas, the use of Internet in
education has created a new context known as e-learning or web-based education in
which large amounts of data about teaching–learning interaction are endlessly
generated and ubiquitously available [23]. All this information provides a gold mine
of educational data. EDM seeks to tap these untouched or maiden data repositories to
better discern learners and learning abilities, and to develop computational approaches
that combine data and theory to transform practice to benefit learners. EDM has emerged as a prolific research area in recent years for researchers all over the world
from different and related research areas [7]. Education Data Mining can be extremely helpful in deducing inferences, make
predictions and more to establish students behavior and attitude as well as
concentration to its educational goals. The results deciphered by utilizing the
traditional data mining algorithms to educational context can help enhance the
educational system as all stakeholders can look into the trends found once analytic
reasoning is applied on the data of student related parameters [25]. Usually we use
regression techniques to analyze data, When we unitize the data into statistical
numbers for analytic reasons, usually the produced results can be plotted on a graphs
and trends can be found in terms of lines or combination of several data points as a
concentration of some student behavior to learning or researching or any such related activity [25]. EDM is involved with various groups of users such as learning
developers, instructors, educators, researchers. Different groups consider educational
data from different angles, based on their mission, vision, and major purpose for using
data mining as it is depicted in Table 1.
Table 1. EDM Users/Stakeholders
User/Actors Objectives for using data mining
Learners/Students/pupils To personalize e-learning, to recommend activities to learners
resources and learning tasks that could further improve their
learning, to suggest interesting learning experiences to the
students[27]
Educators/Instructors/Teachers/Tu
tors
To get objective feedback about instructions, to analyze
students’ learning and behavior, to detect which student need
support, to predict student performance, to classify learners into
groups[28]
Course Developers/Educational
Researchers
To evaluate and maintain courseware, to improve student
learning, to evaluate structure of course content and its
effectiveness in learning process[29]
Organizations/Learning
Providers/Universities/Private
Training Companies
To enhance the decision processes in higher learning
institutions to streamline the efficiency in the decision making
process, to achieve specific objectives[30]
Administrators/School District
Administrators/Network
Administrators/System
Administrators
To develop the best way to organize institutional resources and
their educational offer, to utilize available resources more
effectively, to enhance educational program offers and
determine the effectiveness of distance learning approach[31]
Today, there exists a wide variety of educational data sets that can be downloaded for
free from the Internet. Some widely acclaimed and used repositories are PSLC
DataShop (The world’s largest repository of learning interaction data), Data.gov
(official website of United States Government on Educational data sets), NSES Data
sets (is the primary federal entity for collecting and analyzing data related to
education in United States) [26,32], Barro-Lee data set (the data set provided by researchers Barrow and lee whose contribution has been discussed in section 1),
UNISTATS Dataset (website provides comparable sets of information about full or
part time undergraduate courses and is designed to meet the information needs of
prospective students), SABINS (The School Attendance Boundary Information
System) provides free of charge, aggregate census data and GIS-compatible boundary
files for school attendance areas, or school catchment areas, for selected areas in the
United States for the 2009-10, 2010-11 and 2011-12 school years. UIS (is an
UNESCO initiative), EdStats (A World Bank Initiative), Education Human
Development Network (A World Bank Initiative) and IPEDS Data Center (the
primary source for data on colleges, universities, and technical and vocational
postsecondary institutions in the United States), TLRP [33] .
3.1.1 Analysis of current tools being used for educational data sets
At present statistical tools are predominantly being used to quantify and assess the
educational data sets. Prominent ones are RapidMiner, SAS, IBM SPSS, KEEL [34]
(is a knowledge extraction tool based on evolutionary learning). Programming
language like R is mostly used for statistical analysis and plays a pivotal role in
programming custom tests that may not be available in commercial software
packages. There are some online web based data exploration tools typical java based
that gives the user the freedom to choose from the varied dataset types and see a
graphical representation of them. One of these is Education Data Explorer being
provided by Oregon Department of Education, United States. Another one is
Educational Data Analysis Tool (EDAT), it allows you to download NCES survey
datasets to your computer. EDAT guides you through selecting a survey, population, and variables relevant to your analysis [30].
3.1.2 Educational data set problem and possible solutions
Does the problem really exist or are we running behind a chimera? The “Education
for All”, a global monitoring report prepared by United Nations is the prime
instrument to assess global progress towards its goals. It seems that there is a flurry of
activity around big data and how it’s touching and transforming every aspect of our
life. Analysis of these large scale datasets can help improve the robustness and generalizability of educational research. The problem with most large scale secondary
data-sets used in higher education research is that they are constructed using complex
sample designs that often cluster lower level units (students), within higher level units
(colleges) to achieve efficiencies in the sampling process [35]. As it is clearly shown
in Figure 3, the term “Big Educational Data Mining” known as BEDM can be
proposed for the extraction of useful big educational data from vast quantities of
different large data sets.
Fig 3. Big Educational Data Mining (BEDM), Extracting new knowledge from Big Data Sets
3.2 Big Educational Data
Education has always had the capacity to produce a tremendous amount of data,
compared to any other industry. First, academic study requires many hours of
schoolwork and homework for several numbers of years. These extended interactions
with materials produce a huge quantity of data. Second, education content is tailor-
made for big data, generating cascading effects of insights thanks to the high
correlation between concepts [31]. Recent advancement in technology and data
science has made it possible to unlock/explore these large data sets [15]. The benefits
range from more effective self-paced learning to tools that enable instructors to
pinpoint interventions, create productive peer groups, and free up class time for creativity and problem solving. For instance, as it is represented in Table 2,
educational data can be categorized to five different categories: one pertaining to
student identity and on boarding, and four student activity-based data sets that have
the potential to improve learning outcomes. They are listed below in order to see how
complicated they are to attain:
Table 2. Educational Data Type classes
No. Data Type Description
1 Identity Data Personal Information, Authority, Domain Rights, Geographical Information
2 User Interaction Data
engagement metrics, click rate, page views, bounce rate, etc
3 Inferred Content Data
How well does a piece of content perform across a group, or for any one subgroup, of students? What measurable student proficiency gains result when a certain type of student interacts with a certain piece of content?
4 System-Wide Data Rosters, grades, disciplinary records, and attendance information are all examples of system-wide data.
5 Inferred Student
Data
Exactly what concepts does a student know, at exactly
what percentile of proficiency? What is the probability that a student will pass next week’s quiz, and what can she do right this moment to increase it?
Two areas that are specific to the use of big data in education are educational data
mining and learning analytics. Although there is no hard and fast distinction between
these two fields, they have had somewhat different research histories and are
developing as distinct research areas. Generally, educational data mining looks for
new patterns in data and develops new algorithms and/or new models, while learning analytics applies known predictive models in instructional systems. Big Data practical
examples in Educational context are the following:
The clear example is an education initiative. Analysts estimate that £16 billion is
wasted in productivity due to under-educated citizens. In response, the UK
government gathered data on outcomes of Kindergarten-12 education
(elementary and high school) as well as higher education (university). The data
pertained to student school performance and “success” afterwards as measured by
employment[36].
The government increasingly contributes to the open data movement; it is okay
with releasing “dirty data,” which is raw and not cleansed. Open data enables
individuals and entrepreneurs to use public data to innovate. Data visualization
tools enable parents to understand schools’ outcomes, so they can select appropriate schools for their children.
Universities can use data in exciting ways; they analyze students’ social media
sharing, patterns in checking out library materials, what courses they take (and
outcomes they achieve). This data helps them steer students to courses that are
aligned with their goals. It helps with student retention.
Big data enables interesting insights and correlations such as students that have
high library fines tend to perform worse on tests. Universities also correlate
performance data with socioeconomic and email data, so they can learn what
student characteristics predict the best performance at their schools, and they use
this to guide their recruitment. They are also starting to be able to predict which
students will drop out before graduating, which helps them give additional support [9].
Cost drivers [of education] are keys in big data adoption in the UK, which has
developed the most comprehensive database of pupils (schoolchildren) in the
world. It traces 600,000 pupils' performance from 3,000 elementary schools
through career. It has ten years of data on pupils’ exams, tests, socioeconomic
status, geography, transport, free meals, behavior issues and many others. It is a
rich dataset from which the government can learn and improve schools. It can answer political questions. The government is also combining its data with
health, crime and welfare datasets. It studies what students’ lives are like outside
school, to try to develop a fuller picture of factors that affect performance. This
can help challenge conventional thinking and guide policy.
This initiative is teaching us many things. Socioeconomic status is not as
important as we thought; school performance and responsiveness is very
important. Schools can use data to change. For example, science, technology,
engineering and math courses are far more important than we thought, even when
students don’t intend to pursue STEM careers.
Privacy is an issue with these databases, but the government believes that the
advantages outweigh the pupils' compromised privacy [37] .
Another traditional belief is that poor pupils do poorly and that schools need more money to increase performance. The data are showing that how the money
is invested is more important than how much money is in the school’s budget.
We are starting to be able to measure return on outcomes.
The UK example is more complex, but it effectively illustrates how internal and
external data can be mashed up to address complex problems such as school
performance. It's an excellent example of big data.
3.3 Big Data Mining
In typical data mining systems, the mining procedures require computational intensive
computing units for data analysis and comparisons. A computing platform is,
therefore, needed to have efficient access to, at least, two types of resources: data and
computing processors. For small scale data mining tasks, a single desktop computer, which contains hard disk and CPU processors, is sufficient to fulfill the data mining
goals. Indeed, many data mining algorithm are designed for this type of problem
settings. For medium scale data mining tasks, data are typically large (and possibly
distributed) and cannot be fit into the main memory. Common solutions are to rely on
parallel computing [43], [33] or collective mining [12] to sample and aggregate data
from different sources and then use parallel computing programming (such as the
Message Passing Interface) to carry out the mining process. For Big Data mining,
because data scale is far beyond the capacity that a single personal computer (PC) can
handle, a typical Big Data processing framework will rely on cluster computers with a
high-performance computing platform, with a data mining task being deployed by
running some parallel programming tools, such as MapReduce or Enterprise Control
Language (ECL), on a large number of computing nodes (i.e., clusters). The role of the software component is to make sure that a single data mining task, such as finding
the best match of a query from a database with billions of records, is split into many
small tasks each of which is running on one or multiple computing nodes. For
example, as of this writing, the world most powerful super computer Titan, which is
deployed at Oak Ridge National Laboratory in Tennessee, contains 18,688 nodes each
with a 16-core CPU. Such a Big Data system, which blends both hardware and
software components, is hardly available without key industrial stockholders’ support.
In fact, for decades, companies have been making business decisions based on
transactional data stored in relational databases [10]. Big Data mining offers
opportunities to go beyond traditional relational databases to rely on less structured
data: weblogs, social media, e-mail, sensors, and photographs that can be mined for useful information [1]. Major business intelligence companies, such IBM, Oracle,
Teradata, and so on, have all featured their own products to help customers acquire
and organize these diverse data sources and coordinate with customers’ existing data
to find new insights and capitalize on hidden relationships.
4 Major Challenges in Big Educational Data Mining
4.1 Is education data big enough to call it big data?
Startups like Knewton [38] and Desire2Learn [10] have been founded on the concept of Big Data. We had seen similar e-commerce startup during the early nineties when
the e-commerce boom was there but history is a mute audience to some of that
startup’s fate. Few of them have perished by now. However, the business startup’s
founded on big data in educational context would not face the similar fate because its
foundation rests on the principle of didactic unstructured data that is already present
in Informational systems. Perhaps one of the reasons for some of the ill-fated e-
commerce startups to fail was that their business model did not rest on the availability
of a constant flow of data from which information could be minded. But this is not the
case here. All we need is specialized algorithms that are designed to work with
educational datasets because we already have the data with us. Now companies like
Yahoo, Google, Dell, HP to name a few have ventured into open-source development of big data software’s like Apache foundation Hadoop to facilitate collective learning
by using contests like hackadays or hackathons [25,9]. We also need to understand
that there lies a gap between the application of big data in commerce and that in
education sector. While the former has seen various advances in it but for the latter we
are still dependent on traditional data mining algorithms. And the problem of using
such algorithm is that they may not fit the dataset and that can cause a loss of valuable
predictions that otherwise could have been ascertained by using the data mining
algorithms that would fit the educational dataset. Educational experts have posed
various deployment and implementation barriers to harness the power of big data in
education and learning analytics that most importantly includes technical lacunae,
institutional velocity, legal and quite often ethical issues by applying general data
mining algorithms. For big data to be meaningful it will require the seamless integration of specifically tailored algorithms that could the power of this raging beast
to tame it into knowledge that will be useful to both the learner and the educator [39].
4.2 BDM: Challenges in applying DM approaches on Big Data (from the
educational perspective)
A conceptual view of the Big Data processing framework can be depicted in the
figure 4, which includes three tiers from inside out with considerations on data
accessing and computing (Tier I), data privacy and domain knowledge (Tier II), and
Big Data mining algorithms (Tier III). The challenges at Tier I focus on data
accessing and actual computing procedures.
Fig. 4. A conceptual view of the Big Data processing framework [4]
Because Big Data are often stored at different locations and data volumes may
continuously grow, an effective computing platform will have to take distributed
large-scale data storage into consideration for computing [4,11]. For example, while
typical data mining algorithms require all data to be loaded into the main memory,
this is becoming a clear technical barrier for Big Data because moving data across
different locations is expensive (e.g., subject to intensive network communication and
other IO costs), even if we do have a super large main memory to hold all data for
computing. The challenges at Tier II center on semantics and domain knowledge for
different Big Data applications [40]. Such information can provide additional benefits
to the mining process, as well as add technical barriers to the Big Data access (Tier I)
and mining algorithms (Tier III). For example, depending on different domain applications, the data privacy and information sharing mechanisms between data
producers and data consumers can be significantly different. Sharing sensor network
data for applications like water quality monitoring may not be discouraged, whereas
releasing and sharing mobile users’ location information is clearly not acceptable for
majority, if not all, applications [41]. In addition to the above privacy issues, the
application domains can also provide additional information to benefit or guide Big
Data mining algorithm designs. For example, in market basket transactions data, each
transaction is considered independent and the discovered knowledge is typically
represented by finding highly correlated items, possibly with respect to different
temporal and/or spatial restrictions. In a social network, on the other hand, users are
linked and share dependency structures. The knowledge is then represented by user communities, leaders in each group, and social influence modeling etc. Therefore,
understanding semantics and application knowledge is important for both low-level
data access and for high level mining algorithm designs [16]. At Tier III, the data
mining challenges concentrate on algorithm designs in tackling the difficulties raised
by the Big Data volumes, distributed data distributions, and by complex and dynamic
data characteristics. The circle at Tier III contains three stages [4].
Primarily, sparse, heterogeneous, uncertain, incomplete, and multi-source data
are preprocessed by data fusion techniques.
Secondarily, complex and dynamic data are mined after pre-processing.
Tertiary, the global knowledge that is obtained by local learning and model
fusion is tested and relevant information is fed back to the pre-processing stage.
Then the model and parameters are adjusted according to the feedback. In the whole
process, information sharing is not only a promise of smooth development of each stage, but also a purpose of Big Data processing [30].
4.3 EDM: Challenges in applying DM approaches on education data
The recent advances in information technology have seen the proliferation of
software’s that can code a completely functional website replete with a backend
database system in less than an hour. So this has led to a rampant growth of e-learning
systems mostly cloud technology based. Most of these have incorporated
recommendation features as used by their business oriented counterparts. And both of
them are generating voluminous amounts of data. While online learning systems have
proffered the educator, developer and researcher opportunities to create personalized
learning systems but do note these personalization are using the traditional data
mining algorithms [7]. So what’s the problem then? One would ask. Well, one of the
problem is which most of the e-learning systems are not able to ascertain from an
educational point of view is that these systems are used by learners who have their individual learning styles. When a learner interacts with an LMS it leaves behind a
trail of breadcrumbs or log text files for example its interactions within the LMS
forum with either other students or with the course facilitator [33]. So it logically
follows that if we have to mine this data then it becomes imperative to figure out the
correct dataset to use so as to derive logical conclusions from it. Till now, there have
been fewer instances where data mining methods [44-48] have been introduced within
the e-learning systems to facilitate learner progress. The other problem from a
developer’s point of view would be to determine how to classify individual learning
style of a learner so as to provide it with a truly personalized learning environment.
While another challenge will be as we have repeatedly mentioned it in previous
sections too on how to develop specific data mining algorithms [49-52] that can cater to the learning analytical domain. So essentially what really matters at this point is to
find out methodologies that can help clean educational dataset so that it could further
be processed [23, 42].
5 Discussion and Conclusion
In this analytical study, on the whole, the background of study regarding to
importance of education and its educational data growth as big data, big data mining
tools and techniques to mine these vast amounts of data has been discussed. Moreover, the challenges involved with big educational data mining and extraction of
big educational data has been addressed from different educational data mining
perspectives. Working with big data using data mining and analytics is rapidly
becoming common in the commercial sector. Tools and techniques once confined to
research laboratories are being adopted by forward-looking industries, most notably
those serving end users through online systems [4,43]. Higher education institutions
are applying learning analytics to improve the services they provide and to improve
visible and measurable targets such as grades and retention. K–12 schools and school
districts are starting to adopt such institution-level analyses for detecting areas for
improvement, setting policies, and measuring results. Now, with advances in adaptive
learning systems, possibilities exist to harness the power of feedback loops at the level
of individual teachers and students [40]. Measuring and making visible students’ learning and assessment activities open up the possibility for learner’s to develop
skills in monitoring their own learning and to see directly how their effort builds onto
their success. Teachers gain views into students’ performance that help them adapt
their teaching or initiate interventions in the form of tutoring, tailored assignments,
and the like. Personalized adaptive learning systems enable educators to quickly see
the effectiveness of their adaptations and interventions, providing feedback for
continuous improvement. The practical applications of open source data mining tools
in an educational setting can augment both the researcher and developer to compare
distinct prototypes bearing the same design functionalities. The results thus obtained
could then be used to integrate within the existing in-house educational framework as
used by institutions so as to keep pace with the rapid adoption of blended learning
environment. Open source tools for adaptive learning systems, commercial offerings, and increased understanding of what data reveal are leading to fundamental shifts in
teaching and learning systems. As content moves online and mobile devices for
interacting with content enable teaching to be always on, educational data mining and
learning analytics will enable learning to be always assessed. Educators at all levels
will benefit from understanding the possibilities of the developments described in the
use of big data herein. Besides challenges of this new field which is introduced as big
educational data mining concerned with big identified educational data, the
importance of analyzing big educational data captured, extracted from large scaled
data sets using multiple approaches of big data and data mining analysis has to be
considered in further studies.
Acknowledgments. This work is supported by University of Malaya High Impact
Research Grant no vote UM.C/625/HIR/MOHE/SC/13/2 from Ministry of Education
Malaysia.
References 1. S. Sagiroglu and D. Sinanc, "Big data: A review," in Collaboration Technologies and Systems
(CTS), 2013 International Conference on, 2013, pp. 42-47.
2. A. Peña-Ayala, "Educational data mining: A survey and a data mining-based analysis of recent
works," Expert systems with applications, 2013.
3. G. Siemens and P. Long, "Penetrating the fog: Analytics in learning and education," Educause
Review, vol. 46, pp. 30-32, 2011.
4. X. Wu, X. Zhu, G. Wu, and W. Ding, "Data mining with big data," 2012.
5. C. Bizer, P. Boncz, M. L. Brodie, and O. Erling, "The meaningful use of big data: four
perspectives--four challenges," ACM SIGMOD Record, vol. 40, pp. 56-60, 2012.
6. A. Abraham, "Business intelligence from web usage mining," Journal of Information &
Knowledge Management, vol. 2, pp. 375-390, 2003.
7. C. Romero and S. Ventura, "Educational data mining: A survey from 1995 to 2005," Expert
Systems with Applications, vol. 33, pp. 135-146, 2007.
8. J. Gobert, M. Sao Pedro, R. Baker, E. Toto, and O. Montalvo, "Leveraging educational data
mining for real time performance assessment of scientific inquiry skills within microworlds,"
Journal of Educational Data Mining (accepted), 2012.
9. S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, "Big data, analytics and
the path from insights to value," MIT Sloan Management Review, vol. 52, pp. 21-31, 2011.
10. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, et al., "Big data: The next
frontier for innovation, competition, and productivity," 2011.
11. O. Trelles, P. Prins, M. Snir, and R. C. Jansen, "Big data, but are we ready?," Nature reviews
Genetics, vol. 12, pp. 224-224, 2011.
12. G. Bathuriya and M. Sai Nandhinee, "Implementation of Big Data for Future Education
Developement Data Mining Data Analytics."
13. D. Centola, "The spread of behavior in an online social network experiment," science, vol. 329,
pp. 1194-1197, 2010.
14. L.A. Kurgan and P. Musilek, "A survey of Knowledge Discovery and Data Mining process
models," Knowledge Engineering Review, vol. 21, pp. 1-24, 2006.
15. C. Romero, S. Ventura, and E. García, "Data mining in course management systems: Moodle
case study and tutorial," Computers & Education, vol. 51, pp. 368-384, 2008.
16. S.-H. Liao, P.-H. Chu, and P.-Y. Hsiao, "Data mining techniques and applications–A decade
review from 2000 to 2011," Expert Systems with Applications, vol. 39, pp. 11303-11311, 2012.
17. L. Tsantis and J. Castellani, "Enhancing learning environments through solution-based
knowledge discovery tools: Forecasting for self-perpetuating systemic reform," Journal of
Special Education Technology, vol. 16, pp. 39-52, 2001.
18. C. Romero, S. Ventura, A. Zafra, and P. d. Bra, "Applying Web usage mining for personalizing
hyperlinks in Web-based adaptive educational systems," Computers & Education, vol. 53, pp.
828-840, 2009.
19. C. Romero, S. Ventura, and P. De Bra, "Knowledge discovery with genetic programming for
providing feedback to courseware authors," User Modeling and User-Adapted Interaction, vol.
14, pp. 425-464, 2004.
20. Y. Wang, "Web mining and knowledge discovery of usage patterns," CS 748T Project, 2000.
21. S. Cetintas, L. Si, Y. P. Xin, and C. Hord, "Automatic detection of off-task behaviors in
intelligent tutoring systems with machine learning techniques," Learning Technologies, IEEE
Transactions on, vol. 3, pp. 228-236, 2010.
22. I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques:
Morgan Kaufmann, 2005.
23. C. Romero, S. Ventura, M. Pechenizkiy, and R. S. Baker, Handbook of educational data
mining: Taylor & Francis US, 2011.
24. J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques: Morgan kaufmann, 2006.
25. C. Romero and S. Ventura, "Educational data mining: a review of the state of the art," Systems,
Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 40, pp.
601-618, 2010.
26. R. Baker and K. Yacef, "The state of educational data mining in 2009: A review and future
visions," Journal of Educational Data Mining, vol. 1, pp. 3-17, 2009.
27. N. Sael, A. Marzak, and H. Behja, "Multilevel clustering and association rule mining for
learners’ profiles analysis," 2013.
28. N. Anozie and B. W. Junker, "Predicting end-of-year accountability assessment scores from
monthly student records in an online tutoring system," in Proceedings of the American
Association for Artificial Intelligence Workshop on Educational Data Mining (AAAI-06), July
17, 2006, Boston, MA, 2006, pp. 1-6.
29. L. Razzaq, M. Feng, N. T. Heffernan, K. R. Koedinger, B. Junker, G. Nuzzo-Jones, et al., "A
web-based authoring tool for intelligent tutors: blending assessment and instructional
assistance," in Intelligent Educational Machines, ed: Springer, 2007, pp. 23-49.
30. A. Peña-Ayala and L. Cárdenas, "How Educational Data Mining Empowers State Policies to
Reform Education: The Mexican Case Study," in Educational Data Mining, ed: Springer, 2014,
pp. 65-101.
31. J.A. Lara, D. Lizcano, M.A. Martínez, J. Pazos, and T. Riera, "A System for Knowledge
Discovery in E-Learning Environments within the European Higher Education Area-
Application to student data from Open University of Madrid, UDIMA," Computers &
Education, 2013.
32. M. J. Berry and G. Linoff, Data mining techniques: For marketing, sales, and customer
support: John Wiley & Sons, Inc., 1997.
33. M.-S. Chen, J. S. Park, and P. S. Yu, "Data mining for path traversal patterns in a web
environment," in Distributed Computing Systems, 1996., Proceedings of the 16th International
Conference on, 1996, pp. 385-392.
34. J. Alcalá-Fdez, L. Sánchez, S. García, M. J. del Jesús, S. Ventura, J. Garrell, et al., "KEEL: a
software tool to assess evolutionary algorithms for data mining problems," Soft Computing, vol.
13, pp. 307-318, 2009.
35. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "The KDD process for extracting useful
knowledge from volumes of data," Communications of the ACM, vol. 39, pp. 27-34, 1996.
36. G. Siemens and R. S. d. Baker, "Learning analytics and educational data mining: Towards
communication and collaboration," in Proceedings of the 2nd International Conference on
Learning Analytics and Knowledge, 2012, pp. 252-254.
37. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, "Privacy preserving mining of
association rules," Information Systems, vol. 29, pp. 343-364, 2004.
38. D. Agrawal, S. Das, and A. El Abbadi, "Big data and cloud computing: current state and future
opportunities," in Proceedings of the 14th International Conference on Extending Database
Technology, 2011, pp. 530-533.
39. H. Chen, R. H. Chiang, and V. C. Storey, "Business Intelligence and Analytics: From Big Data
to Big Impact," MIS Quarterly, vol. 36, pp. 1165-1188, 2012.
40. P. Zikopoulos, C. Eaton, D. DeRoos, T. Deutsch, and G. Lapis, "Understanding big data," New
York et al: McGraw-Hill, 2012.
41. M. Bienkowski, M. Feng, and B. Means, "Enhancing teaching and learning through educational
data mining and learning analytics: An issue brief," Washington, DC: SRI International, 2012.
42. R. Nisbet, J. Elder IV, and G. Miner, Handbook of statistical analysis and data mining
applications: Access Online via Elsevier, 2009.
43. P. Guide, "Getting Started with Big Data," 2013.
44. H. Kalia, S. Dehuri, and A. Ghosh: A Survey on Fuzzy Association Rule Mining. International
Journal of Data Warehousing and Mining 9(1): 1-27 (2013)
45. F. Waas, R. Wrembel, T. Freudenreich, M. Thiele, C. Koncilia, and P. Furtado: On-Demand
ELT Architecture for Right-Time BI: Extending the Vision. International Journal of Data
Warehousing and Mining 9(2): 21-38 (2013)
46. A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N. Mazon, F. Naumann, T. Bach
Pedersen, S. Rizzi, J. Trujillo, P. Vassiliadis, and G. Vossen: Fusion Cubes: Towards Self-
Service Business Intelligence. International Journal of Data Warehousing and Mining 9(2): 66-
88 (2013)
47. P. Williams, C. Soares, and J.E. Gilbert: A Clustering Rule Based Approach for Classification
Problems. International Journal of Data Warehousing and Mining 8(1): 1-23 (2012)
48. R.V. Priya and A. Vadivel: User Behaviour Pattern Mining from Weblog. International Journal
of Data Warehousing and Mining 8(2): 1-22 (2012)
49. T. Kwok, K.A. Smith, S. Lozano, and D. Taniar: Parallel Fuzzy c-Means Clustering for Large
Data Sets‚ Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002), Lecture
Notes in Computer Science, Volume 2400, Springer, pp: 365-374, 2002.
50. O. Daly and D. Taniar: Exception Rules Mining Based on Negative Association Rules‚
Proceedings of the International Conference on Computational Science and Its Applications
(ICCSA 2004), Part IV, Lecture Notes in Computer Science, Volume 3046, Springer, pp: 543-
552, 2004.
51. D. Taniar, W. Rahayu, V.C.S. Lee, and O. Daly: Än Exception rules in association rule mining‚
Applied Mathematics and Computation, 205(2): 735-750 (2008)
52. M.Z. Ashrafi, D. Taniar, and K.A. Smith: Redundant association rules reduction techniques‚
International Journal of Business Intelligence and Data Mining, 2(1): 29-63 (2007)
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.