Scientific Paper-2

Prediction of Student learning interests using text analytics Prethiviraj Elango1,

Mithun Rajkumar Antony2 and Krishna Ramanathan3

Faculty of Engineering and IT University of Technology Sydney

Sydney, Australia {1Prethiviraj.Elango, 2Mithun.RajkumarAntony

3Krishna.Ramanathan}@student.uts.edu.au

Abstract – The collaboration of student learning in online is popular because of its novel advantages over the traditional class room learning. There are certain benefits can be accomplished in using this platform of learning, if the quality of approach is unique. However, there ae some limitations in using the vast amount of available student data. There is no proper evident in in using the student data for various purposes. In the existing literatures, there has been various advantages in using the text analytics for the enhancement of the educational pattern of learning; on going through these literatures, this paper proposes a process model to collect and analyze the student data on their online learning environment. This proposed thesis uses data analytic tool called RapidMiner for text processing to indicate the students’ interest in various area of study based on their available data. Furthermore, this report is based on the proof of concept of a project which is simple enough to target University of Technology Sydney (UTS) and other educational stakeholders.

Keywords – Text analytics, online learning, Prediction accuracy

I. INTRODUCTION

The student online learning environment is a significant change in the present day scenario. University of Technology, Sydney (UTS) providing student an opportunity for this engagement of students in online. They are using UTS online software for making the collaboration of student and professors. There are some limitations in the UTS online in which the students can participate only in the discussion board of their enrolled subjects. Professors can only provide some updates regarding subjects, can publish student marks and can include the subject materials in UTS online. Professors cannot monitor the student activities, interests, intentions and so on, as UTS online does not provide any opportunities to do so.The mentioned limitations can be overcome by the implementation of project called CIC Around. This project is currently under

Roberto Martinez-Maldonado Connected Intelligence Centre

University of Technology Sydney Sydney, Australia

[email protected] process, handled by UTS Connected Intelligence Center (CIC). CIC is operating under UTS who handles multiple project for UTS in which CIC Around is one among them. The activities involved in CIC is to find the happenings on intersecting human sense making and computational analysis. CIC’s research is focused on various domain projects like education, learning analytics, human centered, research analytics and transdisciplinary. Their main aim is to conduct research to answer the unanswered questions on these domains. In the CIC Around project, UTS CIC is designing a participatory design process to build an online WordPress multisite environment which will be useful for student learning, their online collaboration with their peers, provide a students an opportunity to collaborate with the industry partners and for building a community among the students. Students can create groups for the various purpose of studying. Professors are also provided with the opportunity to monitor the students’ progress. The implementation of this project will overcome the existing limitations of UTS online in which this project is more of a participatory process that help the student to participate more on this online learning environment. This project will be more helpful to the students who are studying blocked mode subjects. The understanding of wider UTS community student’s interest in the online learning environment will helps in the analysis of student data for the future enhancement of CIC Around. The proof of concept on the UTS CIC Around with the WordPress plugins, BuddyPress and BBpress has been performed. Following the proof of concept, the proposal of a process model for predicting the students’ interest on this online learning environment has been done. The prediction accuracy is based on the rate of interests on the students over their other areas of learning. This proposal will be helpful for the University authorities to refine the particular courses based on the interest level of students. The data analytic tool called RapidMiner

has been used in which the detailed explanations are given as follows.

The rest of paper is organized as follows: Section 2 Motivation, Section 3 Methodology, Section 4 Related work, Section 5 Existing process, Section 6 Proposed process, Section 7 Conclusion.

II. MOTIVATION FOR THIS RESEARCH

The main aim of this research is to provide the educational stakeholders a clear insight about using text analytics in an effective and efficient way. The objective trying to achieve in this paper, is to improve the efficiency of the online learning based on their interests that binds the students from various distance. As technology is enhancing according to recent trends, it is necessary for educational stakeholders to use that technologies to enhance the existing pattern of learning. Certain Universities will be having their own process and norms in enhancing their student’s existing pattern of learning. However, in many cases, student’s interests cannot be predicted by the Universities to know their exact thinking on their selected subjects.

There will be vast and vast subjects available for a particular student to study based on their selected course. To explain with the simple example, University of Technology Sydney have refined their Information Technology course on the four majors like Business Information systems, Data Analytics, Networking and Software Development based on the student participation in their Subject Feedback Survey (SFS). Other than this, there will be more and more internal works might be done by UTS to enhance their course.

Also, in the survey students will provide the feedback only about their enrolled subjects. This is more than a direct approach without any technical means which does not will provide the information regarding student actual interests with their feedback on their enrolled subjects. Conducting surveys for knowing the student interests is a tedious process in which University authorities cannot be able to collect the survey data manually to know their interests to make some refinements in their course. This is the starting point to perform the research in this area which will be useful for various educational stakeholders.

The research is also based on the similar technique explained above but in an alternate way of collecting the student data from their online learning environment. This research will also help in overcoming the flaw of not knowing the interest of a student. Here, data analytic tool is used for the clear understanding of the process involved in this research.

On the whole, the ultimate motivation of this research is to accomplish the accuracy of predicting the student interests in various domain areas and incorporating this prediction accuracy to refine their subjects involved in their course. The selected research will also provide a better understanding of student data which will be helpful in analyzing various patterns in future.

III. METHODOLOGY In this research, review of many articles related to text analytics has been done. And then based on the findings, proposal has been done for an enhancement related to the existing approaches in managing the student data and what can be done with the student data considering on their online participation. Initially based on the available student data, proposed idea has been sorted considering various factors. The research is mainly focused on the two basis. The first one is collecting the student data from online learning environment for the enhancement purpose of the student learning on the whole. This should be done after retrieving the data from the online learning environment. The data should be retrieved on the back end by reporting and also according to the specifications mentioned by the stakeholder’s purposes. So, on researching various criteria’s, finally decided to use the text analytics in which it will helpful in collecting, measuring, analyzing and finding the similar pattern among the students’ data. The second one is focused on the data analytic tool called RapidMiner in which the bulk student data will be processed according to the keyword search option available in that. The main focus on student data in aiming the text analytics is to derive the high quality information from the student entered text. By using this, similar pattern of text will be structured which will be supportive in interpreting the output. The above mentioned two process will be useful in enhancing the student learning. So, decided to use those two process and then proceeded with the ideas with some demonstration. The clear and detailed description of this two process is clearly explained on the proposed process section.

IV. RELATED WORK The initial application of text mining in the field of higher education was not that effective when compared to the later one as they were not user friendly and was very expensive. There are several application of text mining and a unique method is preferred by every user to work with the mining tool depending on their category of knowledge. Text mining also has a great

effect in the field of higher education where the teachers can analyze the activities of the learner and help the learner in an efficient manner. Text mining is also used as a major tool to refine the curriculum of any course in a university or any education standards.

The author in his book Qualitative Text Mining in Student’s Service Learning Diary has analyzed the services in learning activities of the student’s in any education sector in a way to analyze the outcomes of the students from e-learning and also to provide a reflection to the students based on their interaction with eLearning tools like online discussion board, online exams, etc., He also quotes that the curriculum of a course can be updated by using some text mining technologies, which makes the course even more refine, rather than putting a huge syllabus with unrelated contents for the students. He also introduced some computer technology like (Hsu, 2012).

Instructional design

This is to provide a blueprint and to examine the teaching standards of every teacher. Instructional design is used to identify a particular learner who is holding a high rate of dropping out of the subject. Once such a learner is identified, a unique approach, and strategies are used to make an efficient teaching practice. The authors narrowed down the concept of instructional design in their book of “Designing instructional feedback for different learning outcomes”. The book clearly states that the instructional event, where a particular student is picked up for motivation has to follow a pretest, practice and a post-test (Smith et al., 1993)

Text mining prediction

The authors in their book of text mining predictive methods for analyzing unstructured information indicated that any data mining technology will be used to find out the structured data base but not in the semi- structured database. Hearst has identified that data mining would not satisfy the human needs of learning and teaching information. However, when text mining is applied with appropriate language and statistics to analyze text data helps us to attain new data (Weiss et al., 1989).

The professor followed a research method of this study. He says: “Initially apply the instructional design model followed by text mining procedures. The model has to combine 3 aspects of view: professor in action research, student teacher in curriculum and instructional development and design students in motivational

learning evaluation” which is explained on the below figure.

Figure 1: Research models in three points of view The author (Ai et al., 2006) in his paper “The Application of Data Mining Technology in Distance Learning Evaluation has listed out the knowledge that we gather because of text mining, they are:

A. Generalized knowledge A very general description of the characteristics of any text the mining tools could generate (in our case, the mining tool is a rapid miner). This generally contains the reflection of common nature of similar things, refining the abstract data and so on.

B. Related knowledge

This data is gathered when one data is dependent of other similar data or associated knowledge.

C. Category knowledge

This is similar to the related knowledge but it differs where the gathered texts are categorized based on the different characteristics of knowledge. The most widely used type of classification of data is a tree view.

D. Predictive knowledge

This can also be said as future knowledge, which is predicted according to the past data and the current data. The trending predictive methods are statistical method, neural networks and machine learning.

E. Bias-based knowledge

This is nothing but an exceptional knowledge that’s gathered as a description of the

differences between characteristics between attributes.

They also quoted the use of E-Portfolio with text

mining as an application to evaluate the learning behavior of the student. E-Portfolio when used by itself proves to be an inefficient technique to evaluate the learning behavior of the student as it’s evaluated manually by the teacher. It also has the limitations of handling large number of students. The below figure shows that, Text mining when used with E-Portfolio help the teacher to gather some knowledge and in learning objectives associated with the analysis. Through the recorded set of mined data, the teacher can easily understand the regulatory standards and also analyze the results of student’s learning behaviors, which further increases the efficiency of learning evaluation (Ai et al., 2006).

Figure 2: Application of data mining technology in E-Portfolio

The MCMS (Mining Course Management Systems) project in Thames Valley University recommends to build a knowledge management system based on data mining. Data mining techniques are applied to track the individual student performance also to refine the curriculum according to the activities of the student. Text mining is used as a tool to represent the mined data by the MCMS in a human understandable way for better decision making (Oussena, 2008).

A model-driven data integration is applied in MCMS to fetch the data from different systems into a single data warehouse for analyzing (Kim et al., 2009). The data in the warehouse should always be pre-processed and transformed before it undergoes any mining techniques. So when the data is ready, it increases the efficiency of the data mining process. Such an efficient knowledge gathered from the data mining process will be used by the university to have an advanced approach of prediction individual’s behavior, instructing the students. Text mining is applied here to narrow down the student’s interaction with the online learning (ELearning) tool. When a knowledge management system and a text mining process and used

simultaneously, an university will have the highest level of data efficiency which further facilitates the university to choose the most advanced approach in understanding their student’s need.

Figure 3: Workflow of MCMS The author determines the student’s test score by using the data mining prediction technique by using an effective factor. This factor is later altered according to the student’s performance in the succeeding year (Gabrilson, 2003). Luan groups the students into 2 categories. One with the students who can easily deal with the courses and the other with students who take a longer time to complete a course (Luan, 2002). Such groups helps the universities to make a better decision on refining their curriculum, the time for teaching and so on. To understand the factors which determines the student’s retention, the universities usually collects data about the history of academic performance of a student, behavior and perceptions of a student, for instance the author used different classifiers to predict the student’s characteristics which lead to a very less accuracy or a bad accuracy (Superby et al., 2006). The authors in their paper “Use Data Mining To Improve Student Retention In Higher Education” has stated the student retention as the biggest challenge as it decides better academic programs and a better revenue for the universities (Oussena et al., 2010). A simple formula for maintaining the student retention rate was developed by Seidman (Seidman, 1996), which is: Retention=Early Identification + (Early + Intensive

+ Continuous) Intervention This formula helps to understand that early detection of those students at risks and maintain regular interaction will be the most recommendable way to increase student retention Tinto has provided 5 strategies to increase student retention to the next level:

• Understanding the expectations of the student.

• Conducting a counselling session in helping the students choose their courses.

• Providing academic and social support specially before the start of the first semester

• Motivating the student on explaining their capability

• Active interaction with the available learning sources

The authors in their work introduces the idea of using opinion mining from student’s feedback data. As opinions of the stakeholders will be the major factor in individual’s decision making, the authors have considered this technique to understand their students better and to refine the curriculum. The result of the opinion mining depends on how good the data is preprocessed or stages the data has undergone when it’s prepared before classification (Dhanalakshmi et al., 2016).

The authors in their work used linear regression classifier to identify the variable which is associated with the academic performance. This leads them to realize, previous academic performance was the important variable (Oussena , 2008).

V. EXISTING PROCESS

The existing system of text analytics in general is used to process the unstructured information into structured, extract the meaningful information from the entered text and contained information of the text will be used by the various data mining algorithms. The extraction of information will be done by summarizing the number of words in the document. The summarized words then can be analyzed to find the similarities and relationship between them. The most common method in text analytics is to convert the text to numbers for the analysis of clustering and predictive data mining projects. In addition, this method will also be helpful in various analysis. Text mining also includes sentimental analysis, summarization of documents, entity relation model, text clustering and text categorization. The below figure shows the overall description of the text analytics process:

Figure 4: Text analytics process

VI. PROPOSED PROCESS In this proposal, the illustration is going to be with the usage of text analytics with the student data. The proceedings are based on the existing text analytics process. As we are dealing with the student data from the online learning environment, the first thing we needs to do is collecting the student information like their posted data, their comments, their participation data in any discussion and their micro information like the page they visits, they page they like and the topics they are very much interested in. Every data that we will be collecting from the relational databases will be in an unstructured format. All unstructured data will be retrieved in the document format. So to make it into structured format we can use vector representation feature. By using this feature, we can bring those documents in a similar database which will then be converted into structured format. The collection of this structured data is very important because we are going to find some of the similar patterns and relationship among their data. The main purpose in doing this is to make sure to find out the similarities of a single student opinion regarding other subjects in which it is not in the part of their course. Example: For example, a student belongs to Information Technology course but he/she has more interest in marketing related topics. If that particular student is participating in more and more marketing related activities, we can come to the conclusion that particular Information Technology student is equally interested in marketing subjects as well. Like this many other fellow Information Technology students might have interest in marketing. Now, it is very clear from this point is quite a considerable amount of information technology students are interested in marketing. By identifying this similarities and patterns, the Universities are provided with the opportunity to refine the Information Technology course by including marketing subjects. Likewise many students who all are

comes under one particular course will have equal interest in other areas as well. So with the help of text analytics the course can be refined periodically according to the present trends, scenario and students behavior.

Demonstration: To predict the students’ interest on different areas in the online learning environment, we are going to use RapidMiner software platform. It is an open source software in which it will be useful in machine learning, business analysis, text analysis, predictive analysis and data mining. In this software platform, we are going to demonstrate how the text mining process will be effective over the data in online learning environment. Once the installation of RapidMiner is done, we should load the extracted student information from the online learning environment to the RapidMiner. The extraction can be done from any Business Intelligence tool like online analytical processing, Data warehousing and so on. Before loading the extracted file into RapidMiner, we should look for the desired extensions for text processing by clicking the Extensions icon like the below screenshot:

Figure 5: RapidMiner Extensions On clicking the extensions icon, we should install the package of text processing. Once the text processing package is installed, next selection process would be dragging and dropping out the Process Documents from Files from the text processing package to the work area as given below:

Figure 6: Dropping Process Documents from Files

to the RapidMiner workspace After completing this, we should select the parameters for this stipulated extension of Process Documents from filters. This selection is shown in the below screenshot:

Figure 7: Parameter selection In the above screenshot, in text directories we should provide the file path of the local computer. Here, we are going to compare the two extracted files of student data from their online learning environment. The data that we are talking about here is the dummy data for the demonstration purpose. One is the student data that belongs to the

Information Technology department, the other is the student data that belongs to the Telecommunication department. The extraction is based on the student information, their online participation, their intention, topics they are very much interested in, the page they like and so on. The loading of both the student data is performed like the below screenshot in the RapidMiner tool:

Figure 8: Loading dummy student data

Once the dummy student has been loaded, we needs to select our option for vector creation. The Figure 4 shows the vector creation. In that once the file is loaded, we needs to specify which vector creation has to be done. Documents are represented by the vectors. Here, when the texts are processed, it is an unstructured and ordered list of pairs which will then be converted into structured with the help of document vector model. This conversion will be done by counting the number of words in the documents. There are four options for counting of words which is explained below:

Binary Term Occurrences: This is the simplest option in which it will count whether the selected word is there in the document or not.

Term Occurrences: This option is related to binary term occurrences in which it will be checking for how often a word is occurred in a document.

Term Frequency: This will look for the fraction of document length which is happening for the particular term throughout the document.

TF-IDF: This is the most advanced option in the RapidMiner tool which stands for term frequency- inverse document frequency. Term frequency is same as explained above. Inverse document frequency is based on the document frequency which is a number of

documents that a word occurs in. It is used to determine the characteristic of a word. In our demonstration we have selected this option which collectively performs two mentioned tasks. The next step that we needs to perform is which process should happen inside the loop. The process we have selected is Tokenization. The main purpose of this process selection is to cut the texts into individual terms of terms of words. The different separators can be used which is highlighted in the below screenshot

Figure 9: Selection of a separator There are number of separators available on the RapidMiner tool. The first one is non letter which includes wide spaces, punctuations, symbols and so on. The next one is specify characters separator in which we can select the character according to our wish. Apart from these two, we can also separators like regular expression, linguistic sentences and linguistic tokens. In our demonstration we have selected the non-letters separator. We can also perform more number of operations under text processing. For an instance, we have selected the filtering option called Filter Stop words (English). It will helps to remove the articles, conjunctions, pronouns and so on. As we are going to perform multiple operations on the text processing in the rapid miner tool, we have to make sure that we have to give the option of break after in our second and third operation such as Tokenization and Filter Stop words respectively. The next step is we needs to run the selected operations on the RapidMiner tool. Once we run it, we can see the separation between the original text and processed text like in the below screenshot:

table view, plot view and distribution table. The view we have selected here is plot view in which it will compare the number of words from Information Technology student data and Telecommunication student data. From the overall extraction of almost all student data, we have compared only two department’s student data to know their interest on the Marketing area. On giving the selection of word marketing, we can come to the conclusion that more number of Information Technology students are interested in marketing area, as the graph shows. From knowing this, University authorities can refine Information technology subjects by adding some of the Marketing subjects to their curriculum.

Figure 10: Outcome of text analytics

The color has been changed between each and every words because we have used the tokenizer option in which it will make the separation between the individual words and terms. Likewise the same procedures can be repeated for each and every documents. In our demonstration we have used two files containing student data of Information Technology and Telecommunication department. It also includes the example set in which it is consist of one line for each document and one column for each word. In addition to this some of the Meta information is also provided like file information, file date, extension path and group or class which they belongs to with the label attribute. In addition to this, if we wants to generate a classification model, it is possible with the available classification model with in the RapidMiner tool. In our demonstration we have used Naïve Baiyes classification model. The selection of this classification model is available with the modelling package in the RapidMiner tool. Once we select our classification, it will be looking the below screenshot in the RapidMiner working area.

Figure 11: Selecting a classification model

Once after adding the classification model, we can perform different operation on the required output like

Figure 12: Plot view of processed text data

Thus, with the help of text processing it is easier to identify the students’ interest on the online learning environment. Similarly we can compare various patterns among the students according to the university specification.

VII. CONCLUSION & FUTURE IMPLICATIONS

The proposed research gives the clear insight of using the available student data in an effective and efficient way. The attributes discussed in this research will provide a greater benefits to the educational stakeholders to focus more on the students’ academics based on the predicted interests of the students’. The prediction factor of students’ largely depends on their online participation which will also be further helpful in providing the valuable outcome, if the research is done on the various areas similar to this. The future implications would be evaluating the performance of the students individually, on evaluating the performance of the students lecturer can provide some needed assistance to the particular student, providing some improvements in study materials, and finally

sometimes it will also provide an opportunity to evaluate the performance of the lecturer. For this implication, some of the learning analytic tool can be used which will be solely focused on individual enhancement of learning.

VIII. REFERENCES

Ai Yubing., Zhang Jianping., 2010. ‘ The Application of Data Mining Technology in Distance Learning Learning Evaluation’, International Forum on Information Technology in Distance Learning Evaulation.

Cristianini, N., Shawe-Taylor, J., 2000. ‘An Introduction to Support Vector Machines and other kernel-based learning methods’. Cambridge University Press. Dhanalakshmi, v., Dhivya Bino., 2016. ‘Opinion mining from student feedback data using supervised learning algorithms’, 3rd MEC International Conference on Big Data and Smart City

Gabrilson, S., Fabro, D. D. M., Valduriez, P., 2008. ‘Towards the efficient development of model transformations using model weaving and matching transformations’, Office of information technology, Geogia Department of Education. Hsu Chia-Ling., 2012. ‘Qualitative Text Mining in Student’s Service Learning Diary’. Third International Conference on Innovations in Bio-Inspired Computing and Applications Kim, H., Zhang, Y., Oussena, S., and Clark, T., 2009. A Case Study on Model Driven Data Integration for Data Centric Software Development, In Proceedings of ACM First International Workshop on Data-intensive Software Management and Mining Luan, J. 2002.‘Data mining and knowledge management in higher education – potential applications’. In Proceedings of AIR Forum, Toronto, Canada.

Mazon, J. N., Trujillo, J., Serrano, M., Piattini, M., 2005. ‘Applying MDA to the development of data warehouses’. DOLAP 2005

Oussena, S., 2008. ‘Mining Courses Management Systems’. Thames Valley University. P. L. , and Smith, T. J. Ragan, ‘Instructional design’, Macmillan, New York, 1993 Pathros Ibarra García, E. 2011, ‘Model Prediction of Academic Performance for First Year Students’, Mexican International Conference. S. M. Weiss, N.’ Indurkhya, T. Zhang, and, F. Damerau, Text mining predictive methods for analyzing unstructured information’, Spring Science-Business Media, Inc., New York, 2005M. Young, The Technical Writer’s Handbook. Mill Valley, CA: University Science, 1989. Schönbrunn, K., Hilbert, A., 2006. ‘Data Mining in Higher Education, Studies in Classification’.Data Analysis,and Knowledge Organization Advances in Data Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation e.V., Berlin. Seidman, A., 1996. Spring Retention Revisited: RET = E Id + (E + I + C)Iv. College and University, 71(4), 18-20. National Audition Office, 2007, Staying the course: the retention of students in higher education Superby, J.F., Vandamme, J-P., Meskens, N., 2006. ‘Determination of factors influencing the achievement of the first- year university students using data mining Methods’. Workshop on Educational Data Mining. Tinto, V., 2000. ‘Taking student retention seriously: rethinking the first year of college’, NACADA Journal, Vol. 19 No. 2, pp. 5-10. Thomas, L., 2002. ‘Student retention in higher education: the role of institutional habitus’, Journal of Education Policy, Vol. 17 No. 4, August, pp. 423-442. Yorke, M., Longden, B., 2004. ‘Retention and student success in higher education’ , Society for Research in Higher Education.