time table scheduling in data mining

Embed Size (px)

Citation preview

  • 7/21/2019 time table scheduling in data mining

    1/61

    Chapter 1

    INTRODUCTION

    This chapter includes the introduction to the timetabling. It describes the basic concepts and

    types related to the timetable. Then the objectives of this study and thesis outline are described.

    1.1 Introduction to Data miningData Miningis a process to analyzing the data from large databases. As it is also clear from its

    name Data Mining : searching for valuable information in a large database. Data mining is

    also known as knowledge discovery.

    Generally, data mining (sometimes called data or knowledge discovery) is the process of

    analyzing data from different perspectives and summarizing it into useful information -

    information that can be used to increase revenue, cuts costs, or both. Data mining software is one

    of a number of analytical tools for analyzing data. It allows users to analyze data from many

    different dimensions or angles, categorize it, and summarize the relationships identified.

    Technically, data mining is the process of finding correlations or patterns among dozens of fields

    in large relational databases.The overall goal of the data mining process is to extract information

    from a data set and transform it into an understandable structure for further use.

    Data mining is a process that uses a variety of data analysis tools to discover patterns and

    relationships in data that may be used to make valid predictions. The first and simplest analyticalstep in data mining is to describe the data summarize its statistical attributes (such as means

    and standard deviations), visually review it using charts and graphs, and look for potentially

    meaningful links among variables (such as values that often occur together). But data description

    alone cannot provide an action plan. You must build a predictive model based on patterns

    determined from known results, and then test that model on results outside the original sample. A

    good model should never be confused with reality (we know a road map isnt a perfect

    representation of the actual road), but it can be a useful guide to understanding our business. The

    final step is to empirically verify the model. For example, from a database of customers who

    have already responded to a particular offer, we have built a model predicting which prospects

    are likeliest to respond to the same offer.

    1.1.1 Importance of Data Mining

    We can simply define data mining as a process that involves searching, collecting, filtering and

  • 7/21/2019 time table scheduling in data mining

    2/61

    analyzing the data. It is important to understand that this is not the standard or accepted

    definition. But the above definition caters for the whole process. Large amount of data can be

    retrieved from various websites and databases. It can be retrieved in form of data relationships,

    co-relations and patterns. With the advent of computers, internet and large databases it is

    possible collect large amounts of data. The data collected may be analyzed steadily and help

    identify relationships and find solutions to the existing problems. Governments, private

    companies, large organizations and all businesses are after large volume of data collection for

    the purposes of business and research development. The data collected can be stored for future

    use. Storage of information is quite important whenever it is required. It is important to note that

    it may take long time for finding and searching for information from websites, databases and

    other internet sources.

    1.1.2 How does Data mining work?

    While large-scale information technology has been evolving separate transaction and

    analytical systems, data mining provides the link between the two. Data mining software

    analyzes relationships and patterns in stored transaction data based on open-ended user queries.

    Several types of analytical software are available: statistical, machine learning, and neural

    networks. Data mining consists of five major elements:

    Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table.

    1.1.3 Data Mining (KDD) Process

    Understand the application domain Identify data sources and select target data Pre-process: cleaning, attribute selection Data mining to extract patterns or models Post-process: identifying interesting or useful patterns Incorporate patterns in real world tasks

  • 7/21/2019 time table scheduling in data mining

    3/61

    Figure 1.1 Data mining process

    1.1.4 Data mining Techniquesa) Classification

    Classification consists of examining the features of a newly presented object and assigning to it

    a predefined class. The classification task is characterized by the well-defined classes, and atraining set consisting of pre-classified examples. The task is to build a model that can be

    applied to unclassified data in order to classify it. Examples of classification tasks include:

    Classification ofcredit applicants as low, medium or high risk

    Classification of mushrooms as edible or poisonous

    Determination of which home telephone lines are used for internet access

    b) ClusteringClustering is the task of segmenting a diverse group into a number of similar subgroups or

    clusters. What distinguishes clustering from classification is that clustering does not rely on

    predefined classes. In clustering, there are no predefined classes. The records are grouped

    together on the basis of self similarity. Clustering is often done as a prelude to some other form

    of data mining or modeling. For example, clustering might be the first step in a market

  • 7/21/2019 time table scheduling in data mining

    4/61

    segmentation effort, instead of trying to come up with a one-size-fits-all rule for determining

    what kind of promotion works best for each cluster.

    c) Association RulesAn association rule is a rule which implies certain association relationships among a set of

    objects (such as occur together or one implies the other) in a database. Given a set of

    transactions, where each transaction is a set of literals (called items), an association rule is an

    expression of the form XY, where X and Y are sets of items. The intuitive meaning of such a

    rule is that transactions of the database which contain X tend to contain Y. An example of an

    association rule is: 30% of farmers that grow wheat also grow pulses; 2% of all farmers grow

    both of these items. Here 30% is called the confidence of the rule, and 2% the support of the

    rule. The problem is to fund all association rule that satisfy user-specified minimum support and

    minimum confidence constraints.

    d) RegressionRegression is a data mining (machine learning) technique used to fit an equation to a dataset.

    Regression is a data mining function that predicts a number. Age, weight, distance, temperature,

    income, or sales could all be predicted using regression techniques. The simplest form of

    regression, linear regression, uses the formula of a straight line (y = mx + b) and determines the

    appropriate values for m and b to predict the value of y based upon a given value of x. The

    regression functions are used to determine the relationship between the dependent variable

    (target field) and one or more independent variables. The dependent variable is the one whose

    values you want to predict, whereas the independent variables are the variables that you base

    your prediction on.

    1.1.5 Advantages of data mining

    Provides new knowledge from existing datao Public databaseso Government sourceso Company Databaseso Old data can be used to develop new knowledge

    New knowledge can be used to improve services or products Improvements lead to:

  • 7/21/2019 time table scheduling in data mining

    5/61

    o Bigger profitso More efficient service

    1.1.6 Disadvantages of data mining

    User privacy/security Amount of data is overwhelming Great cost at implementation stage Possible misuse of information Possible in accuracy of data

    1.2 Scheduling and Timetabling1.2.4 SchedulingScheduling is one of the important tasks encountered in real life situations. Various scheduling

    problems are present, like personnel scheduling, production scheduling, education time table

    scheduling etc. Education time table scheduling is a difficult task because of the many

    constraints that are needed to be satisfied in order to get a feasible solution. Education time table

    scheduling problem is known to be NP hard. NP hard stands for non polynomial hard and means

    that; there is no known exact algorithm that can solve problems of time table scheduling in

    polynomial time. Methodologies like genetic algorithms (GAs), Evolutionary Algorithms (EAs)

    etc. have been used with mixed successes.

    Scheduling theory is concerned with the optimal allocation of scarce resources to activities

    over time. The practice of this field dates back to the first time two human contended for a

    shared resource and developed a plan to share it without bloodshed. The theory of the design of

    algorithm for scheduling is younger but still has a significant history. The earliest papers in the

    field were published more than 40 years ago.

    Scheduling problems arise in a variety of settings, as illustrated by the following examples:

    Consider the central processing unit of a computer that must process a sequence of jobsthat arrive over time.

    Consider a team of five astronauts preparing for the reentry of their space shuttle intoatmosphere.

    Consider a factory that produce different sorts of gadgets. Each gadget must first beprocessed by machine 1 then machine 2 then machine 3 where different gadgets require

    different amount of processing time on different machines.

  • 7/21/2019 time table scheduling in data mining

    6/61

    Consider an academic environment, which requires the scheduling of a given set ofcourses and meetings between students and lecturers. Each course takes place in a

    particular hall and each hall has its capacity. We must also make sure students or

    lecturers are not fixed up in more than one appointment.

    1.2.2 Scheduling of timetabling

    The general area of scheduling has been the subject of intense research for a number of decades.

    Scheduling and timetabling are typically viewed as two separate activities, with the term

    scheduling used as a generic term to cover specific types of problems in this area. Consequently,

    timetable constructions can be considered as a special case of generic scheduling activity.

    In the most general terms, scheduling can be described as the constrained allocation of resources

    to objects, being placed in space-time in such a way that the total cost of a set of the resources

    used can be minimized. Examples of this problem set can be seen in transport scheduling and

    delivery vehicle outing where the business driven objective is to minimize the total cost function.

    Timetable construction is the allocation, subject to constraints, of given resources to objects

    being placed in space-time in such a way as to satisfy or nearly satisfy a desirable set of possible

    objectives. Class timetables and exam timetables are examples of these problems where all hard

    constraints must be satisfied to generate a valid solution. [1]

    Thus the term scheduling covers all aspects of the activity of allocating resources and, at the

    same time, satisfying some predetermined objective. However, due to the enormity of the

    problem, it becomes necessary to classify the scheduling problem into specialized activities such

    as timetabling. Thus, in practical terms the timetabling problem can be described as scheduling a

    sequence of lectures between teachers and students in a prefixed time period (typically week),

    satisfying a set of varying constraints.

  • 7/21/2019 time table scheduling in data mining

    7/61

    1.3 What is Timetabling?

    Timetabling problems are a specific type of scheduling problem and are mainly concerned with

    the assignment of events to timeslots subject to constraints with the resultant solution

    constituting a timetable. Wren (1996) defined timetabling in the following way:

    Timetabling is the allocation, subject to constraints,

    Of given resources to objects being placed in space time,

    in such a way as to satisfy as nearly as possible a set of desirable objectives.

    Based on the definition given by Wren (1996), we need to know whether there are sufficient

    resources available for the given event to take place at its specified time as well as which

    resources are allocated. The goal is to optimize some objective function depending on the

    application domain at hand. For example, in examination timetabling environments, the functionto optimise is usually the gap between two examinations that a student has to sit in i.e. try to

    spread the examinations throughout the examinations periods of time. The basic terminology

    used in timetabling problems is summarised in Table 1.1.

    Table 1.1 Basic terminology used in timetabling

    Terminology Definition

    Event An activity to be scheduled. Examples include

    examinations and courses.

    Timeslot (period) An interval of time in which events can be scheduled.

    Resource Resources required by events. Examples include rooms

    and equipment (i.e. projectors).

    Constraint A restriction to schedule the events. Examples include

    room capacity andspecific timeslot.

    Individual A person who has to attend the events.

    Conflict Two events are clashing with each other if they have at

    least a common individual and are scheduled in the same

    timeslot.

  • 7/21/2019 time table scheduling in data mining

    8/61

    The constraints in timetabling can be divided into two categories: hard and soft. Hard constraints

    cannot be violated. Soft constraints are not essential but their satisfaction is highly desirable in

    order to produce a good quality timetable.

    1.3.1 Hard Constraints: Hard constraints [9] [15] are the constraints that physically cannot

    be violated; a timetable in presence of violation of such hard constraints can never be

    acceptable. For example, a lecturer cannot be in two places at once. Following are the list of

    hard constraints:

    1. Classrooms must not be double booked.

    2. Every class must be scheduled exactly once.

    3. Lecturers must not be double booked.

    4. A lecturer must not be booked when he/she is unavailable.

    5. Some classes need to be held consecutively. For example the Labs.

    6. Some classes require particular rooms like experiments must be held in particular

    laboratories.

    7. Classrooms must be large enough to hold the class scheduled in it.

    1.3.2 Soft constraints: Some constraints [9] [15] are less straight forward to define. Usually,

    these constraints must be fulfilled as well as possible. The timetable that violates these constraints

    is still usable, but it is not convenient for either students or teachers. Following are the soft

    constraints:

    1. Teachers may prefer specific time slots.

    2. Teachers may prefer specific rooms.

    3. Certain kind of subjects should not be in contiguous time slots.

    4. Some lecturers do not wish to have classes assigned consecutively in time.

    5. There are preferred hours in which a lecturer's classes might be scheduled.

    6. Most students and some lecturers do not wish to have empty periods in their timetables.

  • 7/21/2019 time table scheduling in data mining

    9/61

    7. Classes should be distributed evenly over the week.

    8. Classrooms should not be booked which are much larger than the size of the class.

    9. More than one member of staff might need to be assigned to a particular class.

    It is desirable that timetables should satisfy all hard and soft constraints. However, it is usually

    difficult to meet all these constraints because hard constraint must not be violated in any case, but

    some soft constraints can be sacrificed to find feasible timetables.

    1.4 Classification of Educational Timetabling Problems

    Schaerf (1999a) classified educational timetabling into three main classes i.e. school

    timetabling, course timetabling and examination timetabling. They share the same basic

    characteristics of the general timetabling problem but can still have significant differences

    between them. Each one of them has its own constraints, requirements and rules. More details on

    educational timetabling can be found in Burke et al. (2004e). In this section, a classification of

    educational timetabling and its properties are discussed. We divided educational timetabling into

    two categories i.e. school timetabling and university timetabling (which consists of examination

    timetabling and course timetabling).

    1.4.1 School Timetabling

    The school timetabling problem is concerned with the weekly scheduling for all the lessons of aschool. The problem consists of a set of teachers, classes, subject/lessons and weekly periods.

    These weekly periods are predefined. This problem tries to assign lessons to periods and, a

    teacher to a particular class at a given time while satisfying a set of constraints in order to

    produce a feasible timetable. Some examples of constraints in the school timetabling problem are

    capacities, locations, teacher loads, rest time between two lessons and other personal preferences.

    Examples of research on school timetabling can be found in Abramson (1991) who employed

    simulated annealing, Carrasco and Pato (2001) who employed a multi-objective genetic

    algorithm and Legierski (2003) who applied a constraint-based approach.

    1.4.2 University Timetabling

    The university timetabling problem can be grouped into two categories: (i) course (or lecture)

    timetabling and (ii) examination timetabling. The course timetabling problem is the process of

  • 7/21/2019 time table scheduling in data mining

    10/61

    assigning timeslots and rooms so that meetings between lecturers and students can take place.

    The examination timetabling problem refers to the assignment of timeslots and rooms so that

    students can take examinations. These two (examination and course) timetabling problems are

    fairly similar in some superficial ways, but there are some distinct underlying differences

    between them. In examination timetabling, several examinations can be assigned to one (large)

    room at the same time. However, this is not possible for course timetabling where only one

    course can be assigned to one room.

    a) The Examination Timetabling ProblemThe examination timetabling problem represents a major administrative activity for academic

    institutions. It is often a difficult and demanding process and it affects a significant number of

    people. Romero (1982) reports that there are three broad categories of people that are affected by

    its outcome: administrators, academic staff and students. Many universities are seeing an

    increasing number of student enrolments into a wider variety of courses and an increasing

    number of combined degree courses. This is contributing to the growing challenge of developing

    examination timetabling software to cater for the broad spectrum of constraints and demands that

    are required by educational institutions across the world and therefore the quality of a timetable

    should be evaluated from several points of view.

    Carter and Laporte (1996) defined the examination timetabling problem as:

    The assigning of examinations to a limited number of available

    Timeperiods in such a way that there are no conflicts or clashes

    The examination timetabling problem is very common in both schools and universities. It is

    concerned with allocating a set of examinations, into a limited number of timeslots (periods),

    subject to a set of constraints. Carter et al. (1994) quoted that the basic challenge of examination

    timetabling is to schedule examinations over a limited number of timeslots so as to avoidconflicts and to satisfy a number of side constraints. In this case, the conflict is referred to as a

    hard constraint and side constraints are referred to as soft constraints.

    The generally accepted hard constraints for the examination timetabling problem are (i) there

    must be enough seating capacity and (ii) no student should be required to sit two examinations at

  • 7/21/2019 time table scheduling in data mining

    11/61

    the same time. Solutions that satisfy all the hard constraints are called feasible. On the other

    hand, there might be some requirements that are not essential. These are referred to as soft

    constraints. Common soft constraints are (i) Students should not be scheduled to sit more than

    one examination in a day. (ii) Each students examinations should be spread as evenly as

    possible over the schedule.

    In a real world situation, it is, of course, usually impossible to satisfy all the soft constraints, but

    minimizing these violations will increase the quality of the solution by calculating the penalty

    function to the extent to which a timetable has violated its soft constraints.

    b) The Course Timetabling ProblemCarter and Laporte (1998) defined course timetabling as:

    amulti-dimensional assignment problem in which students,

    teachers (or faculty members) are assigned to courses, course

    sections or classes; events (individual meetings between students

    and teachers) are assigned to classrooms and times

    In course timetabling (which is also sometimes known as class/teacher timetabling), a set of

    courses is scheduled into a given number of rooms and timeslots within a week and, at the same

    time, students and teachers are assigned to courses so that the meetings can take place. Some

    combinatorial models which draw upon graph colouring for simple class-teacher timetabling

    problems can be found in de Werra (1996b, 1997b). As in examination timetabling, course

    timetabling also involves hard and soft constraints. Examples of hard constraints for the course

    timetabling problem are:

    1. A student and a teacher cannot be in two places at the same time.

    2. Only one course is allowed to be assigned to a timeslot in each classroom.3. The classroom capacity should be equal to or greater than the number of students attending the

    course at a particular timeslot.

    Some related soft constraints for course timetabling reported by Socha et al. (2002) are:

    1. Teachers may prefer specific time slots.

  • 7/21/2019 time table scheduling in data mining

    12/61

    2. Teachers may prefer specific rooms.

    3. Certain kind of subjects should not be in contiguous time slots.

    Some combinations of assignments lead to acceptable timetables, others do not. Such

    restrictions follow from conditions imposed by rooms, students or teachers. As stated earlier, in

    university course timetable, a set of course and associated events is assigned to a set of rooms and

    time periods within a week and at the same time, students and teachers are assigned to the

    courses so that the appropriate lessons can take place, subject to a variety of hard and soft

    constraints.

    1.5 Need of Study

    Organizations like universities and schools uses timetable to schedule classes and lectures,

    assigning times and places to them in such a way that makes best use of available resources.

    Universities in particular increasingly have to deal with a large number of courses and flexible

    degree structures. A timetable that is not well designed will be inconvenient and will be

    expensive in terms of wasted time and money. Timetabling is a search for Good Solutions in a

    space of possible timetables.

    Traditionally, the educational staff solved the problem manually. Making timetable is a

    slow, laborious task, performed by people working on the strength of their knowledge of

    resources and constraints of a specific institution. Generating universitys timetable is a tedious

    job with lots of constraints to be satisfied. Different requirements by different departments or

    universities must be satisfied also. Thus, generating timetable is being considered as a complex

    problem, but result is often not reasonable i.e. it does not meet all the requirements. These

    uncertainties have motivated for the scientific study of the problem, and to develop a semi-

    automated solution technique for it. These programs build a set of timetables but still do not

    solve the whole problem.

    The construction of automated course timetables for academic institutions is a very

    difficult problem with a lot of constraints that have to be respected and a huge search space to be

    explored, even if the size of the problem input is not significantly large, due to the exponential

    number of the possible feasible timetables. On the other hand, the problem itself does not have a

    widely approved definition, since different departments face different variations of it. This

    problem has therefore proven to be a very complex. Timetables are considered feasible provided

  • 7/21/2019 time table scheduling in data mining

    13/61

    the so-called hard constraints are respected. However, to obtain high-quality timetabling

    solutions, soft constraints, which impose satisfaction of a set of desirable conditions for classes

    and teachers, should be satisfied and also gives more accurate timetable schedule, high precision,

    high recall and takes less execution time for scheduling the timetable by using the modified k-

    mean clustering algorithm.

    1.6 Research Objectives

    The main objective of this thesis work is to fully utilize the resources of the university in the

    automated timetable Generator. The goals of the thesis work are:

    1. Analysis of the problem exists in timetabling.2. To create a system that can utilize the resources in efficient and effective manner in

    order to remove the redundancy, ambiguity so that the system should be cost effective

    and user friendly.

    3. To compare the existing system with new one.

    In this work we check the accuracy, precision, recall of generated timetable and we also compare

    the execution time for the timetable generation. In this work we have to use improved k-mean

    clustering algorithm and analysis that execution time for improved k-mean clustering is less then

    k-mean clustering.

    1.7 Scope of Work

    A time-table is very crucial for educational institutions and schools. We have found an

    effective solution to the problem of adjusting time tables in a simple and economic way. Through

    our interactive means we can ensure that the time-table can be generated really fast and

    smoothly. The software for time-table is equipped with the most efficient features that can ensure

    that the whole time table can be made easily to ensure that the most effective time schedules can

    be fixed according to the school's needs and suggestions. With the help of some really effective

    tools we can make sure that there are no unnecessary delays or confusion caused with effective

    time-tables.

  • 7/21/2019 time table scheduling in data mining

    14/61

    1.8 Structure of dissertation

    The dissertation has been organized into six chapters. A brief description of the content of

    these chapters is given in the following paragraphs:

    Chapter 1 provides an overview and introduction to timetable scheduling. It also introduces

    various timetabling problems and concentrates upon particular research issues concerned with

    university timetabling problems. This chapter presents the need of study and the aims of the

    research.

    Chapter 2 introduces background of timetabling problems. It reviews and analyses the current

    published research on the subject of university timetabling.

    Chapter 3 introduces the various techniques for solving the timetabling problem.

    Chapter 4 introduces the methodology used in this research work for solving the timetable

    scheduling problem.

    Chapter 5 shows the working of new system and snap-shots of results. It also compares the new

    system with existing system. This chapter also describes the accuracy, precision, recall and

    execution time of this new system.

    Chapter 6 describes the conclusion and future work.

  • 7/21/2019 time table scheduling in data mining

    15/61

    Chapter 2

    SURVEY OF LITERATURE

    This chapter discusses the various literatures that are reviewed during the whole research work.

    Various research papers and journals have been studies during the period.

    2.1 Background

    Timetabling is known to be a non-polynomial complete problem i.e. there is no known efficient

    way to locate a solution. Also, the most striking characteristic of NP-complete problems is that,

    no best solution to them is known. Hence, in order to find a solution to a timetabling problem, a

    heuristic approach is chosen. This heuristic approach, therein, leads to a set of good solutions

    (but not necessarily the best solution). In a general educational timetabling problem, a set of

    events (e.g. courses and exams, etc) are assigned into a certain number of timeslots (time

    periods) subject to a set of constraints, which often makes the problem very difficult to solve in

    real-world circumstances [2]. In fact, large-scale timetables such as university timetables may

    need many hours of work spent by qualified people or team in order to produce high quality

    timetables with optimal constraint satisfaction [7] and optimization of timetables objectives at

    the same time. These constraints are of two types Hard and Soft constraints. Hard constraints

    include those constraints that cannot be violated while a timetable is being computed. For

    example, for a teacher to be scheduled for a timeslot, the teacher must be available for that time

    slot. A solution is acceptable only when no hard constraint is violated. On the other hand soft

    constraints are those that are desired to be addressed in the solution as much as possible. For

    example, though importance is given to a teachers scheduling, focus is on setting a val id

    timetable and this can lead to a teacher going free for a time slot. Thus, while addressing the

    timetabling problem, hard constraints have to be adhered, at the same time effort is made to

    satisfy as many soft constraints as possible. Due to complexity of the problem, most of the work

    done concentrates on heuristic algorithms which try to find good approximate solutions [8].

    Some of these include Genetic Algorithms (GA) [8], Tabu Search [10], Simulated Annealing

    [11] and recently used Scatter Search methods. Heuristic optimization methods are explicitly

    aimed at good feasible solutions that may not be optimal where complexity of problem or limited

    time available does not allow exact solution. Generally, two questions arise (i) How fast the

  • 7/21/2019 time table scheduling in data mining

    16/61

    solution is computed? and (ii) How close the solution is to the optimal one? Tradeoff is often

    required between time and quality which is taken care of by running simpler algorithms more

    than once, comparing results obtained with more complicated ones and effectiveness in

    comparing different heuristics. The empirical evaluation of heuristic method is based on

    analytical difficulty involved in the problems worst case result. In its simplest form the

    scheduling task consists of mapping class, teacher and room combinations (which have already

    been pre- allocated) onto time slots.

    2.2 Literature Review

    Many approaches and models have been proposed for dealing with the variety of timetable

    problems. Problems range from the construction of semester or annual timetables in schools,

    colleges and universities to exam timetabling at the end of the period. Early timetable activities

    were carried out manually and a typical timetable once constructed remained static with only a

    few changes necessary, in order to fine tune it every semester or year. However, the nature of

    education has changed substantially over the years and thus the requirements of timetables have

    become much more complicated than they used to be. Consequently the need for automated

    timetable generation is increasing and thus the development of a timetable generation system

    that generates valid solutions is essential. As a result, during the last 30 years, many papers

    related to automate timetabling have been published in conferences, proceedings and journals. In

    addition, several applications have been developed and implemented with various successes.

    The early techniques used in solving timetabling problems were based on a simulation of the

    human approach in resolving the problem. These included techniques based on successive

    augmentation that were called direct heuristics. These techniques were based on the idea of

    creating a partial timetable by scheduling the most constrained lecture first and then extending

    this partial solution lecture by lecture until all lectures were scheduled .[3] Then exit step was

    for researchers to apply general techniques like integer and linear programming, graph coloring

    and network flow to solve the timetable problem. Hence the first two papers published on

    timetable construction using these general techniques are generally attributed to Kuhn and

    Haynes. Kuhns a paper adopts a mathematical approach to the fundamental timetable problem

    in contrast to Haynes paper, which concentrates, on the more practical problem aspects of

    scheduling events for a conference. Interest in timetable solution generators increased

  • 7/21/2019 time table scheduling in data mining

    17/61

    dramatically in the 1960s mainly due to the more common availability of computers to perform

    the number crunching required by the algorithms developed. [4]

    The first non-heuristic approach was developed by Gotlieb in 1963 and discussed in the now

    famous process of reducing the availability array and presented at the Munich IFIP congress.

    This was arguably the first paper on this partitioning approach and was further enhanced by

    Berghuis, where the concept of virtual classes or teachers to obtain the classical bipartite

    problem was introduced. Typically these papers were based on a heuristic approach. Due to this

    work many of the papers followed which discussed the problem but had very little new work in

    them.

    Around the late 60s some attempts at limiting the general problem by considering case

    examples were beginning to be published. For instance, Lawrie in 1969 developed a model for

    the school timetable problem by using an integer linear programming approach.

    During the 1970s several authors adopted the usage of the heuristic approach in tackling the

    timetable problem. For example Junginger in 1972 provided a reduction of the timetable

    problem by applying it to a three dimensional transport problem. Schmidtand Strohlein in 1973

    predicted the generation of timetables by computer would be heavily influenced by devices at

    hand, with timetable programming moving from remote handling in huge computing centers to

    micro computer centre's owned by schools and directly handled by teachers on their desktops.

    The major general techniques that seemed to have been prevalent in the 1970s and1980s havetheir roots in artificial intelligence and are based on algorithms supported by simulated

    annealing, tabu search and genetic algorithm methods. Papers in the literature typically

    described a substantial software implementation and this is supported by the presentation of

    results of the application of the method in one or more cases. Furthermore, there were a number

    of important surveys of timetabling literature that were published in the 1980s. [4]

    DeWerra in 1985 listed the various problems dealing with timetabling in a formal way and

    provided different formulations in an attempt to solve them. He also described the approaches

    considered the most important at that time. Carter in 1986 analyzed a survey, which discussed

    actual applications of timetables at several universities. He also provided details of a tutorial

    guide for practitioners on electing and/or designing an algorithm for their own institutions. [5]

  • 7/21/2019 time table scheduling in data mining

    18/61

    Junginger in 1986 described research work in Germany on the school timetable problem and the

    underlying approaches that were based on direct heuristics. Corneetal in 1994 provided a survey

    of Genetic Algorithm application to timetables, discussed future perspectives of such approaches

    and compared results obtained with respect to other approaches. Although there were papers

    published in the 1990s solving timetable problems using the above artificial intelligence based

    techniques, there was a new approach emerging, also rooted in Artificial Intelligence that has

    gained prominence called Constraint Satisfaction Programming (CSP). [6]

    Abramson in 1991 used Simulated Annealing as an optimization technique. The possibility of

    adding cost components was discussed in an attempt to include the more complex scheduling

    constraints that arise in schools. Also described is how the weighting of cost components

    allowed one component to be made more important than others. He implemented this in a

    parallel computer system and proved that the speed of the algorithm improved along with

    results. Cooper and Kingston in 1993 described a computer program that solved a problem

    within a large and highly constrained high school without any simplifications. A timetable

    specification language was provided that helped to avoid many constraints in a uniform way.

    Schaerf in 1999 provided a survey of the different techniques used in timetable generation.

    Constraint satisfaction techniques were stressed as an important addition to the tools that are

    used in solving the timetabling problem. [10]

    KennedyandEberhartin1995 developed Particle Swarm Optimization (PSO) algorithm for

    optimization.Shu-Chuan Chu,Yi-Tin Chen in 2006 developed the school timetable using the

    PSO. They observed that PSO has many successful applications in continuous optimization

    problems. The main contribution of their work is to utilize PSO to solve the discrete problem of

    timetable scheduling. [7]

    Ahmed Hamdi Abu Absa and Dr. Sana Wafa Ai Sayegh [8] explained the details of the

    implementation of the Genetic Algorithms (GA) which is used for university timetable generator.

    This paper presents a program, written in java. In a simple university timetable problem it creates

    efficient time table without constraint violation. The study tested the effects of mutation rate and

    http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Shu-Chuan%20Chu.QT.&searchWithin=p_Author_Ids:37420219100&newsearch=truehttp://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Yi-Tin%20Chen.QT.&searchWithin=p_Author_Ids:38045954200&newsearch=truehttp://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Yi-Tin%20Chen.QT.&searchWithin=p_Author_Ids:38045954200&newsearch=truehttp://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Shu-Chuan%20Chu.QT.&searchWithin=p_Author_Ids:37420219100&newsearch=true
  • 7/21/2019 time table scheduling in data mining

    19/61

    population size. This paper discussed Genetic Algorithm approach is very effective and useful on

    the lecture time tabling problems.

    Alberto Colorni et al. [9] analyzed the results of an automated timetabling problem solve by

    genetic algorithm. They described that automated timetable problem is representative of the class

    of multi-constrained, NP-hard, combinatorial optimization problems with real-world application.

    This paper compares two versions of the genetic algorithm, with and without local search, both

    to a handmade timetable and also compare with to two other approaches based on simulated

    annealing and tabu search. The results show that genetic algorithm with local search and tabu

    search with relaxation perform better than simulated annealing and handmade timetables. When

    they tested algorithm the results were better, both the didactical requirements and teachers

    preferences were better satisfied. The total cost of this newly built system was much less than

    handmade timetable.

    E.K. Burke et al. [10] discussed automatic timetable generation with the use of traditional

    methods such as graph coloring and advanced methods such as the genetic algorithms. This

    paper presents the examination timetabling. This paper discussed Genetic algorithm is very

    useful general purpose optimization tools that may be applied to wide range of very difficult

    problems

    Khaled Mahar [2] proposed genetic algorithm has a simple representation that handles all the

    university timetables at once and easily modified to creation of a accurate timetable which

    satisfies constraints that must not be broken. This paper presents that algorithm is applied to

    create timetables for the college of Arab Academy for Science and Technology in Egypt and the

    results are very satisfactory and there is no hard constraint violation encountered. The program

    tested with different population sizes, a crossover and mutation rates. This paper also provides an

    overview of different techniques for automatic generation of university time tables like tabu

    search, simulated annealing, genetic algorithms, graph coloring heuristics, constraint

    programming, network flow models, and constraint programming .This papers proves that as

    long as population size increases the cost changes faster and large size takes too much running

    time and memory consumption.

  • 7/21/2019 time table scheduling in data mining

    20/61

    Dipti Srinivasan et al. [11] stated an evolutionary algorithm based approach to solving a large

    constrained university timetabling problem. Other techniques also used for obtaining feasible

    timetables in a appropriate time that are Heuristics and context-based reasoning. The complete

    course timetabling system presented in this paper has been accurate, tested and discussed using

    data from a university. The results have shown that implementing the intelligent adaptive

    mutation operator has led a more than 10 times of improvement in the performance of

    evolutionary algorithm.

    Prof. Swapna Borde et al. [12] presented a hybrid algorithm for university timetabling problem

    which is combination of two techniques first is If Else Algorithm and second is Graph Coloring

    Algorithm. First algorithm is based on simple if else statement which can be easily used in any

    programming language and second algorithm is based on connected graphs. This hybrid

    algorithm is removing the individual methods disadvantages and provide more efficient time

    table generating algorithm.

    Hana Rudova et al. [13] focused on the hard constraint with preference propagation for soft

    constraints. They extended the constraint logic programming technique that used for partial

    satisfy the soft constraints. They applied this method to solve the timetabling problem of Purdue

    University. This model and search methods applied to the solution of the large lecture room

    component are presented and analyzed the computational results. Their results were able to

    satisfy the course requests of 98% of students.

    Ashish jain, Dr. Suresh Jain and DR. P.K. Chande [14] showed the various genetic operators

    such as selection, mutation and crossover. They select the best chromosomes on the basis of

    fitness function from the groups of chromosomes and similarly with crossover we can exchange

    the information of the timetable as per our requirements. This paper claims that evolutionary

    based genetic algorithm approach as an effective solution and powerful method to solve course

    timetabling problem.

    Yao-Te Wang et al. [15] proposed a practical automatic timetabling scheduling system based on

    students needs in which process is divided into two stages. In the first stage, students needs in

    course selection are decided and an association among courses selected by students is extract

    using the association mining technique; while in the second stage, the genetic algorithm is used

    to arrange the course timetable. This study is based on students willingness in course selection,

  • 7/21/2019 time table scheduling in data mining

    21/61

    analyzes the performance of student learning, teachers preferred schedules, determines the cost

    function value of each class period, and then applies the genetic algorithm for class period

    exchange and produce an optimal course timetable. The automatic course scheduling system

    proposed in this study not only can efficiently replace the task of conventional manual

    timetabling scheduling, but also produce course timetables that truly fulfill users needs and

    increase students and teachers satisfaction. The automatic course scheduling system proposed

    in this study is capable of improving the interaction among students, teachers and the school,

    creating a good relationship among the three parties.

    Kuldeep Kumar et al. [6] suggested in their paper, there are a very large number of feasible

    solutions of university timetabling problem. Some method is required to permission the overall

    quality of different solutions to be measured, in order to allow them to be compared, so that the

    best one is selected. The use of Genetic Algorithm for university timetabling problems is

    generally the appropriate technique that gives a number of alternative solutions that satisfy most

    of the hard constraints are possible.

    Nikita Desai [16] proposed a large number of tools that are used for solving timetabling problem

    based on the resources provided by user. The tools consider those resources which are ignored.

    They focus mainly on the specifications of classrooms, teachers, and subjects but are not able to

    fit in resources related to the human factors like fondness, hostile and weakness of the teachers

    and students. She presented a survey done to find the preferences of teachers and later concludes

    with the rules extract using classification method. She analyzed rules can be proper utilized

    resources for an accurate timetable generation.

    Carpente [17] presented an application that solves the complex school timetabling process, from

    the resources that are available and adjustment the resources with the fully utilization in

    automatically generated solution. Their application interacts with the Academic Administration

    Official Systems (AAOS) and makes simple the hard phase of introducing the data and complete

    solutions are efficiently provided by different heuristic techniques. The application can be easily

    updated by designed user interface.

    Ho Sheau Fen, Irene et al. [18] described the PSO technique for solving the university

    timetabling problem. They apply the constraint based reasoning to the PSO. The proposed

    algorithm is tested using real data from Teknologi University, Malaysia. The result is compared

  • 7/21/2019 time table scheduling in data mining

    22/61

    against Standard PSO and hybrid PSO-Local search and the results of proposed algorithm is

    better than others but computational time used to generate a solution by proposed algorithm is

    slightly longer compare to hybrid PSO- Local search and standard PSO.

    Ruey-Maw Chen et al. [19] explained that course timetabling problem is NP-complete problem.They used the PSO for solving this problem due to its fast convergence, fewer parameters setting

    and ability to fit dynamic environmental characteristics. They check the performance of PSO and

    SPSO with and without local search. They indicate that after local search added thee outcomes

    are significantly better than those obtained by using PSO or SPSO alone. Moreover the

    performance of SPSO-local search is better than that of applying PSO-local search.

    Elizabeth Montero et al. [20] analyzed the PSO when we have to face a dynamic problem where

    new courses and exam can appear during the semester. They use the forward checking algorithm

    and this approach can efficiently handle the creation of new courses or after the initial and static

    start up planning.

    Danial Qarouni - Faral et al. [21] used the swarm intelligence that is based on social

    psychological principles as well as contributing to engineering applications. This paper applies

    the PSO to the classic timetabling problem. The result shows that the number of errors is

    decreased in comparison with previous approaches.

    LI Lin et al. [22] described the course-scheduling problem with the PSO algorithm. Theydescribed initial population with higher performance was obtained by improving greedy strategy

    so as to promote efficiency of the algorithm. This paper analyzed the application of PSO in

    course scheduling system, adding greedy strategy to the algorithm in selecting initial particle

    swarm. The initial population is much closer to the target of optimal solution, and the

    convergence speed of the algorithm could be enhanced. Meanwhile varies kinds of hard or soft

    constraints are taken into account; the difficulty of course scheduling is reduced.

    ZHU Jihrong et al. [23] explained a new adaptive particle swarm optimization algorithm. Every

    particle chooses its inertial factor according to the fitness of itself and the optimal particle in the

    presented algorithm. With better fitness, the particle chooses a smaller inertial factor. The

    simulation results show that the proposed algorithm is effective and robust. Simulation results

    show that the new algorithm has advantage of global convergence property and can effectively

    alleviate the problem of premature convergence. At the same time, the experimental results also

  • 7/21/2019 time table scheduling in data mining

    23/61

    show that the suggested algorithm is greatly superior to PSO and APSO in terms of robustness.

    R.C. Eberhart et al. [24] compared the two methods of particle swarm optimization. They

    compare performance of particle swarm optimization using an inertia weight and using a

    constriction factor. Five benchmark functions a r e used for the comparison. It is concluded that

    the best approach is to use the constriction factor w h i l e limiting the maximum velocity Vmax

    to the dynamic range of the variable Xmax on each dimension. The results here also indicate that

    improved performance can be obtained by carefully selecting the inertia weight w,c1,and c2.

    Almost all of the papers in the literature describe a substantial software implementation. In

    addition, this is supported by the presentation of results of the application of the method in one or

    more test cases. The results obtained are measured against manual results but unfortunately, the

    absence of a common definition of the various problems and of widely accepted benchmarks

    prevents the comparison of the algorithms among each other. The computational complexity of

    the proposed systems is determined only through computing time. However comparisons are

    difficult as hardware varies from case to case. Furthermore there seems to be a substantial gap

    between the theoretical discussion and implementation of the software to test cases in contrast to

    obtaining effective and realistic timetables that can be used in every day operations. Therefore in

    order to generate a timetable that is practical and effectual it needs to be flexible enough so that it

    can facilitate and overcome the problems.

  • 7/21/2019 time table scheduling in data mining

    24/61

    Chapter 3

    TECHNIQUES

    3.1 Techniques Applied to the Timetabling Problem

    A timetabling problem can be defined as the scheduling of a certain number of lectures, which

    are to be attended by specific group of students and given by a teacher, over a definite period of

    time. Each lecture requires certain resources in limited number and must fulfill certain specific

    requirements. In particular, automatic building of timetable is extremely difficult because of

    diversity of constraints that must be taken into account.

    The most usual methods to solve this problem are inherited from operations research such as

    graph coloring and mathematical programming, or from Genetic Algorithms [9]. These well-

    known and widely used methods have given good results. But, OR inherited methods generally

    lack flexibility (i.e modifying the data may lead to the necessity of reconsidering the initial

    model); moreover it is difficult to find a model which includes all the constraints. For local

    search methods (where most of the constraints are put in the objective function) or for Genetic

    Algorithms (where the constraints are active in the fitness function), the user frequently obtains

    solutions by tuning rather than by defining his own search strategy dedicated to the problem.

    This section divides the related techniques applied to university timetabling problems into six

    categories i.e. constraint-based methods, graph-based approaches, cluster-based methods,

    heuristic base approach and genetic approach. The details of these categories are discussed in the

    following sub-sections. There are many other approaches are also for timetabling like

    population-based approaches, meta-heuristic methods, multi-criteria approaches, hyper-

    heuristic/self adaptive approaches, case-based reasoning, knowledge-based and fuzzy-based

    approaches.

    3.1.1 Constraint-based Methods

    Reasoning approaches are considered new methodologies in problem solving. Two types ofreasoning approaches are applied on the UCT problem: Case-Based Reasoning (CBR) [28]

    approaches and Constraint-Based Reasoning approaches. Case-Based Reasoning (CBR)

    approaches are considered new methodologies in solving timetabling problems which use

    previous timetables and previous construction methodology in solving latest timetabling

    problems by using similarity measures. The big challenge for these approaches is a definition for

  • 7/21/2019 time table scheduling in data mining

    25/61

  • 7/21/2019 time table scheduling in data mining

    26/61

    Cluster methods were classified as one of the four major approaches by Carter and Laporte

    (1996). The idea of the cluster method was first coined by Desroches et al. (1978). White and

    Chan (1979) and White and Haddad (1983) describe cluster methods as which can be thought of

    as representing a three phase approach. In the first phase, the examinations are grouped into

    timeslots to construct a feasible timetable. The second phase attempts to reduce second order

    conflicts by considering permutations of timeslots. Then the third stage is employed with the aim

    of improving the solution quality further. This is done by moving a particular examination

    between timeslots such as by employing a hill climbing local search.

    3.1.3 Graph-based Approaches

    Graph coloring is concerned with coloring the vertices of a given graph using a given number of

    colors. Let us consider the examination timetabling problem. We need to schedule all theexaminations within a limited number of timeslots in such a way that any clashing examinations

    (i.e. examinations that have at least a common student) are scheduled in different timeslots, so

    this problem can be viewed as a graph coloring model where the vertices represent the

    examinations, the colors represent the slots and the edges represent the conflicts between

    examinations. Each vertex of a graph should be colored using p colors so that no two vertices

    connected by an edge are both assigned the same color and normally there are a limited number

    of colors available.

    A definition of the concepts and terms that relate to a graph is given before progressing with the

    explanation of this model. An undirected graph G = (V,E) is a representation that consists of a

    set of vertices, V = {v1,,v

    n}, and a set of edges, E. If (v

    i,v

    j) is an edge in a graph G = (V, E),

    then vertex viis adjacent to vertex v

    j(Burke et al., 2004a). Figure 3.1 shows the representation of

    an undirected graph on the vertex set {v1, v

    2, v

    3, v

    4, v

    5}.

  • 7/21/2019 time table scheduling in data mining

    27/61

    Figure 3.1 an undirected graph G = (V,E)

    Other related definitions are:

    The degree of a vertex is the number of edges connected t o it. For example, from Figure 3.1,

    vertex v1

    has a degree of 3.

    The chromatic number of a graph is the minimum number of colors necessary to color the

    vertices, so that no two vertices connected by an edge are both assigned the same color.

    For a better understanding about the relationship between the graph coloring problem and the

    timetabling problem, an example of course timetabling is presented in Figure 3.2.

    From Figure 3.2, we can see that there are five different courses coded as A,B, C,D andE. One

    possible goal is to find the minimum number of timeslots that are needed to schedule the five

    courses. A set of edges represents clashes between courses. If there is an edge between vertices,

    it means that these courses cannot be scheduled in the same timeslot. In our example, course A

    cannot be scheduled at the same time as course B and C. Course B cannot be scheduled at thesame time as courseA andD and so on. Clearly 3 colors (timeslots) are needed to schedule this

    problem. CourseA andE could be colored red, courseD could be colored yellow, courseB could

    be colored blue and course C could be colored yellow or blue. The colors correspond to

    timeslots. The graph coloring problem is concerned with finding the chromatic number of a

  • 7/21/2019 time table scheduling in data mining

    28/61

    graph (which is the minimum number of colors required to color the graph). From the graph in

    Figure 3.2, it is easy to see that the chromatic number is 3.

    Figure 3.2 A graph model for a simple course timetabling problem

    A variety of graph coloring based heuristics for constructing a clash-free timetable is available in

    the literature.

    3.1.4 Heuristic Approach

    The Heuristic-based approaches use heuristic concepts and heuristic search to construct and

    define the solutions for many problems and give good results. Through the recent decades, there

    has been a heaviness of literature on heuristic approaches to solve timetabling problems and

    many researches discuss the heuristic topics in related field. Some heuristic approaches employ

    heuristic ordering where a heuristic is used to measure the difficulty of scheduling a particular

    course and solve conflicting between other courses [2]. These approaches order courses by using

    heuristics and then assign the courses sequentially into proper time slot; so that, courses in the

    period are free-conflict with each other.

    Heuristic optimization methods are explicitly aimed at good feasible solutions that may not be

    optimal where complexity of problem or limited time available does not allow exact solution.

  • 7/21/2019 time table scheduling in data mining

    29/61

    Generally, two questions arise (i) How fast the solution is computed? and (ii) How close the

    solution is to the optimal one? Tradeoff is often required between time and quality which is taken

    care of by running simpler algorithms more than once, comparing results obtained with more

    complicated ones and effectiveness in comparing different heuristics. The empirical evaluation

    of heuristic method is based on analytical difficulty involved in the problems worst case result.

    In its simplest form the scheduling task consists of mapping class, teacher and room

    combinations (which have already been pre- allocated) onto time slots.

    One possible approach is as follows: We define a tuple as a particular combination of identifiers

    such as class, teacher and room, which is supplied as an input to the problem.[2] The problem

    now becomes one of mapping of tuples onto period slots such that tuples which occupy the same

    period slot are disjoint (have no identifiers in common). If tuples are assigned arbitrarily to

    periods, then in anything but the most trivial cases, a number of clashes will exist. We can use

    the number of clashes in a timetable as an objective measure of the quality of the schedule. Thus,

    we adopt the number of clashes as the cost of any given schedule. It is simple to measure the cost

    of a schedule. For each period of the week, we make a count of the number of occurrences of

    each class, teacher and room identifier. The cost of the entire timetable is the sum of each of the

    individual costs. This procedure is discussed in more detail in Abramson [21]. The proposed

    algorithm aids solving the timetabling problem while giving importance to teacher availability.

    This algorithm uses a heuristic approach to give a general solution to school timetabling

    problem. It takes the user input of a number of subjects, number of teachers, subjects every

    teacher takes, number of days in a week for which the timetable needs to be set, number of time

    slots in a day and the maximum lectures a teacher can conduct in a week. It initially uses

    randomly generated subject sequence to make a temporary time table. While generating this

    sequence, care is taken to avoid repetition of subjects over a day. After this, the teacher

    availability for each of the subjects allocated for the respective slot is checked. Every time a

    teacher is available for the subject at the allocated slot, the subject and the teacher are entered

    into the output data structure and marked as final. Before the allocation of this subject to the

    output data structure, a check is also conducted on the number of maximum lectures a teacher

    can conduct. If the teacher has been allocated more than the allowed maximum lectures the

    subject is moved into a Clash data structure. To avoid cycling and to improve the search, this

  • 7/21/2019 time table scheduling in data mining

    30/61

    variable selection criterion can be randomized. There are several methods [22] which can be

    applied,

    e.g.:a random walk technique (with the given probability p a random variable is selected)not

    the worst variable, but a random selection of a variable worse enough (e.g., from the top N worst

    variables), ora selection of a variable according to a probability based on the above mentioned

    criteria (e.g., roulette wheel selection).

    The main advantage for these orderings, it is easy to implement. After the courses ordered,

    variety of approaches can be used to choose the best time slot for each course.

    The disadvantages of SA method, it needs long time to get good solutions and must supply some

    parameters with awareness [16]. Another meta-heuristics used to solve timetabling problem is

    the Tabu Search (TS) method, which remembers the features of prior solutions to avoid visiting

    them again. This reduces the search space and gets results relatively quickly.

    3.1.5 Genetic approaches

    The Genetic Searching (GS) algorithms are other meta-heuristics approaches, which employed to

    obtain high quality timetables. Many papers written in the literature employ and apply the

    genetic algorithms in their approaches to solve the timetabling problems such as [28].

    In general, a genetic searching method starts by producing randomized timetables which present

    a parent population for the timetabling problem. After that, each generated timetable is converted

    to consistent timetable by eliminating courses that cause conflicting with other courses. Someinitial timetables may be empty which no courses are scheduled. After that, selection criterion

    applied to choose timetables that used to get new parent population using genetic operators [6].

    This operation repeated until the produced solution contains all scheduled courses and soft

    constraints satisfied with maximum satisfaction degree. The general algorithm for genetic is as

    follows:

    Create a Random ini tial state

    An initial population is created from a random selection of solutions (which are analogous to

    chromosomes).

    Evaluate Fitness

    A value for fitness is assigned to each solution (chromosome) depending on how close it actually

    is to solving the problem (thus arriving to the answer of desired problem),(These solutions are

  • 7/21/2019 time table scheduling in data mining

    31/61

    not to be confused with answers to the problem, think of them as possible characteristics that the

    system would employ in order to reach the answer.)

    Reproduce (&Chil dren M utate)

    Those chromosomes with a higher fitness value are more likely to reproduce offspring (which

    can mutate after reproduction). The offspring is a product of the father and mother, whose

    composition consists of a combination of genes from them (this process is known as crossing

    over).

    Next Generation

    If the new generation contains a solution that produce an output that is close enough or equal to

    the desired answer then the problem has been solved. If this is not the case, then the new

    generation will go through the same process as their parents did. This will continue until a

    solution is reached.

  • 7/21/2019 time table scheduling in data mining

    32/61

    Chapter 4

    EXPERIMENTAL PROCEDURES

    This chapter describes the experimental procedures followed and processing parameters

    selected in the present study.

    4.1 Methodology

    Making a class schedule is one of those NP complete problems. The problem can be solved

    using a heuristic search algorithm and genetic algorithm to find the solution, but it only works for

    simple cases. For more complex inputs and requirements, finding a considerably good solution can

    take a while, or it may be impossible. In this dissertation work we use the improved k-mean

    clustering algorithm and decision tree techniques for solving the timetabling problem.

    4.2 Research Design

    The thesis work is carried out through a number of stages starting from problem selection to

    literature review about the state of art technology specific to Automated timetable Generator on

    Java Platform. Most of the time is spent in identifying and selecting the problem and literature

    review. Selection of optimization algorithms and understanding the working of it also took a lot

    of time. We divided the overall research into four stages as shown in figure 4.1 below:

    Figure 4.1 Research MethodologyIn this research work we used the improved k-mean clustering algorithm for the clustering of

    data set. Clustering is finding groups of objects such that the objects in one group will be similar

    to one another and different from the objects in another group. The traditional K-means

    algorithm is a widely used clustering algorithm, with a wide range of applications. In the

    improved K-means clustering algorithm analysis the advantages and disadvantages of the

    Problem

    Identification &

    Selection

    Literature

    Review

    Select

    Appropriate

    Algorithm

    Toolbox

    Optimization

    Results

  • 7/21/2019 time table scheduling in data mining

    33/61

    traditional K-means clustering algorithm elaborates the method of improving the K-means

    clustering algorithm based on improve the initial focal point and determine the K value.

    Simulation experiments prove that the improved clustering algorithm is not only more stable in

    clustering process, at the same time, improved clustering algorithm to reduce or even avoid the

    impact of the noise data in the dataset object to ensure that the final clustering result is more

    accurate and effective.

    We also used the decision tree technique of data mining for the classification of the clustered

    data set. A decision tree is a flow-chart-like tree structure, where each internal node is denoted

    by rectangles, and leaf nodes are denoted by ovals. All internal nodes have two or more child

    nodes. All internal nodes contain splits, which test the value of an expression of the attributes.

    Arcs from an internal node to its children are labeled with distinct outcomes of the test. Each leaf

    node has a class label associated with it.

    This dissertation work is implemented on java platform. Java is a computer programming

    language that is concurrent, class-based, object-oriented, and specifically designed to have as

    few implementation dependencies as possible. It is intended to let application developers "write

    once, run anywhere"(WORA), meaning that code that runs on one platform does not need to be

    recompiled to run on another. Java applications are typically compiled to bytecode (class file)

    that can run on anyJava virtual machine (JVM) regardless ofcomputer architecture.Java is, as

    of 2014, one of the most popular programming languages in use. The main steps of this research

    work are following:

    1. Dynamically/ manually create the data base

    2. Connect the database with Java Net-beans IDE.

    3. Pre process the data set with clustering algorithm.

    4. Classify the data set clusters using Decision Tree

    5. Knowledge discovery of time table scheduled and made.

    4.2.1 Dynamically/manually create the database

    In this dissertation work, we first create the database for the teacher registration, student

    registration, for subjects of seven semesters of b.tech computer science, for the attendance of

    teachers, for the room numbers of college and for the time slots. In this work we create the

    database timetable scheduling. Timetable scheduling database has total 13 tables like teacher that

    has all the information about the teachers. We can enter the information about a teacher in this

    http://en.wikipedia.org/wiki/Computer_programming_languagehttp://en.wikipedia.org/wiki/Computer_programming_languagehttp://en.wikipedia.org/wiki/Concurrent_computinghttp://en.wikipedia.org/wiki/Class-basedhttp://en.wikipedia.org/wiki/Object-oriented_programminghttp://en.wikipedia.org/wiki/Write_once,_run_anywherehttp://en.wikipedia.org/wiki/Write_once,_run_anywherehttp://en.wikipedia.org/wiki/Compilerhttp://en.wikipedia.org/wiki/Java_bytecodehttp://en.wikipedia.org/wiki/Class_%28file_format%29http://en.wikipedia.org/wiki/Java_virtual_machinehttp://en.wikipedia.org/wiki/Computer_architecturehttp://en.wikipedia.org/wiki/Computer_architecturehttp://en.wikipedia.org/wiki/Java_virtual_machinehttp://en.wikipedia.org/wiki/Class_%28file_format%29http://en.wikipedia.org/wiki/Java_bytecodehttp://en.wikipedia.org/wiki/Compilerhttp://en.wikipedia.org/wiki/Write_once,_run_anywherehttp://en.wikipedia.org/wiki/Write_once,_run_anywherehttp://en.wikipedia.org/wiki/Object-oriented_programminghttp://en.wikipedia.org/wiki/Class-basedhttp://en.wikipedia.org/wiki/Concurrent_computinghttp://en.wikipedia.org/wiki/Computer_programming_languagehttp://en.wikipedia.org/wiki/Computer_programming_language
  • 7/21/2019 time table scheduling in data mining

    34/61

    table when we perform the registration procedure for any teacher; we also take the snap of teacher

    at the time of registration. Timetable database also has a table student that contains the information

    regarding registration of student. We can enter information of student in student table when a new

    student takes admission in college or during their registration. Timetable scheduling database also

    has the tables like semesterfirst, semestersecond, semesterthird, semesterforth, semesterfifth,

    semestersixth and semesterseventh that has information about the subjects of relative semesters.

    We create an attendance table in this database that contains the information about the presence or

    absence of teachers. We also have a timetable table in the timetable scheduling database that has

    the information regarding timeslots of college.

    4.2.2 Connect the database with java net-beans IDE

    After creating the database we need to connect it with java net-beans. We connect the databasewith java net-beans, so that we can dynamically enter the data into database by using user interface

    and we can also fetch the data from database using this interface when required. Using this

    connection with the database, we can enter the teachers registration information; Student

    registration information, attendance of teachers etc enter into the database dynamically. We can

    also able to fetch any information from database when required. Using this connection we fetch

    data from the database and schedule the timetable for seven semesters of courses.

    4.2.3 Pre process the data set with clustering algorithm

    After entering the data into database, we need to preprocess the dataset by using the clustering

    algorithm. In this dissertation work we use the improved k-mean clustering algorithm. By using

    improved k-mean clustering algorithm, we create the clusters for teachers, subjects, rooms and

    timeslots. We use the improved k-mean clustering algorithm instead of k-mean algorithm because

    improved k-mean clustering has number of advantages over k-mean and it also overcome the

    disadvantages of k-mean clustering algorithm. The k-mean and improved k-mean algorithms are

    described in following sections.

    4.2.3.1 K-mean algorithm

    K-means cluster algorithm was proposed by J. B. MacQueen in 1967, which is used to deal with

    the problem of data clustering, the algorithm is relatively simple, so generate a widely influence in

    the scientific field research and industrial applications [30]. It is based on decomposition, using K

  • 7/21/2019 time table scheduling in data mining

    35/61

    as a parameter, divide n object into K relatively low similarity between clusters. And minimize the

    total distance between the values in each cluster to the cluster center. The cluster center of each

    cluster is the mean value of the cluster. The calculation of similarity is done by mean value of the

    cluster objects. The measurement of the similarity for the algorithm selection is by the reciprocal of

    the Euclidean distance.

    a) Procedure of K-means Algorithm

    Distribute all objects to K number of different cluster at random;

    Calculate the mean value of each cluster, and use this mean value to represent the cluster;

    Re-distribute the objects to the closest cluster according to its distance to the cluster center;

    Update the mean value of the cluster. That is to say, calculate the mean value of the objects

    in each cluster;

    Calculate the criterion function E, until the criterion function converges.

    Usually, the K-means algorithm criterion function adopts square error criterion, be defined as:

    K n

    E= ||xi -mj ||2

    J=1 i=1

    xi cj

    In which, E is total square error of all the objects in the data cluster, xibellows to data object set,

    mi is mean value of cluster Ci (x and m are both multi-dimensional). The function of this

    criterion is to make the generated cluster be as compacted and independent as possible.

    b) Analysis of the Performance of K-means Algorithm

    Advantages:1. K-mean value algorithm is a classic algorithm to resolve cluster problems; this algorithm is

    relatively simple and fast.

    2. For large data collection, this algorithm is relatively flexible and high efficient, because the

    Complexity is O (ntk). Among which, n is the times of iteration, k is the number of cluster, t is

    the times of iteration. Usually, kn and tn. The algorithm usually ends with local optimum.

    3. Because the limitation of the Euclidean distance. It can only process the numerical value, with

    good geometrical and statistic meaning.

    Disadvantages:

  • 7/21/2019 time table scheduling in data mining

    36/61

    The inherent prosperities of the K-means clustering algorithm to determine its limitations,

    specific performance is as follows:

    1. The K value is most important for K-means clustering algorithm. There is no applicable

    evidence for the decision of the value of K (number of cluster to generate), and sensitive to initial

    value, for different initial value, there may be different clusters generated.

    2. K-means clustering algorithm has a higher dependence of the initial cluster centers. If the

    initial cluster center is completely away from the cluster center of the data itself, the number of

    iterations tends to infinity, but also makes it easier for the final clustering results into local

    optimization, resulting in incorrect clustering results.

    3. K-means clustering algorithm has a strong sensitivity to the noise data objects. If there is a

    certain amount of noise data in dataset, it will affect the final clustering results, leading to its

    error.

    4. K-means clustering algorithm for the discovery of clusters of arbitrary shape is most difficult.

    5. K-means clustering algorithm has main limitation on amount of data. In the iterative process,

    every time you need to adjust the cluster to which data object belongs and compute cluster

    center, so in case of large amount of data, the K-means clustering algorithm is not applicable.

    4.2.3.2 The Research Point of K-means Clustering Algorithm

    The research on K-means clustering algorithm is mainly from the following two aspects:

    First, about the determination of k value. Through the above analysis, the K value of the initial

    cluster centers to determine the far-reaching impact throughout the clustering process and the

    final clustering results, while the K value in practical applications is very difficult to direct or

    one-time determination [30]. Especially, if the amount of data tends to infinity which is pending,

    the K value of the K-means algorithm to determine will be very difficult. At present, there are

    two clustering algorithms to determine the K value is relatively effective which is the cost

    function based on distance and propagation clustering algorithm based on nearest neighbors. The

    former find the minimum through using the cost function. Thus obtain the corresponding K

    value. The latter using nearest neighbor clustering algorithm to calculate the appropriate number

    of cluster center, the number of cluster center provides for the maximum K value of the K-means

    clustering algorithm to get the optimal value of K. Second, about the choice of initial cluster

    centers. K-means clustering algorithm using the iterative method to solve the problem, except the

    first step, the clustering results of each step are improved to some extent; otherwise terminate the

  • 7/21/2019 time table scheduling in data mining

    37/61

    process of iteration. Traditional K-means clustering algorithm takes the cluster squares error and

    the criterion function value change or not as the iterative termination conditions. But the

    clustering results obtained from this criterion function easily fall into local minimum solution,

    the result is the clustering results of search are moving toward the direction of diminishing the

    criterion function value [31]. In this, the improvement of K-means algorithm is mainly reflected

    in the following two aspects:

    Optimize the initial cluster centers, to find a set of data to reflect the characteristics of data

    distribution as the initial cluster centers, to support the division of the data to the greatest extent.

    Optimize the calculation of cluster centers and data points to the cluster center distance,and

    make it more match with the goal of clustering.

    4.2.3.3 Improved K-means Clustering Algorithm

    a) Related Concept

    Definition 1 The distance between data points and the cluster center. The distance formula of

    data point xiand cluster center kjdefined as following [5]:

    (2)Where w represents the number of attributes of the data points xi.

    Definition 2 The density parameter . The number of data points which is contained by a scope

    defined as density parameter. The scope is a round which takes space point of not statistics x ias

    the center, as the radius. The greater the density of xi, the greater the value of the densityparameter are.

    Definition 3 The core data points. If the y-neighborhood of a data point contains at least PTS

    _min number of data points, then the data point called the core data point.

    Definition 4 The cluster center. Differences from the traditional clustering adjustment, the

    improved clustering algorithm add the weight of data point to the cluster center. Data points near

    the center of the cluster weights, on the contrary, the value of data points away from the cluster

    center is less weight. The formula of cluster center defined as follow:

    (3)

    Where j represents the jth cluster, h is the number of data points in the cluster, djhrepresents the

    distance between the hth data point which belongs to cluster c and cluster center. And with the

    restriction of dj1 dj2 : : : djh,

  • 7/21/2019 time table scheduling in data mining

    38/61

    Definition 5 The Euclidean distance between data points and the cluster center. The distance

    between data point and the cluster center determine the cluster which data point belongs to, the

    formula of Euclidean distance is defined as follows:

    (4)

    where j represents the jth cluster cj, i represents the ith data point x i , dji is the Euclidean distance

    between data point xi and the cluster center cj , represents the squares error of the cluster cj, is the squares error sum of the K clusters c.b) Improved K-means Algorithm Description

    Algorithm 1: Improved K-means Algorithm

    Input: data set x contains n data points; the number of cluster is k.

    Output: k clusters of meet the criterion function convergence.

    Program process:

    Step 1. Initialize the cluster center.

    Step 1.1 Select a data point xifrom data set X, set the identified as statistics and compute the

    distance betweenxiand other data point in the data set X. If it meet the distance threshold, then

    identify the data points as statistics, the density value of the data pointxiadd 1.

    Step 1.2 Select the data point which is not identified as statistics, set the identified as statistics

    and compute its density value. Repeat Step 1.2 until all the data points in the data set X have

    been identified as statistics.

    Step1.3 Select data point from data set which the density value is greater than the threshold and

    add it to the corresponding high-density area set D.

    Step 1.4 Filter the data point from the corresponding high-density area set D that the density of

    data points relatively high, added it to the initial cluster center set. Followed to find the k-1 data

    points, making the distance among k initial cluster centers are the largest.

    Step 2 Assigned the n data points from data set X to the closet cluster.

    Step 3 Adjust each cluster center K by the formula (3).

    Step 4 Calculate the distance of various data objects from each cluster center by formula (4), and

    redistribute the n data points to corresponding cluster.

    Step 5 Adjust each cluster center K by the formula (3).

    Step6 Calculate the criterion function E using formula (1), to determine whether the

    convergence, if convergence, then continue; otherwise, jump to Step 4.

  • 7/21/2019 time table scheduling in data mining

    39/61

    4.2.4 Classify the data set clusters using Decision Tree

    After making the clusters for the data set, we need to do classification of the clusters, so that we

    can assign the teachers to different courses, subjects to teachers, class rooms to different classes

    without any clash. In this dissertation work we classify the data set clusters by using the decision

    tree technique. By using decision tree technique we also try to satisfy the soft constraints on

    timetable schedule like assign the subjects to teachers according to their choice, give the preference

    to more experienced teachers, try to assign the class rooms according to their choice and also

    assign the timeslots according to teachers choice.

    4.2.4.1 Classification

    Classification consists of examining the features of a newly presented object and assigning to it a

    predefined class. The classification task is characterized by the well-defined classes, and atraining set consisting of pre-classified examples. The task is to build a model that can be applied

    to unclassified data in order to classify it. Examples of classification tasks include:

    Classification of credit applicants as low, medium or high risk

    Classification of mushrooms as edible or poisonous

    Determination of which home telephone lines are used for internet access

    Predictive modeling can sometime-but not necessarily desirably be seen as a Black box that

    makes predictions about the future based on information from the past and present. Some models

    are better than others in terms of accuracy. Some models are better than others in terms of

    understandability; for example, the models range from easy-to-understand to incomprehensible

    (in order of understandability): decision trees, rule induction, regression models, and neural

    networks. Classification is one kind of predictive modeling. More specially, classification is the