View
266
Download
9
Category
Preview:
Citation preview
Proceedings of the ICICIS 2016 Conference
Gaborone, Botswana, May 18-20, 2016
Edited by
Oduronke Eyitayo
George Anderson
Department of Computer Science
University of Botswana
ICICIS 2016 Proceedings of the 1
st International Conference on The Internet, Cyber Security and Information
Systems, Grand Palm Hotel, Gaborone, Botswana
May 18-20, 2016
Jointly organised by:
Department of Computer Science, University of Botswana and Department of Applied Information Systems, University of Johannesburg, South Africa
International Programme Committee Chairs: Audrey N Masizana, PhD. University of Botswana Barnabas Gatsheni, PhD. University of Johannesburg, South Africa
General Chairs: Ezekiel Uzor Okike, PhD. University of Botswana Kennedy Njenga, PhD. University of Johannesburg, South Africa
Review Committee
O. T. Eyitayo, PhD (Chair) University Botswana G. Anderson, PhD (Co- Chair) University Botswana S.D. Asare University of Botswana S. Browne, PhD. National University of Ireland Galway, Ireland D. Garg University of Botswana G. Malema, PhD. University of Botswana T.M. Mogotlhwane, PhD. University of Botswana G. Mosweunyane, PhD. University of Botswana P. Motlogelwa University of Botswana T. Motshegwa, PhD. University of Botswana K Njenga, PhD. University of Johannesburg, South Africa T. Seipone University of Botswana Q. Sello University of Botswana E. Thuma, PhD. University of Botswana M. Van Den berg, PhD. University of Johannesburg, South Africa
Sponsors
United States of America – Botswana Embassy Ministry of Transport & Communications Botswana Innovation Hub Botswana Fibre Networks Botswana Communications Regulatory Authority Bit Brands Digital Agency
ISBN 978-99968-0-430-4
Copyright © 2016 Department of Computer Science, University of Botswana Published by: Department of Computer Science University of Botswana Private Bag UB 00704 Gaborone, Botswana
Table of Contents
Information Systems Track
Modeling and Simulation of a Hybrid Mobile Target Tracking System for Livestock
Obakeng Maphane, Oduetse Matsebe, Molaletsa Namoshe ........................................................... 1
A Distributed Computational MapReduce Algorithm for Big Data Electronic Health Records
Sreekanth Rallapalli, Radhika Kidambi, Suryakanthi Tangirala ....................................................... 11
A Collaborative Tool for MPhil/PhD Student Dissertation Workflow
Bigani Sehurutshi, Oduronke T. Eyitayo ............................................................................................... 22
Evaluating the Effect of Privacy Preserving Record Linkage on Student Exam Record Data Matching
George Anderson, Tsholofetso Taukobong, Audrey Masizana ............................................................. 35
Ontological Perspectives in Information System, Information Security and Computer Attack
Incidents (CERTS/CIRTS)
Ezekiel Uzor Okike, Tshiamo Motshegwa, Molly Nkamogelang Kgobathe ......................................... 46
Cybersecurity Track
Big Data Forensics As A Service
Oteng Tabona, Andrew Blyth ................................................................................................................ 61
Information Security Policy Violation: The Triad of Internal Threat Agent Behaviors
Maureen van den Bergh, Kennedy Njenga ........................................................................................... 69
Challenges in Password Usability - Users Perspective
Tiroyamodimo Mogotlhwane, Kagiso Ndlovu....................................................................................... 82
Enhancing the Least Significant Bit (LSB) Algorithm for Steganography
Oluwaseyi Osunade, Ganiyu Idris Adeniyi ............................................................................................. 90
A Security Model for Mitigating Multifunction Network Printers Vulnerabilities
Jean-Pierre Kabeya Lukusa.................................................................................................................. 103
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 1
IC1012
Modeling and Simulation of a Hybrid Mobile Target
Tracking System for Livestock
Obakeng Maphane, Oduetse Matsebe, Molaletsa Namoshe
Department of Mechanical & Energy Engineering College of Engineering & Technology
Botswana International University of Science and Technology Private Bag 16, Palapye, Botswana
maphaneo@biust.ac.bw; matsebeo@biust.ac.bw; namoshem@biust.ac.bw
ABSTRACT
Wireless Sensor Networks (WSN’s) have enjoyed wide spread application in the Internet of Things (IoT) especially indoors; therefore, research has now expanded their use in outdoor applications. They are superior due to their ability to canvas large areas, using the wireless nodes attached to sensors and relaying the data collected to a sink node. Global System for Mobile (GSM) /Group Packet Radio Service (GPRS) network is one of the oldest telecommunication, and the fastest growing field in telecommunications technology. It covers wide ranges; unfortunately, the signal deteriorates with distance from Base Stations (BST’s) and it is expensive for application in tracking. This paper presents the concept and simulation of a high level solution to livestock tracking using a hybrid mobile target tracking system; where both the GSM and WSN are combined for the purpose of livestock tracking, leveraging on the scalability of WSN and coverage range of GSM. The proposed system combines strategically located static nodes and mobile nodes, embedded within a WSN and handshaking with the GSM/GPRS network though BST’s on a timely basis. The data collected is relayed to a background database in a hierarchically, from mobile node through static nodes to BST and Database. The hybrid system paves way for research in livestock behavior, virtual fencing, Foot and Mouth (FMD) monitoring, GSM to WSN connection, WSN optimization and applications of GSM technology for tracking or in the IoT’s. The simulation results presented shows that this system is viable.
Keywords: Wireless Sensor Networks, Foot and Mouth, GSM technology, Internet of Things, Solar
Power, Livestock tracking and Management.
1 INTRODUCTION
1.1 Motivation
Traditional methods of livestock management have proved costly, slow and failing in most cases;
consequently, they are a major contributor to the decline in the nation’s beef industry let alone
market restrictions. On the other hand, electronic agriculture has taken off in strides. A lot of mobile
applications and web based applications are being developed to improve the industry; they have
leveraged the speed of technology, and the coverage of the internet in promoting and reviving
agriculture. Electronic Agriculture is the fusion of electronics in managing and developing tools to
improve agricultural practices.
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 2
1.2 Background
a) Livestock management
Historically; since humans domesticated buffalos for meat and milk, they managed them by housing
them in Kraals and monitored them even during grazing periods. Over time this has been the only
way of monitoring livestock and rightly so because it was the main source of livelihood; however,
over time urbanization has led men to attempt balancing livestock management and maintaining
paid jobs especially in African countries (mainly Botswana). This became even more costly with
introduction of requirements by foreign markets to trace livestock movements, but farmers then
could only estimate them. The use of technology was introduced to help monitor livestock
management. One of which is the use of ingestible Radio Frequency Identification (RFID) bolus, for
identification and historic records. Bolus inserted in the animal’s gut contains an RFID tag, it is read
for identification by veterinary officers during routine vaccinations and assessment, before travel
documentation are awarded for transportation to abattoirs. This provided unique identification for
livestock, it also provided some degree of security against stock theft for owners (Moreki et al.,
2012) and gave a limited account of the history of the animal based on when and where it was
scanned; however, it had its challenges. The insertion process required trained officers and the
Ministry of Agriculture’s (MoA); Department of Veterinary Services (DVS) had a limited number of
them. It was costly for the Ministry to deploy and track livestock; most of all it had gaps in tracing
the movement of the animal, because the officers only knew where the animal had been scanned
not where it had been since it had been tagged (Moreki et al., 2012); (Sunday Standard Reporter,
2012). Recently applications of WSN, GPS and in some occasions GSM/GPRS network research has
taken off in agricultural applications.
b) GSM Network
GSM was developed in the early 20th century for routing calls after Alexandra Graham Bell
developed the first telephone; it is one of the fastest developing technologies, it moved from
switching network to data era in such a short period of time. Based on the architecture of the
network; it is possible to track the user to the nearest cell tower giving a rough estimate of the
user’s location, because the Base Station Towers (BST’s) are stationary and in a known location.
Overtime; after the introduction of mobile devices, this became a challenge but still possible to
some degree. GSM network provides the ability to track the devices over large ranges in the
magnitudes of kilometers (Behzad et al., 2014), but it is prone to network loss hence loss of target
device (Ficek et al., 2013). Network traffic is another challenge in using GSM for tracking, because it
introduces bottlenecks in the system. Finally the system is expensive to install and run due to size
and cost of equipment.
c) Wireless Sensor Network
WSN has taken off exponentially, and the increased application in automated home security,
sparked a huge interest in WSN’s. The silicon boom encouraged a growth in miniature microchips
and semiconductors; contributing to the growth in WSN application, currently areas that were hard
to access due to equipment size are reachable and sensor node power consumptions has reduced
significantly. WSN’s were initially used in research to track wildlife movements, deep seas animal
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 3
movements, underground mine condition and monitoring of veldt fires. They have recently been
introduced to agriculture monitoring farming areas and tracking livestock (Nagl et al., 2003; Huircan
et al., 2010; Raizman et al., 2013). WSN’s are widely scalable, and they provide short to medium
range coverage of the deployment area. The challenge with their application is limited power
reducing their life span (Nagl et al., 2003). Application of WSN’s in tracking has been deployed in
two ways; one of which is static sensors monitoring an area in a pre-defined network structure.
They relay any data identified, they are programmed to release information from sensors through
the network to the base station. Second method is called ad-hoc network, it is a mobile WSN for
target tracking tagged with sensors that constitute nodes in a network in dynamic mesh network.
1.3 Related Works Research on tracking mobile target has been around for quite some time; it took off after the
development of GPS system, initially by the US military to improve troupe location and target zones.
(Chakole et al., 2013) developed an application of the hybrid system on tracking and monitoring
vehicles in order to improve service delivery during accidents, the vehicle information is relayed to a
database and displayed via a Graphical User Interface (GUI). Also (Behzad et al., 2014) developed a
similar hybrid system for monitoring and tracking vehicles, their focus was in the security of the
vehicles and also to provide extra features than current vehicle tracking systems. The authors
developed a low cost tracker that also monitors the status of the vehicle while parked; through a
hidden button for system activation, the system warns the user of movements and turns off the
ignition if any suspicious motion is detected. It also gives the user the ability to control the ignition
system through text messages in case the vehicle is stolen.
Other applications of this hybrid network system are a smart home monitoring as outlined in (Xu et
al., 2010) and (Liu, 2014); both authors developed a system that uses a ZigBee WSN connected to
GPRS system and linked to a database. In (Xu et al., 2010) the researchers focused on improving a
Dijkstra routing algorithm and testing it on their system to find the shortest path to relay data; while
(Liu, 2014) focused on hardware design and analysis, for a ZigBee network on smart homes
technology that is low cost, low power and fast rate.
(Ficek et al., 2013) did a review of tracking in mobile networks, the authors present a short message
system (sms) based active tracking system, for obtaining target position of mobile user terminals
though the pre-existing GSM network. The system can be deployed in an academic environment
using off shelf components; also the system can be cross platform, and cross boarders for roaming
customers.
2 PROPOSED HYBRID SYSTEM DESIGN
The system is developed from off shelf components; a mobile assert tracking using wireless sensor
networks usually has a mesh network or star topology, either clustered or dynamic routing. To
expand the coverage distance GSM/ GPRS is combined with the WSN; it is composed of mainly the
following components:
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 4
2.1 Hardware Design
a) Mobile Node (ear tag)
The mobile tags are attached to the animals ears, encased in a plastic cover. These mobile nodes will
frequently receive the target’s GPS coordinates; append the RFID tag identification from the current
Livestock Identification and Tracking System (LITS), and the owner’s details into a packet. Then it is
transferred to the static nodes upon connection. The mobile node is composed of GPS receiver,
ARM Microcontroller, a GSM module, a Hybrid power system (coin battery plus thin-film solar power
system) and a memory card.
b) Static Nodes (gateway)
These parts of the WSN are strategically placed around the BST’s; they act as WSN gateways, sink or
base stations for the mobile nodes. They are stronger in coverage; they have external antennae and
much larger surface area solar panel power system, their location and distance from the BST is fixed.
The static nodes comprises of a larger Hybrid power module (battery plus solar power system),
microcontroller, memory card and a GSM module.
c) GPS module & GSM/GPRS module
For mobile nodes; Sim968 combined GPS/GSM module (SIM Tech, 2007), is used to receive GPS
triangulation coordinates and to communicate with the static nodes. For static nodes; since their
location is known and stationary, there is no need for GPS receiver only a Sim900DS GSM module is
used (SIM Tech, 2007). The two modules mentioned are portable and equipped with a powerful
single processor ARM926EJ-S core. Sim900DS has a dual Sim capability and they are both Quad band
modules, which allows more than one frequency in a network can be used to implement the system
thus improving chances of transmission and reducing connection speed; furthermore, multiple
bands can be used to relay data.
d) Electronic RFID tag
These are currently in circulation; the system will adopt the current system tags to reduce costs and
to improve ease of adoption, as well as keep to current protocols when the system comes into use.
The tags need not be read since they are encased by the tag upgrade, so all that is required is to
install the casing with the existing RFID number stored at installation. The proposed system is
designed for farmers; it will reduce costs for government, however both entities have a choice to use
what works for them without compromising the other.
e) Hybrid Powering Module
These modules are mainly for the WSN, research has shown that power management is a major
challenge for WSN’s (Nagl et al., 2003); therefore, in order not to lose focus on the main task of real
time tracking, the authors feel a hybrid power system (i.e. chemical batteries and solar power) will
ensure longer node life-span (Huircan et al., 2010). Thin-film solar cells layered on the surface of the
ear tag casing will be used to collect and convert solar radiation into power for mobile nodes and for
static nodes; small to medium solar panels will be used combined with larger batteries.
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 5
2.2 Software Design
a) Nodes Software
Each of the components in the network have a specific functionality that outlines how they combine
in forming the network; Figure 1 (appendix A) shows the individual node flow diagrams, outlining the
separate nodes and their data flow. These nodes combined, form the network (WSN) software of the
proposed system. The flow diagrams showing the function, decision tress and data flow within
system is outlined in Figure 1, mobile node, static node and BST node from left to right respectively.
The envisioned overall pseudo code of the software algorithm for the hybrid system focusing on
networking part is outlined in Algorithm 1, it is not detailed and the variables are mentioned for easy
understanding and they are subject to change in time. The pseudo code is to demonstrate the
proposed approach to data structure and transfer; from livestock nodes to the abstracted database.
The research is not focused on optimizing routing algorithms and database structure, but on the
main part of tracking and storing historic data of livestock movement in real time.
b) Pseudo code for routing algorithm
Algorithm 1: Summarized algorithm of the system
c) User Interface
The system will have multiple user interfaces; cell phone messaging, Android Application for mobile
devices and a Web application giving position of livestock overlaid onto a Google maps. These
interfaces will provide farmers with alert messages and visual tracking of their livestock. GSM
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 6
messaging is specifically for less technologically survey and older generation of users, they still find
it difficult to use smart phones. It also provides access in area where 3G+ coverage is not available;
while web applications gives a much larger viewing area and more access to users who have access
to internet in their mobile devices, e.g. laptops and tablets.
3 EXPERIMENTAL SETUP: MODELING AND SIMULATION
Modeling and Simulation of the proposed hybrid system are performed with Network Simulator 2
(NSG2) software using Tcl scripting, generated by a java applet to model the WSN scenarios. It
provides an insight into the applicability of the WSN part of the system; however, it has limitations
which are outlined below:
The simulator does not give information on connection loss.
It does not recover lost nodes or show how they communicate if they have gone beyond
coverage.
The node movement is preplanned using waypoint configuration, not random as livestock
movements would be in reality.
Configuration via the applet is clattered and can get confusing, with all the crossing lines, see
Figure 3.
The distance and direction of the nodes does not display as they move which will be a good
indicator of tracking
The simulator does not explicitly indicate/generate tracking errors
The main algorithm; which is has a BST at the center, and it is only accessed by the surrounding
static nodes. Mobile nodes only have access to static nodes simulates properly and gives hope that it
will reduce connection costs and improve coverage area. Figure 3 (appendix C) shows a snap shot of
the system scenario setup with node 0 representing BST, nodes 1-3 represent static nodes and finally
the others are mobile nodes or the livestock.
Figure 4 (appendix D) shows a snapshot view of the simulation output of the system as it runs, the
rings around the nodes depict a connection and the arrows show data transfer between the nodes.
The simulator can be executed and ran in both forward and reverse mode in case one misses
something during review; it also allows change of running speed.
4 RESULTS AND DISCUSSION
The simulations show movement and communication between the nodes as they move through the
network, indicating the path the data packets take from node to sink through rings forming around
the transmitting and receiving node. It indicates a successful transmission and reception of data
during motion; concurrently, outlining the dynamic mesh network of the system. When a node goes
out of range; it stops transmitting, even when it drifts back into focus it does not show any
communication. This indicates loss of connection and by extension the node or livestock. Figure 4
(appendix D) shows the nodes and their connection rings during a running simulation, but it does not
show the coordinates of either node as well as show the cardinal points for one to figure direction of
travel for the nodes. This may be due to limitations of the software but it is necessary for
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 7
interpretation of the viability of the system. Planned movements in the simulation do not depict well
the erratic animal movements of livestock, but they give an idea of how the system will respond to
mobile nodes.
5 CONCLUSION
This paper presents the concept and simulation of the high level solution to livestock tracking using a
hybrid mobile target tracking system. Simulation results show that the system is viable, and that
mobile node tracking using the combination of networks is achievable. The hybrid system provides
both identification conforming to current standards and real time 24/7 livestock tracking (no gaps in
animal movements). One of the limitations of the simulator is that it does not quantify tracking
errors, but fortunately tracking livestock does not require extreme precision in locating them as does
missile attack. Consequently, the close proximity location provided by the system would be much
helpful in reducing costs of tracking, and keeping records of movements for livestock. The challenges
identified; direction of node, co-ordinates of nodes and loss of nodes will be worked on in future to
improve accuracy of the system. An algorithm to optimize the location and tracking of livestock and
to improve power management will be incorporated into this research for future development. Also
in future; simulations of the individual components will be presented, as well as findings of changes
made before deploying with detailed descriptions. A technique to quantify tracking errors will also
be developed.
REFERENCES
Behzad, M., Sana, A., Khan, M. A., Walayat, Z., Qasim, U., Khan, Z. A., & Javaid, N. (2014). Design
and Development of a Low Cost Ubiquitous Tracking System. Procedia Computer
Science, 34, 220-227.
Chakole, S. S., Kapur, V. R., & Suryawanshi, Y. A. (2013, April). ARM Hardware Plaform for Vehicular
Monitoring and Tracking. In Communication Systems and Network Technologies (CSNT),
2013 International Conference on (pp. 757-761). IEEE.
Ficek, M., Pop, T., & Kencl, L. (2013). Active tracking in mobile networks: An in-depth view. Computer
Networks, 57(9), 1936-1954.
Huircán, J. I., Muñoz, C., Young, H., Von Dossow, L., Bustos, J., Vivallo, G., & Toneatti, M. (2010).
ZigBee-based wireless sensor network localization for cattle monitoring in grazing
fields. Computers and Electronics in Agriculture, 74(2), 258-264.
Liu, Z. Y. (2014). Hardware design of smart home system based on ZigBee wireless sensor
network. AASRI Procedia, 8, 75-81.
Long Distance Post. (1996). History of GSM and More. Belmout: LDpost.
Moreki, J. C., Ndubo, N. S., Ditshupo, T., & Ntesang, J. B. (2012). Cattle Identification and
Traceability in Botswana. Journal of Animal Science Advances, 2(12), 925-933.
Nagl, L., Schmitz, R., Warren, S., Hildreth, T. S., Erickson, H., & Andresen, D. (2003, September).
Wearable sensor system for wireless state-of-health determination in cattle. In Proceeding of
the 25th Annual International Conference of the IEEE EMBS, Cancun, Mexico (pp. 3012-
3015).
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 8
Raizman, E. A., Rasmussen, H. B., King, L. E., Ihwagi, F. W., & Douglas-Hamilton, I. (2013).
Feasibility study on the spatial and temporal movement of Samburu's cattle and wildlife in
Kenya using GPS radio-tracking, remote sensing and GIS. Preventive veterinary
medicine, 111(1), 76-80.
Sim com. (n.d.). GSM GPRS Modules. Sim Com .
SIM Tech. (2007). SIM968 Combo Module . Sim com.
Sunday Standard Reporter. (2012, 05 12). ELECTRONIC EAR TAGS TO REPLACE BOLUS. Retrieved 01 27, 2016, from http://www.sundaystandard.info/electronic-ear-tags-replace-bolus
Xu, M., Ma, L., Xia, F., Yuan, T., Qian, J., & Shao, M. (2010, October). Design and implementation of
a wireless sensor network for smart homes. In Ubiquitous Intelligence & Computing and 7th
International Conference on Autonomic & Trusted Computing (UIC/ATC), 2010 7th
International Conference on (pp. 239-243). IEEE.
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 9
APPENDIX
A: Node flow diagram
Figure 1: Node flow diagrams
B: pseudo code algorithm for the system
Figure 2: Summarized algorithm of the system
Proceedings of the 1st International Conference on
the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 10
C: Simulation setup using NSG2
Figure 3: WSN scenario setup in NSG_2
D: Simulation output
Figure 4: Simulation Output
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 11
IC1013
A Distributed Computational MapReduce Algorithm for Big Data Electronic
Health Records
Sreekanth Rallapalli Network & Infrastructure Management, Faculty of Computing
Botho University Gaborone, Botswana
rallapalli.sreekanth@bothouniversity.ac.bw
Radhika Kidambi Department of Computer Science
AIMS Institutions Bangalore, India
kidambiradhika@gmail.com
Suryakanthi Tangirala Department of Accounting & Finance, Faculty of Business
University of Botswana Gaborone, Botswana
suryakanthi.tangirala@mopipi.ub.bw
ABSTRACT
Recent advancement in technology and architecture led to Big Data analysis. Implementation of big data for small and mid-size organizations can be offered through cloud computing. Processing of big data can be performed through MapReduce. MapReduce is a programming technique for big data processing. This requires networked attached storage and parallel processing. Designing an efficient algorithm for processing big data on cloud is a challenging task. Health care is generating huge amount of data in the form of Electronic Health Records (EHR) and these data has to be processed on cloud to minimize the processing cost. An efficient scalable algorithm in MapReduce is required to process large EHR data which are generating from various sources. Cloud computing lets you process big data without having to buy or maintain your own cluster or data center. Divide-and-conquer and branch-and-bound algorithms proposed by researchers confirms the effectiveness and scalability of MapReduce algorithms. Recent research confirms that to handle the massive computational and storage resources which are in demand by big data at reasonable power costs, we must rely on parallel and distributed computation. In this paper we study the existing algorithms to process EHR using cloud computing as platform. Then we propose ESPD-CIMAC an efficient, scalable, parallel and distributed computational MapReduce algorithm for cloud computing to process EHR using Hadoop Clusters. Key words: Big Data; Cloud Computing; EMR; Hadoop Clusters; MapReduce.
1 INTRODUCTION Large amount of medical data is being generated by hospital systems, clinical systems, and other medical devices which are popularly known as Big Data (Manyika et al., 2011). Big data is currently managed and analysed using database management systems. But the traditional database management systems (Evans & Hutley, 2010) do not have the capability to handle Big Data. The reason for unsuitability of traditional DBMS is that it is structured and whereas Big data is either unstructured or semi-structured.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 12
As the data is vastly scalable, an efficient scalable algorithm is required to efficiently process the data. While building the scalable solutions, network bottleneck and low performance of hardware nodes has to be taken to consideration (Wang & Liu, 2011). Medical history of a patient stored in a digital format can be referred as Electronic Health Record (EHR). Healthcare needs scalable and distributable solutions on cloud. Cloud computing is becoming a promising technology which provides a shared pool of configurable computing resources and managed with minimal management effort (Mell & Grance, 2010).
There are various frameworks developed for big data computing (Ekanayake et al., 2010; Howe et al., 2010; Mihaylov et al., 2012; Low et al., 2012; Ewen et al., 2012; Zhang et al., 2012). MapReduce (The Apache Software Foundation, 2013) can provide efficient methods to build scalable and distributable solutions for health care data. The most important issue is to efficiently move the big data to the cloud. Cost minimizing data migration solutions using various online algorithms (Zhang et al., 2013) will be flexible for cloud to choose data centre for effective data processing. MapReduce frame work on a single cluster may not be suitable for distributed data and resources (Condie et al., 2010). Distributed MapReduce architectures are preferred when data is being aggregated from various data sources and from different computing nodes.
Cloud MapReduce architecture is used to process healthcare records. Rack servers connected to a top of rack switch which uplinks to all other switches of the similar bandwidth forms a cluster (Zhou et al., 2009). In this paper we propose ESPD-CIMAC efficient and scalable MapReduce algorithms for processing EHR by using Hadoop clusters. The organization of the paper is as follows. Section 2 investigates some preliminaries concerning Hadoop Clusters. Section 3 relates to EHR. Section 4 relates to MapReduce and health care issues. In Section 5 SOA based cloud computing features are studied. In Section 6 Literature study of MapReduce algorithms are studied. In Section 7 we propose an efficient scalable Iterative MapReduce algorithm for cloud computing to process Big Data EHR using Hadoop clusters. In section 8 Experimental results are presented. Section 9 provides the conclusion.
2 HADOOP CLUSTERS
Hadoop clusters are built by the rack servers which are connected to top of rack switch. Uplinks of
rack switch are connected to another set of switches which have the equal bandwidth. This forms a
cluster in a network. We can setup this cluster on cloud so that the workflow of this cluster will get
the required results from the large data sets. In this case we load the EHR data to the cluster and
search for a query.
The workflow of the cluster is as follows: Hadoop Distributed File System (HDFS) writes the loaded
data into the cluster. The data is analysed with MapReduce algorithms. HDFS writes the data and
saves into the cluster. HDFS reads the results from the cluster. From large data sets of EHR if we
need to find how many patients were diagnosed with heart disease, this can be analysed and
processed very quickly using hadoop. Hadoop divides the huge chunks of data into smaller chunks
and the process across multiple machines and thus produce the result so quickly. The typical
architecture of hadoop cluster is shown in Figure 1.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 13
Figure 1. Hadoop Cluster Architecture
For faster parallel processing huge data has to be loaded into hadoop cluster for processing.
The client breaks the data into smaller chunks and all the chunks of data are being sent to
different machines for processing. To avoid the data loss it is ensured that each chunk of data is
running parallel on different machines. In order to prevent data loss and network performance
hadoop administrator manually defines the rack number of each slave data node in the cluster.
EHR data loaded into cluster is divided into data chunks like File A, File B, File C and so on. In
this section we present how these data chunks will be loaded into HDFS. Client will consult the
name node and then write the block of code to one data node. The data node then replicates
the required blocks as decided by the Name node. The same repeats for the next set of blocks
till all the chunks of data have been completely processed. Writing of files to HDFS is shown in
Figure 2. In the figure we can see that large data for Electronic health records can be sent as
data chunks to the client initially. Replication factor for the blocks by default is set to 3. Hadoop
writes the data efficiently to its nodes and keep all the data in the nodes to be safe. If for any
reason the node fails, the data is still available in other nodes.
Figure 2: EHR files writing to HDFS
3 ELECTRONIC HEALTH RECORDS (EHR) Complete data relevant to patient medical history, demographics, problems, medications, clinical observations, signs and symptoms, immunization reports, radiology reports, laboratory data, billing information, personal data, patient progress data. In EHR storage and retrieval of data is efficient. Patient care can be enhanced by EHR Systems (Jamoom et al., 2012). The various components of EHR are shown in Figure 3.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 14
Figure 3. EHR Components
Clinical Document contains various information (Electronic Health Records Overview, 2006). These
documents are very critical for patient care and validate the patient care provided. An XML based
standard electronic document has been developed by HL7 (Dolin et al., 2006) international which
defines the structure of the document. The benefits of EHR include minimized errors in billing (Wang
et al., 2003), reduced costs (Johnston et al., 2004; Menachemi & Brooks, 2006), accuracy in diagnosis
(Jamoom et al., 2012), and many more. Despite the many benefits of EHR there are certain barriers
(Rahman & Reddy, 2015) for its adoption. EHR will be inevitable for patient care in hospitals and
various medical practices.
4 BIG DATA, MAPREDUCE AND HEALTHCARE Voluminous amount of healthcare data relating to patients is being generated by healthcare organizations and this data has different formats (Hu et al., 2014). Healthcare data sources need to follow the standards set by the healthcare industries. Big data helps the organizations to reduce the costs and improve quality in patient healthcare. Big data technologies helped to develop various applications which are successfully applied in various fields of health sciences. Various health information systems were proposed by researchers (Fernandez-Luque, 2009; Duan et al., 2011; Hoens at al., 2013; Wiesner & Pfeifer, 2014). Healthcare is one of the filed which needs scalable and distributable solutions for efficiently solve the problems of data processing and distributing those data in secured way. There are many problems in healthcare data which need to be addressed like finding patients with common symptoms, analysing the laboratory records, treatments given to various patients based on reports, patient responsiveness to the prescribed drugs, and so on. As 80% of the healthcare data (Miliard, 2011) is unstructured, this requires applications where this type of data is processed. MapReduce has emerged one of the solutions to solve healthcare issues (Bhatotia et al., 2011; Mazur et al., 2011; Yan et al., 2012). Big data analytics can be applied to the EHR data to predict the risk among patients on various health related issues. Analytics will also help to diagnose the patients with certain disease at early stages. Programming with data oriented tools such as SQL and various statistical languages are required for
big data analysis in healthcare. Big Data analytics in healthcare is required to transform the data into
knowledge. Statistical, contextual, quantitative, predictive, cognitive models can be developed to
derive information form huge data sets. Latest Big data technologies will be able to collect the large
data from millions of patients identify clusters, correlations and in short time analyze those data
using statistical machine learning ad modeling techniques (Baah et al., 2006). By implementing
statistical modeling or machine learning techniques it is possible to tell whether a patient is likely to
fall sick again using a valid range of data sets. The McKinsey Global institute estimated the potential
value from Big Data in healthcare could be $300 billion a year (Kayyali et al., 2013). Big data analytics
platform on healthcare is suggested as in Figure 4.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 15
Figure 4 Big Data analytics on Healthcare
5 CLOUD COMPUTING In the healthcare ecosystem, cloud computing environment can be applied for various benefits for
all the components (Parakala & Udhas, 2011). As healthcare data available in various forms like
structured and unstructured, the database used to store the data should be capable to process
unstructured data. NoSQL database can be efficiently used to handle such data. The data integration
can be based on Hadoop MapReduce (Apache Hadoop, 2012). The basic requirements of Cloud
Infrastructure as a Service should be secure, scalable and Multi tenancy. It should be location
independent and on demand virtual networks. Figure 5 shows the basic requirement of cloud IaaS.
Figure 5: Cloud infrastructure (Source: http://www.slideshare.net/bradhedlund/architecting-data-
center-networks-in-the-era-of-big-data-and-cloud-13033773
A number of read and write requests are generated while processing big data on the cloud.
Thousands of entities such as application servers are accessing data. In order to avoid any failures
and continue the service it is often needed to balance the read and write load a large number of
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 16
servers need to be kept ready to distribute. Cloud computing makes it possible to have large-scale
on demand infrastructure which can provide resources for different workloads. Big data can be
offered as a service on the cloud (Low et al., 2012). The data on the cloud for parallel analysis need
to be partitioned, distributed, configured and then load the data into memory. Hadoop can be
deployed on the cloud to perform massive data processing. The Hadoop cloud platform for data
processing can be given as such the efficient algorithm for hadoop data base is designed. Cloud
computing provides opportunities for the growth even though there are barriers for various services
for big data (Moretti et al., 2008). Table-1 illustrates the growth opportunity for cloud computing
with respect to various barriers.
We use parallel computing to process large data sets on the cloud. MapReduce is a parallel
programming model which is supported by capacity on demand clouds (Gunarathne et al., 2010).
For a large collection of data stored in the cloud MapReduce work is to compute the inverted index
in parallel. A reliable SOA infrastructure is required for SOA suite in order to integrate the Healthcare
records. The applications can share the information constantly to leverage essential business
processes for healthcare integration. Cloud computing will ensure to provide scalable infrastructure
like hardware, software and any healthcare applications. SOA make sure that it delivers the software
as a service to all healthcare providers. By implementing cloud SOA healthcare (Rallapalli & Gondkar,
2016) organizations can minimize the security risk involved in exchange of the information. Assume
that each node i in the cloud stores the EHR records ri,1,ri,2,ri,3,ri,4,ri,5,….. and that the EHR
record contains patient laboratory information pi,1,pi,2,pi,3……. In order to retrieve information of
patient with similarities of lab information we use inverted index and it lists as follows
{w1:r1,1,r1,2,r1,3,…….}
{ w2:r2,1,r2,2,r2,3,…….}
{ w2:r2,1,r2,2,r2,3,…….}
In order to analyze Terabytes and Petabytes of data in quick and limited amount of time and to
perform statistical analysis we need hundreds of servers. Data must be distributed across the servers
to analyze the data in parallel. Big Data Electronic Healthcare Records processing on the cloud as
suggested is shown in Figure 6.
Figure 6: Big Data EHR processing on cloud
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 17
Barrier Opportunity
Availability Multiple Cloud Providers
Data Lock-in Standardize API; Hybrid Cloud Computing
Data confidentiality Encryption, Firewalls, VLAN
Data transfer bottleneck Higher Bandwidth switches, Transporting Disks
Performance unpredictability Improve Virtual Machine support
Scalable storage Scalable store available
Bugs in large distributed system Debugger for VMs
Scaling quickly Auto Scalar
Software licensing Pay-for-use license
Table 1: Barrier for data and cloud computing opportunities
6 LITERATURE REVIEW OF UNCERTAIN DATA ALGORITHMS
Healthcare data is unstructured due to large number of images are stored in database, this section
focus on reviewing algorithms required for uncertain data. The main challenge involved in uncertain
data is modelling and integrating this with various applications. Uncertain data working models were
proposed in (Sarma et al., 2006; Aggarwal, 2010). The second challenge is data management and
processing applications for the management of uncertain data. In order to analyse large data sets
traditional data clustering algorithm is proposed (Bu et al., 2010). In order to know the performance
of the application the MapReduce programmer need to write an complex code. For data clustering
algorithms, classification and regression trees can be applied to improve the performance.
6 EFFICIENT, SCALABLE, PARALLEL AND DISTRIBUTED COMPUTATIONAL BASED MAPREDUCE
ALGORITHM FOR CLOUD COMPUTING
In MapReduce program we have two functions named Map() and Reduce() . The general syntax for
these is shown as
map (s1,t1) [<s2,t2>]
reduce (s2,{t2}) [<s3,t3>]
MapReduce system like Hadoop reads the input data, performs computation and writes the results
to Hadoop Distributed File System and then creates chunks of blocks which run across cluster of
machines. In a MapReduce program the system runs a process called Job Tracker on the master
node in order to monitor the job progress, and a set of processes called as Task Tracker which
processes on worker nodes to perform the exact Map and Reduce tasks.
Let us first discuss on the various iterative models proposed to improve the MapReduce processing.
Twister, HaLoop, iMapReduce proposed earlier focus on reducing job startup costs and caching
structured data. In order to provide efficient, scalable, distributed computational iterative based
algorithm for cloud computing we need to separate the structure and data in the Application
Program Interface. The iterative map model requires the algorithms like k-means and Page Rank to
produce iterative functions. The separation of structure and state data can be achieved by
enhancing the map function to the structure with key value pairs in the incremental MapReduce.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 18
The below algorithm reads a chunk of EHR data. The structure and state key value pairs for iterative
algorithms varies with structure key, value, state key, value. There may be relations like one to one,
many to one existing between the state key and structure key. For iterative algorithm two important
data sets are required. The loop invariant structure data and loop variant state data required for
efficient and scalable computation. SOA based cloud computing does the information exchange
using loosely coupled software components. Iterative algorithms are generally used for ranking the
data in the clusters. Consider j, k as vertex numbers. In this algorithm we combine clustering
algorithm, Generalized algorithm
Algorithm 1 Parallel computing algorithm
Input: Data is in List<vertex numbers>
Output:
1. Begin with n hadoop clusters, each containing chunks of EHR data and we will number the
clusters 1 through n.
2. Input the vertex numbers as Map phase and output the vertex so that it reduces and
produce the output phase as by summing all the neighbours.
3. Compute the cluster distance using k-means for MapReduce.
4. Algorithms like Generalized iterated matrix vector in MapReduce can be implemented for
data sets.
5. Run sequence of jobs J1, J2, J3…..which will incrementally produce the results of an
iterative algorithm.
8 EXPERIMENTAL RESULTS
The experiment setup consists of four nodes which is shared by a LAN with a managed switch. One
node in this is used as a Master which supervises the data and flow of control over all other nodes in
Hadoop cluster. All the nodes run on Intel core2 Duo processor.
All nodes used for this experiment uses Ubuntu Linux Operating system with Java JDK 7 installed on
the nodes. Apache Hadoop which is available as open source in the website has been used to install
Hadoop. The installation guides has been used for the node setup. The experiment setup is shown in
Figure 7.
Figure 7 Experimental setup of 4 nodes
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 19
For experimental purpose we consider the public database for Electronic health records which
contain in total 100,000 patients, 361,760 admissions, and 107,535,387 lab observations. The total
size of the file was 1.4 GB. These experiments run on Amazon EC2. In order to process these EHR
data various algorithms like priori can be implemented for incremental one-step processing for
iterative MapReduce. MapReduce computation takes 800 seconds. Our proposed algorithm takes
only 100 seconds for the computation.
9 CONCLUSION
In this paper we have described how an iterative distributed and computational MapReduce
algorithm is more efficient when compared with various algorithms for bulk data processing. By
implementing this algorithm on SOA based cloud computing we can significantly reduce the runtime
when compared with general MapReduce algorithm.
REFERENCES
Aggarwal, C. C. (Ed.). (2010). Managing and mining uncertain data (Vol. 35). Springer Science &
Business Media.
Apache Hadoop (2012). Retrieved from: http://hadoop.apache.org.
Baah, G. K., Gray, A., & Harrold, M. J. (2006, November). On-line anomaly detection of deployed
software: a statistical machine learning approach. InProceedings of the 3rd international
workshop on Software quality assurance(pp. 70-77). ACM.
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011, October). Incoop:
MapReduce for incremental computations. In Proceedings of the 2nd ACM Symposium on
Cloud Computing (p. 7). ACM.
Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: efficient iterative data processing on
large clusters. Proceedings of the VLDB Endowment, 3(1-2), 285-296.
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., & Sears, R. (2010, April).
MapReduce Online. In NSDI (Vol. 10, No. 4, p. 20).
Dolin, R. H., Alschuler, L., Boyer, S., Beebe, C., Behlen, F. M., Biron, P. V., & Shabo, A. (2006). HL7
clinical document architecture, release 2. Journal of the American Medical Informatics
Association, 13(1), 30-39.
Duan, L., Street, W. N., & Xu, E. (2011). Healthcare information systems: data mining methods in the
creation of a clinical recommender system.Enterprise Information Systems, 5(2), 169-181.
Electronic Health Records Overview (2006) National Institute of Health. National Center for Research
Resources. MITRE Center for Enterprise Modernization, Mclean Virginia.
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S. H., Qiu, J., & Fox, G. (2010, June). Twister: a
runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium
on High Performance Distributed Computing (pp. 810-818). ACM.
Evans, D., & Hutley, R. (2010). The Explosion of Data. White Paper.
Ewen, S., Tzoumas, K., Kaufmann, M., & Markl, V. (2012). Spinning fast iterative data
flows. Proceedings of the VLDB Endowment, 5(11), 1268-1279.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 20
Fernandez-Luque, L., Karlsen, R., & Vognild, L. K. (2009, August). Challenges and opportunities of
using recommender systems for personalized health education. In MIE (pp. 903-907).
Gunarathne, T., Wu, T. L., Qiu, J., & Fox, G. (2010, June). Cloud computing paradigms for pleasingly
parallel biomedical applications. In Proceedings of the 19th ACM International Symposium
on High Performance Distributed Computing (pp. 460-469). ACM.
Hoens, T. R., Blanton, M., Steele, A., & Chawla, N. V. (2013). Reliable medical recommendation
systems with patient privacy. ACM Transactions on Intelligent Systems and Technology
(TIST), 4(4), 67.
Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics: a
technology tutorial. Access, IEEE, 2, 652-687.
Jamoom, E., Beatty, P., Bercovitz, A., Woodwell, D., Palso, K., & Rechtsteiner, E. (2012). Physician
adoption of electronic health record systems: United States, 2011. NCHS data brief, (98), 1-8.
Jamoom, E., Patel, V., King, J., & Furukawa, M. (2012, August). National perceptions of EHR adoption:
Barriers, impacts, and federal policies. In National conference on health statistics.
Johnston, D., Pan, E., & Walker, J. (2004). The value of CPOE in ambulatory settings. J Healthc Inf
Manag, 18(1), 5-8.
Kayyali, B., Knott, D., & Van Kuiken, S. (2013). The big-data revolution in US health care: Accelerating
value and innovation. Mc Kinsey & Company, 1-13.
Li, B., Mazur, E., Diao, Y., McGregor, A., & Shenoy, P. (2011, June). A platform for scalable one-pass
analytics using MapReduce. In Proceedings of the 2011 ACM SIGMOD International
Conference on Management of data(pp. 985-996). ACM.
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., & Hellerstein, J. M. (2012). Distributed
GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of
the VLDB Endowment, 5(8), 716-727.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data:
The next frontier for innovation, competition, and productivity.
Mell, P., & Grance, T. (2010). The NIST definition of cloud computing.Communications of the
ACM, 53(6), 50.
Menachemi, N., & Brooks, R. G. (2006). Reviewing the benefits and costs of electronic health records
and associated patient safety technologies. Journal of medical systems, 30(3), 159-168.
Mihaylov, S. R., Ives, Z. G., & Guha, S. (2012). REX: recursive, delta-based data-centric
computation. Proceedings of the VLDB Endowment, 5(11), 1280-1291.
Miliard, M. (2011) IBM Unveils New Watson-Based Analytics. Healthcare IT News. Retrieved from:
http://www.healthcareitnews.com/news/ibm-unveils-new-watson-based-analytics-
capabilities.
Moretti, C., Bulosan, J., Thain, D., & Flynn, P. J. (2008, April). All-pairs: An abstraction for data-
intensive cloud computing. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE
International Symposium on (pp. 1-11). IEEE.
Parakala, K., & Udhas, P. (2011). The Cloud: Changing the Business Ecosystem. KPMG India study.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 21
Rahman, R., & Reddy, C. K. (2015). Electronic health records: a survey. Healthcare Data Analytics, 36,
21.
Rallapalli, S., & Gondkar, R. R. (2016). A Study on Cloud Based SOA Suite for Electronic Healthcare
Records Integration. In Proceedings of 3rd International Conference on Advanced Computing,
Networking and Informatics (pp. 143-150). Springer India.
Sarma, A. D., Benjelloun, O., Halevy, A., & Widom, J. (2006, April). Working models for uncertain
data. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference
on (pp. 7-7). IEEE.
The Apache Software Foundation (2013) Hadoop MapReduce Tutorial. Retrieved from:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.
Wang, F., & Liu, J. (2011). Networked wireless sensor data collection: issues, challenges, and
approaches. Communications Surveys & Tutorials, IEEE, 13(4), 673-687.
Wang, S. J., Middleton, B., Prosser, L. A., Bardon, C. G., Spurr, C. D., Carchidi, P. J., ... & Kuperman, G.
J. (2003). A cost-benefit analysis of electronic medical records in primary care. The American
journal of medicine, 114(5), 397-403.
Wiesner, M., & Pfeifer, D. (2014). Health recommender systems: concepts, requirements, technical
basics and challenges. International journal of environmental research and public
health, 11(3), 2580-2607.
Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for
iterative computation. Journal of Grid Computing,10(1), 47-68.
Zhang, L., Wu, C., Li, Z., Guo, C., Chen, M., & Lau, F. (2013). Moving big data to the cloud: an online
cost-minimizing approach. Selected Areas in Communications, IEEE Journal on, 31(12), 2710-
2721.
Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute
similarities. Proceedings of the VLDB Endowment, 2(1), 718-729.
Yan, C., Yang, X., Yu, Z., Li, M., & Li, X. (2012, June). Incmr: Incremental data processing based on
mapreduce. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on (pp.
534-541). IEEE.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 22
IC1014
A Collaborative Tool for MPhil/PhD Student Dissertation Workflow
Bigani Sehurutshi, Oduronke T. Eyitayo
Department of Computer Science University of Botswana Gaborone, Botswana
Bigani.Sehurutshi@mopipi.ub.bw; eyitayoo@mopipi.ub.bw
ABSTRACT
Over the years, experiences from University of Botswana (UB) show that workflow is a major problem. Current workflow methods do not provide adequate support for MPhil/ PhD students’ dissertation in the University of Botswana. There are a lot of bottlenecks in the current system. There is need to mitigate the current limitations and bottlenecks by improving operational efficiency and effectiveness by redesigning better and efficient workflows. In this paper, we propose a model to improve the process by: firstly analysing and assessing current processes, then identifying the opportunities for improvements and potential solutions to resolve the challenges experienced by users when doing their projects. Lastly, simulate improved workflow scenarios and evaluate the solution. Evaluation was carried out to find out if the new system was beneficial. The testing revealed that the prototype for student’s research management system was regarded as easy to use and very useful which makes the prototype a better improvement to the current manual system in place. Some of the benefits attained from the workflow improvements include; increased student’s satisfaction, speedy throughput with efficient workflows and maximised assets utilization by removing constraints related to people, processes inefficiencies.
Key words: Workflow Reengineering, Process Modelling, Usability Testing, Prototype Design,
Processmaker
1 INTRODUCTION University of Botswana (UB) has a student’s enrolment of about sixteen thousand distributed
amongst the seven faculties which are; Business, Education, Engineering and Technology,
Humanities, Science, Health Sciences and Social Sciences. Each faculty is composed of departments
and they offer programs from diploma, bachelor’s degree through to masters and doctoral degrees.
In the 2015/2016 academic session, the University currently has 1704 Masters/MPhil students and
96 PhD students.
Student dissertation in our context is referred to as research in which scholars extend their
knowledge to make some contributions to their respective areas. Research projects are made up of
hundreds of processes. Yan et. al (2012) emphasised the need for improvement of quality of thesis
supervision and instruction. Apart from the supervisory approach, one major problem that has not
received a lot of attention is the workflow of student’s dissertation. Although many academic staff
have invented their own specific paper based guidelines and pro-forma documents to ease and
control the supervision process between all involved parties based on their experience and best
practices, they have not been supported by a central online collaborative system that can help them
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 23
to easily monitor and control the whole workflow (from identifying a project idea to final
assessment) and support smooth data handover between all involved parties. The current states of
research processes have been inefficient.
One of the problems encountered over the years include students not graduating the year
they were supposed to because there were no follow ups on reports from internal and
external examiners. There is no clear way of tracking progress between the examiners and
School of Graduate studies. Sometimes, reports dispatched through a courier may not reach
the examiner. Having in place a unique system that monitors workflow can therefore help with
some of the problems identified above. The current support for dissertation is handled in a
manual way. The main objective is to study the current processes and model a better
workflow and a prototype for the whole process of student’s project management from the
point of registering topics to the final submission of the dissertation.
The specific objectives of the research were:
To assess the current state of operations for student research project management in the
University
To model a workflow to make it more efficient and
To design, develop and evaluate prototype system based on the workflow
2 LITERATURE REVIEW
2.1 Modelling a Workflow To study and understand models, one has to construct models using a particular modelling
technique. It is important to identify the use or purpose of the models. In order to choose the right
technique, the modeller must know the purpose of the model. Different techniques are more
suitable for certain purposes. There are many process modelling techniques and the mostly used
being flowcharts, data flow diagrams, petri nets and workflows.
A workflow is referred to as an automation of business process, in whole or a part, during which
documents, information and tasks are passed from one participant to the other. Participants may be
people or automated processes (WFMC Documentation, 1996). A process is said to be a number of
tasks that need to be carried out and a set of conditions that define the order of the tasks. Workflow
management involves managing flow of work such that the work is done at the right time by the
proper persons. Workflow management systems aim to help business goals to be achieved with high
efficiency by means off sequencing work activities and invoking appropriate human or information
associated with these activities (WFMC Documentation, 1996). It also ensures integration of people
and programs.
2.2 Theses Management Systems Edinburgh Napier University (Romdhani et al., 2011) proposed an integrated and collaborative online
supervision system for final year and dissertation projects. The idea was initiated in order to provide
a high quality supervisory processes and effective relationship. (Romdhani et al., 2011) Suggested
that there is a need for a supervisory process to be supported by a central electronic technology
system to record, monitor, revisit supervision process and enhance the students learning process.
He further said that from the student’s perspective having in place a unique electronic supervision
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 24
system alongside the traditional face-to-face and paper based supervision methods, can ensure
guaranteed assessment and handling smooth data and reports transfer between all parties. The
system minimized administration overheads and gave better control of project progression and
monitoring (Romdhani et al., 2011).
In a study by Yan et. al. (2012), a web based system was designed to support the masters’ degree
thesis research process and the knowledge sharing. The study identified the main steps on the
research process. It then presents an instructional model based on the analysis of practical thesis
research workflow. In a study of hundred Chinese universities analyzed audit was carried and it
found out that Universities differ in theses time management and process organization. His study
also highlighted that most studies follow the generic steps of; topic selection, thesis writing, oral
examination and evaluation of excellent theses. They came up with the processes whereby the
thesis research is a combination of problem based learning and thesis management process. (Yan et
al., 2012). A web based supporting system called THEOL was designed according to the instructional
mode for the Master’s Degree thesis. The system features three key modules: research process
supporting, research group management and knowledge sharing, which has functions to support the
whole thesis research process, multi-supervision from the teachers, and rich resource sharing during
whole process.
2.3 Methodology Section The study is composed of five phases. The first phase was intended to model the current
manual system of managing researches in faculties at the University of Botswana. The second
phase was designed to model the manual system from the user requirements in phase one
into workflows. Flow charts were used to model the steps. The third phase improved
workflows to eliminate human-dependent processes which brought some inefficiency. The
purpose of the fourth phase was to use the formulated workflows to develop a prototype for
the research management system. Process maker software was used to design the prototype.
The final phase was to evaluate the designed prototype interface design through heuristic
evaluation, perceived ease of use and perceived usefulness.
3 CURRENT STATE OF OPERATIONS FOR STUDENT RESEARCH PROJECT MANAGEMENT The major processes involved in MPhil and PhD processes are admission, proposal defence, and
submissions of title and abstract of the thesis, submission of the thesis for examination, entry into
the examination, appointment of examiners and board of examiners, the oral examination and
results.
With all the processes executed manually, there is so much inefficiency through the whole process.
The delays are caused by processes depended on the human beings not following procedures. Most
processes are controlled by the coordinator. Automating the current processes the way they are will
mean the coordinator still controls the processes making the system semi-automated. Due to lack of
monitoring and communication, scheduled tasks are also postponed as students do not show up to
see their supervisors. Some do not even submit the milestones and it is therefore difficult to know
the status of their projects. Other stakeholders may forget the deadlines or forget that the
dissertation is with them. Examiners are given a month to bring the report but in some cases reports
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 25
arrive after five months. There is no clear communication between the examiners and school of
graduate studies. Reports dispatched through a courier may not even reach the examiner, and
follow up is sometimes not done. A proposed re-engineered workflow is shown in figure 2.
3.1 Improved Workflows Processmaker contains two main components: a design environment and a run-time engine. The
designer’s environment includes tools to map processes, define business rules, create dynamic
forms and add input and output documents. Web based application was created. With the web
based application a client will use a browser to access services from the server. The method relieves
the developer of the responsibilities of installing the application in every computer of an end user.
The other benefit is that changes of the application logic and database happens in one place (on the
server) and doesn’t affect the end users machines.
For this project, the open source Processmaker was customised to meet the needs of student’s
project management system. For the interfaces to be created, the system allows for the creation of
the processes first. In Processmaker a process is a collection of tasks with inputs to create outputs
that is of a value to the students doing research and end user within the University of Botswana
institution.
As major processes exist, there are some child processes created to ease the pressure on the
processes. It is recommended breaking large processes into separate master and child processes to
reduce the complexity of the process map and give sub processes time to handle exceptional
situation and activities. Functionalities of useful process can be hooked into another process. Some
of the sub-processes include; topic submission (under topic registration process), meeting reports
(under project writing), progress report and oral examination. Sub processes are divided into
synchronous and asynchronous.
Synchronous sub-processes allow for the execution of child process, ceasing the master process to
appoint where the sub-process is complete first before the master processes resume where it
stopped. Asynchronous sub processes does not pause the master process, no dependency on each
other. All the sub-processes are synchronous.
Figures 3, 4 and 5 show a few of the processes and child processes.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 26
Figure 1: PhD Dissertation current workflow
Figure 2: PhD re-engineered workflow
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 27
Table 1 shows a comparison of the original workflow and the improved system.
Table 1 Comparison of the original workflow and the improved system.
Features Current system Improved system
Prefill of data Not available Data filled using forms
Documents storage Files, hard copies Electronic databases
Documents sharing Not available Electronic sharing, enhanced retrieval of documents
Reminders to users Sent by coordinators Emails included in the system remind users of the pending tasks
Assignment of Tasks
Coordinator assigns users tasks Electronic assignments
Tasks deadlines Done by the administrators Processes have a validity period.
Reports No reporting Done by Management staff members
Figure 3: Progress Report sub process
Figure 4: Student Examination Workflow
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 28
Figure 5: Oral Examination workflow
4 PROTOTYPE DEVELOPMENT The various processes were designed. Figures 6, 7, 8 and 9 show some examples of forms used in
the development. The dissertation examination starts with the students making an indication that
they are ready to submit. Fill in the date and make a request to the coordinator as shown in figure 6.
The coordinator will then open the submission for students who made request. As a student
upload your work and sent as specified in the form shown in figure 7.
Other processes include:
The submission by student is routed to the supervisor. Supervisor indicate that he/she
approves the submission and add comments if any to the coordinator.
Coordinator sending to the school of graduate studies (SGS)
when satisfied with submission
Figure 6: Student Examination Request Form
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 29
Figure 7: Student Thesis submission Form
The SGS then accepts the submission and indicate the number
of hard copies received from the student.
The dissertations are then logged as they are sent to the examiners with the
dates the reports are expected.
The report can also be submitted online as shown in figure 8
When the report arrives the SGS notifies the coordinator of the report from the
examiners and indicate the duration.
The coordinator prepares a brief report to the student not going into confidential
information. Inform the student that the reports are ready for collection as shown in
figure 9.
Figure 8: Examiner’s report form
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 30
The student then receives the report and then sends back the corrected version to the
supervisor.
The supervisor indicates the approval status of the corrected version project.
Indicating that all the necessary corrections are taken into consideration. This is routed
through the internal examiner
The internal examiner indicates as to whether he/she is satisfied with the correction
and indicates the final result of the students. The SGS will then notify the coordinator
and student of the outcome and post-examination submissions will be made.
Figure 9: Detailed Examiner’s report
5 PROTOTYPE EVALUATIONS Three types of evaluation were done to test the prototype system. They are usability and heuristic
evaluation, perceived ease of use and perceived usefulness. A system can only be said to be
effective and efficient if it meets usability criteria for specific types of user’s carrying out specific
tasks (Agarwal, 2002). Usability is associated with positive effects, including errors reduction,
enhanced accuracy and positive attitude towards users (Agarwal, 2002). A system that passes
usability testing we can say it is effective and efficient. In the last phase of the study, the prototype
system was evaluated using heuristic evaluation and end user evaluation. Heuristic evaluation is
considered a practical, inexpensive method of identifying usability problems and assisting in
refinement of system design (Laurie et al., 2002). Thirteen usability evaluators from department of
Computer Science and department of Library and Information studies evaluated the prototype using
heuristic evaluation. Eight usability factors were applicable to the study (Table 2). As a result only
two usability factors; user control and freedom and consistency with standards violated heuristics by
scoring negative responses, but their overall severity score were between “no usability problems to
minor usability problem”. Some of the problems included; complex wording, no cancel buttons, no
default values in the data fields, no undo buttons and others. The overall severity rating ranged
between, no usability problems, cosmetic problem and minor usability problem, which suggested
that the prototype was very usable hence efficient and effective. Most of the comments were
problems that were resolved.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 31
Table 2: Summary Heuristic Evaluation results
Usability factor Positive(Y) Negative(N)
Not
Applicable
(NA)
Visibility of the system status 64 1 0
Match between system and real world 31 8 0
User control and Freedom 25 38 3
Consistency and standards 23 39 5
Error Prevention 36 16 0
Recognition and Recall 54 15 9
Aesthetic and Minimal design 26 13 1
Help users recognise, diagnose and recover
from errors 21 5 0
The problems that were listed by the evaluators are listed in Table2. The evaluators also
recommended the following:
Specifying the labels and button names
Titles for pop-up messages
Training users
Six questions on perceived ease of use of the prototype were added as part of the evaluation. The
same thirteen who did the Usability and Heuristic Evaluation did the evaluation. The evaluators
responded to the six questions and the responses are summarised in table 3. The responses were
between “strongly agree” and “strongly disagree”.
Table 3: Usability problems and design solutions
Usability problem Design solution
Ambiguous words like “Area of expertise, cluster” Used combo-box to list the areas
What is the difference between userID and StudentId
userID changed to username
Date format not clear in date field Fix date format
No cancel in all forms, only submit buttons Cancel buttons created
No clear title for form of coordinator receiving submission from students
Title created
No undo and redo buttons Solution not implemented
No option of using keyboard instead of mouse Solution not implemented
No defaults values for data values Default values added
No dots used to indicate length Solution not implemented
In addition to the heuristic evaluation, evaluators were given six questions on perceived ease of use
and perceived usefulness of the prototype. On the whole, the evaluators perceived the prototype
system as easy to use and a useful tool to the management of research tasks.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 32
Participants perceived that the prototype system as very useful. The prototype appeared to meet
participant’s need for a research projects management system.
Table 4 Percentage summary of perceived ease of use
SA* A* N* DA* SDA* Total%(N)
I find the system easy to use. 46.2 46.2 7.7 0 0 100(13)
Learning to operate the system is easy
for me. 15.4 76.9 7.7 0 0
100(13)
I find it easy for the system to do what I
want it to do. 23.1 46.2 31 0 0
100(13)
The system is flexible to interact with 23.1 53.8 23 0 0 100(13)
I can easily remember how to perform
tasks 23.1 61.5 15 0 0
100(13)
My interaction with the system is clear
and understandable 15.4 46.2 39 0 0
100(13)
SA – Strongly Agree; A – Agree, N- Neutral; DA- Disagree and SDA –Strongly Disagree
It can be seen from Table 4 that users found the system easy to use, flexible as well as easy to
remember the process of performing tasks. Responses ranged between 77% and 92%. However
lower percentages between 61% and 69% were obtained in system doing what uses want clarity.
There is high possibility that this rating is also due to the usability problems which have been
attended to.
5.1 Perceived Potential Usefulness
The perceived potential usefulness was measured using six items with a 5-point scales ranging from
strongly agree to strongly disagree. The scale was 1= strongly disagree, 2-disagree, 3- neutral, 4-
agree and 5-strongly agree. Twenty participants completed the questionnaires across different
departments. The respondents came from mostly the Faculty of Science. The participants were
walked-through the online prototype. Of the 20 participants, 85% were from Faculty of Science and
15% were other Faculties. Within Faculty of Science 71% of participants were from Department of
Computer Science. From the participants 20% were graduate students and 75% came from
undergraduate students. This was mainly based on convenience sampling - those who were willing
and available to evaluate the system. As shown in Table 5, over 80% of the evaluators responded
with either “agree” or “strongly agree” in all the items related to the perceived usefulness. Overall,
all evaluators felt the system will be useful. None of them disagreed or strongly disagreed. In terms
of reducing delays, all participants agreed. This really buttresses the fact that the system will indeed
reduce delays and improve efficiency.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 33
Table 5 Percentage summary of perceived potential usefulness
SA A N DA SDA Total%(N)
The system would allow me to complete my tasks
more quickly 25 55 20 0 0
100(20)
Using the system would increase effectiveness of
performing tasks 40 45 15 0 0
100(20)
Using the system would give me more time over
other issues than administrative task 55 35 10 0 0
100(20)
Using the system would give me more visibility over
my tasks 45 40 15 0 0
100(20)
Using the system would reduce delays for the same
amount of effort 45 55 0 0 0
100(20)
I would find the system useful in the process for my
research work 40 45 15 0 0
100(20)
Participants perceived that the prototype system as very useful. The prototype appeared to meet
participant’s need for a research projects management system. The evaluators’ narrative comments
also support the potential usefulness of the system.
6 CONCLUSION
Student’s research is a capstone in the study process at the University of Botswana. This study
looked at the current setup and proposed a better workflow system to improve the process. The
process led to the development of a prototype for a student’s projects management system for
University of Botswana. Five phases were followed in the developing of the prototype. The end of
the first phase resulted in gathering of information about the current state of student’s research
projects. The second phase used the information gathered in the first phase to model the current
processes. The third phase improved the processes designed in the second phase by providing a new
workflow. The last two phases were the prototype development and the evaluation. The results of
these two phases proved the concept was viable. The completion of this dissertation as a whole,
demonstrated a viable concept of project research management system which met the users’
expectations. The prototype will be quite useful in the effective processes of monitoring, supervising
and managing students’ research projects; therefore, in future developments of the institution, the
issues of student research workflow should be incorporated into the system.
There are however several causes of delays such as humans and system but the research focus is
mainly those caused by administrative inefficiencies that can easily be dealt with using proper
process flows and reminders.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 34
REFERENCES
Abdelgader, F. M.,Dawood, O. O., & Mustafa, M. M. (2013). Comparison of The
WorkflowManagement Systems. The International Arab Conference on Information
Technology. , (pp. 1-5). Khartoum: theIRED.
Abiddin, Z. N., Ismail, A., & Ismail, A. (2011). Effective Supervisory Approach in Enhancing
Postgraduate Research Studies. International Journal of Humanities and Social Sciences, 1(2),
206-217.
Agarwal.R, V. (2002). Assessing a Firm's Web Presence: A Heuristic Evaluation Procedure for the
Measurement of Usability. Information Systems Research, Vol. 13(2), 168-186.
Aguilar-Saven, R. S. (2004). Business Process modelling: Review and Framework. International
Journal Production Economics, 129-149.
Aguillar-Save, R. S. (2004). Business process modelling: Review and Framework. International Journal
of Production and Economics , 129-149.
Bakar, M. A., Jailani, N., Shukur, Z., & Yarim, N. F. (2011). Final Year Supervision Management System
as a Tool for monitoring Computer Science Projects. Procedia Social and Behavioral Sciences,
273-281.
Davis, F. D. (1998). Perceived usefulness, perceived ease of use and user acceptance of Information
Technology. MIS Quartely, 13, 319-340.
Laurie, K. R. (2002). Structured Heuristic Evaluation of Online Documentation., 2002. IEEE
Professional Communication Society.
Nielsen, J. (1994). Usability Inspection Methods. New York: John Wiley & Sons,.
Romdhani, I., Tawse, M., & Habibullah, S. (2011). Student Project Performance Management System
for effective Final year and Dissertation Projects Supervision. London: London International
conference on Education.
Sommerville, I. (2011). Software Engineering. Boston: Addison-Wesley.
Tahir, M. I., Ghani, A. N., Atek, E. S., & Manaf, Z. (2012). Effective Supervision from Research
Students’ Perspective. International Journal of Education, 4(2).
WFMC Documentation, W. M. (1996). The Workflow Reference Model. Retrieved Jan 2014, from
http://www.aiai.ed.ac.uk/project/wfmc/ARCHIVE/DOCS/glossary/glossary.html
Yan, Y., Han, X., Yang, Y., & Zhou, Q. (2012). On the design of an advanced web-based system for
supporting thesis research process and knowledge sharing. Journal of Educational
Technology Development and Exchange, 5(2), 111-124.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 35
IC1015
Evaluating the Effect of Privacy Preserving Record Linkage on Student
Exam Record Data Matching
George Anderson, Tsholofetso Taukobong, Audrey Masizana
Department of Computer Science University of Botswana Gaborone, Botswana
andersong@mopipi.ub.bw; mphot@mopipi.ub.bw; masizana@mopipi.ub.bw
ABSTRACT
Data matching identifies which record pairs from two different databases represent the same entities. The data matching process improves data quality, enriches data and allows analysis that would otherwise be impossible from one individual database. While there is need for data matching, issues of preserving privacy and maintaining confidentiality need to be adequately addressed. Numerous research studies have been carried out with different approaches proposed to address these issues, resulting in a relatively new research area termed Privacy Preserving Record Linkage (PPRL). The broad approach is to transform record data using some sort of one-way function into an encoded representation, which is then used for matching. In this paper, we study the impact of such privacy preserving data matching on the quality of data matching, as determined by standard metrics, when applied to a university data set comprising student exam records and student registration records. Our results demonstrate that the quality of data matching does not suffer, while benefits of privacy are maintained. Key words: Privacy Preserving Record Linkage, Data Matching, Bloom Filters, Computational University Administration.
1 INTRODUCTION A Record Linkage (also known as Data matching) involves identifying records which correspond to
the same entities from several databases (Christen, 2010). Entities could be patients, customers,
people being counted in a census, etc. The process usually involves linking different records using a
set of common fields. What makes data matching a challenging field of its own is that the records in
question in the various databases might not have usable unique identifiers (matching primary keys),
due to errors. As such, other fields have to be used, such as surnames. However, surnames might
have errors, such as typographical errors. Some fields might appear in one database, but not in the
other one. For example, a post office box number. Data matching has been used to solve problems
in a variety of domains, such as (Christen, 2010): national census, where data quality can be
improved by matching records across census carried out at different points in time, as well as
building a richer source of data by integrating census databases with other databases, such as crime;
health, where records across hospital, clinic, ambulance, and mortuary databases can be matched in
order to give a richer understanding of a patient’s health over her lifetime; national security, where,
for example, terrorists have to be identified, taking advantage of the online and financial records
they leave behind as they carry out their activities; bibliographic database, such as Google Scholar,
must match bibliographic records in different formats, so correct authors can be attributed for
research papers, and information such as citation counts are accurate.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 36
Standard data matching techniques assume that all data required for matching is available in a
readable, unencoded/unencrypted, form. When data matching is done internally in an organization,
and all data belongs to that organization, employees involved are made aware of all regulations and
policies concerning handling of data. However, when data from two or more organizations has to be
matched, or matching is outsourced, then issues of privacy become a concern. A good example is
when health records from two hospitals are matched by a third party, and the issue of data privacy is
not well addressed, it may arise that a well-known politician has a terminal medical condition. This
information might be made public against proper legal procedure.
In order to address such issues, Privacy Preserving Record Linkage (also known as Privacy Preserving
Data Matching) works by encoding or encrypting databases, before data matching is carried out,
either by one of two participating organizations, or by a third party (Schnell et al., 2009; Christen,
2010).
This research contributes to this field by applying a Privacy Preserving Record Linkage technique to
match student registration against examination records and evaluating its impact.
Research has been carried out in the University of Botswana to evaluate the potential for data
matching in processing student exam results (Anderson et al., 2013). This involved matching student
entities between student registration lists, which do not contain errors, and student exam records
which contain errors introduced by students when completing their exam forms. Various metrics,
such as precision, recall and pairs quality, were used to evaluate the approach. Results of the study
demonstrated great potential for application of data matching in this domain.
The current study takes the next step, to evaluate the impact of privacy preserving data matching on
the same problem. Incorporating privacy preserving data matching into our data matching system
would enable the data matching exercise to be overseen by people who are not at the same level of
responsibility as lecturers, such as teaching assistants or student assistants. In this paper, we
describe our study, including the experiments conducted, and our results.
The rest of this paper is organized as follows. Section 2 gives a background on data matching. Section
3 discusses our problem environment. Section 4 describes the privacy-preserving approach we used.
Section 5 describes our experiments and discusses our results. Section 6 concludes.
2 BACKGROUND While there is need for data matching from separate databases in order to improve data quality,
enrich data and allow analysis that would otherwise be impossible from one individual database,
issues of preserving privacy and maintaining confidentiality need to be adequately addressed.
Numerous research has been done, with different approaches proposed to address these issues,
resulting in the research area termed Privacy Preserving Record Linkage (Christen, 2010).
Privacy Preserving Record Linkage (PPRL) endeavors to present a way in which two or more different
organizations can perform record linkage without revealing any other information to either party
besides the matched records. For example, two businesses identifying if they have common
customers without revealing customer identity or any other confidential knowledge coming from the
matched data to either party (Christen, 2010).
Figure 1 illustrates the record linkage process under privacy preserving context. With data
preprocessing done independently by the two database owners, it is very important that they agree
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 37
about the approaches to use as well as what the common attributes to be used for linkage are.
Figure 1. Record linkage process under privacy preserving context.
Adapted from (Vatsalan et al., 2013a).
Vatsalan et al. (2013b) describe taxonomy for PPRL techniques given in Figure 2, which categorize
them into the five main areas. These are described in detail below.
Privacy aspects
i. Number of parties: This can be a two party protocol involving just two database owners
or a three party protocol also known as linkage unit.
ii. Adversary model: Two adversary models generally used in cryptography are employed
here, that is the Honest-But-Curious Behaviour (HBC) where the two database owners
follow protocol but also want to learn about the other database owner’s data. The
second model is Malicious Behaviour whereby the database owners may behave
arbitrarily.
Privacy technique: There are a number of privacy techniques being used in PPRL such as SMC
(Secure Multi-Party Computation), phonetic encoding, bloom filters and others.
Linkage techniques: The techniques that a linkage unit uses during the different steps of the PPRL
process will determine computation requirements and the quality of matched data results.
i. Indexing: There is need to apply indexing or blocking algorithms to reduce
computational complexity during comparisons (Vatsalan et al, 2013a, Anderson et al.,
2013).
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 38
ii. Comparisons and Matching: Matching can be Exact, which takes into consideration exact
matching values, that is 1 for and exact match and 0 for non-match. Alternatively
matching can be Approximate which takes into consideration partial similarities with
approximate values between 0 and 1. Various approximate string comparison functions
have been used e.g. edit – distance, use of q-grams (common sub strings). Durham et al.
(2011) detail a comparison of the different PPRL string comparison techniques.
iii. Classification: Many different techniques used for classification including threshold
based, rule based and machine learning based
Figure 2. Taxonomy of PPRL techniques adapted from (Vatsalan et al., 2013b).
Theoretical analysis: This looks at estimate measures for aspects such as scalability to large
databases, quality of linkage results (accuracy, precision, recall, f-measure), and privacy
vulnerabilities of the various PPRL methods employed (Vatsalan et al., 2013a; Vatsalan et al., 2013b).
i. Scalability to large databases: This is in terms of computational efforts and
communication costs of the overall PPRL process, normally ‘big O’ notation measure is
used.
ii. Quality of linkage: defined in terms of data errors and discrepancies fault tolerance of
the linkage technique used, whether the matching is field based or record based as well
as the data types involved.
iii. Privacy vulnerabilities: A measure of vulnerability of different PPRL techniques is
achieved by looking at the privacy attacks it is susceptible to. For instance Bloom filter
based techniques have been seen to be at risk of the cyptanalysis attack whereby an
intruder can map the individual encoded values back to their original values
(Niedermeyer, 2014). Other privacy attacks include dictionary attacks, frequency attacks,
and composition attacks, discussed in detail in (Vatsalan et al., 2015).
Evaluation: Evaluation is based on the same three aspects of scalability, linkage quality, and privacy.
Scalability measures based on platform and infrastructure include runtime, memory space and
communication size; while those based on number of generated record pairs are reduction ration,
pair completeness and pair quality (Vatsalan et al., 2013a). These measures are useful in evaluating
the performance of the linkage algorithm in terms of efficiency and effectiveness. Quality of linkage
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 39
is generally evaluated in terms of accuracy measures such as precision, recall, F-measure as record
classification is a highly imbalanced classification problem.
Practical aspects: This looks at three aspects being the implementation techniques used to
prototype and implement the solution, datasets and application area. With privacy issues making it
difficult to acquire real data with personal information, usually synthetically produced datasets
usually are used (Randall et al., 2014). Application area looks at whether there are targeted
application areas for certain techniques or the techniques are developed generally without any
target areas.
Our work focuses on privacy techniques (q-gram and hash function encoding), linkage techniques
(comparison and matching), theoretical analysis (we consider the quality of linkage using precision
and recall), and evaluation.
Various studies (Vatsalan et al., 2014; Randall et al., 2014; Vatsalan & Christen, 2014) have evaluated
the scalability, linkage quality and privacy issues of Bloom filters in PPRL by applying the technique to
large real-world datasets. Randall et al. (2014) used dataset coming from hospital admissions data
from two Australian hospitals with about 7 million records from one hospital linked with 20 million
from another. The researchers applied unencrypted linkage using bigrams on personal identifiers
against encrypted linkage using trigrams on Bloom filters. The results showed high linkage quality
for both with almost no difference in quality between the two linkages (encrypted and unencrypted
using Bloom filters) demonstrating that it is possible to get high linkage quality even under privacy
preserving context.
Evaluating any PPRL technique normally entails assessing its performance in addressing the three
main challenges of record linkage, that is scalability, linkage quality and privacy issues. Of the three,
privacy is the most difficult to evaluate. Privacy evaluation entails calculating the probability of an
attack; in other words, the risk of an adversary correctly identifying original human data using a
publicly available dataset such as a telephone directory (Vatsalan et al., 2013b; Vatsalan et al., 2014).
Other studies (Niedermeyer, 2014; Vatsalan et al., 2014) evaluated the privacy issues associated
with Bloom filter based linkage by simulating a cryptanalysis, which is the privacy attack the Bloom
filter based approach is most susceptible to (Kuzu et al., 2011). Through these privacy evaluations
and other performance measures, the method can be better understood and its security enhanced,
as it has been compared with other PPRL techniques and seen to give better results despite its
limitations (Vatsalan et al., 2014; Randall et al., 2014; Schnell et al., 2013; Vatsalan & Christen, 2014;
Karakasidis & Verykios, 2011).
Our work is different from these, because to the best of our knowledge, no one has evaluated PPRL
using Bloom filters on our data set.
3 DESCRIPTION OF ENVIRONMENT The research data comes in the form of real world datasets. The datasets are from the University of
Botswana Computer Science ICT121 course offering (Computing Skills Fundamentals I) exam results
for the Faculty of Education group from 2007 to 2011. The exams are administered with special
answer forms that are filled by shading using HB pencils (Anderson et al., 2013). Students use the
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 40
answer forms to provide their answers as well as their student details such as student ID, surname,
initials, program of study etc. The answer sheets are then graded by scanning them through an OMR
(Optical Mark Recognition) scanner which, together with the scanner software, creates CSV (Comma
Separated Value) data files which may be used as they are or converted to spreadsheet files. These
data files normally contain errors as some students shade their details incorrectly. Common
mistakes such as swapping of student ID digits or leaving spaces and how the scanner reacts to them
are discussed in (Anderson et al., 2013). The exam data records are matched with student
registration records from University of Botswana Academic Student Administration System (ASAS), in
order to ensure every student gets the right mark.
These datasets contain confidential student details and is considered highly sensitive, hence the
need for application of privacy preserving record linkage technique for their linkage. 4116 student
records were used.
4 PRIVACY PRESERVING RECORD LINKAGE APPROACH The privacy preserving approach we adopted for our experiments is to use Bloom filters to encode
our database records. We chose Bloom filters because they are known to work well for a wide
variety of data set types (Christen, 2010) and are easy to implement, therefore giving us a good
point of reference.
Bloom filters, conceived by Burton Howard in 1970, is a data structure for efficiently checking set
memberships (Niedermeyer et al., 2014; Bloom, 1970). It is a single array of l bits, where l is an array
length. Initially all bits are set to zero. In order to store a specified set S = {s1, ... , sn } of elements in a
bloom filter k independent hash functions h1, ..., hn are defined such that each hash function maps
on the domain between 0 and l-1 and all bits having indices hj(si) for 1<=j<=k are set to 1. If a bit had
been set to 1 before, it retains the 1. Set membership of element x to set S is checked by mapping
element x with the same hash functions such that if the bit indices h1(x), ..., hk(x) in the Bloom filter
are all 1 then x is believed to be a member of S. However there is a chance that the mapping results
are a false positive which comes about when the indices with ones h1(x), ..., hk(x) are resulting from
different si. On the other hand, if at least one of the bits turns out to be 0, the x will definitely not be
a member of the set S. Bloom filters can also be used to determine an approximate match between
two sets (Schnell et al., 2009; Niedermeyer et al., 2014).
An approach for privacy preserving record linkage using Bloom filters were proposed by researchers
Schnell et al. (2009). Their approach was to split identifier strings into q-grams, that is sub-strings of
length q, use hash functions (for example MD5 or SHA-1) that only the two database owners know
to map the q-grams into the Bloom filters, and then send the linkage unit to Bloom filters. Dice co-
efficients are used to generate a similarity score between the two records. These make use of the
number of 1-bits in the Bloom filters for comparison and matching (Schnell et al., 2009). With q=3
the trigram set of the name “morapedi” is “mor”, “ora”, “rap”, “ape”, “ped”, “edi”.
A simple example is illustrated as follows. Two names (paula, paul) using bigrams are mapped to
two Bloom filters (F1, F2) using two hash functions (k = 2) with 14 bits each, a is given as the number
of 1-bits in F1 and b is given as the number of 1-bits in F2, while h is the number of 1-bits in each
Bloom filter. Table 1 shows the hashing values for the two strings.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 41
Table 1: Bloom filter hashing example
String 2-gram F1 hash bit number F2 hash bit number
paula pa 0 6
au 2 7
ul 6 11
la 8 13
paul pa 0 6
au 2 7
ul 6 11
Table 2: Resulting Bloom filters for the example in Table 1
paula 1 0 1 0 0 0 1 1 1 0 0 1 0 1
paul 1 0 1 0 0 0 1 1 0 0 0 1 0 0
In string “paula” h= 7, in “paul” h=5. The two strings have 5 common 1-bits between them. Hence
the Dice-Coefficient F1, F2 =
=
giving 0.83 as the approximate similarity of the two strings.
Research by Schnell et al. (2013) suggests using longer Bloom filter of about 500 to 1000 bits and
much more than just two hash functions for effective comparisons and more efficient secure
encodings. Variations of Bloom filter based PPRL have been used in a number of applications such as
legal applications, health applications, computer networking applications (Schnell et al., 2009;
Schnell, 2013; Niedermeyer et al., 2014 ).
5 EXPERIMENTS AND RESULTS Experiments were conducted using a 4116 record data set, each record representing a student’s
exam record. Another data set represented student registration records. For a scenario where
privacy preservation was not used, each record pair, one from each database, was compared using a
Levenshtein edit distance similarity score. We implemented our system using Python and the Python
Levenshtein library (Haapala, 2015) was used for the similarity score (the ratio function in the library
was used). The similarity score ranges from 0 to 1 (a real number). Two fields were used: ID number
and surname. The scores from two corresponding fields were added. A threshold was used to check,
for each threshold, what is the precision and recall? The threshold varied from 0.0 to 1.98 in steps of
0.02, therefore 100 threshold values were used. The range for threshold uses arises because the
combined similarity score for the two fields ranges from 0.0 to 2.0.
For Privacy preservation, Bloom filters were used to encode the records in both databases. Bloom
filters had a structure of 1000 bits, since such a long length serves to reduce the number of false
positives, was shown to work well in the literature, and was used in experiments evaluating Bloom
filters by Schnell et al. (2009). To hash the strings into the Bloom filters, the mmh3 hash function in
the Python Murmur Hash 3 library (Appleby, 2016) was used. The number of hash functions used
were 10, 30, and 60, making for 3 experiments, in order to evaluate the performance for varying
number of hash functions, which together with long Bloom filters, serves to reduce the number of
false positives. Each student record comprised two string, each of which were broken up into 2-
grams, and each 2-gram hashed into a 1000-bit Bloom filter using k MurmurHash3 has functions. A
false positive (FP) is a record pairing which is identified as a match, but is actually a non-match. A
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 42
true positive (TP) is a record pairing which is identified as a match and is actually a match. A true
negative (TN) is a record pairing identified as a non-match and is actually a non-match. A false
negative (FN) is a record pairing identified as a non-match but is actually a match. Precision is the
fraction of record pairs identified as matches that are true positives i.e. TP/(TP+FP). Recall is the
fraction of actual matches identified as true positives i.e. TP(TP+FN) (Manning et al., 2008). Precision
and recall are calculated for each threshold.
Figures 3 to 5 show results of experiments for the various configurations. The red curve is the
precision-recall curve for the non-privacy preserving experiment and the blue curves are for the
privacy-preserving configuration using Bloom filters. We employ the use of visual inspection in order
to compare the two curves. This approach was used in the literature to evaluate Bloom filters
(Schnell et al., 2009). A visual inspection shows that the Bloom filter configuration performs almost
the same as the non-privacy preserving configuration. In the figures, m represents the Bloom filter
length (number of bits) while k represents the number of hash functions used for each q-gram.
Therefore there is negligible negative impact on the quality of record linkage.
6 EXPERIMENTS AND RESULTS There are laws and regulations guiding the use of data that contains sensitive information such as
student records. Only with the assurance of use of privacy preserving record linkage can institutions
freely allow the use of their student data for data matching. Privacy preserving record linkage on
student data can be used for research such as determining student preparedness for tertiary
education in general or for specific tertiary programs for secondary leaving students by linking their
first year tertiary results with their secondary results. Alternatively linkages can be used to
determine the degree to which a student can successfully go through a specific program course, for
example programming courses, by linking results from other courses done. These can help in
understanding failure rates in certain courses or programs without exposure of details of the failing
students. Privacy preservation also helps mitigate bias as matched records can be evaluated without
knowledge of who the actual individual students are.
We have demonstrated that Privacy Preserving Data Matching has negligible effect on performance
of data matching using our data set by conducting extensive experiments and using a visual analysis.
Future work will involve a numerical analysis using a metric such as Area Under Precision Recall
Curve. Future work will also involve development of a protocol for privacy-preserving record linkage
in our context. This will detail how many data-handling entities are required, which ones do the
hashing, and which ones do the matching.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 43
Figure 3: Precision-Recall Curves for Non-Privacy Preservation (Red) and Privacy Preservation
(Blue) With 1000 bit Bloom Filters and 30 hash functions.
Figure 4: Precision-Recall Curves for Non-Privacy Preservation (Red) and Privacy Preservation
(Blue) With 1000 bit Bloom Filters and 10 hash functions.
Figure 5: Precision-Recall Curves for Non-Privacy Preservation (Red) and Privacy Preservation
(Blue) With 1000 bit Bloom Filters and 60 hash functions.
ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their useful comments.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 44
REFERENCES Anderson, G., Masizana, A.N., & Mpoeleng, D. (2013). An Exact and Inexact Approach for Saving
Time and Preventing Errors in Processing of Student Exam Results at the University of
Botswana. International Journal on Information Technology (IREIT), 1(3), 179-185.
Appleby, A. (2016). MurmurHash3. Retrieved from:
https://github.com/aappleby/smhasher/wiki/MurmurHash3.
Bloom, B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the
ACM 13(7), 422–426.
Christen, P. (2012). Data Matching, Berlin: Springer-Verlag.
Durham, E., Xue, Y., Kantarcioglu, M., & Malin, B. (2011). Quantifying the correctness, computational
complexity, and security of privacy-preserving string comparators for record linkage.
Information Fusion, 13(4), 245-259.
Haapala, A. (2015). Python Levenshtein. Retrieved from: http://github.com/ztane/python-
Levenshtein.
Karakasidis, A., & Verykios, V.S. (2011) Secure blocking + secure matching = secure record linkage.
Journal of Computing Science and Engineering, 5(3), 223–35.
Kuzu, M., Kantarcioglu, M., Durham, E., & Malin, B. (2011, July). A constraint satisfaction
cryptanalysis of bloom filters in private record linkage. In Privacy Enhancing Technologies
(pp. 226-245). Springer Berlin Heidelberg.
Manning, C.D., Raghavan, P., & Schutze, H. (2008). An Introduction to Information Retrieval,
Cambridge University Press.
Niedermeyer, F., Steinmetzer, S., Kroll, M., & Schnell, R. (2014). Cryptanalysis of basic bloom filters
used for privacy preserving record linkage. Journal of Privacy and Confidentiality, 6(2), 3.
Randall, S. M., Ferrante, A. M., Boyd, J. H., Bauer, J. K., & Semmens, J. B. (2014). Privacy-preserving
record linkage on large real world datasets. Journal of Biomedical Informatics, 50, 205-212.
Schnell, R., Bachteler, T., Reiher, J. (2009) Privacy preserving record linkage using Bloom filters.
BioMed Central Medical Informatics and Decision Making, 9(1), pp. 1-11.
Schnell, R., Bachteler, T., & Reiher, J. (2009). Privacy-preserving record linkage using Bloom filters.
BMC Medical Informatics and Decision making, 9(1), 41.
Vatsalan, D., Christen, P., & Verykios, V. (2013) Tutorial on Techniques for Scalable Privacy
Preserving Record Linkage, Presented at the 22nd ACM International Conference on
Information and Knowledge Management (CIKM 2013), San Francisco, October 2013.
Retrieved from: https://cs.anu.edu.au/people/Peter.Christen/cikm2013pprl-tutorial/cikm-
2013-pprl-tutorial-slides.pdf
Vatsalan, D., Christen, P., & Verykios, V. S. (2013). A taxonomy of privacy-preserving record linkage
techniques. Information Systems, 38(6), 946-969.
Vatsalan, D., Christen, P., O'Keefe, C. M., & Verykios, V. S. (2014). An evaluation framework for
privacy-preserving record linkage. Journal of Privacy and Confidentiality, 6(1), 35-75.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 45
Vatsalan, D., & Christen, P. (2014). Scalable privacy-preserving record linkage for multiple databases.
In Proceedings of the 23rd ACM International Conference on Conference on Information and
Knowledge Management (pp. 1795-1798). ACM.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 46
IC1016
Ontological Perspectives in Information System, Information Security and
Computer Attack Incidents (CERTS/CIRTS)
Ezekiel Uzor Okike, Tshiamo Motshegwa, Molly Nkamogelang Kgobathe
Department of Computer Science Faculty of Science
University of Botswana Gaborone, Botswana
okikeue@mopipi.ub.bw; tshiamo.motshegwa@mopipi.ub.bw; molly.kgobathe@mopipi.ub.bw
ABSTRACT
Ontological methodologies are used in almost every field of study including philosophy, medicine, science and engineering based disciplines. This paper is motivated by the need to address pertinent issues and adopt necessary and useful ontology based research approaches in Information systems. The paper aims to discuss ontology by defining its uses, types, methodologies and applications especially in Information Systems and Information security. The paper discusses techniques, applications and use of ontologies in computer science, multiagent systems and particularly in information systems from three perspectives; namely Information systems (IS) research methods, Formal Specification and Information Security. The paper concludes with the view that these three perspectives are still needed in Information Systems (IS) research and education agenda. Hence IS research methods do accommodate surveys, case studies, and experiments as in other disciplines. However the researcher must demonstrate appropriately the need and usefulness of the choice and application of any research method in IS research. On the Information security side, the paper is of the view that application of ontological approaches and models could assist in the development of information system security tools for sharing cyber-attacks incident data and information. Keywords: Ontologies, Information Systems, Information Security, Formal Specification, Multiagent
Systems and CERTs/CIRTs
1 INTRODUCTION Knowledge-Based Systems (KBS) are computer programs that reason using a knowledge base to
solve complex problems ( Hayes-Roth, Waterman, & Lenat, 1983). The development of effective
Knowledge Management System (KMS) has become a critical issue in applied domains (WU,
Jiangning, 2005). As a result, the need to adopt ontological approaches in Artificial Intelligence,
Computer science and Information systems has been widely discussed (Chandrasekaran, 1999);
(Pereira & Santos, 2009); (Raskin, 2001). As explained by (Chau, 2007), ontology is concerned with
the detailed description of the architecture, the development and the implementation of the
systems prototype using both forward chaining and backward chaining during the inference process.
In order to realize the objective of semantic match for knowledge search, ontology may be divided
into information ontology and domain ontology. Enterprises with ontology based knowledge
management applications, focus on Knowledge Processes and Knowledge Meta Processes (Steffen
Staab, 2003).
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 47
This paper is concerned with research methods in computing science, and especially in Information
systems. The paper also proposes an ontological semantic approach for sharing information among
Computer Emergency Response Teams (West-Brown, Stikvoort, Kossakowski, Killcrece, & Ruefle,
1998) by developing what we coin CERTS Ontology or Computer Incidents Response Teams (CIRTS
Ontology).
1.1 The Statement of the Problem The main challenges in Computing and Information Systems research are in defining a research problem and in the selection of appropriate research methodology. Upcoming researchers (especially graduate students) and younger researchers in the domains of computing and Information systems often are not able to apply ontological research methods, and formal models due to their lack of grasp on ontology as the basis of research and as formalism.
1.2 Study Objective The objective of this paper is to examine ontological research methods and ontology applications in
computing, Information systems, and Information security and to propose an ontological semantic
approach for sharing information among Computer Emergency Response Teams (CERTS), also
referred to as Computer Security Incident Response Teams (CIRTS). This is done in order to
demonstrate the usefulness of ontological research methods, and formal ontological models in
Information systems and Information security.
1.3 Methodology The approach adopted in this paper is of evaluation of the literature review relating to ontology,
Information systems and Information security in order to obtain the general perceptions of
researchers with respect to ontological methods and perspectives in computing and Information
systems. Using this background we then propose an ontology for CERTS.
The rest of this paper is organised as follows. Section 2 presents a formal definition of ontology, its
uses, types, methods and techniques. Section 3 deals with ontological applications in regards to the
choice of research methods, Information systems and information security. Section 4 examines
ontological applications in multi agent systems. Section 5 concludes the discussion with the view
that ontological research methods enable the definition of research problems and the selection of
appropriate research methods in Information systems including the use of surveys, case studies and
experiments and proposes use ontology for CERTS domain.
2 ONTOLOGIES
2.1 What is Ontology? Ontology is a formal, explicit specification of a shared conceptualization used to encourage
standardization of the terms for representing knowledge about a domain (Kang Ye etal, 2009).
Ontology describes the logical structure of a domain, its concepts and the relations between them
(Silvonen, 2002). Ontology has also been referred to as a frank technology to represent knowledge
(Sunitha Abburu and G Suresh Babu, 2013).
2.2 Uses of Ontology (Dieter Fensel et al, 2001) States that the rapid growth of online information on intranets and the
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 48
Web has led to information overload. There needs to be some automatic meaning-directed or
semantic information processing of online documents. As a solution, an Ontology Knowledge Base
provides an innovative tools for semantic information processing and thus for much more selective,
faster, and meaningful user access.
Ontology has been adopted by early Artificial Intelligence (AI) researchers, who acknowledged its
work application from mathematical logic and debated that AI researchers could create new
ontologies as computational models that allow certain kinds of automated reasoning (Gruber T. ,
Ontology, 2009).
Furthermore, (Gruber T. , 1992) offers the following definition of Ontology - "A specification of a
representational vocabulary for a shared domain of discourse — definitions of classes, relations,
functions, and other objects” Ontologies are also seen as defining a common vocabulary in which
shared knowledge is represented. They are widely used to support sharing and reuse of formally
represented knowledge in AI systems (Gruber T. , 1992).
(Kabilan, 2007) and (Gruber T. , 2009) also list the following as frequent uses of ontologies;
1. To share common understanding of the structure of information among people or software
agents.
2. To enable reuse of domain knowledge.
3. To make domain assumptions explicit.
4. To separate domain knowledge from the operational knowledge.
5. To analyse domain knowledge.
2.3 Types of Ontologies Ontologies could either be classified based on its scope or domain granularity, taxonomy
construction direction or the type of data sources as shown in figure 1 (Catherine Roussey et al,
2011).
1. Domain Granularity
Figure 1: Ontology Categories (Antonio Zilli et al, 2009)
(Antonio Zilli et al, 2009) noted that the top-level ontology has some concepts that have some
general agreements or stable standards; domain ontology has concepts that define the main focus
of interest on the domain; the task ontology deals with sub-concepts that are needed to solve
problems on the main domain and the application ontology deals with concepts that exercise the
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 49
fastest rate of exchanging data.
2. Taxonomy construction direction
Several approaches could be followed to build the concept taxonomy. One could either use the
bottom–up approach, top-down approach or the middle out approach (Catherine Roussey et al,
2011).
Bottom-Up approach: defines first the most general concepts then goes towards the most specific.
Top-Down approach: defines first the most specific concepts then goes towards the most specific in
order to build the ontology.
Middle-Out approach: defines the concepts from the central area towards the general and / or
specific concepts to build ontology.
3. Type of sources
Ontologies could also be described according sources used to get the knowledge (Catherine Roussey
et al, 2011). The knowledge could either be based on:
Text: Unstructured data given to a computer system for processing.
Thesaurus: forming concepts from words or linguistic relations to build ontology.
Relational Database: structured and accurate software storages used to build ontologies from.
UML Diagrams: using formal described UML classes to define concepts to build ontologies
2.4 Ontology Engineering/Methodologies Apparently by 1995 there were no standard methodologies for building ontologies nor were there
much research publications in this area (Mike Uschold and Micheal Gruninger, 1996). So the same
authors came up with a methodology named the Enterprise Ontology Modelling Process which has
the following phases.
1. Identify Purpose and Scope: which deals with main reason why the ontology is being built
2. Building the ontology: segmented into three steps
i) Ontology capture: deals with identifying the key concepts and relationships in
the domain of interest.
ii) Ontology coding: deals with representation of the knowledge using a formal
language for the ontology.
iii) Integrating existing ontologies: incorporates the both coding and capturing
process with logic of how to use the ontology.
3. Evaluation: gives a technical judgment on the ontology
4. Documentation: Stating the guidelines for each purpose
As years went by the it came to be known that ontology development process could be done in so
many ways following the IEEE standard for developing Software Life Cycle Process (Mohammad
Nazir Ahmad et al, 2012). Two approaches used for building an ontology are,
1. Building ontology from scratch, or
2. Building ontologies from existing ontologies or from different data sources (Giannopoulou,
2008).
Many researcers came up with the following methodologies adressing the ontology development.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 50
Therefore when designing an ontology knowledge base one could choose to follow either one of the
following methodologies as shown in Figure 2.
Figure 2: Phases of methodologies for building ontologies (Mohammad Nazir Ahmad et al,
2012)
2.5 Ontology Techniques The focus of modern information systems is moving from “data processing” towards “concept
processing”, meaning that the basic unit of processing is less and less an atomic piece of data and is
becoming more a meaningful concept which carries an explanation and exists in a context with other
concepts (Janez Brank et al, 2005).
The first key characteristic of the standardization of ontologies is the development of the ontology
mark-up formats and associated standards over the time, which also shows the evolving demands
for semantic mark-up (Nordmann, 2009). Nordman (2009) further explains that the largest
distributed pile of data currently is the internet, which is processed by a lot of different involved
applications. This involves static data, as well as web services interacting with each other and
different data sources to build new services. Figure 3 below shows the history of the languages of
technologies that were used to define ontologies over the years.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 51
Figure 3: History of ontology related technologies (Nordmann, 2009)
2.6 Ontology Matching (Euzenat J and Shvaiko P, 2007) define ontology matching as one which provides a common
conceptual basis for organizing classifications that can be used to compare (logically) different
existing ontology matching systems as well as for designing new ones, taking advantages of formal
art solutions.
2.7 Ontology Mapping (Godugula, 2008) explains ontology mapping as a technique that has become quite useful for
matching semantics between ontologies or schemas that were designed independently of each
other. Ontology mapping is done by analysing various properties of ontologies, such as syntax,
semantics and structure, in order to deduce alternate semantics that may apply to other ontologies,
and therefore create a mapping (Godugula, 2008).
3 ONTOLOGY APPLICATIONS According to (Viinikkala, 2004) and (Catherine Roussey et al, 2011) the term Ontology has become
popular especially in information systems domains such as knowledge engineering, natural language
processing, cooperative information systems, intelligent information integration, web technologies,
database design and knowledge management. Viinikkala (2004) further noted that one strongly
pursued goal in the information systems ontology domain is that of establishing methods for
automatically generating ontologies and suggests that automation requires a higher degree of
accuracy in the description of its procedures, of which ontology is a mechanism for helping to
achieve this. Therefore information system ontology needs to be designed for at least one specific or
practical application. (Kabilan, Ontology for Information Systems (O4IS) Design Methodology:
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 52
Conceptualizing, Designing and Representing Domain Ontologies, 2007) justifies ontology as a
software artefact or formal language designed with a specific set of uses and computational
environments in mind.
3.1 Ontology in Information Systems We propose the use of ontologies in Information systems from three perspectives namely:
(i) Information systems research
(ii) Information systems development (Information systems engineering)
(iii) Information systems security
3.1.1 Ontologies in the Information systems domain research. In this regard we consider the choice of research methods and the use of formal specifications as
crucial in IS research
(i) Choice of research methods
Consider the ontological research method shown in figure 4 below. When you consider research
philosophy or methodology, ontological perspectives come into bearing since ontology is also “the
science or study of being” and deals with the nature of reality (Blaikie, 1993). In simple language and
from philosophical point of view, ontology is a system of belief that reflects an individual’s
interpretation about what constitutes a fact and what does not. In this consideration, a researcher
should decide if the entities being studied are objective or subjective. Objectivism and subjectivism
have been identified as two important aspects of ontology. According to (Saunders, Lewis, &
Thornhill, 2009) Objectivism “portrays the position that social entities exist in reality external to
social actors concerned with their existence” (Bryman, 2003) further adds that objectivism
“is an ontological position that asserts that social phenomenon and their meanings
have an existence that is independent of their social actors”.
With regards to subjective research, subjectivism (constructive interpretation of research result )
perceives that social phenomena are created from perceptions and consequent actions of the social
actors. (Bryman, 2003) formally define constructive interpretation (constructionism) as “an
ontological position which asserts that social phenomena and their meanings are continually being
accomplished by social actors”
Our argument at this point is that Information systems research should also follow ontological
research methods with careful application. The impact of ontology on the choice of research
methods is shown in figure 4 below. From the bottom, the researcher has to think of appropriate
research method and decide if it needs a quantitative or qualitative approach or both; then decide
on the research strategy (experimental, case study, deduction, induction); then decide on the
approach depending on the strategy (empirical, interpretivist), and finally follow an ontological
research approach. In clear statements, information systems research permits the use of surveys,
case studies, experiments depending on the research design and approach (Garble, 1994; Choudrie
& Dwivedi,2005; Walsham & Sahey, 2006; Glasow, 2005 .)
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 53
Figure 4. Impact of ontology in the choice of research methods
(ii) The use of formal specifications
Ontologies are explicit formal specifications of terms in the domain and relations among them
(Gruber, 1993). Using formal specifications in IS research guarantees the ability to prove the
reliability and workability or our theories. The theoretical frameworks of an IS research may be
established as a provable system during specification. Indeed the concepts of domains and relations
are mathematical models whose applications play vital roles in IS research. In the relational model
the basic concepts are relations, Cartesian products, attributes, key, domain, N-Uplets .
Mathematically, Let and be two sets.
. Define the Cartesian Product of sets and as the set of all ordered
pairs such that and .
1
Any subset of defines a relation on
. Define a relation as the set of ordered pairs such that
. Define the Domain of a Relation as the set :
2
. Define the range of a relation as the set
y (x,y) R, for some x 3
Information systems include: Relations, Domains, Attribute, Key and N-Uplet. Consider for example
the STUDENT Table shown in Figure 5 below. In this Table, Domains represent columns or Attributes,
while N-Uplet represents rows or records.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 54
FirstName LastName StudentID Level of study Department
Abel Richard 20130123 200 Information Systems
Luke Anderson 20130145 200 Computer Science
Mary Paledi 20130167 200 Information Systems
Figure 5. STUDENT table
The application of these definitions in IS development can be seen in the development of logical
schemas and tables, functional dependency principles and the normalization of tables in database
base systems as well as in the development of Data models for specific applications . The need for
proper modelling of data object, entities and their relations cannot be over emphasized in IS project
as the success of the design depends largely on the data models. Improperly modelled systems
compromise system functionality. Therefore the foundations of IS models are highly based in
ontology. Moreover, the foundations of computing paradigms such as Object Oriented Analysis and
Design (OOAD) and their tools are highly rooted in ontology. More aspect of formal ontology in
Information systems are discussed in (Guarino, 1998)
3.1.2 Ontology in Information system development (Information Systems Engineering)
Ontology as a theory of domains present a highly structured system of concepts covering processes,
objects and attributes with all of their complex relations.
In Information Systems development, software development processes involve at least 4 steps
including –
1. Systems analysis (Feasibility study, Requirements Engineering, Data Modeling),
2. Systems design (interface designs, input/output, files and database design –
tables/schemas/relations/normalization),
3. System implementation (programming, testing, deployment- installation/system
conversion/documentation),
4. System maintenance, and evaluation.
Ontology finds useful applications in software development processes especially in system analysis
and data modelling. For example (DiLeo, Jacobs, & DeLoach, 2002) situates building an ontology as
part of the systems analysis phase when incorporating ontologies as part of extending the MaSe
Methodology (DeLoach & Kumar, 2005)
As presented in (Raskin, 2001), an ontology may divide the root concept ALL into EVENTS, OBJECTS
AND PROPERTIES; EVENTS into MENTAL –EVENT, PHYSICAL-EVENT, SOCIAL-EVENT; OBJECTS into
INTANGIBLE- OBJECTS, MENTAL-OBJECT, PHYSICAL-OBJECT, SOCIAL-OBJECT; PROPERTY into
ATTRIBUTE, ONTOLOGY-SLOT, RELATION as shown in figure 7 below. This approach is utilized in the
Object Oriented Systems Analysis and Design (OOAD) (Booch, 1993) with all its modelling tools such
as the Universal Modelling Language (UML) and facilities including Use Cases, Activity diagrams, Use
Case Descriptions, Functional Models, Structural Models (Classes, Class diagrams, Object diagrams,
patterns, Associations, Attributes, operations (Targeden, Dennis, & and Wixon, 2013).
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 55
Figure 6: Ontology building as part of system analysis (DiLeo, Jacobs, & DeLoach, 2002)
Figure 7. ALL Tree hierarchy illustrating design principle of Ontology (Classification)
Adapted from (Raskin, 2001)
ALL
EVENT
OBJECT
PROPERTY
EVENT
MENTAL-EVENT
PHYSICAL- EVENT
SOCIAL-EVENT
OBJECT
INTANGIBLE-OBJECT
MENTAL-OBJECT
PHYSICAL-OBJECT
SOCIAL-OBJECT
PROPERTY
ATTRIBUTE
ONTOLOGY-SLOT
RELATION
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 56
3.1.3 Ontology in Information system security
(Raskin, 2001) discusses ontological needs in information security from 2 perspectives namely:
(i) Inclusion of Natural Language data sources as an integral part of the overall data sources
in Information security applications. NLP is extensively used today in writing system
admin logs, information hiding, scanning of documents to detect possible intellectual
property breaches etc.
(ii) Formal specification of the information security community know-how for the support of
routine and time-efficient measures to prevent and counteract computer systems
attacks. Sophisticated algorithms are also used as encryption tools in information
security. Security in Information systems can be enforced at different levels. In fact the
basic concepts of the relational model (namely Relations, Domain, Attributes,Key, N-
Uplet) allows for appropriate security check at design level such as
domain integrity, referential integrity, and N-uplet checks, all of which are derived from
formal specifications in design. (Note that in a Relational table, Domains represent
columns or attributes, while N-Uplets represent rows or records)
Furthermore, (Pereira & Santos, 2009) suggested that ontologies contribute to unify terminologies
involved in classification and storage of security data, promote exchange of security information,
support browsing and search of semantic contents, promote interoperability for facilitation of
knowledge management and configuration, and provide support for construction of models or
theories of specific domains.
As security is critical in information systems, approaches that guarantee provable and reliable tools
such as found in ontology should emphasized. Hence the need for adequate dissemination of
essential ontological knowledge among researchers, Information system engineers and industry
practitioners as a sine-qua non to providing security for vital data and information.
3.1.4 Ontologies in Multiagent Systems From an Artificial Intelligence perspective, agents are communicative, intelligent, rational and possibly intentional entities. From the computing perspective, they are autonomous, asynchronous, communicative, distributed and possibly mobile processes (Pitt & Mamdani, 1991). Multiagent systems (Wooldridge M. , 2002) are modular distributed systems with decentralized data. Agents in a Multiagent system have incomplete information or capabilities and have to interact using Agent communication languages (Labrou, Finin, & Peng, 1999) and interaction protocols (Huget & Koning, 2003) to further their goals. There are numerous Agent-Oriented software engineering (AOSE) methodologies. The most common are MaSE (DeLoach & Kumar, 2005) GAIA (Wooldridge & Jennings, 2005) (Wooldridge, Jennings, Zambonelli, 2005), PROMETHEUS (Padgham & Winikoff, 2005) and TROPOS (Bresciani, Giorgini, Giunchiglia, Mylopoulos, & Perini, 2004).
Ontologies have been since integrated successfully in some of the methodologies and used extensively for the development of Multiagent Systems. For example (Tran & Low, 2008) introduce MOBMAS as a methodology for ontology-based multi-agent systems development. It is claimed that MOBMAS was the first methodology that explicitly identified and implemented the various ways in which ontologies can be used in the MAS development process and integrated into the MAS model definitions.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 57
There have also been extensions of Multiagent Systems methodologies like MaSe (DiLeo, Jacobs, & DeLoach, 2002) to use Ontologies for information Domain specification as part of the system analysis phase. Agent development frameworks and systems for developing Multiagent system like JADE (Bellifemine & Giovanni, 2007) also offer tools for developing these Ontologies.
4. ONTOLOGY IN CERTS – a PROPOSAL FOR ONTOLOGIES IN PROTECTION STRUCTURES - COMPUTER EMERGENCY RESPONSE TEAMS APPLICATIONS
This paper proposes development of an ontology for representing and sharing incidents between
CERTS. CERTS are protection structures for critical infrastructure and services (West-Brown,
Stikvoort, Kossakowski, Killcrece, & Ruefle, 1998).
There is a proliferation of critical systems in utility services, e.g. power distribution,
telecommunications, water and others. Critical systems also are used to drive most of the economic
activity - for example, banks and financial institutions, stock markets, tax systems and others.
Furthermore it is widely known and accepted that business in general is now increasingly done
online through e-commerce and so are other services through eServices.
Increasingly also, most governments have developed eGovernment strategies that highlight strategic
aims like improving service and content delivery through the development of a core set critical
infrastructures. Governments are investing heavily in the undersea cables, development of national
telecommunications infrastructure, backbone networks and increased bandwidth and connectivity.
These developments coupled with the diffusion of devices, deregulation and falling bandwidth
prices, has led to an increase in the roll out and consumption of government public service offerings
and information. These developments are also employed to use technology as leveller for social
equality by homogenising quality of service across societies in areas like health, education and
agriculture, through for example eHealth and telemedicine, Tele-education and distance learning.
Governments are also engaged in open government initiatives and are opening up datasets and
statistics to facilitate business and innovation.
As a result, most countries as a matter of strategic security imperative and as part of frameworks for
critical information infrastructure protection in relation to this proliferation of critical systems,
critical infrastructure and services, are now developing or have developed Computer emergency
response teams (CERTs) to handle cybersecurity incidents
With this backdrop, there is need to develop information systems to allow common representation
of knowledge and sharing of information between CERTS and between countries. Ontologies can be
used for these purposes.
Such systems if based on Multiagent systems could also provide mechanisms for sharing ontologies
and ontology negotiation. This paper is part of our ongoing research in this direction.
5. CONCLUSION In concluding this paper we suggest that ontological methods are useful in Information systems
research, Information systems development and in Information systems security. The Information
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 58
Systems field (IS) makes extensive use of ontological models in terms of formalism (formal
specifications), theories, software development process, tools, metrics and several applications.
Ontological research methods approaches provide good guidance in defining research problems as
well as guidance in the selection of appropriate research methods. Therefore, the need to explore
the potentials of ontology in Information systems cannot be over emphasized. Surveys, case studies,
as well as experiments are allowed in IS research. However, the researcher must demonstrate
appropriately the need and usefulness of the choice and application of any research method in IS
research. On the Information security side, we suggest that application of ontological approaches
and models could assist in the development of reliable Information security tools to withstand
cyber-attacks on data and information. We propose in this paper an application of ontology in
CERTS/CIRTS
REFERENCES
Antonio Zilli et al. (2009). Semantic Knowledge Management: An Ontology-Based Framework: An
Ontology. New York: Information Science Reference.
Bellifemine, F. L., & Giovanni, C. G. (2007). Developing Multi-Agent Systems with JADE. Wiley.
Blaikie, N. (1993). Approaches to Social Enquiry. Cambridge: Polity Press.
Booch, G. (1993). Object-Oriented Analysis and Design with Applications. Addison-Wesley
Professional.
Bresciani, P., Giorgini, P., Giunchiglia, F., Mylopoulos, J., & Perini, A. (2004). TROPOS:an agent
oriented software development methodology. Journal of Autonomous Agents and Multi-
Agent Systems, 8(3), 203–236.
Bryman, A. a. (2003). Business Research Methods. Oxford: Oxford University Press.
Catherine Roussey et al. (2011). An Introduction to Ontologies and Ontology Engineering. In F. G. al,
Ontologies In Urban Development Projects (p. 241). London: Springer-Verlag.
Chandrasekaran, B. a. (1999). What are ontogies and why we do need them. IEEE Intelligent systems,
20-26.
Chau, K. (2007). An ontology-based knowledge management system for flow and water quality
modeling. Advances in Engineering Software 38, 172–181.
DeLoach, S., & Kumar, M. (2005). Multi-agent systems engineering: an overview and case study. In P.
G. B. Henderson-Sellers (Ed.), Agent-Oriented Methodologies (pp. 236–276). IDEA Group
Publishing.
Dieter Fensel et al. (2001). On-To-Knowledge: Ontology-Based-Tools for Knowledge Management.
Free University Amsterdam VUA, Mathematics and Informatics , De Boelelaan 1081a, NL-
1081 HV Amsterdam, The Netherlands . Retrieved September 09, 2015, from
http://www.cs.vu.nl: http://www.cs.vu.nl/~frankh/postscript/eBeW00.pdf
DiLeo, J., Jacobs, T., & DeLoach, S. A. (2002). Integrating Ontologies into Multiagent Systems
Engineering. AOIS '02, Agent-Oriented Information Systems, Proceedings of the Fourth
International Bi-Conference Workshop on Agent-Oriented Information (AOIS-2002 at
AAMAS02). Bologna, Italy: Springer. Retrieved from http://SunSITE.Informatik.RWTH-
Aachen.DE/Publications/CEUR-WS/Vol-59/1DiLeo.pdf
Euzenat J and Shvaiko P. (2007). Ontology Matching. USA: Springer Verlag.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 59
Giannopoulou, E. G. (2008). Building Ontology form Knowledge base Systems. In F. K. El-Ghalayini,
Data Mining in Medical and Biological Research (p. 320). China: In Tech.
Godugula, S. (2008, June 15). Survey of Ontology Mapping Techniques. Software Quality and
Assurance, p. 14.
Gruber, T. (1992). A Translation Approach to Portable Ontology Specifications. Knowledge
acquisition, 5(2), 199-220.
Gruber, T. (1992). A Translation Approach to Portable Ontology Specifications. Knowledge
acquisition, 5(2), 199-220.
Gruber, T. (2009). Ontology. In L. L. Özsu (Ed.), Encyclopedia of Database Systems. Springer-Verlag.
Gruber, T. (2009). Ontology. Encyclopedia of Database Systems,.
Guarino, N. (1998). Formal Ontology and Information Systems. Proceedings of FOIS '98, (pp. 3-15).
Trento, Italy 6-8 June .
Hayes-Roth, F., Waterman, D., & Lenat, D. (1983, August 29). Knowledge-based systems. In Building
Expert Systems. Addison-Wesley. Retrieved from wikipedia.
Huget, M.-P., & Koning, J.-L. (2003). Interaction protocol engineering. In M.-P. Huget (Ed.),
Communication in Multiagent Systems (Vol. 2650, pp. 179-195). Springer.
Igor Jurisica et al. (1999). Using Ontologies for Knowledge Management:An Information Systems
Perspective. Annual Conference ofthe American Societyfor Information Science (p. 15).
Washington DC: University of Toronto, Toronto, Ontario, Canada.
Janez Brank et al. (2005). A SURVEY OF ONTOLOGY EVALUATION TECHNIQUES. Int. multi-conf.
Information Society, (pp. 166-169). Retrieved October 28, 2015, from http://ai.ia.agh.edu.pl:
http://ai.ia.agh.edu.pl/wiki/_media/pl:miw:2009:brankevaluationsikdd2005.pdf
Kabilan, V. (2007). Ontology for Information Systems (O4IS) Design Methodology: Conceptualizing,
Designing and Representing Domain Ontologies. The Royal Institute of Technology: KTH
information and Communication Technology .
Kabilan, V. (2007). Ontology for Information Systems (O4IS) Design Methodology: Conceptualizing,
Designing and Representing Domain Ontologies. The Royal Institute of Technology,
Department of Computer and Systems Sciences. KTH information and Communication
Technology.
Kang Ye etal. (2009). Ontologies for crisis contagion management in financial institutions. Journal of
Information Science, 35 (5), 548–562.
Labrou, Y., Finin, T., & Peng, Y. (1999). Agent communication languages:The current landscape. IEEE
Intelligent Systems, 14(2), 45–52.
Mike Uschold and Micheal Gruninger. (1996). Ontologies: Principles, Mthods and Applications.
Knowledge Engineering Review Volume 11 No 2 , 1-69.
Mohammad Nazir Ahmad et al. (2012). Ontology-Based Applications for Enterprise Systems and
Knowledge Management. United States of America: IGI Global.
Nordmann, K. (2009, May 13). Standardization of Ontologies. Retrieved from http://kore-
nordmann.de: http://kore-
nordmann.de/talks/09_04_standardization_of_ontologies_paper.pdf
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 60
Padgham, L., & Winikoff, M. (2005). Prometheus: a practical agent-oriented methodology. In P. G. B.
Henderson-Sellers (Ed.), Agent- Oriented Methodologies (pp. 107-135). IDEA Group
Publishing.
Pereira, T., & Santos, H. (2009). An Ontology based approach to Information Security. In M. S. F.
Sartori (Ed.), Metadata and Semantic Research (pp. 183-192). Springer Berlin Heidelberg:
Springer.
Pitt, J., & Mamdani, A. (1991). A protocol-based semantics for an agent communication. Proceedings
16th International Joint Conference on Artificial (pp. 486-491). Stockholm, Sweden: Morgan-
Kaufmann Publishers.
Raskin, V. a. (2001). Ontology in Information Security: A Useful Theoretical Foundation and
Methodolological Tool. Proceedings of the 2001 Workshop on New Security Paradigms (pp.
53-59). Cloudcroft, New Mexico: ACM. doi:10.1145/505168.505183
S.A. DeLoach, M. K. (2005). Multi-agent systems engineering: an overview and case study. In P. G. B.
Henderson-Sellers (Ed.), Agent-Oriented Methodologies (pp. 236–276). IDEA Group
Publishing.
Saunders, M., Lewis, P., & Thornhill. (2009). Research Methods for Business students (5th ed.).
Prentice Hall.
Silvonen, P. (2002, October 21). Ontologies and Knowledge Base. Retrieved October 22, 2015, from
http://www.ling.helsinki.fi:
http://www.ling.helsinki.fi/~stviitan/documents/Ontologies_and_KB/ontology.html
Steffen Staab, R. S. (2003). Knowledge Processes and Meta Processes in Ontology-Based Knowledge
Management. In C. W. Holsapple (Ed.), Handbook on Knowledge Management (Series
International Handbooks on Information Systems ed., Vol. 2, pp. 47-67). Berlin Heidelberg
GmbH, Karlsruhe, Germany: Springer. doi: 10.1007/978-3-540-24748-7
Sunitha Abburu and G Suresh Babu. (2013). A Framework for Ontology Based Knowledge
Management. International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-
2307, Volume-3, Issue-3, 21-25.
Targeden, D., Dennis, A., & and Wixon, H. B. (2013). Systems Analysis and Design with UML.
Singapore: Wiley.
Tran, Q.-N. N., & Low, G. (2008). MOBMAS: A methodology for ontology-based multi-agent systems
development. Information and Software Technology, 50(7-8), 697 - 722.
Viinikkala, M. (2004, March 22). Ontology in Information Systems. 8109103 Ohjelmistotuotannon
teoria, p. 17.
West-Brown, M. J., Stikvoort, D., Kossakowski, K.-P., Killcrece, G., & Ruefle, R. (1998). Handbook for
Computer Security Incident Response Teams (CSIRTs). Pittsburgh: Carnegie Mellon Software
Engineering Institute.
Wooldridge, M. (2002). An Introduction to MultiAgent Systems. John Wiley & Sons.
Wooldridge, M., & Jennings, N. (2005). Multi-agent systemsas computational organizations: the Gaia
methodology. In P. G. B. Henderson-Sellers (Ed.), Methodologies, Agent-Oriented (pp. 136-
171). IDEA Group Publishing.
WU, Jiangning. (2005). A Framework for Ontology-Based Knowledge Management System. Institute
of Systems Engineering (p. 9). Dalian, China: Dalian University of Technology.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 61
ICI002
Big Data Forensics As A Service
Oteng Tabona, Andrew Blyth
Information Security Research Group University of South Wales
Pontypridd, United Kingdom oteng.tabona@southwales.ac.uk; andrew.blyth@southwales.ac.uk
ABSTRACT
The endless human reliance on computers, the proliferation of relatively inexpensive computing devices, and continuous advancement of technology has given rise to Big Data. Today, digital forensic investigations do not only involve a PC but numerous digital devices, which together holds huge amount of data. These developments have affected digital forensic. This is because current forensic tools fall short in many ways to deal with the Big Data. In this paper we discuss Big Data challenges, outline some requirements for a Big Data forensic tool and then design a Big Data forensic framework. A few experiments carried out on the platform shows that the framework can be used as a forensic tool. Key words: Big Data Forensics, Digital Forensics as a Service, Forensic cloud, Digital Forensics, Hadoop
1 INTRODUCTION The cost of digital devices has dropped drastically over the years making them affordable to most
people. Digital devices are becoming people’s lifelines. Many people find it hard to spend a day
without a digital device. They depend on them for various purposes such as communication,
entertainment, access to information and health applications. Each digital device is designed for a
specific purpose therefore consumers normally own a number of them for different cases. For
example, a person might own a smartphone, tablet, laptop, desktop, game console and TV all for
different purposes.
Digital devices have become very important part of crime investigations. However, the proliferation
in their usage has increased the amount of data that is acquired for digital forensic investigations.
The records by the Regional Computer Forensic Laboratory (RCFL, 2013) indicates that the size of
evidence has increased by over 500% for a period of seven years from 2006. The collected evidence
data is huge in size, heterogenous and need new efficient algorithms to analyse it. We classify this
data as Big Data. Big Data was described by Laney (2012) as high-volume, high-velocity and high-
variety data. Big Data has affected digital forensic in a number of ways that are discussed in the
following paragraphs.
Traditional forensic tools are typically based on a single workstation. The massive amount of
evidence collected for investigations nowadays cannot be examined on a single workstation because
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 62
of storage and processing power demand. In addition, the number of devices that can be seized per
an investigation has increased. This is a concern because, current forensic tools are not designed to
investigate multiple devices at the same time. The opportunity to carry out a cross-drive analysis and
find some correlations between these devices is missing.
Furthermore, various data sources generate data in different formats. Unstructured data constitutes
80% of the data produced today (IBM, 2010). However, existing tools cannot efficiently analyse it
because of the need for more processing power that is beyond the capability of a single workstation.
The lack of unstructured data support means that these data is often examined manually. Manual
examination can be costly with regard to time and resources and increases the likelihood of errors.
The existing forensic tools also lack the capability to share case output. There is a need to share case
output to help in identifying organised crimes that happen across multiple law enforcement regions.
The lack of such capability means that criminal organisations that appear in different regions will not
be discovered.
1.1 Contribution This paper presents a digital forensic platform to address the Big Data challenges in forensics.
Hadoop framework is used to store and process the evidence collection. We also adopt ‘as a service’
model for our platform to work on the cloud. A cloud environment provides further scalability to
what Hadoop offers and thus will help in investigations involving huge amount of forensic evidence.
The framework also encompases an inteligence sharing framework, which offers the capability to
search previous case evidence from other LEA.
The reset of this paper is structured as follows: the first section outlined the Big Data challenges in
digital forensics. Section 2 discusses the requirements of a Big Data forensic tool. The following
section (section 3) provides some research related to this work. Section 4 shows the architecture of
the proposed framework. A few experiments carried out on the framework are shown in section 5.
The paper is concluded in section 0.
2 REQUIREMENTS
In this section we detail the requirements of our Big Data forensic framework and explain how these
requirements will address the Big Data challenges discussed above in section 0.
2.1 Big Data technology Big Data requires a new generation of technologies and architectures, designed to efficiently extract
value from large volume of heterogenous data, by allowing high-velocity discovery and/or analysis
(IDC, 2011). Traditional software and techniques cannot efficiently analyse huge volumes of data.
Recently there has been a rise in a number of Big Data solutions. Big Data solutions offer features
such as scalability, reliability, and availability. One solution that will be considered for this study is
Hadoop. Hadoop is an open source, distributed, batch processing and fault tolerant system that is
capable of storing and analysing massive amount of data (White, 2011).
Hadoop provides a high level of scalability that is needed for Big Data processing. It allows the
addition of computing nodes whenever the demand increases. Data in Hadoop can be processed
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 63
with MapReduce. MapReduce processes data in parallel thereby achieving scalability in data
intensive analysis (Dean & Ghemawat, 2008). Another fundamental feature of MapReduce is that,
code is sent to the nodes that hold the data instead of transferring the data to the computing nodes.
This approach reduces the chances of network bottleneck because the code is relatively small in size
than the data.
Another benefit of Hadoop framework is that it is fault-tolerant; it ensures that all data nodes are in
operation all the times. Data in Hadoop is normally replicated to multiple data nodes (by default 3)
for backup purposes.
2.2 Cross-Drive analysis Cross-drive analysis gives an investigator the opportunity to search across multiple evidence sources
to find some interesting patterns. This analysis makes it possible to build an exhaust timeline. An
exhaustive timeline gives a detailed picture of past events.
2.3 Collaboration Digital forensic investigators are experiencing complex and wide range of problems that needs
people with different expertise to work together and share know-hows. Collaboration gives
investigators the opportunity to learn from each other − leading to improved strategies in problem
solving. This feature will as well reduce the effect of duplicating processes that would affect the
efficiency of the investigation.
Collaboration addresses the current issue whereby each investigator is assigned an evidence source
and after completing the investigation that is when the results sets are grouped to find any
connections (Baar, Beek & Eijk, 2014). The problem with the current procedure is that results can
easily be missed. However with the proposed approach results will be automatically aggregated
together.
The collaboration platform also encourages specialisation. Examiners who specialises on image
analysis can concentrate on that area alone while those who specialise in textual analysis can focus
on text examination only. The collaboration feature will also allow trusted investigators in any
location to be given access to the platform and assist with examinations.
2.4 Intelligence sharing
The current forensic set-up makes it hard to find some correlation between cases. The intelligence
sharing feature allows investigators to search old cases for any links, in doing so examiners can
discover criminal gangs.
2.5 Knowledge sharing
The technological landscape is evolving at a very high rate, for example, every year new
smartphones are released with new features. These features often need to be studied to find
suitable ways to discover evidence. The knowledge sharing framework is designed to allow
individuals to share investigative techniques and strategies. When an investigator is faced with a
new challenge they can access the framework to see if there is any shared information pertaining
the issue in hand.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 64
3 LITERATURE REVIEW
The idea of using a cloud platform to address a Big Data challenge in the field of Digital Forensics is
relatively new. There are a few research papers, which are concentrating on providing digital
forensics from a cloud environment to tackle this challenge. Some of the research papers of interest
are discussed below.
Roussev et al. (2009) presented an MPI MapReduce (MMR) model to deal with large forensic
collection. MMR provides linear scaling for both CPU-intensive processing and indexing. Miller et al.
(2014) designed a forensicloud, an architecture for cloud-based digital forensic analysis. The
forensicloud framework encompasses existing forensic tools which run on a cloud environment
providing faster processing capabilities and also collaborative functions (Miller et al., 2014). In this
paper, authors (Roussev & Richard, 2004) proposed a light weight distributed framework to improve
the investigation turnaround time. The authors (Roussev & Richard, 2004), emphasised that single
workstations cannot cope with the performance demands. They also suggested that it is not possible
to upgrade the performance level of these tools as they have already reached the limit.
This paper (Baar, Beek & Eijk, 2014), presented a Digital Forensics as a Service model (DFaaS), which
is used by Netherlands Forensics Institutes (NFI). The authors (Baar, Beek & Eijk, 2014), declared that
the DFaaS model has significantly reduced case backlogs and improved the efficiency of
investigations by improving the traditional investigation process. Some of the key changes include:
freeing up the investigators time by assigning administrative roles to administrators, sharing of
information between the investigator, analyst and detective, and archiving acquired knowledge for
future use.
Federici (2013) designed a conceptual digital forensic framework called AlmaNebula, which
leverages the power and storage capacity of private or community clouds to process digital
evidence. The author (Federici, 2013) foresees cloud computing platforms as a solution to existing
tools, which cannot scale to meet both storage and processing power demands. The Sleuth Kit
Hadoop framework (Carrier, 2012) is a prototype project that incorporates The Sleuth Kit into a
Hadoop cluster to speed up digital forensic investigation time. However, the development of this
platform seem to have stop and more work need to be done to complete it. Raghavan et al.
(Raghavan, Clark, & Mohay, 2009), made a case for merging of different evidence sources. The
authors' (Raghavan, Clark, & Mohay, 2009), highlighted that by integrating multiple sources
investigators will be able to reconstruct a precise image of the past. Raghavan et al. (Raghavan,
Clark, & Mohay, 2009), presented an architecture for integrating evidence information from
different sources irrespective of the logical type of its contents.
4 ARCHITECTURE
An architecture of the proposed Big Data forensic framework is shown in Figure . The architecture is
divided into three block which are the storage, processing and applications. The storage dimension is
used to store the evidence while the processing feature analyses the evidence. The application
aspect holds a number of applications that are applied to the evidence. The framework is
implemented in a cloud environment to increase the overall scalability of the system.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 65
4.1 Storage Digital forensic images will be acquired using existing imaging tools such as FTK imager. The acquired
images are inserted into the framework and stored in HDFS (Hadoop Distributed File System) by an
Ingester application. After the insertion, the ingester will initialise a File System Parser (FSP)
program. FSP reads the image and populate an HBase table of the case being investigated.
The main requirement of the storage system is to maintain the integrity of the evidence. Hadoop
was created with the notion of write once and read many times. The assumption is that once a file is
written it will not be modified. This requirement is monitored throughout the investigation of the
evidence.
Figure 1 Big Data forensic framework
4.2 Processing The MapReduce framework is used to process evidence in HDFS and HBase. Client applications are
written to implement the Map and Reduce code. MapReduce operate by sending code to the nodes
that holds the data. The Map function processes the data in parallel. The Map output is then
combined by the Reduce function. The resulting Reduce output can then be processed by client
applications (such as timeline visualisation) to discover suspicious acts.
4.3 Applications The application block features client applications. The early development of the framework will
include a few applications, additional programs can be added later to improve the functionality of
the framework. For example, the initial development of this framework only incorporates widely
used file systems. The remaining file systems parsers can be written and applied to the framework
later.
Several examiners are allowed to investigate the same case together. Their role is to run client
applications such as timeline and network analysis. The output from these analyses can be further
interpreted using various techniques such as visualisation.
The Intelligence Sharing application is a specialised feature that identifies objects from a case. These
objects can include; credit card numbers, email addresses, phone numbers, people’s name and so
on. The identified objects are represented in a format such as XML and stored in an external
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 66
database. This external database is accessible to other investigators using the same framework. The
Intelligence Sharing feature is also used to identify links between the case being investigated and
previous cases. Links between cases are missed using the current techniques because of the lack of
such framework.
4.4 The framework as a Big Data forensic tool In this section we review how this framework can addresses the Big Data challenges in digital
forensic.
The cloud environment together with Hadoop offer high scalability to accommodate large forensic
collection. This feature makes it possible to store evidence from multiple sources together. With the
evidence stored in a common area it is practical to perform cross-drive analysis to find any existing
connection.
In addition to the large storage, the framework also provides more processing power that is beyond
a single workstation. Vast computing power allows us to automate most processes. Current tools are
still doing some crucial analyses manually and thus ineffective.
Furthermore, both the storage and processing dimension of the proposed framework can process
unstructured data. Existing forensic tools are struggling to deal with unstructured data, despite it
being the most common data type.
Additional features such as collaboration and Intelligence sharing are incorporated to facilitate
investigations. Collaboration allows a pool of investigators to come together and share know-hows
on a case. Intelligence sharing gives more insights about the case being investigated. A link between
it and other cases can trigger leads that are currently impossible to discovery.
5 EXPERIMENTS
In this section, we evaluate if Hadoop can be used for digital forensic. In our experiment we
gathered enron dataset (Cohen, 2015) and insert it into an empty 8 GB FAT32 USB drive. The USB
device was then imaged using FTK imager. The resulting image was then ingested into HDFS and the
md5sum hash noted. A FAT32 file system parser was the initiated to recover the files and insert
them into HBase. We then carried out a few analysis (network analysis and word count), which are
discussed below. The total number of files that were inserted in HBase table was 88278.
5.1 Network analysis Network analysis was used to identify communication relationships and social circles. To achieve this
we wrote a MapReduce code. The Map code identified the sender of an email and all the recipients
of that email. The Reduce code mapped a sender to all the recipients that have received emails from
them (sender). The Reduce output was then processed to produce an output that can be visualised
using vis.js library (Almende B.V., 2016).
The visualised output was analysed and a number of relationships can be recognised (see Figure ).
For example, a..hope@enron.com, a..howard@enron.com and a..price@enron.com were
responsible for sending email to many individuals. On the other hand, michelle.cash@enron.com,
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 67
shelley.corman@enron.com and sally.beck@enron.com received emails from one or more
individuals who were responsible for sending many mails (a..hope@enron.com,
a..howard@enron.com and a..price@enron.com). In an investigation the identified group of
individuals will be examined further for more information.
Figure 2 Network analysis
5.2 Word count A word count analysis was performed on the data to determine the word frequencies. We wrote a
MapReduce program to compute the word frequencies. The following 20 words (see Table 1)
appeared in most emails than others.
Pat, Bill, Baughman, Don, Jr., Mr., Mrs., Andrea, Call, Friend, Janice, Laddie, Lalena, Marc, Mary, Matlock, NEIGHBOUR, Patsy, Randy, Reagan
Table 1 Most frequent word
In an investigation, word count can be used to generate keywords. The keywords will then be used
to search for files that contain them.
In this paper we carried out network analysis to identify social structures from enron data. We also
did a word count to find word frequencies from the emails. A few more analysis can be implemeted
and run of the evidence. However the main objective of this experiment was to evaluate if Hadoop
can be used for digital forensic. To prove this we calculated the hash value of the final image (after
analysis) and compared it with the original hash value. The output of the comparison shows that the
two hash values match. This result proves that Hadoop can be used for digital forensic. The
experiment does not present a Big Data case. It was very important to evaluate the feasibility of
using Hadoop for forensic before experimenting with a Big Data case. The following experiments will
present a Big Data case.
6 CONCLUSION
Big Data explosion has significantly affect digital forensic, as current tools cannot cope with the
demand to analyse these data. In this paper we have identified the current Big Data challenges in
digital forensic. We further identified essential Big Data forensic tool requirements. After that we
identified Hadoop as a suitable tool. We carried out a few experiment to first show that Hadoop can
be used for digital forensic. The output from the experiments shows that the integrity of the
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 68
evidence is preserved in Hadoop, which makes it suitable for forensic.
The implementation of platform is still on going. The next stage is to implement more file system
parser so that the platform can hold evidence from a variety of data sources. When this stage is
complete a Big Data case will be designed and further experiments carried on.
ACKNOWLEDGMENTS The authors would like to thank the Botswana International University of Science and Technology (BIUST) for their support.
REFERENCES Almende B.V. (2016). Vis.js. Retrieved from http://visjs.org/#
Baar, R., Beek, H., & Eijk, E. (2014). Digital forensics as a service: A game changer. Digital Investigation, 11, S54 – S62. doi: http://www.sciencedirect.com/science/article/pii/S1742287614000127
Carrier, B. (2012). Sleuth Kit Hadoop Framework. Retrieved from http://www.sleuthkit.org/tsk_hadoop/
Cohen, W., W. (2015). Enron Email Dataset. Retrieved from http://www.cs.cmu.edu/~enron/
Dean, J. & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51, 107–113. doi: http://doi.acm.org/10.1145/1327452.1327492
Federici, C. (2013). Almanebula: A computer forensics framework for the cloud. Procedia Computer Science, 19, 139 – 146. doi: http://www.sciencedirect.com/science/article/pii/S1877050913006315
IBM. (2010). The enterprise answer for managing unstructured data. Retrieved from https://www-304.ibm.com/events/idr/idrevents/detail.action?meid=6320
IDC. (2011). Extracting value from chaos. Retrieved from https://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
Laney, D. (2012). The importance of ’big data’: A definition,”
Miller, C., Glendowne, D., Dampier, D., & Blaylock, K. (2014). Forensicloud: An architecture for digital forensic analysis in the cloud. Journal of Cyber Security, 3, 231–262.
Raghavan, S., Clark, A. J., & Mohay, G. M. (2009). FIA: an open forensic integration architecture for composing digital evidence. Second International Conference of e-Forensics : Forensics in Telecommunications, Information and Multimedia,, 83–94. doi: http://eprints.qut.edu.au/28073/
Regional Computer Forensics Laboratory (2013). The RCFL Program's annual report for Fiscal Year 2012. Retrived from https://www.rcfl.gov/downloads/documents/2012-rcfl-national-report/view
Roussev, V., & Richard III, G. G. (2004). Breaking the performance wall: The case for distributed digital forensics. Proceedings of the 2004 Digital Forensics Research Workshop (DFRWS 2004), , 1-16.
Roussev, V., Wang, L., Richard, G., & Marziale, L. (2009). A cloud computing platform for large-scale forensic computing. Advances in Digital Forensics V: Fifth IFIP WG 11.9 International Conference on Digital Forensics, 306, 201-214.
White, T. (2011). Hadoop:The Definitive Guide. O’Reilly Media, Inc.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 69
IC1007
Information Security Policy Violation:
The Triad of Internal Threat Agent Behaviors
Maureen van den Bergh, Kennedy Njenga
Department of Applied Information Systems
University of Johannesburg Johannesburg, South Africa
maureenvdb@uj.ac.za; knjenga@uj.ac.za
ABSTRACT
Behavioral information security studies that primarily pursue the classification of information security policy violation behaviors have not been given much attention in literature. An emergent number of researchers and security reports have advocated research into this area. This paper endeavours to address the limitation of information in literature by conceptualizing the triad of internal threat agent behaviors, via a thematic analysis of literature. The triad of internal threat agent behaviors represents three classes of security behaviors, namely misbehavior, non-malicious deviant behavior, and malicious deviant behavior. This distinction could potentially improve upon the effectiveness of corrective actions in mitigating the risks associated with information security policy violations.
Key words: information security policy, misbehavior, non-malicious deviant behavior, malicious
deviant behavior, internal threat agent
1 INTRODUCTION Information Systems (IS) security remains a top priority and a challenge for information security
(InfoSec) managers (D'Arcy, Hovav, & Galletta, 2009; K.H Guo, Yuan, Archer, & Connelly, 2011;
Johnston, Warkentin, & M, 2015; Loch, Carr, & Warkentin, 1992). With the financial loss of security
breaches ever increasing and threats constantly evolving (The Global State of Information Security®
Survey, 2015), IS security violations by employees continually represent a problem that creates
tremendous risks and costs for organizations (Vance, Lowry, & Egget, 2015). Information security
managers and organizational executives struggle with enforcing policies designed to protect
information and information assets from intentional or unintentional security violations (D'Arcy et
al., 2009; Johnston et al., 2015).
Because employees are the cause of many IS security incidents (Ernst & Young's Global Information
Security Survey, 2014; The Global State of Information Security® Survey, 2015), they are a major risk
to IS security and therefore an organization’s first priority when it comes to IS security (Ernst &
Young's Global Information Security Survey, 2014). Internal organizational employees, referred to as
"Internal Threat Agents (ITAs)", are employees who are either the cause of, or contribute to security
incidents (Verizon Data BREACH Investigations Report, 2012). The impact of these security incidents
could be significant, because insiders are more likely to steal sensitive data of a non-financial nature
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 70
or intellectual property (The Global State of Information Security® Survey, 2015; US State of
Cybercrime Survey, 2014; Verizon Data BREACH Investigations Report, 2012). Also because of the
privileged insider position of the ITAs, it is possible that they could avoid discovery due to their
intimate knowledge of organizational security efforts (Verizon Data BREACH Investigations Report,
2012; Whitman & Mattord, 2012).
In the research stream of behavioral InfoSec studies (Crossler et al., 2013; S Furnell & Clarke, 2012),
generally the purpose of these studies are to improve our understanding of ITA security behaviors
(D'Arcy et al., 2009; Herath & Rao, 2009; Workman, Bommer, & Straub, 2008), mitigate the risks
associated with ITA security behaviors (Colwill, 2009; Safa et al., 2015; Safa, Von Solms, & Furnell,
2016), and to change ITA security behaviors (Beautement & Sasse, 2009; Steven Furnell, Papadaki, &
Thomson, 2009; Johnston et al., 2015).
Although these behavioral InfoSec studies provide multiple insights into ITAs behavior, recently the
concept emerged of differentiating ITA security behaviors, and the importance of applying the right
kind of corrective actions towards different classes of security behaviors (Crossler et al., 2013;
Verizon Data BREACH Investigations Report, 2012).
Although the classification of ITA security behaviors are not the primary focus of the following
studies, they do refer to ITAs and their behavioral intent. Intent is described with terminology such
as accidental (Loch et al., 1992; Vroom & von Solms, 2004), passive (Willison & Warkentin, 2013),
unintentional (Crossler et al., 2013; Verizon Data BREACH Investigations Report, 2012), with the
opposite as intentional (Crossler et al., 2013; Loch et al., 1992; Vroom & von Solms, 2004; Willison &
Warkentin, 2013), deliberate (Verizon Data BREACH Investigations Report, 2012), and knowingly (K.H
Guo et al., 2011).
A study by Stanton, Stam, Mastrangelo, and Jolton (2005), proposed and tested a taxonomy that
includes six types of security behaviors. These behaviors are differentiated according to
intentionality and technical expertise. Intentionality in turn is described as malicious, neutral and
beneficial.
In literature, not much attention have been given to behavioral InfoSec studies that primarily pursue
the classification of ITA security behaviors. This paper endeavours to address the limitation of
information in literature by conceptualizing the triad of ITA behaviors, via a thematic analysis of
literature. We propose three classes of behaviors, namely misbehavior (MB), non-malicious deviant
behavior (NDB) and malicious deviant behavior (MDB). MB is defined as unintentional, non-malicious
information security policy violations, NDB as intentional, non-malicious information security policy
violations, and MDB as intentional, malicious information security policy violations.
Applying the right kind of corrective actions towards a specific behavior is important (Crossler et al.,
2013). For example, trying to address malicious deviant behavior with awareness and training is
inappropriate, because this type of behavior is intentional and with harmful motivation. Whereas
using awareness and training to address misbehavior, is appropriate.
In pursuance of the above goal, this paper is divided into five sections. The first section introduce the
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 71
triad of ITA behaviors. The second section discusses an emergent perspective on ITA behavior. The
third section reflects on who ITAs are and the risks they pose to IS security. The fourth section
theorizes MB, NDB, MDB, and presents the triad of ITA behaviors. The fifth section discusses the
application of the right kind of corrective actions towards MB, NDB and MDB. The paper closes with
a discussion and conclusion. 2 AN EMERGENT PERSPECTIVE ON ITA BEHAVIOR An article titled “Future directions for behavioral information security research” by Crossler et al.
(2013) and the Verizon Data BREACH Investigations Report (2012), both propose future research to
separate insider misbehavior from deviant behavior. In the Computer Security Institute (2011)
survey, they asked their respondents in their latest survey to differentiate between non-malicious
insider actions and malicious insider actions (they did not do so in previous surveys). Separating
behavior may improve upon the success and sometimes applicability of corrective actions towards
misbehavior and deviant behavior (Crossler et al., 2013).
While the classification of ITA security behaviors are not the primary aim of InfoSec studies, some
refer to ITAs and their behavioral intent in sections of their studies. For example, in their pursuit to
determine the threats to IS, Loch et al. (1992) included the human perpetrator’s accidental and
intentional intent as part of their threat taxonomy. In the organizational context, Willison and
Warkentin (2013) focused on a holistic approach to insider computer abuse, and considered the
thought processes of human perpetrators preceding deterrence. As part of their investigation they
extended Loch et al.’s (1992) threat taxonomy, focussing on the human perpetrator. They agreed
with Loch et al.’s taxonomy of behavior as intentional, but differed on the term “accidental” by
replacing it with the term passive. They then proceeded to expand the taxonomy to passive non-
volitional noncompliance, volitional but not malicious noncompliance, and intentional malicious
computer abuse.
Vroom and von Solms (2004) explored the role of auditing in organizational InfoSec. They singled out
the human factor involved in organizational asset security, and how difficult it would be to audit the
behavior of these employees. They concur with Loch et al. (1992) and also refer to the ITA’s intent as
accidental and intentional. Vroom and von Solms (2004) adds to the granularity of intent by referring
to the accidental non-malicious, and intentional malicious employee.
The Verizon Data BREACH Investigations Report (2012) clearly state their classification of ITA
behaviors as unintentional, inappropriate but not malicious, and deliberately malicious. They use a
similar classification, of unintentional, as Crossler et al. (2013) does, but the similarity ends there.
While The Verizon Data BREACH Investigations Report (2012) uses the classification of “deliberate”,
Crossler et al. (2013) rather agrees with Loch et al. (1992), Vroom and von Solms (2004), and
(Willison & Warkentin, 2013) on a classification of “intentional”.
One study, K.H Guo et al. (2011), theorised and tested a model of intentional violation, but only
focused on non-malicious security violations (NMSV), and a study by Stanton et al. (2005), produced
a taxonomy that included six types of security behaviors. Behaviors were differentiated according to
intentionality and technical expertise. In their two factor taxonomy, Stanton et al. (2005) classified
three intentions: malicious, neutral and beneficial.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 72
The studies mentioned above use a variety of terminologies to describe ITA security behaviors. But
despite this, the studies cited above also seems to inadvertently suggest a classification and
differentiation of ITA security behaviors, via their agreement on, and disagreement of behavior
terminologies.
Table 1 shows the results from the thematic analysis of literature, as described above.
Table 1 Thematic Analysis of Literature
Author/s Year Source Perpetrator Intent Behavior
Loch et al. 1992 Internal Human Accidental
Intentional
Willison and Warkentin
2013 Internal Human
Passive Non-volitional Non-compliance
Volitional But not malicious
Non-compliance
Intentional Malicious Computer Abuse
Verizon Data Breach Investigations Report
2012 Internal Human
Unintentionally
Inappropriately But not malicious
Deliberately Maliciously
Guo et al. 2011 Internal End-users Knowingly Violate Non-malicious Security violations
Crossler et al. 2013 Insider
Intentional Deviant
Unintentional Misbehavior
Stanton et al. 2005 Information technology users
Intentionally Malicious
Intentionally Beneficial
Neutral
Vroom & Von Solms
2004 Employees Accidental Non-malicious
Intentional Malicious
The security behavior of ITAs are key to an organization’s efforts to decrease policy violations. It is
central for an organization’s security efforts that ITAs behave and act responsibly with regards to the
information security policy (ISP). The next section explores the above.
3 INTERNAL THREAT AGENTS The threats to IS security are from sources internal and external, and from perpetrators both human and non-human (Loch et al., 1992). Of these, the internal human threat, also referred to as the “Internal Threat Agent (ITA)”, is the cause of many IS security incidents (Ernst & Young's Global
Information Security Survey, 2014; The Global State of Information Security® Survey, 2015).
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 73
ITAs are the internal organizational employees who are either the source of a security incident or
contribute to such an incident. ITAs include principal management, employees, contractors and
interns (Verizon Data BREACH Investigations Report, 2012). Principal management are defined as
persons with the primary duty of managing a department, unit, and/or sub-division, employees are
persons who do not manage other employees, and contractors are persons or firms that undertake a
contract to provide materials or labour to perform a service or do a job. Lastly interns are students
or trainees who work in organizations to either obtain work experience or to fulfil some requirement
for their studies.
ITAs are a significant risk to IS security (Ernst & Young's Global Information Security Survey, 2014),
with insider theft as one of the top causes of data breaches (Ernst & Young's Global Information
Security Survey, 2014). “Careless or unaware employees” are a vulnerability that increase
organizational risk exposure and therefore an organization’s first priority when it comes to IS
security (Ernst & Young's Global Information Security Survey, 2014).
Surveys such as the CSI/FBI computer crime and security survey (2010), The Global State of
Information Security® Survey (2015), The US State of Cybercrime Survey (2014), and The Vormetric
Data Security Report (2015), state that insider agents contribute a much smaller number towards
the overall percentage of incidents, compared to external agents, but this number is not indicative of
the demise of insider misconduct. External agents may launch a single sting attack against hundreds
of victims (industrialised attacks), whereas internal agents have a much smaller number of potential
targets (Verizon Data BREACH Investigations Report, 2012). Also because of the privileged insider
position of the internal threat, it is possible that they could avoid discovery due to their intimate
knowledge of organizational security efforts (Verizon Data BREACH Investigations Report, 2012;
Whitman & Mattord, 2012).
Nonetheless, compared to external agents, the possible impact of insider incidents on an
organization is significant. The reason for this is the higher probability for insiders to steal sensitive
data of a non-financial nature or to acquire intellectual property (The Global State of Information
Security® Survey, 2015; US State of Cybercrime Survey, 2014; Verizon Data BREACH Investigations
Report, 2012). Insider crimes could have a greater financial impact than that of outsider crimes.
Often security incidents result in increased data loss, as a result of compromised employee and
customer records, or unintentional release of sensitive information, especially via the Internet.
Other threats are caused by peer-to-peer file sharing, or abuse of system access privileges by others,
as a result of carelessly written down passwords, or easy-to-guess passwords (K.H Guo et al., 2011;
The Global State of Information Security® Survey, 2015).
IS security remains a high priority for InfoSec managers (Loch et al., 1992), and a major challenge
(K.H Guo et al., 2011) because the financial loss of security breaches are ever increasing, amid
constantly evolving threats (The Global State of Information Security® Survey, 2015). A study by
Garg, Curtis, and Halper (2003), on the economic impact of security breaches concluded that the
financial loss to an organization is much higher than initially reported by other self-reporting
organizational surveys: more in the range of 0.5% to 1.0% of annual sales. 8.7% of respondents from
The Global State of Information Security® Survey (2015), reports that compared to their 2014 survey,
financial losses are up by 15% in 2015, and 32% of respondents from this survey indicate that
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 74
compared to outsider crimes, insider crimes are more costly.
Data breaches are also a major challenge, with insiders causing damage by unintentionally exposing
sensitive information, such as compromising confidential records, customer records and employee
records (US State of Cybercrime Survey, 2014). The Global State of Information Security® Survey
(2015), also reports that security incidents result in increased data loss, especially caused by
compromised employee and customer records. “In our heavily networked world, organizations
across the globe are under attack 24/7/365” (Computer Security Institute, 2011).
Surveys such as the CSI/FBI computer crime and security survey (2010), The Global State of Information Security® Survey (2015), The US State of Cybercrime Survey (2014), and The Vormetric Data Security Report (2015), state increased insider security incidents, while 62% of respondents from The Insider Threat Spotlight Report (2015) indicate an increase in the frequency of insider threats during the last 12 months. Next, follows a discussion of the classification of ITA behavior, in terms of MB, NDB, and MDB.
4. THEORIZING INTERNAL THREAT AGENT BEHAVIOR
4.1 Classification of Behaviors The terminologies from the thematic analysis of literature, as presented in table 1 under section
number 2, are homogenized across the numerous similarities to classify ISP violation behaviors. This
study then classifies ISP violation behaviors with the source as internal, the agents as human, their
intent as either unintentional or intentional, and the three categories as misbehavior, non-malicious
deviant behavior, and malicious deviant behavior. The human threat agent is differentiated as
management, employees, contractors and interns (Verizon Data BREACH Investigations Report,
2012). Figure 1 illustrates the above.
Figure 1 Classification of ITA Behaviors
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 75
4.2 Misbehavior We define MB as unintentional, non-malicious information security policy violations. This means that
while misbehaving insiders participate in the behavior, the violation of policy is unintentional and
without knowledge of the violation. However, despite the unintentional violation by the
misbehaving ITA, ISP violation has still taken place, and therefore, just like NDB and MDB, MB
requires a response. Even though misbehaving ITAs do not purposefully choose to violate policies,
the result may well put organizational information and information assets at risk or still cause
possible damage to IS security. For example, indirect consequences of MB is creating weaknesses
that could allow hackers to infect internal systems with viruses or spyware, or allow them to bypass
the firewall and access confidential data (Crossler et al., 2013).
MB typically include human error, ignorance, uninformed violations, accidental data entry, forgetful
oversights, inadvertent data breach, and unintentional actions (Ken H. Guo, 2013; Siponen & Vance,
2010).
4.3 Non-malicious Deviant Behavior We define NDB as intentional, non-malicious information security policy violations. NDB is behavior
engaged in by ITAs who knowingly violate ISPs. The behavior is without hateful intent. While non-
malicious deviant ITAs do not really want to cause loss of operations or security breaches, the result
may exactly be that which the insider did not intend doing in the first place, causing possible damage
anyway.
NDB include such behavior as failure to perform backups or delaying backups, accessing websites
unrelated to work requirements using corporate computers, clicking phishing email links, opening
possible unsecure attachments, choosing passwords that are not up to standard, not changing
passwords regularly, password sharing, failing to log off when leaving the computer, not shredding
sensitive information, and mistakenly uploading confidential data onto unsecure servers or the web
(Aytes & Connolly, 2004; Crossler et al., 2013; D'Arcy et al., 2009; Warkentin & Willison, 2009).
4.4 Malicious Deviant Behavior We define MDB as intentional, malicious information security policy violations. MDB is behavior
engaged in by ITAs who also knowingly violate ISPs, whereas NDB does not contain hateful intent,
MDB does contain it. Malicious deviant ITAs intend damage and/or security breaches. Their intent is
malevolent, with a possible goal of putting the organization at risk, causing degraded services to
customers, disrupted services to customers, and corporate failure. For example deviant behavior
may directly be responsible for loss of profits, credibility or competitive advantage (Crossler et al.,
2013).
MDB include such behavior as espionage, sabotage, embezzlement, identity theft, IP theft, stealing
sensitive information, malicious data breach, data corruption, data theft, data destruction,
intellectual property theft and fraud (Aytes & Connolly, 2004; Crossler et al., 2013; K.H Guo et al.,
2011; Vance et al., 2015; Verizon Data BREACH Investigations Report, 2012; Warkentin & Willison,
2009; Willison & Warkentin, 2013).
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 76
4.5 Triad of ITA Behaviors The triad of ITA behaviors represents the interrelated and independent nature of MB, NDB and
MDB. MB, NDB and MDB all three violate ISPs, no matter their intent. NDB are interrelated with
both MB and MDB. NDB and MB are non-malicious in their behavioral intent. NDB and MDB in turn
both intentionally violate policies. Each class of ITA security behavior also represents an independent
unit in the overall phenomenon of ITA security behaviors.
Figure 2 Triad of ITA Behaviors
5 CORRECTIVE ACTIONS
Despite the fact that organisations are implementing technical controls (Aytes & Connolly, 2004;
Ayuso, Gasca, & Lefevre, 2012; Choo, 2011; Hansen, Lowry, Meservy, & McDonald, 2007; Zafar &
Clark, 2009), ISPs (Pfleeger & Caputo, 2012; Siponen, Mahmood, & Pahnila, 2009; Siponen,
Mahmood, & Pahnila, 2014; Whitman & Mattord, 2012), compliance approaches (Siponen et al.,
2014; Whitman & Mattord, 2012), and countermeasures (D'Arcy et al., 2009; Herath & Rao, 2009;
Straub, 1990) such as awareness, education, training, software programmes, penalties and
pressures, employees seldom comply with policies (Siponen et al., 2014).
Some IS security studies also attempted to understand the security behavior of individuals via
concepts such as (Crossler et al., 2013): neutralization (Siponen & Vance, 2010), disgruntlement
(Willison & Warkentin, 2013), shame and moral beliefs (Siponen, Vance, & Willison, 2012), self-
control (Hu, Xu, Dinev, & Ling, 2011), accountability (Siponen et al., 2012; Vance et al., 2015), fear
(Johnston et al., 2015), organizational culture (Hu, Dinev, Hart, & Cooke, 2012), and rational choice
(Aytes & Connolly, 2004; Bulgurcu, Cavusoglu, & Benbasat, 2010).
A number of IS security studies have investigated the effectiveness of deterrence on deviant
behavior (D'Arcy et al., 2009; Herath & Rao, 2009; Hu et al., 2011; Straub, 1990; Straub & Welke,
Misbehavior
Non-malicious Deviant
Behavior
ISP Violations
Malicious Deviant
Behavior
Intentional
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 77
1998), and for example found that countermeasures that include deterrent administrative
procedures and pre-emptive security software resulted in lowered computer abuse (Straub, 1990),
and when users are aware of security countermeasures, it influence the perceived certainty and
severity of organizational IS misuse sanctions, which lead to reduced IS misuse intention (D'Arcy et
al., 2009).
The common denominator with the current approaches to mitigate ISP violations, is that the studies
were conducted using survey samples that did not differentiate security behavior of ITAs. It may not
be sufficient anymore to have a “one-size-fits-all” approach to address security behavior within
organizations, as different security behavior require different corrective actions.
To reduce the number of incidents caused by ITAs, organizations ought to address MB, NDB and
MDB by applying the correct type of corrective action or actions towards each class of behavior. For
example: attempting to address MDB with compliance approaches (Siponen et al., 2014; Whitman &
Mattord, 2012), and countermeasures (D'Arcy et al., 2009; Herath & Rao, 2009; Straub, 1990) such
as awareness, education and training is inappropriate, because this type of behavior intends damage
and/or security breaches and threat agents know well enough that they are violating the ISP policy.
While awareness campaigns, educational programs, and training sessions are appropriate for MB,
because it is an appropriate way to address ignorance of the ISP and the content thereof. Although
the NDB agent knowingly violate ISPs, the behavior is without hateful intent, awareness campaigns,
educational programs, and training sessions aimed at explaining the risks associated and
consequences of ISPs, is appropriate to address this class of security behavior.
Organizational culture positively influences employees’ attitudes towards compliance with ISPs (Hu
et al., 2012), and increasing accountability can reduce access policy violation intensions (Vance et al.,
2015). Accountability (Siponen et al., 2012; Vance et al., 2015) and organizational culture (Hu et al.,
2012) are thus well suited for addressing MB and NDB. Although misbehaving ITAs are not
intentional in their violation of policy, having a positive attitude towards compliance may influence
them to seek information and attend training programs. While the non-malicious deviant ITA, may
rethink their deviant behavior in a culture that advocates secure practices and accountability.
The importance of applying corrective actions aimed at a specific class of behavior, could improve
upon the effectiveness of corrective actions in mitigating the risks associated with ISP violations.
6 DISCUSSION
IS security remains a high priority for information security executives (Loch et al., 1992). It is key for
an organization’s security efforts that ITAs behave and act responsibly with regards to the
information security policy (ISP). Organizations need to be able to clearly differentiate the security
behaviors of their ITAs, and then apply appropriate corrective actions towards the different
behaviors.
To this end, we classified ITA security behaviors as misbehavior, non-malicious deviant behavior, and
malicious deviant behavior. The implications of this differentiation is key to improve the
effectiveness of applied corrective actions. Organizations need to understand that it may no longer
be sufficient to have a “one-size-fits-all” approach to address ITA security behaviors, and to mitigate
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 78
the risks associated with ISP violations. Research by Crossler et al. (2013) and reports such as the
Verizon Data BREACH Investigations Report (2012) and Computer Security Institute (2011) support
this viewpoint.
While technical controls (Aytes & Connolly, 2004; Ayuso et al., 2012; Choo, 2011; Hansen et al.,
2007; Zafar & Clark, 2009), ISPs (Pfleeger & Caputo, 2012; Siponen et al., 2009; Siponen et al., 2014;
Whitman & Mattord, 2012), compliance approaches (Siponen et al., 2014; Whitman & Mattord,
2012), and countermeasures (D'Arcy et al., 2009; Herath & Rao, 2009; Straub, 1990) have been
suggested to reduce ISP violations, applying the right kind of corrective actions towards a specific
behavior is important (Crossler et al., 2013).
Organizations should accept that security incidents caused by ITA can’t be eliminated, but the
number of incidents can be reduced by addressing the related ITA behaviors. Organizations should
consider not addressing ISP violation behaviors as a collective, but rather apply corrective action/s
towards MB, NDB and MDB individually. Organizations that understand their employees and their
security behaviors, could address ISP violations more effectively.
MB, as defined by this study, includes the unintentional, non-malicious ISP violating ITA. The
consequences to their ignorance or uninformed violations (Ken H. Guo, 2013; Siponen & Vance,
2010) are to indirectly create weaknesses that could allow hackers to infect internal systems with
viruses or spyware, or allow them to bypass the firewall and access confidential data (Crossler et al.,
2013). Countermeasures such as awareness, education, and training programs (D'Arcy et al., 2009;
Herath & Rao, 2009; Straub, 1990), are appropriate for MB, because these countermeasures address
ignorance of the ISP and the content thereof. NDB, as defined by this study, includes the intentional,
non-malicious ISP violating ITA. Although the NDB agent knowingly violate ISPs, the behavior is
without hateful intent, therefore awareness campaigns, educational programs, and training sessions
aimed at explaining the risks associated and consequences of ISPs, are also appropriate to address
this class of security behavior.
Both NDB and MDB knowingly violate ISPs, and while deterrence was found to be effective in
reducing deviant behavior (D'Arcy et al., 2009; Straub, 1990), some IS security studies that
attempted to understand the security behavior of individuals, deduced that employees would use
techniques of neutralization (Siponen & Vance, 2010) to moderate their violation. NDB does not
contain hateful intent and while shame and moral beliefs (Siponen et al., 2012), accountability
(Siponen et al., 2012; Vance et al., 2015), fear (Johnston et al., 2015), and organizational culture (Hu
et al., 2012), could address NDB, because of disgruntlement (Willison & Warkentin, 2013), MDB
which contian hateful intent could continue with their malevolent behavior and possibly may
directly be responsible for loss of profits, credibility or competitive advantage (Crossler et al., 2013).
7 CONCLUSION
We established the potential to more effectively apply countermeasures and corrective actions
towards ITA security behaviors, and therefore to decrease the number of ISP violation incidents.
While behavioural InfoSec studies aim to improve our understanding of, mitigate the risks associated
with, and change ITA security behaviors, it should consider applying countermeasures and corrective
actions, not to survey samples as a collective, but to categorized survey samples. Thus, firstly
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 79
categorize participants according to their behavioral class of MB, NDB and MDB. But, as mentioned
by Crossler et al. (2013), such a strategy might proof difficult. This would be the subject of future
research. Empirical research could support the conceptualization of the triad of ITA behaviors, and
test the applicability and effectiveness of countermeasures and corrective actions.
REFERENCES
Aytes, K., & Connolly, T. (2004). Computer Security and Risky Computing Practices: A Rational Choice Perspective. Journal of Organizational and End User Computing, 16(3), 22-40.
Ayuso, P. N., Gasca, R. M., & Lefevre, L. (2012). FT-FW: A cluster-based fault-tolerant architecture for stateful firewalls. Computers & Security, 31(4), 524-539.
Beautement, A., & Sasse, A. (2009). The economics of user effort in information security. Computer Fraud & Security, 2009(10), 8-12. doi: http://dx.doi.org/10.1016/S1361-3723(09)70127-7
Bulgurcu, B., Cavusoglu, H., & Benbasat, I. (2010). Information Security Policy Compliance: An empirical study of Rationality-based beliefs and information security awareness. MIS Quarterly, 34(3), 523-548.
Choo, K. K. R. (2011). The cyber threat landscape: Challenges and future research directions. Computers and Security, 30(8), 719-731. doi: 10.1016/j.cose.2011.08.004
Colwill, C. (2009). Human factors in information security: The insider threat – Who can you trust these days? Information Security Technical Report, 14(4), 186-196. doi: http://dx.doi.org/10.1016/j.istr.2010.04.004
Computer Security Institute. (2010). 2010/2011 Computer Crime and Security Survey. from http://gatton.uky.edu/FACULTY/PAYNE/ACC324/CSISurvey2010.pdf
Computer Security Institute. (2011). 2010/2011 Computer Crime and Security Survey. from http://gatton.uky.edu/FACULTY/PAYNE/ACC324/CSISurvey2010.pdf
Crossler, R. E., Johnston, A. C., Lowry, P. B., Hu, Q., Warkentin, M., & Baskerville, R. (2013). Future directions for behavioral information security research. Computers & Security, 32(1), 90-101.
D'Arcy, J., Hovav, A., & Galletta, D. (2009). User Awareness of Security Countermeasures and Its Impact on Information Systems Misuse: A Deterrence Approach. Information Systems Research, 20(1), 79-98.
Ernst & Young's Global Information Security Survey. (2014). EY’s Global Information Security Survey 2014. from http://www.ey.com/Publication/vwLUAssets/EY-global-information-security-survey-2014/$FILE/EY-global-information-security-survey-2014.pdf
Furnell, S., & Clarke, N. (2012). Power to the people? The evolving recognition of human aspects of security. Computers & Security, 31(8), 983-988.
Furnell, S., Papadaki, M., & Thomson, K.-L. (2009). Scare tactics – A viable weapon in the security war? Computer Fraud & Security, 2009(12), 6-10. doi: http://dx.doi.org/10.1016/S1361-3723(09)70151-4
Garg, A., Curtis, J., & Halper, H. (2003). Quantifying the financial impact of IT security breaches. Information Management & Computer Security, 11(2), 74-83.
Guo, K. H. (2013). Security-related behavior in using information systems in the workplace: A review and synthesis. Computers & Security, 32, 242-251. doi: http://dx.doi.org/10.1016/j.cose.2012.10.003
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 80
Guo, K. H., Yuan, Y., Archer, N. P., & Connelly, C. E. (2011). Understanding Nonmalicious Security Violations in the Workplace: A Composite Behavior Model. Journal of Management Information Systems, 28(2), 203-236.
Hansen, J. V., Lowry, P. B., Meservy, R. D., & McDonald, D. M. (2007). Genetic programming for prevention of cyberterrorism through dynamic and evolving intrusion detection. Decision Support Systems, 43(4), 1362-1374. doi: 10.1016/j.dss.2006.04.004
Herath, T., & Rao, H. R. (2009). Encouraging information security behaviors in organizations: Role of penalties, pressures and perceived effectiveness. Decision Support Systems, 47, 154-165. doi: 10.1016/j.dss.2009.02.005
Hu, Q., Dinev, T., Hart, P., & Cooke, D. (2012). Managing Employee Compliance with Information Security Policies: The Critical Role of Top Management and Organizational Culture. Decision Sciences, 43(4), 615-659.
Hu, Q., Xu, Z., Dinev, T., & Ling, H. (2011). Does Deterrence Work in Reducing Information Security Policy Abuse by Employees? Communications of the acm, 54(6), 54-60.
Insider Threat Spotlight Report. (2015).
Johnston, A. C., Warkentin, M., & M, S. (2015). An Enhanced Fear Appeal Rhetorical Framework: Leveraging Threats to the Human Asset Through Sanctioned Rhetoric. MIS Quarterly, 30(1), 113-134.
Loch, K. D., Carr, H. H., & Warkentin, M. E. (1992). Threats to Information Systems: Today's Reality, Yesterday's Understanding. MIS Quarterly, June, 173-186.
Pfleeger, S. L., & Caputo, D. D. (2012). Leveraging behavioral science to mitigate cyber security risk. Computers & Security, 31(4), 597-611.
Safa, N. S., Sookhak, M., Von Solms, R., Furnell, S., Ghani, N. A., & Herawan, T. (2015). Information security conscious care behaviour formation in organizations. Computers & Security, 53, 65-78. doi: http://dx.doi.org/10.1016/j.cose.2015.05.012
Safa, N. S., Von Solms, R., & Furnell, S. (2016). Information security policy compliance model in organizations. Computers & Security, 56, 70-82. doi: http://dx.doi.org/10.1016/j.cose.2015.10.006
Siponen, M., Mahmood, M. A., & Pahnila, S. (2009). Are Employees Putting Your Company At Risk By Not Following Information Security Policies? Communications of the acm, 52(12), 145-147. doi: 10.1145/1610252.1610289
Siponen, M., Mahmood, M. A., & Pahnila, S. (2014). Employees’ adherence to information security policies: An exploratory field study. Information & Management, 51, 217-224.
Siponen, M., & Vance, A. (2010). Neutralization: New insights into the problem of employee information systems security policy violations. MIS Quarterly, 34(3), 487-502.
Siponen, M., Vance, A., & Willison, R. (2012). New insights into the problem of software piracy: The effects of neutralization, shame, and moral beliefs. Information & Management, 49(7/8), 334-341. doi: 10.1016/j.im.2012.06.004
Stanton, J. M., Stam, K. R., Mastrangelo, P., & Jolton, J. (2005). Analysis of end user security behaviors. Journal of Computers and Security, 24, 124-133.
Straub, D. W. (1990). Effective IS Security: An Empirical Study. Information Systems Research, 1(3), 255-276. doi: http://dx.doi.org/10.1287/isre.1.3.255
Straub, D. W., & Welke, R. J. (1998). Coping with Systems Risk: Security Planning Models for Management Decision-Making. MIS Quarterly, 22(4), 441-469.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 81
The Global State of Information Security® Survey. (2015). The Global State of Information Security® Survey. from http://www.pwc.com/gx/en/consulting-services/information-security-survey/download.jhtml and http://www.pwc.com/gx/en/consulting-services/information-security-survey/key-findings.jhtml
US State of Cybercrime Survey. (2014). 2014 US State of Cybercrime Survey. from http://resources.sei.cmu.edu/library/asset-view.cfm?assetID=298318
Vance, A., Lowry, P. B., & Egget, D. (2015). Increasing Accountability through User-Interface Design Artifacts: A New Approach to Addressing the Problem of Access-Policy Violations. MIS Quarterly, 39(2), 345.
Verizon Data BREACH Investigations Report. (2012). from http://www.verizonenterprise.com/resources/reports/rp_data-breach-investigations-report-2012-ebk_en_xg.pdf
Vormetric Data Security Report. (2015). 2015 Vormetric Insider Threat Report: Trends and Future Directions in Data Security GLOBAL EDITION.
Vroom, C., & von Solms, B. (2004). Towards information security behavioural compliance. Computers & Security, 23(3), 191-198.
Warkentin, M., & Willison, R. (2009). Behavioral and policy issues in information systems security: the insider threat. European Journal of Information Systems, 18, 101-105. doi: 10.1057/ejis.2009.12
Whitman, M. E., & Mattord, H. J. (2012). Principles of Information Security, International Edition (Fourth ed.): Course Technology CENGAGE Learning.
Willison, R., & Warkentin, M. (2013). BEYOND DETERRENCE: AN EXPANDED VIEW OF EMPLOYEE COMPUTER ABUSE. MIS Quarterly, 37(1), 1-20.
Workman, M., Bommer, W. H., & Straub, D. W. (2008). Security lapses and the omission of information security measures: A threat control model and empirical test. Computers in Human Behavior, 24(6), 2799-2816.
Zafar, H., & Clark, J. G. (2009). Current State of Information Security Research In IS. Communications of the Association for Information Systems, 24, 571-596.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 82
IC1009
Challenges in Password Usability - Users Perspective
Tiroyamodimo Mogotlhwane, Kagiso Ndlovu
Department of Computer Science University of Botswana Gaborone, Botswana
mogotlhw@mopipi.ub.bw; kagiso.ndlovu@mopipi.ub.bw
ABSTRACT
One of the most popular authentication methods in computing is the use of passwords. Password is an additional user’s credential that is used to verify that a user entering a computer system is hopefully the one authorized to do so. As the number of applications that an individual need access to increases, so is the number of passwords that they are expected to have. Increasingly, users have to have passwords for a lot more applications such as ATM, online banking, mobile banking, and different web applications. These bring in challenges because it becomes very difficult if not impossible to memorize so many passwords, coupled with the fact that ideally, a password must not be written down anywhere. Some of the methods that people use to remember their many passwords are, writing them down, using the same password for many applications, storing them in mobile devices etc. There are now web based applications that are developed to help with password management. This is more so because ideally each application requires its own password. The challenges that users have in maintaining passwords and possible solutions according to literature are discussed. At the moment, password principle is based mainly on an individual remembering it. Latest research has shown that a human being’s ability to remember is influenced by how frequently an individual use that which he remembers. Hence the frequently used passwords are the ones that are likely to be remembered easily. A theoretical model is proposed to help address problems associated with password maintenance. The model calls for an inner application that can be embedded before supplied password authentication takes place. The increase in online application systems requires high security to maintain the quality of data. The increase in these applications is bringing in new challenges on the use of password as a means of protecting personal information.
Key words: date-time-stamp; authentication; password; challenges; management
1 INTRODUCTION
The use of computer based information system has significantly increased. Many organizations have
several information systems that are driving their businesses. Such systems are at times linked with
external stakeholders such as customers, clients, suppliers etc. These systems store organizational
data. There is a need to protect this data by controlling access to such systems. In the past data was
stored in paper folders or steel cabinets and these were kept in secured physical location often
called strong rooms where only certain individuals had access to such storage facilities. These strong
rooms were locked and not everyone had access to their keys. One of the commonly used methods
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 83
of controlling access to digital data devices and systems is the use of passwords. There is an increase
in applications that require a user to remember multiple passwords ( Sharma, Sharma, & Dave,
2015). However, with the increase in the number of applications that requires its own password, this
is a challenge to a user to remember all the passwords.
2 AUTHENTICATION AND SECURITY
2.1 Authentication
Authentication in computing is a process by which a user of a system provides some information
pertaining to them which is then matched with those stored by an operating system or server
(Rause, 2015). A user is assigned a login name and password as the credentials to use in a human
computer interaction environment. The login name is displayed as a user enters the authentication
details while the password is masked from being displayed. A user is expected to memorize the
password so that no one else will know it. So even if another user can know the login name, the
password will only be known by the rightful owner.
Computer based information system require that the information and data stored in them must be
secure and protected from those who are not supposed to access it. Security is required to deny
unauthorized person or agent to access the resource and that only authorized people can access the
resource. There are other methods of authenticating users to computer system but password is the
most common one as it is not expensive to implement (Wenstrom, 2002), (TANESKI, BRUMEN, &
HERICKO, 2014). Although popular, authentication using login names and passwords is not an
effective security measure. In most cases, the reliability of the password mechanism is dependent
upon an individual using it although human errors are common with the use of information systems.
In 2012 IEEE unintentionally left about 100 000 passwords publicly exposed. It was discovered that
even IEEE members used passwords that are easy to guess as most common passwords were
“123456,”“ieee2012,” and “12345678” ( Mills, 2012), (Symantec, 2013)].
2.2 Password Management
Due to increase in security threat on information stored in computerized information systems,
organizations have developed policies relating to use of passwords. The aim of a password policy is
to make it difficult for those who might want to guess the password to enable them to gain access.
The biggest problem with authentication using login name and password is that if anyone knows
those credentials they can log in the system. The computer systems know the credentials and not
the person whom the credentials are given to. Cyber criminals are always on the lookout to steal
these login details along with other personal information such as credit card numbers. Identity theft
is a global problem that is on the increase. Cyber criminals write applications which they can use to
perform identity theft. It is estimated that there are about 130million application systems used to
perform identity theft today. The figure was about 1million in 2007 ( Anderson, 2013). The demand
for such applications is high because theft of personal information can be used to imitate the real
person and perform illegal acts ( Saunders & Zucker, 1999). Personal information like credit card
details, email, and address are targeted by cyber criminals for the purpose of committing crime
pretending to be the original person. Selling credit card numbers is a big business as recently after an
exposure of a website that was selling such details stolen from many people across the world (Dean,
2016). Like other cyber related crimes it is very difficult to investigate crimes like online identity theft
as it requires interstate cooperation.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 84
Password policies are designed to make it difficult for cyber criminals to easily access and get hold of
them. Many organizations have developed password policies to protect customer information and
other business data kept within their systems. Some of the things that a password policy try to
enforce for example Microsoft Developer Network as follows (MSDN, 2015):
Password complexity
Password expiration
Password complexity may specify what can or cannot be a password, minimum and maximum
characters of a password, special characters that each password must have. Password complexity
can be enforced by making it as long and as complex as possible.
Password expiration defines the lifespan of a password, time lock off when a user is logged on but is
inactive or lockout duration.
If a password is not carefully selected, it can be the weakest point that criminals can easily use to
access the system. A password that is not easy to get is said to be strong. A password policy must
enforce the following to make passwords strong (MSDN, 2015):
Have more than 8 characters
Include, numbers, characters and symbols
It must not be a name
Regularly changed
It is not written anywhere
It is memorized
2.3 Password Challenges
The increase number of different applications that a single user need to use has significantly
increased. Each system requires its own password and may even have different password policy.
One system may require 8 character passwords while another system may require 6. For example an
average user needs to have a password for their bank accounts, credit card, official email account,
personal email account etc. Each of these should have its own password. To a user it is very difficult
to remember all of them. A common solution is to have a similar password for all the applications. It
has been shown that in some instances 75% of social network username and password were similar
to users email accounts (MSDN, 2015). Another research has shown that over 50% of Americans use
the same password for various online applications (SecurityWeek, 2010). The main problem with
using the same password for different systems is that once a hacker has managed to get that
password, they will be able to access several systems where that password is used. It is like having
the same key for your house, car, office etc.
Another common approach that people use to deal with challenges of multiple passwords across
different devices is to write each system/device password down. Under this approach a strong
password is established for each application. The security of this approach lies on the security of
where the password is written on. For example if it is in a diary, diaries are often left in desks or if
stored in a mobile phone, mobile phones get lost or are stolen. Writing a password down on paper is
the next most common method that people use to manage their passwords, the most popular being
remembering ( Wlasuk , 2012).
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 85
A password is not supposed to be shared. In the banking industry any entry into your banking details
where a user has correct credentials is a valid transaction. A customer who shares a password with a
friend cannot hold the bank accountable if the friend performs unauthorized transactions in the
account. One of the conditions that Standard Bank recommends to its customers on how to protect
their password is not to disclose it to anyone (Cobb, 2012). Sharing password still remains a big
problem for many organizations today (Imprivata, 2014).
It is not only information systems that require authenticaion. There are mobile devices, cloud based
systems that also require authentication. Increasingly employees are bringing their own devices to
the workplace and use them to access corporate data. This increase the complexity of the problem
of managing and controlling access to corporate data and systems.
3 PASSWORD AUTHENCTICATION PROCESS
There are numerous password authentication protocols (PAP) that are currently in use. These range
from simple to more complex ones. The simplest protocol is where a user is assigned a login name
and an initial unique (individual) password. The login name is publicly available but the password is
not public. A user is often expected to change the initial password to something that only the user
knows and can remember. The login name and password are then stored for future reference during
log in sessions. The process of authentication is performed when a user wants to use the system or
device. A user is asked to provide their credentials. These are then compared with what is stored. If
they match, access is allowed. If they are not the same, access is denied.
Figure 1. Password Authentication Protocol (The TCP/IP Guide, 2005)
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 86
4 PROPOSED SOLUTION
Password still remains the cheapest method of authentication. There is a need to continue to
explore innovative methods through which this method can be made much more secure to meet the
current challenge of increase in cyber-crime such as password fishing. This paper proposes
improvement of PPP by incorporating date time stamp in the validation process.
The date time stamp (DTS) is the numerical combination of the current year, month, day and time
for example if the numerical values of these are extracted and combine to produce a number, this
number will become unique especially when time is set may be to smallest units. For example
Year 2015
Month 04
Day 17
Time (hour: Minute: Second: 12:53:55
The above can be converted to a date stamp to be 20150417125355. If this is set to be date time
stamp, then this numeric value can never be repeated in the future. The strength of the value can be
improved by improving accuracy of measurement of the second’s value.
The proposed system perform authentication by combining a user password with the data time
stamp value. Upon log in the DTS is captured and combined with a user’s password. An operation
preferably a mathematical operation can be performed to do the combination to produce a unique
value which is also sent to the responder. DTS is also send along with the password for validation at
the responder.
At the responder, a similar operation is performed using the stored password and the sent DTS. The
value obtained at the responder is then compared with the one that has been received to perform
authentication. If the values are the same then a user is validated or else access is denied.
Since an operation is being performed on the password combing it with a unique value of DTS, the
password does not have to be very strong; it can be something that is easy to remember. The
strength of this method will depend on the operation that is performed using a password and DTS.
Also the method rely on the mathematical operation been performed being very difficult to guess.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 87
Figure 2. Additional functionality to PAP
5 CONCLUSIONS
The increase in cyber-crime is a global problem. It is even made more difficult by the proliferation of
web based applications. The internet does not recognize territorial boundaries. Increasingly human
beings are challenged by the need to protect their data and identity when online. Though there are
advance methods of providing online security, authentication using passwords still remain to be the
most popular method. Use of passwords for authentication has numerous challenges. There is a
need to strengthen password methodologies to provide a secure platform for using modern devices
and other information systems. Mobile devices and remote access of information systems need to
be provided with secure environments
This proposed solution is still at theoretical concept that needs to be developed. For example the
nature of the operation that needs to be performed need to be developed as well as the tools that
will be required to capture the required data and perform the operation. Global time difference also
has to be considered to decide which time zone the security authentication will use as the concept is
explored further. In this method it is assumed that the complexity or strength of the password is
enhanced by making the operation complex. The benefit of this approach is that it will reduce the
need for people to remember multiple passwords or change their passwords regularly. More work
still needs to be done to validate this theoretical concept.
Initiator
Responder
Capture password and DTS
Perform operation
Result
Receive password and DTS
Perform similar operation like
initiator
Compare Results
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 88
REFERENCES Anderson, C. J. (2013, April 14). Identity theft growing, costly to victims. Retrieved April 10, 2015,
from USA Today: http://www.usatoday.com/story/money/personalfinance/2013/04/14/identity-
theft-growing/2082179/
Cobb, S. (2012, December 4). Password handling: challenges, costs, and current behavior (now with
infographic). Retrieved April 15, 2015, from welivesecurity:
http://www.welivesecurity.com/2012/12/04/password-handling-challenges-costs-current-
behavior-infographic/
Dean, J. (2016, February 19). Website selling stolen credit cards is shut down. Retrieved April 25,
2016, from The Times (London):
http://www.lexisnexis.com/lnacui2api/delivery/rwBibiographicDelegate.do
Imprivata. (2014). Eliminate password sharing with single sign on technology. Retrieved April 14,
2015, from imprivata: http://www.imprivata.com/password_sharing
Mills, E. (2012, September 25). Researcher says 100,000 passwords exposed on IEEE site. Retrieved
March 17, 2016, from CNET: http://www.cnet.com/news/researcher-says-100000-
passwords-exposed-on-ieee-site/
MSDN. (2015). Password policy. Retrieved April 12, 2015, from Microsoft Developer Network:
https://msdn.microsoft.com/en-us/library/ms161959.aspx
Profis, S. (2014, November 25). The guide to password security (and why you should care). Retrieved
April 17, 2015, from CNET: http://www.cnet.com/how-to/the-guide-to-password-security-
and-why-you-should-care/
Rause, M. (2015). Authentication. Retrieved April 5, 2015, from TechTarget:
http://searchsecurity.techtarget.com/definition/authentication
Saunders , K. M., & Zucker, B. (1999). Counteracting Identity Fraud in the Information Age: The
Identity Theft and Assumption Deterrence Act. International Review of Law, Computers &
Technology, 13(2), 183-192.
SecurityWeek. (2010, August 16). Study Reveals 75 Percent of Individuals Use Same Password for
Social Networking and Email. Retrieved April 13, 2015, from Security Week:
http://www.securityweek.com/study-reveals-75-percent-individuals-use-same-password-
social-networking-and-email
Sharma, A., Sharma, S., & Dave, M. (2015). Identity and access management- a comprehensive study.
(pp. 1481 - 1485). Noida: IEEE Explore.
Standard Chartered Bank. (2014). Take control of your online security . Retrieved April 14, 2015, from
Standard Chartered: https://www.sc.com/en/online-banking/security/how-to-protect-
yourself.html
Symantec. (2013). Reaping the Benefits of Strong, Smarter User Authentication. Retrieved March 18,
2016, from Symantec:
http://www.symantec.com/content/en/us/enterprise/white_papers/b-reaping-the-benefits-
en-us.pdf
Taneski, V., Brumen, B., & Hericko, M. (2014). The Effect of Educating Users on Passwords: a
Preliminary Study. ACM Transactions on Applied Perception, 2, 1-8.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 89
The TCP/IP Guide. (2005, September 20). PPP Authentication Protocols: Password Authentication
Protocol (PAP) and Challenge Handshake Authentication Protocol (CHAP) . Retrieved April 15,
2015, from The TCP/IP Guide:
http://www.tcpipguide.com/free/t_PPPAuthenticationProtocolsPasswordAuthenticationPr-
2.htm#Figure_29
UCSC. (2015, March). UCSC Password Strength and Security Standards. Retrieved April 16, 2015, from
University of Carlifornia: http://its.ucsc.edu/policies/password.html
USA TODAY. (2013, April 14). Identity theft growing, costly to victims. Retrieved April 2016, from USA
TODAY: 25
Wenstrom, M. (2002, February 22). Examining Cisco AAA Security Technology. Retrieved April 12,
2015, from CISCO: http://www.ciscopress.com/articles/article.asp?p=25471&seqNum=3
Wlasuk , A. (2012, February 10). Password Purgatory - Are we Ever Going to Get Passwords Right?
Retrieved April 14, 2015, from Security Week: http://www.securityweek.com/password-
purgatory-are-we-ever-going-get-passwords-right
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 90
IC1010
Enhancing the Least Significant Bit (LSB) Algorithm for Steganography
Oluwaseyi Osunade, Ganiyu Idris Adeniyi
Department of Computer Science University of Ibadan
Ibadan, Nigeria
seyiosunade@gmail.com
ABSTRACT
Various Steganography algorithms have been proposed and implemented for hiding the existence of
data in a cover object starting from the algorithms that work in transform domain to the ones that
work in spatial domain, such as Least Significant Bit (LSB), which uses the three colours (RED, GREEN
and BLUE) present in an image. Three colours are present in the pixel of an image, therefore, this
project proposed a new algorithm that chooses only the two colours (GREEN and BLUE) out of the
three colours (RED, GREEN and BLUE) that made up of a pixel present in an image to hide data. This
proposed algorithm successfully hides the data with the two colours (GREEN and BLUE) present in an
image with no significant changes in the resulting colours of the image. The result of this experiment
has shown the effectiveness of the proposed algorithm. This experimental result has shown that the
algorithm strikes a balance between the security and the quality of the image. It should be noted
that this research work only considers image as the cover object, other forms of cover object are not
considered here. It should also be noted that the algorithm only hides data from 8 bytes to 1024
bytes using two different images of different size, which shows no effect on the effectiveness of the
algorithm.
Keywords: Steganography, least significant bit, colour, data, algorithm
1 INTRODUCTION
Due to the continuous changing of global Technology trends, data is continuously moving from one
host system to another system on the network or on the internet and thus the security of this data is
highly important.
It is generally accepted that the security of the data can be achieved by using encryption and
Steganography method. In Cryptography, the encrypted data is transmitted after the data is
transformed to another form in order to hide the content of the data from unauthorized users.
Steganography on the other hand, deals with hiding the existence of the data in a cover object such
as texts, image, audio/video and protocol rather than transforming the data itself thereby making
people unaware that communication is taking place.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 91
The application of Steganography will continue to play a vital role in protecting data across several
hosts due to its unsuspicious methodology. Various Steganography algorithms have been proposed
and implemented but most of the algorithms do not hide the data effectively. Usually a slight
distortion in the image used to hide the data gives it away. Therefore, there is a need to get an
algorithm that gives only the slightest distortion. This is what led to the newly proposed algorithm.
1.1 An Overview of Ancient Steganography
Steganography can be traced back to ancient times. Early attempts at steganography made use of
chemicals and even human bodies to convey information. In practice, modern steganography has
gone beyond the use of physical bodies and chemicals but in principle, it is still the same as the
ancient steganography. Some of the records are outlined below:
Herodotus (484 BC – 425BC) is one of the earliest Greek historian. His great Work, The
Histories, is the story of the war between the huge Persian Empire and the war between the
huge Persian Empire and the much smaller Greek city–states. Herodotus recounts the story
of Histaiaeus, who wanted to encourage Aristagoras of Miletus to revolt against the Persian
King in other to secure convey his plan, Histaiateus shaved the head of his messenger, wrote
the message on his scalp, and then waited for the hair to regrow. The messenger, apparently
carrying nothing contentious, could travel freely. Arriving at his destination, he shaved his
head and pointed it at the recipient.
Pliny the Elder (23 AD – 79 AD) explained how the Milk of the thithymallus plant dried to
transparency when applied to paper but darkened to brown when subsequently heated,
thus recording one of the earliest recipes for invisible ink. The Ancient Chinese wrote notes
on small pieces of silk that they then wadded into little balls and coated in wax, to be
swallowed by a messenger and retrieved at the messenger’s gastrointestinal convenience.
Giovanni Batista Porta (1535 - 1615) described how to conceal a message within a
hardboiled egg by writing on the shell with an ounce of alum and a pint of vinegar. The
solution penetrates the porous shell, leaving no visible trace, but the message is stained on
the surface of the hardened egg albumen, so it can be read when the shell is removed.
2 RELATED WORKS
Steganography is an art and science of hiding messages in such a way that no one apart from the
intended recipient knows the existence of the message (Divya & Ram, 2012). The term ‘hiding’ refer
to the process of making the information imperceptible or keeping the existence of the information
secret.
Steganography is derived from two Greek words ‘steganos’ which literally means ‘covered’ and
‘graphy’ means ‘writing’ i.e. covered writing. Steganography refers to the science of ‘invisible’
communication for hiding secret information in various file formats, there exist a large variety of
Steganographic techniques. Some are more complex than others but all of them have respective
strong and weak points (Lokeswara et al., 2011). Different applications have different requirement
of the steganography techniques to be used.
Hiding data is the process of embedding information into digital content without causing perceptual
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 92
degradation. In data hiding three famous techniques can be used. They are watermarking,
steganography and cryptography. Steganography is defined as cover writing in Greek. It involves any
process that deals with data or information within other. (Rosziati & Teoh, 2011).
The main advantage of using Steganography over the remaining famous techniques is due to its
simple security mechanism because steganographic message is integrated invisibly and covered
inside other harmless sources.
The Steganography can be considered as a branch of Cryptography that tries to hide messages
within others, avoiding the perception that there is some kind of message. To apply steganographic
techniques, cover files of any kind can be used, although archives of image, sound or video files are
the most used today. Similarly, information to hide can be texts, image, video, sound e.tc. There are
two trends at the time to implement steganography algorithms: the method that work in the spatial
domain (altering the desired characteristics on the file itself) and the methods that work in the
transform domain (performing a series of changes to the cover image before hiding information)
(Juan & Jesus, 2009).
Different research carried out has proved the fact that the methods that work in the spatial domain
are simpler and faster to implement than the ones that work in the transform domain which is more
robust in term of resistance to attacks.
In Spatial Domain, message or data to be transferred is embedded directly into images to be used as
cover object whereas, in transform domain as its name implies, images are first transformed before
the data or message to be transferred is embedded into it.
Image steganography can be implemented using Transfer domain and Spatial domain which
implements any of these three methods:
Non-Filtering: This method deals with embedding the data into the cover object by starting
from the first pixel of the images to be used as cover object.
Randomized: In this method both the sender and receiver of the image use password
denominated stego-key that is employed as the seed for pseudo-random number generator,
which then creates sequence that is used as index to have access to the image pixel.
Filtering: In this method, the algorithm filters the cover image by using a default filter and
hides information in the areas that get a better rate (Roque, Juan & Jesus, 2009).
2.1 Steganography Algorithms
Most of the algorithm that works in Spatial Domain use Least Significant Bits Algorithm (LSB) method
or any of its derivatives as the algorithm for information hiding i.e., hiding one bit of information in
the least significant bit of each colour of a pixel. However, this method cannot stand some types of
statistical analysis (such as RS or Sample Pairs). The problem stems from the fact that modifying the
three colours of a pixel produces a major distortion in the resulting colour. This distortion is not
visible to the human eye, but detectable by statistical analysis. (Juan & Jesus, 2009).
Research carried out has proved the fact that the methods that work in the spatial domain is simpler
and faster to implement than the one that work in the transform domain which is more robust in
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 93
term of resistant to attacks. Therefore, this project focuses on Least Significant Bits Algorithm (LSB)
method and its derivative Selected Least Significant Bit Algorithm (SLSB).
2.1.1 Least Significant Bit Algorithm
In Least Significant Bit Algorithm, both the data and the image to be used as cover object are
converted from their pixel format to binary. And the Least Significant Bit of the image is substituted
with the bit of the data to be transferred so as to reflect the message that needs to be hidden. The
bits of the data replace each of the colours of the Least Significant Bit of the Image. (Lokeswara et
al., 2011).
For instance, suppose the data ‘AID’ with the following property is to be stored in the first 8 pixels of
200 by 400 Pixels with 24 bits in a pixel that made up the image.
Table 1: Showing 3 letters with ASCII values and corresponding BINARY values
LETTER ASCII VALUES BINARY VALUES
A 065 01000001
I 105 01100100
D 100 01101001
To hide ‘AID’ with the Binary Code (01000001 01100100 01101001) using Least Significant Bit
Algorithm, each bit with the least significant bit of each colour that made up the Pixel is flipped.
The affected Bits is half of the bits of the images, since there are 256 possible intensities of each
primary colour, changing the LSB of a pixel results in small changes in the intensity of the colours.
These changes cannot be perceived by the human eye - thus this makes the data to be successfully
hidden.
Figure 1: Least Significant Bit method adapted from (Roque, Juan & Jesus, 2009)
Cover Image
Colour Selection
Pixel Filtering
Steganographic Image
Data to hide
LSB matching
Bit Replacement
File Compression
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 94
2.1.2 Selected Least Significant Bit Algorithm
In Selected Least Significant Bit Algorithm, both the data and the image used as cover object are
converted from pixel format to binary. The Least Significant Bit of one colour (BLUE) that made up a
Pixel is substituted with the bit of the data to be transferred. This will reflect the message that needs
to be hidden. Only the Least Significant Bit of one colour in a Pixel is flipped by the bits of the data to
hide. (Juan & Jesus, 2009).
Only one-third (1/3) of the bits of the image is used. Hiding Data using Selected Least Significant Bit
takes more pixels of images compared to the Least Significant Bits method of hidden data, since only
the last colour of the Least Significant Bit is going to be replaced. As a result, the human eye cannot
perceive the changes - thus this makes the Data to be successfully hidden and inconspicuous to the
human eye.
3 PROPOSED ALGORITHM FOR NEW SELECTED LEAST SIGNIFICANT BIT
In this technique, a new steganography algorithm that is based on selecting the Least Significant Bit
of the two colours (Green and Blue) in each pixel is proposed, since images in a computer system are
represented as arrays of values. These values represent the intensities of the three colours R (Red),
G (Green) and B (Blue), where the value for each of the three colours describes a pixel. Each pixel is
combination of three components (Red, Green and Blue).
In this scheme, the bits of last two components (Green and Blue) of Pixels of image have been
replaced with Data Bits. The blue colour is selected because of a research conducted by Hecht
(Hecht, 2006), which reveals that the visual perception of intensely BLUE objects is less distinct than
the perception of objects of Red and Green. Green is chosen in combination with Blue because it
gives more room for the length of the data to be embedded.
3.1 Proposed Procedure for Embedding Phase
To embed data into images the following procedure is performed.
Step 1: Extract the entire pixel in the image and store it in the array called Pixel-array
Step 2: Extract all the characters in the given text file and store it in the array called Character-
array.
Step 3: Extract all the characters from the Stego-key and store it in the array called Key- array.
Step 4: Choose first pixel and pick characters from Key- array and place it in first and second
component of pixel. If there are more characters in Key-array, then place rest in the first
component of next pixels.
Step 5: Place some terminating symbol to indicate end of the key.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 95
Step 6: Place characters of Character- Array in each first and second components (Blue and
Green channel) of next pixels by replacing it.
Step 7: Repeat step 6 till all the characters has been embedded.
Step 8: Again place some terminating symbol to indicate end of data.
Step 9: Obtained image will hide all the characters that input.
Figure 2: Proposed Selected Least Significant Bit method
3.2 Proposed Procedure for Extraction Phase
To extract data from Stego- image the following procedure should be performed.
Step 1: Consider three arrays, Character-Array, Key-array and Pixel- array.
Step 2: Extract all the pixels in the given image and store it in the array called Pixel-array.
Step 3: start scanning pixels from first pixel and extract key characters from first and
second (blue and green) components of the pixels and place it in Key-array. Follow
Step 3 up to terminating symbol, otherwise follow step 4.
Step 4: If this extracted key matches with the key entered by the receiver, then follow;
otherwise, terminate the program by displaying message “Key is not correct”
Cover Image
Colour Selection
Pixel Filtering
Steganographic
Image
Data to hide
SLSB matching
Bit Replacement
File Compression
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 96
Step 5: If the key is valid, then again start scanning next pixels and extract secret Message
characters from first (Blue and Green) component of next pixels and place it in Character
array. Follow Step 5 till up to terminating symbol, otherwise follow step 6.
Step 6: Extract secret message from Character-array
Figure 3: Steganography Mechanism Receiver
3.3 Interface Design
The user interface is generally the means of communication between the user and the system
i.e. to enable the user to access the system. It is important that this communication is as
meaningful and friendly as possible.
Based on the proposed algorithm, we develop a simple interface using: Java Graphical User
Interface i.e., Java Net Beans and Eclipse, since the system is implemented using Java
Programming Language. It is a very simple interface to use with the following buttons:
ENCODE: This Button when click will open a text box where user is asked to input the data to be
hidden in the cover object.
ENCODE NOW: This Button when clicked will open a dialog box for the user to browse for the
preferred Cover Object (Image).
DECODE: This Button when click will open a dialog Box for the user to browse for the Stego Image
that has the data embedded in a cover object (Image).
Stego Image
File
Pixel to Binary
Conversion
SLSB Decoder
Binary to ASCII
Conversion
Binary to Pixel
Conversion
Text
Cover Image
File
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 97
DECODE NOW: This Button when click will decode the Stego image.
EXIT Button: This button is use to terminate the application programmed
RESULTS
Histograms are a very useful tools used to analyze and compare significant changes in the frequency
of appearance of the colors of the cover image with steganographic images so as to be able to get a
quick summary of the tonal range present in any given image.
It plots a graph of the tones in the image from black (on the left) to white (on the right). A histogram
with lots of dark pixels will be skewed to the left and one with lots of lighter tones will be skewed to
the right.
For efficient analysis and comparison, two different images are used and detailed analysis of the four
component of any image: Brightness, Red, Green and Blue colors have been carried out.
Figure 4: Original Image
Figure 5: Original Image (Luminosity) Figure 6: Original Image (Blue)
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 98
Figure 7: Original Image (Green) Figure 8: Original Image (Red)
Stego Image using LSB Stego Image using SLSB Stego Image using NEW SLSB
Figure 9: Stego Images
LSB Image SLSB Image NEW SLSB Image
Figure 10: Luminosity channel of Stego Images
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 99
LSB Image SLSB Image NEW SLSB image
Figure 11: Red channel of Stego Images
LSB Image SLSB Image NEW SLSB image
Figure 12: Green channel of Stego Images
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 100
LSB Image SLSB Image NEW SLSB Image
Figure 13: Blue channel of Stego Images
As it can be seen from the Experimental result (Histogram analysis) above, the algorithms are tested
using an image. The result has shown that all the algorithms successfully hides the image with no
difference in the resulting frequency in the color of the images and sizes.
Table 2: showing the image size before and after encoding using different algorithms
No Original Image
(Size)
Hidden Data
File (Size)
LSB Algorithm
(Size)
SLS Algorithm
(Size)
NEW SLSB
Algorithm (Size)
1 11.97kb 32 bytes 111kb 111kb 111kb
2 5.97kb 32 bytes 77.4kb 77.4kb 77.4kb
4 CONCLUSION
The result of the experiment perform has shown the effectiveness of the proposed algorithm. The
experimental result has shown that the algorithm strikes a balance between Least Significant Bits
Algorithm (LSB) and Selected Least Significant Bit (SLB) algorithm in such a way that achieved
balance between the security and the quality of the image. There is no loss of the data hidden
whatsoever and this new method retains the quality of the image.
This research work, only consider images as the cover object. Other forms of cover object are not
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 101
considered here. The algorithm only hides data between 8 bytes and 1024 bytes. Future work will be
how to use the algorithm with other forms of cover object i.e., Text, Video and also to hide data of
bigger size.
REFERENCES
Arvind K. and Kim P. (2010). “Steganography- A Data Hiding Technique” International Journal of
Computer Applications ISSN 0975 – 8887, Volume 9– No.7, November 2010.
Chen P. and Wu W. (2009). A modified side match scheme for image steganography, International
Journal of Applied Science & Engineering 7 (2009) 53-60.
Divya S.S and Ram M. (2012). Hiding text in audio using multiple lsb steganography and provide
security using cryptography. International journal of scientific & technology research volume
1, issue 6, July 2012.
El-Emam N. (2007) Hiding a large amount of data with high security using steganography algorithm,
Journal of Computer Science 3 (2007) 223-232.
Fridrich J, Du R and Meng L. (2000) “Steganalysis of LSB Encoding in Color Images,” Proc. IEEE Int’l
Conf. Multimedia and Expo, CD-ROM, IEEE Press, Piscataway, N.J., 2000.
Gandharba S. and Saroj K.L. (2012). A Technique for Secret Communication Using a New Block
Cipherwith Dynamic Steganograph. International Journal of Security and Its Applications
Vol. 6, No. 2, April, 2012
Hecht, E. 2006. Optics. Delhi, India: Pearson Education.
John M. and Manimurugan S. (2012).A Survey on Various Encryption Techniques. International
Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, 2(1), March 2012.
Juan J. and Jesus M. (2009). SLSB: Improving the Steganographic Algorithm LSB. Universidad
Nacional de Educación a Distancia (Spain).
Lokeswara V., Subramanyam A. and Chenna P. (2011). Implementation of LSB Steganography and its
Evaluation for Various File Formats. Int. Journal Advanced Networking and Application, 2(5),
Pages: 868-872
Lou D., Liu J. and Tso H. (2008)Evolution of information – hiding technology, in H. Nemati (Ed.),
Premier Reference Source–Information Security and Ethics: Concepts, Methodologies, Tools
and Applications, New York: Information Science Reference, 2008, pp. 438-450.
Mauro B., Franco B., Vito .C and Alessandro P. (1999). A DCT-domain system for robust image
watermarking. Dipartimento di Ingegneria Elettronica, Universita di Firenze,via di S. Marta,
3, 50139 Firenze, Italy
Morkel T., Eloff J. and Olivier M. (2005). An overview of image steganography.Information and
Computer Security Architecture (ICSA) Research Group.Department of Computer Science
University of Pretoria, 0002, Pretoria, South Africa.
Rosziati I. and Teoh S. (2011). Steganography Algorithm to Hide Secret Message inside an Image.
Computer Technology and Application 2 (2011) 102-108.
Roque, Juan J., and Jesús M. M., (2009) "SLSB: Improving the Steganographic Algorithm LSB." WOSIS.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 102
Roziati I. and Teoh (2011). Steganography Algorithm to Hide Secret Message inside an Image. Faculty
of Computer Science and Information Technology, University Tun Hussein Onn Malaysia
(UTHM), Batu Pahat 86400, Johor, Malaysia
Stefan K. and Fabien A. (2000) “Information HidingTechniques for Steganography and Digital
Watermarking”. Boston,Artech House, pp. 43 – 82. 2000.
Thomas A. (2005). Implementing Steganographic Algorithms: An Analysis and Comparison of Data
Saturation
Vijay k. and vishal S. (2005). A steganography algorithm for hiding image in image by improved lsb
substitution by minimize detection .Journal of Theoretical and Applied Information
Technology
Wu P and Tsai W. (2003). A steganographic method for images by pixel-value differencing, Pattern
Recognition Letters 24 (2003) 1613-1626.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 103
IC1011
A Security Model for Mitigating Multifunction Network Printers
Vulnerabilities
Jean-Pierre Kabeya Lukusa
Department of Network and Infrastructure Management Botho University
Gaborone, Botswana Jean-Pierre.Lukusa@bothouniversity.ac.bw
ABSTRACT
With the ability of incorporating a wide range of functions, Network Printers have not only become one of the most essential tools in today’s businesses but also one of the most neglected component in network security defenses. An efficient network security architecture design therefore necessitates the integration of key security implementations by means of formal security models conceived with security policies that take into consideration multifunctional network printers (MNP) security liabilities. This paper, presents a novel approach aimed at enforcing policy constrained security mechanisms using a multilevel printer security architecture. The proposed security model ensures discretionary access control (DAC), and a secure flow of information to and from entities connected to the network, to provide a trusted computing base (TCB). Access to the printer by subjects is controlled by means of security clearance matrices that can then be applied to security classes under which network resources can be grouped. Lastly, a validation of the model is presented using simple set theoretic concepts to assess the resilience of the implemented security defense model.
Keywords: Printer security, access control matrices, security architecture design, information flow
control, trusted computing base (TCB)
1. INTRODUCTION
1.1. Background
The modern printer has over the years evolved into an embedded device that is capable of
incorporating a wide range of functionalities that go beyond what an otherwise conventional printer
would be thought capable of doing. This unique ability of incorporating multiple functions into a
single unit, has earned it the acronym Multi-function Printer (MFP). For generalization sake, the term
Multifunction Network Printer (MNP) has been adopted in the text to better represent it as a network
integratable embedded1 device. MNP have, in spite of the commendable efforts made by the “go-
green” (Bansal, 2000; Di Giuli, 2014) corporations2, managed to become one of the most essential
tools in today’s businesses and homes (Infotrends, 2011). Due to this increasing market demand for a
multipurpose printer and the need for incorporating a wide range of functionalities into a compact
1 An object containing a special purpose computing system.
2 Encourage conservation of paper by advocating printing only when absolutely necessary (a.k.a. Green
Computing).
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 104
unit, most manufacturers have opted to integrate disk drives in their printer designs to record and
store latent and/or residual data thus effectively turning, even the most secured printers, into a
dormant security liability.
Nowadays, a typical MNP is capable of printing, scanning, copying or faxing documents from both
electronic and hard sources. These documents would more often than not contain potentially
sensitive information that if not properly secured may fall into the wrong hands (Forbes, 2013).
Therefore, identified security flaws in mechanisms preventing unauthorized access to files residing
within these printers and the illegal flows of information heighten the level of vulnerability to inter-
process communications and thus potentially compromising the privacy and integrity across the
network (Gonsalves, 2013; Vail, 2003).
1.2. Problem of Interest
In highlighting the problem of interest, this paper takes cognizance of the ISO/IEC 154083 standard
(Chen, 2015) and it would thus be of consequence to clarify that the focus of the paper is not on the
internal or architectural security design flaws (Cui, 2013; Forbes, 2013) in MNPs, that are otherwise
conventionally addressed by the aforementioned standard, but rather on potential security loopholes
born from complexities inherent to MNPs. These loopholes can be roughly grouped into risks linked
with (i) control security, (ii) data security, and (iii) network security. Risk, in this sense, can then be
viewed as an component that is increased by the magnitude of the resulting causation threats and
vulnerabilities subject4 to assets (Bishop, 2012; Pfleeger, 2011) as presented by the following formula:
(1)
1.3. Focus of this Paper
In order to best describe the proposed security architecture, a formal mathematical model
(Landwehr, 1981) is presented to demonstrate its potential implementation. This paper primarily
focuses on: (i) devising a multilevel printer security5 mechanism for controlling access by subjects with
different security clearances; (ii) safeguarding privacy and integrity of data stored on the printers; (iii)
providing audit trails for all transient inter-process communications; and (iv) and providing protection
against printer denial of service. The goal is to attain optimum security without compromising the
balance between protection and usability (Vail, 2003).
2. IDENTIFICATION OF TRUSTED COMPUTING BASE6 FUNCTIONS IN MNPs
2.1. Definition of Terms Used
A Subject – is an active network resource capable of exchanging data or control information with an MNP.
A Network User – is a person authorised to use a given network. A User Identifier – is a unique character string used to identify a given network user.
3 The Common Criteria for Information Technology Security Evaluation.
4 An active entity (i.e. user/user process) that interacts with an MNP.
5 Multilevel security deals with the protection of information to which different security level clearance
classifications have been ascribed. 6 A set of hardware, software, or firmware factory implemented protection mechanisms within a given MNP
that are responsible for enforcing security policies (Bishop, 2003).
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 105
A Security Class – is a security attribute that can be assigned to all network resources to which a sensitivity level can be ascribed (e.g. ADMIN, POWER-USER, DOMAIN USER, etc…). It provides a basis for determining access from subject(s)-to-MNP(s). This allows us to define the set of security class as a bounded7 lattice of sensitivity levels where such that . This is important as it defines the set of permissible information flow/transactions between subject(s) and MNP.
A Classification – is a designation attached to an MNP used for a given security class that reflects its relative value and vulnerability levels as a network asset.
An I/O Interface – is a point of transit for data/control located on an MNP. Each I/O interface belongs to a given classification.
An Operation – is a unit function that can be assigned to a given MNP and performed by an authenticated subject. These include, but are not limited to, the following: a. Print – reproducing text and/or image from digital to hard-copy. b. Scan – capturing images from hard-copy onto a digital format. c. Fax – transmitting or receiving an electronic copy of a document. d. Email – an electronic transmission or reception of a document.
A Reference Monitor (RM) – is used to mediate all information flows/transactions to a given MNP by subject(s).
A Reference Validation Mechanism (RVM) – is used to represent an implementation of the RM concept.
2.2. Security Requirements for Connecting Trusted and Untrusted Subject(s) to MNP(s)
This section presents some scenarios aimed at highlighting specific security requirements for
connecting trusted (t) or untrusted (u) subject(s) to MNP(s). Before proceeding further, it is important
to point out that creating a perfectly secured network is an ultimate, albeit unachievable goal as there
would in some way always be an element of risk albeit minute (Suh-Lee, 2015). For instance, a trusted
subject or MNP may be defined as one that is deemed to have provided sufficient credible evidence
that it meets a finite set of security requirements (Bishop, 2003) and thus also making it in a way
trustworthy. Consequently, a subject or MNP deemed as ‘trusted’ would remain in that state
provided that the link between the ‘credible evidence’ and the finite set of ‘security requirements’ is
maintained. For this reason, ‘trust’ within the context of security should never be thought of as a
given.
Let X represent the set of security states for a given MNP such that and Y represent
those of its connecting subject(s) such that where represent trusted security
states and untrusted security states. Figure 1, illustrates four partial ordered sets that can
result from the interaction between given elements of X and those of Y using a Hasse diagram. From
the diagram, the following four possible cases can be deduced, with the assumption that all involved
MNP have a resident TCB:
i. Threat T1: – A trusted subject connecting to a trusted MNP. a. Requirement T1-A: the subject is permitted to begin a session only if a unique user identifier8
has been supplied to the MNP and if it has been successfully validated and authenticated by the MNP.
b. Requirement T1-B: an active subject must belong to a classification that allows or prohibits access to both operations and/or I/O interfaces provided by the MNP.
7 An ordered set with both join (i.e. least upper bound) and meet (i.e. greatest lower bound) semi-lattices
8 May be presented as a user id and password, biometric inputs such as finger vein scan, smart card, etc…
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 106
c. Requirement T1-C: an inactive session albeit authenticated must have a finite expiration period.
d. Requirement T1-D: a valid subject identifier must have a finite print/copy quota in any given session.
e. Requirement T1-E: Mandatory security policies and flow control functions must be implemented on both MNP and its subjects.
{xt, xu, yt, yu}
{xt, yt} {xt, yu} {xu, yt} {xu,yu}
{xt} {xu} {yt} {yu}
Ф
Figure 1: Hasse diagram of partial order derived from X and Y
The security analysis mapping for T1 can be defined as: T1 {T1-A, T1-B, T1-C, T1-D, T1-E}. The set
{T1-A, T1-B, T1-C, T1-D, T1-E} is known as a security target reference mapping to threat T1.
ii. Threat T2: – An untrusted subject connecting to a trusted MNP. a. Requirement T2-A: a valid justification of the security analysis mapping for T1 must be
presented in accordance to ITSEC (Commission of the European Communities, 1991). b. Requirement T2-B: the classification of each untrusted subject must fall within the range of
sensitivity levels that the MNP is trusted. c. Requirement T2-C: the MNP must provide an audit trail (i.e. maintaining an event log) for all
past actions performed on behalf of subject(s). d. Requirement T2-D: an appropriate user classification with matching sensitivity level must be
defined for all subjects connecting to an MNP during non-business hours. e. Requirements T2-E: appropriate network level security must be enforced to ensure
discretionary access control9 as well as port and protocol access control is observed at all network layers.
The security analysis mapping for T2 can be defined as: T2 {T1, T2-A, T2-B, T2-C, T2-D}.
iii. Threat T3: – An untrusted subject connecting to an untrusted MNP. a. Assumption T3-A: due to lack of discretionary access control trust association with subject(s)
cannot be reciprocated by the MNP and vice-versa. b. Assumption T3-B: as a result of T3-A, no security policy may be effected. This also means
that the presence of a TCB on the MNP is symbolic (i.e. not functional). 9 Also known as an identity-based access control (IBAC) – Granting restricted access to subjects on the basis of their
identity and/or groups to which they belong.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 107
The security analysis mapping for T3 can be defined as: T3 { T3-A, T3-B}.
iv. Threat T4: – A trusted subject connecting to an untrusted MNP. a. Requirement T4-A: a valid justification of the security analysis for T2 must be validated in
accordance to ITSEC (Commission of the European Communities, 1991). b. Requirement T4-B: upon initiation of an active session, subject need to secure their
transaction using either data protection mechanisms10 or printer job locking11.
2.3. MNP Security Mis-configurations: A Review of Possible Problem Areas
In order to best address possible problem areas inherent to MNP, it is necessary to look at security
control management in terms of its access (Dohi, 2012), information flow (Denning, 1975; Stoughton,
1981), and cryptographic (Kahate, 2013) control. These are briefly discussed in the following sections
as potential security sore spots leading to network threats and/or vulnerability in MNP.
i. Devising Generic MNP Based Security Mechanisms for Controlling Access by Subjects When inspecting access control vulnerabilities areas, one needs to describe them in terms of
configurable authentication12, authorization13, and accountability14 features. For instance on a generic
MNP these can be controlled through enabling, amongst others, features such as discretionary
copy/print/scan/fax account tracking, subjects authentication for both remote and local access, auto
log off on idle processes, function restrictions, event log historic, printer driver user data encryption,
non-business hours user account tracking, etc…
ii. Safeguarding Data Privacy and Integrity While it is equally important to ensure confidentiality of data by taking simple measures such as not
leaving personal documents lying in the MNP’s output tray; the emphasis in this section is on the
implementation of appropriate data security policies to safeguard latent and/or residual data stored
on MNPs’ resident drives. Amongst others, these can be controlled by enabling features such as disk-
drive password protection, hard-disk data encryption, hard-disk data overwriting, temporary data
deletion, timed data auto deletion, etc…
iii. Providing Audit Trails for Transient Inter-process Communications Provision of audit trails for transient inter-process communications on MNP is often realized through
the integration of reference monitor functions on the MNP. Control is achieved here by enabling
features such as IP address filtering, port and protocol access control, SSL15/TLS16 encryption, IPSec
support for secured session tunneling, IEEE 802.1x support, NDS17 authentication, etc…
iv. Protection Against Printer Denial of Service Most MNPs denial of service attacks discussed in literature (Ormazabal, 2014a; Ormazabal, 2014b;
Ormazabal, 2015) can be grouped into two broad categories. The first is often achieved by gaining
unlawful access to the printer via unsecured ports (such as HTTP, or Telnet) with the intent of
10
Includes – hard-disk password protection, disk data encryption, hard-disk data overwrite, latent data auto deletion, etc… 11
Ensuring that jobs from an authenticated subject is put on hold until a matching identifier is provided physically at the machine. 12
confirming subjects’ identity 13
determining what the subject can do 14
associating subjects to its action 15
Secure Socket Layer 16
Transport Layer Security 17
Novel Directory Services
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 108
damaging or unlawfully restricting access to services or functionalities that were otherwise
provisioned for authenticated subjects. The second type is achieved by flooding known printer
interfaces/ports (such as port 9100) with random data with the intent of exhausting its resources and
thus effectively preventing it from provisioning any further service. These are often circumvented by
simply observing basic access control measures as discussed earlier and frequently applying printer
firmware patches.
3. MEASURING SECURITY AS A RESULT OF THE INTERACTION BETWEEN SUBJECTS AND MNP
3.1. Preamble
This section presents an adaptation of the method for quantifying security risk presented by Suh-Lee
and Jo (2015). Before proceeding to the calculations, there is need to define the following terms:
i. A Danger Zone – is a network segment where trusted members belonging to the set attached to the segment have frequent interactions with an untrusted host belonging to the
set . The set of nodes belonging to a given Danger Zone is defined as:
{ }
(2)
ii. A Zone Proximity Value – is an integer value that indicates the proximity of a trusted members of to the untrusted member of . The smaller the
value of , the closer the member is to the untrusted node, and therefore, the higher the risk in the zone (Suh-Lee, 2015).
iii. The Proximity value of a node H from - is defined as:
(3)
iv. The Proximity-adjusted Vulnerability Score of a Host H (Suh-Lee, 2015) – is defined as:
∑ (4)
Where is a given vulnerability found in the host and
⁄ (5)18
v. The Relative Cumulative Risk (RCR) of the vulnerability – is defined as:
(6)
3.2. Evaluating the Relative Cumulative Risk of the Interaction between Members of
the set
To demonstrate the effect of the interactions between subjects and MNPs located in various sections
of the network and estimate their relative risk, an emulation of a typical deployment environment is
18
The Common Vulnerability Scoring System v3 (CVSS): is a measurable vulnerability severity score ranging from 0 to 10, with 7.0 to 10.0 being the highest, often used to prioritize responses and resources according to threat.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 109
used as represented in Figure 2.
Under the assumption that the internal network is adequately policed and fulfills security
requirements as stated in the previous section; the three Multifunction Network Printers (MNP) have
been placed in three different segments of the network. From figure 2, the following two danger
zones (i.e. ) can be identified.
The first zone is directly positioned behind inbound connections originating from the internet via the
ISP supplied gateway router inbound through FW1 and onto the DMZ segment containing the
employees e-mail server (EMS), the public hall printer (MNP2) and the webserver (WS) that also has a
backend connection to the database server(DS).
DMZ
Subnet 1 Workstations
Internet
UTM/FW
External (FW1)
FW
Internal (FW2)
Proxy/Email Filter
(PRX)
Subnet 2 Servers
Gateway
Router (GW)
DB Server
(DS)
AD Server
(AD)File Server
(FS)
Management Printer
(MNP1)Management Server
(MS)
E-Mail Server
(EMS)
Web Server
(WS)Main Hall Printer
(MNP2)
User Station
(US1)User Station
(US2)
System Admin
Workstation (AUS3)
Sales Printer
(MNP3)
Figure 2: The Test Network Diagram
Excluding all none printing resources our zone definition and proximity value assignment would be:
: .
The second zone comprises of subnet 1 and 2. The first subnet has an outbound connectivity to the
Internet through the proxy server via FW1. Similar to the above representation of zone 1, our zone
definition and proximity assignment in this case is:
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 110
Determining proximity values for the printers relative to the danger zones, using (3) and (5)
generates the following proximity map.
Figure 3: Proximity Map for Test Network
The proximity adjusted vulnerability scores for MNP1, MNP2, and MNP3 can be calculated as
follows:
For MNP1:
⁄
⁄
For MNP2:
⁄
⁄
For MNP3:
⁄
⁄
From the above calculation, we can conclude that MNP2 has the highest risk rank (i.e. RCR=45.49) of
the three printers; followed by MNP3 (i.e. RCR=28.73). MNP1 is found to be the least at risk resource
(i.e. RCR=2.33).
This conclusively demonstrates how exposure to different vulnerability levels can elevate the
relative risk rankings of resources that may otherwise be assumed to have been properly secured.
The presented method (Suh-Lee, 2015) therefore establishes that while accurate configuration of
printers is important, it is equally important to remedy existing network vulnerabilities to ensure
that risks are kept at their lowest.
4. CONCLUSION The paper presented a multi-level network printer security architecture that relied on robust policy
constrained security mechanisms for discretionary control of both trusted and untrusted entities by
means of the TCB. The developed model further demonstrates the need for a secured and
trustworthy network environment since the nature of such an environment tends to measurably
DZ1
Internet: 0
GW: 0
FW1: 1
MNP2: 1
FW2: 1
DZ2
MNP1: 5
FW2: 3
MNP3: 4
Internet: 0
GW: 0
FW1: 1
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 111
reduce the risk ranking ascribed to a given MNP regardless of how well it is thought to have enforced
access and information flow control mechanisms. It is hoped that by focusing on accurate
configuration, good use of discretionary security policy implementations, and strategic placement of
MNPs; one can greatly improve both the security and trustworthiness of MNPs while still
maintaining the balance between protection and usability.
ACKNOWLEDGEMENTS I would like to thank Botho University for sponsoring this paper’s presentation costs. I would also
like to thank colleagues, and friends who have taken time to review and provide much needed
feedback.
REFERENCES
Bansal, P., & Roth, K. (2000). "Why companies go green: A model of ecological responsiveness."
Academy of Management Journal 43 (4): 717-736.
Bishop, M. (2012). "An Overview of Computer Security." In Computer Security: Art and Science, 3-24.
Capetown: Addison-Wesley.
Bishop, M. (2003). "Assurance." In Computer Security: Art and Science, 475-544. Capetown: Addison-
Wesley.
Chen, H., Bao, D., Goto, Y., & Cheng, J. (2015). "A Supporting Environment for IT System Security
Evaluation Based on ISO/IEC 15408 and ISO/IEC 18045." Computer Science and its
Applications 1359-1366.
Commission of the European Communities. (1991). Information Technology Security Evaluation
Criteria. Brussels: Commission of the European Communities.
Cui, A., Costello, M., & Stolfo, S. J. (2013). "When Firmware Modifications Attack: A Case Study of
Embedded Exploitation." NDSS.
Denning, D. E. R. 1975. "Secure Information Flow in Computer Systems." Ph. D. Dissertation. Purdue
Univ.
Di Giuli, A., & Kostovetsky, L. (2014). "Are red or blue companies more likely to go green? Politics
and corporate social responsibility." Journal of Financial Economics 111 (1): 158-180.
Dohi, M. 2012. Printing system, information processing apparatus, printing apparatus, print
management method, and storage medium. Washington, DC: U.S. Patent 8,161,297. April
17.
Forbes. (2013). "The Hidden IT Security Threat Multifunction Printers." February 7. Accessed
December 26, 2015. http://www.forbes.com/sites/ciocentral/2013/02/07/the-hidden-it-
security-treat-multifunction-printers/?sf9393024=1.
Gonsalves, A.,. (2013). "Printers Join Fray in Network Vulnerability Landscape." CSO Online. January
29. Accessed December 26, 2015. http://www.csoonline.com/article/2132861/access-
control/printers-join-fray-in-network-vulnerability-landscape.html.
Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016
Copyright © Department of Computer Science, University of Botswana, 2016 112
Grubb, B.,. (2013). "Security Fears over Exposure of Web-accessible Printers." January 29. Accessed
December 26, 2015. http://www.theage.com.au/it-pro/security-it/security-fears-over-
exposure-of-webaccessible-printers-20130129-2dhxo.html.
Infotrends. (2011). "Placements of Printers & MFP Devices Grew In U.S. and Western Europe Despite
Challenging Economy." May 24. Accessed December 26, 2015.
http://www.infotrends.com/public/content/press/2011/05.24.2011c.html.
Kahate, A. (2013). Cryptography and network security. New Delhi: Tata McGraw-Hill Education.
Landwehr, C. E. (1981). "Formal models for computer security." Computer Surveys 13 (3): 247-275.
Ormazabal, G. S., & Schulzrinne, H. G. (2014b). Denial of service detection and prevention using
dialog level filtering. DC: USA Patent 8,719,926. May 6.
Ormazabal, G. S., & Schulzrinne, H. G. (2014a). Maliciouis user agent detection and denial of service
(DOS) detection and prevention using fingerprinting. DC: U.S Patent 8,689,328. Apr 1.
Ormazabal, G. S., Schulzrinne, H. G., Yardeni, E., & Patnaik, S. B. (2015). Prevention of denial of
service (DoS) attacks on session initiation protocol (SIP)-based systems using return
routability check filtering. DC: U.S Patent 8,966,619. Feb 24.
Pfleeger, C. P., & Pfleeger, S. L. (2011). "Administering Security." In Security in Computing, 524-545.
Boston: Prentice Hall Professional Technical Reference.
Savage, C., Petro, C., & Goldsmith, S. (2015). System for Providing Session-based Network Privacy,
Private, Persistent Storage, and Discretionary Access Control for Sharing Private Data.
Washington, DC: U.S. Patent 20,150,333,917. November 19.
Stoughton, A. (1981). "Access Flow: Protection model which integrates access control and
information flow." IEEE Symp. Security and Privacy. 9-9.
Suh-Lee, C., & Jo, J. (2015). "Quantifying security risk by measuring network risk conditions." 2015
IEEE/ACIS 14th International Conference. Las Vegas. 9-14.
Vail, V. T. (2003). "Printer Insecurity: Is it Really an Issue?" SANS Institute InfoSec Reading Room, May
28: 1-12.
Recommended