Upload
ngokien
View
219
Download
4
Embed Size (px)
Citation preview
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 03 Issue: 04 | Apr-2016 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | Page 1411
MALICIOUS DATA MINING FROM CYBER TEXT DATA
1M.Bhavana, 2M.Ashok Kumar, 3K.Nikhil ,4A.Naga Kiran,5Mrs. Y.Padma,
1, 2,3, 4IV/IVB.Tech, Department of IT, P. V.P.Siddhartha Institute of Technology, Vijayawada.
5Asst. Professor, Department of IT, P. V.P.Siddhartha Institute of Technology, A.P, INDIA.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Abstract—Due to the increase in technology, there are
chances of performing the crimes in newer ways. In recent
trends, we have seen a tremendous usage of social
networking sites. Due to these social networking sites,
there is a high chance of carrying out criminal activities
like robbery, killing, abetting suicides etc, as these are
user protected and has ability to transfer messages and
mails among several number of people in the form of
mails and documents. The main objective of our project is
to analyse such criminal information from mails and
documents in order to aid the criminal department
investigators. This model helps the investigators by
displaying the malicious messages contents of respective
users and their word frequencies in order to solve the
mysteries in a very short span of time.
Keywords:Malicious, Crime, Email, Investigation,
Data Mining, Forensic.
1. INTRODUCTION
At present days, everyone like professionals, students,
professors, teachers including criminals is
communicating through internet via emails, social
networking sites, messengers etc, Because of this trouble
free communication means, criminals are performing
many more illegal activates very easily which includes
bomb blasts, robbery, fraud drug dealings and many
more. In order to find the culprits, forensic experts and
investigators often going through various chatting sites
to analyse chat data between suspects and to find out
the actual culprits. The main concept in our project is to
collect chat data between such suspects and to help the
crime department investigators by analysing large
amount of chat data and find out the hidden malicious
data. In our project, not only structured format data, but
also unstructured formatted data can also be analysed.
The main concept in our project is to collect chat data
between such suspects and to help the crime department
investigators by analysing large amount of chat data &
finding out the hidden malicious data. In our project, not
only structured format data, but also unstructured
formatted data can also be analysed. The main objective
of our project is to assist investigation departments by
obtaining the information in advance, which a culprit is
transferring through internet based communication. The
following process shows the process of how to detect
malicious messages between suspects and their count
frequencies by performing several actions like pre
processing, extraction of pre processed keywords,
comparison of that extracted keywords with suspected
words loaded in dictionary and finding out the actual
culprits.
Problem Statement:-
In order to investigate, several crime departments are
going through internet based information transferring
applications like emails & chat messengers. Since, such
data will be in large volumes and in unstructured format,
finding the actual culprit from suspicious persons is a
very challenging and tedious task. The problem can be
classified into following:
Defining that the information as malicious.
Defining that resultant information associate with
specific email or person.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 03 Issue: 04 | Apr-2016 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | Page 1412
Detecting the criminals and their impactness.
2. DATA ANALYSIS
Generally the chat data between the users will
be in unstructured format and is of high volume. Using of
artificial intelligent machines is very difficult for such
data. So we used data mining techniques and classified
that data with respect to certain attributes. We have
analysed that data by considering the email addresses of
users and their chat content in an excel format. We have
used Net beans IDE software to handle that classified
data.Net beans IDE uses java language for performing
data mining approach. For data pre-processing and
extracting, porter stemmer algorithm has been used
which easily removes all the clause forms from the chat
content and converts that words into its root forms. Stop
words has also been deleted in pre-process steps and
resultant words has been extracted for further checking.
Since we have already stored many suspicious words in
the database, we have compared our resultant extracted
words with that dictionary words which are finely in
structured format. It is observed that the output is very
accurate as it is displaying each users email addresses
and their chat content along with obtained malicious
terms frequencies.
3. OUR APPROACHES
3.1 Collection of emails and messages
information datasets:
Fundamental step is to collect datasets which have large
amount of information regarding mails to suspects. We
have drawn out a huge amount of sample datasets which
contains this information in order to implement the data
analysis and data mining.
3.2 Creation of database & uploading suspected
words into our created database:
Initially we have created a database using my
slowed have collected several suspected words from
various sources in web. This is a text file which contains
all the malicious words in a organized from and it
contains words in several formats like jumbled, mixed
words etc. We have delivered all these malicious words
into our database to detect suspicious users and their
respective data and emails information.
3.3 Uploading chat datasets into our model:
Data need to be adapted to the software we are using.
Data sets which are in MS-Excel format have been
uploaded into our model which contains to and from
attributes along with the message contents between
those suspects. Since we are using java language to
implement our model, we have converted our datasets
into table format.
3.4 Data pre-processing and extracting:
Information that is gathered have to be converted into
the form which is to be understandable by the software
or language we are using. For that purpose, we are
performing several tasks like removal of unwanted texts,
symbols, as well as words which are generally not useful
for performing text mining .Since text files contains
unwanted and inconsistent data, initially there is a need
to perform cleaning procedure. Pre-processing and
extracting will be done with the help of program code
which helps in removing punctuation marks, stop words
and some particular information which is not required
for further checking process .For this Data Pre-
Processing we had used two algorithms which were used
to help for cleaning the data .The main algorithms are:
3.4.1 Porter Stemming algorithm:
Stemming is the term used in linguistic
morphology and information retrieval to describe the
process for reducing inflected (or sometimes derived)
words to their word stem, base or root form. Generally a
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 03 Issue: 04 | Apr-2016 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | Page 1413
written word form. The stem need not be identical to
the morphological root of the word; it is usually
sufficient that related words map to the same stem, even
if this stem is not in itself a valid root. Algorithms for
stemming have been studied in computer science since
the 1960s. Many search engines treat words with the
same stem as synonyms as a kind of query expansion, a
process called conflation.
3.4.2 Catching the Stop words
Most Search Engines do not consider extremely common
words in order to save disk space or to speed up search
results. These filtered words are known as 'Stop Words'
.These are more generally used words in real life these
words are used make the meaning of the conversation
.We don’t have use of such words keeping in the chat
data .So in order to eliminate we use this algorithm for
eradication of general words.
3.5 Inspecting the data
After the pre-processing and extraction, the
resultant data will be scrutinised by applying different
mining procedures like comparing the resultant words
with the suspected words that are resided in the
dictionary and presenting the output. The entire process
will be shown in the following sector.
First we have some chat datasets from online and
suspected data words to be loaded into database. We
have also created some chat datasets on our own for
better understanding. Our entire model will be perceived
by using following screens:
Admin will login into the account in order to
perform further actions.
List of suspected words feeding into the database
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 03 Issue: 04 | Apr-2016 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | Page 1414
Performing data preprocessing steps on the extracted
and analyzed data.
Showing results of suspicious users message contents
and their respective mail ids.
4 .CONCLUSION
In this paper, we have brought a model into
existence which can perform term mining on criminal’s
cyber text data. This model takes suspicious persons
emails and chat messages content in an excel table
format and performs pre-processing and analysing after
extraction. Finally it provides information regarding
actual culprit’s malicious actions as shown in the above
figures which will helps investigation departments to
search over specific mails rather than examine all
suspicious people’s cyber text data.
5 .FUTURE WORKS
As our model is only used for data
which was collected after a crime scene was done, we
need to overcome such large criminal activities by
creating a model where mining will be done over online
streaming data. Model have to be created where we can
place certain limitations to users like, whenever the
suspicious word count increases between the users , that
respective users will automatically get blocked from
sending and receiving mails or chats after certain limited
warning alerts. To achieve this, we have to give some
malicious words with high priority and if those words
which is of highest priority are often gets used, that
respective user’s details and locations will automatically
get transferred to criminal investigation organizations.
Remaining emails and users details would get removed
from the database so the increasing volume of database
can be controlled. With this it would also become easy
for forensic department to check only specified mail
rather than all.
REFERENCES
[1]S.Gowri, 2G.S.Anandha Mala, 3G.Divya,”Suspicious
Data Mining From Chat And Email Data” 2012/
International Journal of Advances in Science Engineering
and Technology.
[2]Net Beans IDE installation procedure
https://netbeans.org/downloads
[3]https://dev.mysql.com/downloads/mysql/ MySQL
server and command line client command prompt
installation.
[4]N. Pendar, “Toward spotting the pedophile telling
victim from predator in text chats,” IEEE Internet
Computing, pp. 235–241, 2007.
[5]Farkhund Iqbal, Benjamin C. M. Fung, Mourad
Debbabi “Mining Criminal Networks from Chat Log”
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 03 Issue: 04 | Apr-2016 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | Page 1415
2012 IEEE/WIC/ACM International Conferences on Web
Intelligence and Intelligent Agent Technology.
[6]Martin Halvey and Mark T. Keane, “An Assessment of
Tag Presentation Techniques” poster presentation at
WWW 2007, 2007.
[7]Salvatore J. Stolfo and Shlomo Hershkop, “Email
Mining Toolkit Supporting Law Enforcement Forensic
Analyses” Columbia University. 500 West 120th St. New
York, NY 10027.
[8]Vadher, Bhargav, "EMail Data Mining: An Approach to
Construct an Organization Position-wise Structure While
Performing Email Analysis" (2010).Master's Projects.
Paper 63, San Jose State University.