23
Messenger account security using natural language processing -THE CANID SCRIPT- by Pushkar Gupta PEC University Of Technology, Chandigarh, India

Messenger account security using natural language processing

  • Upload
    pec

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Messenger account security usingnatural language processing

-THE CANID SCRIPT-

byPushkar Gupta

PEC University Of Technology,Chandigarh, India

IndexAbstract...Introduction...Script components...

The Environment

Parameters Scanned

Contact Statistics and formal contacts

Characteristics Database

Initialized and Non-initialized contacts

Suspicion and Buffer database

Time Limit

Message authenticity script

Script mechanism infographic

Some Algorithms...Determining who has been cursed

Finding a trend

Scope for improvement...

Source Code...Contact Info...

-Abstract-

Whenever two people communicate with each other using speech, each person has a particular style of talking, based on the social relationship between the two. This is actually a personality trait which human beings unconsciously possess. This style and pattern of using particular words, tones, etc. constitute the style of speech for a particular person, which varies as the person talks to different people. This style canbe used to identify the identity of a person. For example, ifthe phone rings and the caller identifies himself to be your colleague at work, you can judge if the person is actually whohe/she claims to be, by analysing his speech (this part happens subconsciously in your brain).

Based on this fact, an algorithm that can be made and used fora messaging application (email, chat messenger), which helps the software to check if the message sent by the user, is sentby the actual user. In other words, it can detect an intrusioninto someone's messaging account and block unauthorized messaging. This script belongs to the field of artificial intelligence. It tracks the typing habits of its users and uses its database to analyse authenticity of users sending messages. This project aims to create a different and more intelligent way to provide user account security.

-Introduction-

On most software applications where a user account is maintained, the account is protected by using a username and apassword, sometimes the password security is optional (smartphone applications). This project is based on providing an alternate security system for user accounts on any chat application.

It has the advantages of being totally hack-proof. Passwords can be broken, but this concept will detect and block every intrusion. It will simply block an intrusion of the kind that someone gets access to someone else's account and starts sending random messages, creating an embarrassing situation for the account holder.

Such intrusions occur quite often, the hacker is generally some friend who actually is doing it all for fun and means no harm. But, as stated above it could lead to an embarrassing situation. Once password authentication is passed (friends might know each others' passwords), the user gets a lot of privileges. If password authentication is required at every step, the application will become a headache to use. So, I have created a system that protects 'random message sending' intrusions, and at the same time the UI becomes quite smooth.

For this security system (set of scripts) , a full chat messenger website has been created for the scripts to run on.

This paper explains the mechanism of the entire messenger application, focussing mainly on the various components of the intrusion detection script. After the components, some of the sub functions that require further explanation, have beendiscussed. As, this project focuses on the concept and idea ofan intrusion detection algorithm, less emphasis has been laid on code efficiency. The 'scope of improvement' section describes some of the possible improvements. Some parts and subscripts of this project have been licensed under the GNU Public License 3 and their source code has been hosted Github.The respective links have been provided at the end of this paper.

The messenger application is a website coded in HTML, JQuery, CSS. The server side programs and the intrusion detection script has been coded in PHP and all the databases have been created using MySQL.

-Script Components and Concepts-

The Environment:

In this project a chat messenger application has been created as an environment and a platform for the script to run on. It is a website where every user can create his/her own user account, that requires a unique username and a password to login. A user has to send requests to other users to add to

his/her contact list, and then the two users can chat with each other. The communication scheme employs a message passingmodel, where each message passes through a server, where it isstored. The server in turn writes the message to the receiver's chat portal if the message is authentic. When user 'A' sends a message to user 'B', the message is first sent to a server, which processes the current message, analyses it forauthenticity, updates the database and sends it to user 'B' orblocks it.

The script is uploaded to a main server which processes each message.

Parameters scanned during message processing:

The script keeps track of the typing habits and patterns of its users, by monitoring the following parameters.

General Characteristics:

1) Use of chat abbreviations and/or acronyms.

Why? - People often use chat abbreviations during a casualchat.

2) Use of words that don't belong to the English language (ex.communicating in a foreign language but using the Englishalphabet set).

Why? - People who don't have English as their mother tongue,often chat with their friends and family in some otherlanguage, but use the English alphabet characters.

3) If the current message is a continuation of the previousmessage.

Ex. Message 1 - "I want to meet u" Message 2 -"tomorrow"

Why? - Some people divide a message into parts and send themas a series of more than onemessages. This is more of a personality trait.

4) If a special typing pattern is present in the message. Ex "tYpInG LiKe ThIs", using too many spaces, making excessive use of '!,?,.' characters, etc.

The script to determine this has been explained in the Algorithms section.

5) When someone types too fast, and if there are spelling errors, the following pattern is observed:

hello --> hrllo On the QWERTY keyboard 'e', 'r' are placed adjacent to each other.

find --> fimd 'n', 'm' are adjacent to each other.

Therefore, it can be observed that if the incorrect letter is adjacent to the correct letter, then the user might be typing too fast, and possibly is in a casual conversation.

Cursing/Abusing characteristics

1) Who is being cursed in the sentence. Three cases arise fromthis:

a) The sender himself. (Someone abusing him/herself)

b) The receiver.

c) A third party and/or there is no one to whom the curseis addressed to.

Why? - Most intrusion attacks consist of sending messageswhich have cuss words in them, and often consist of the sendercursing him/herself.

The script to determine this has been explained in the Algorithms section.

2) If the curse is in response to a curse sent in a previousmessage by the receiver.

Why? - Some people generally curse, only when it is a reply tosome previous message containing strong language.

3) If the sentence conveys no other information other thancursing.

Ex. "You d**k !"-> satisfies this criteria

"A di** like you deserves to die !" -> doesn't

Why? - Again, most non-authentic messages are short ones,containing only strong language.

4) If excessive use of "!,?,." characters is used in thesentence containing curse words.

Contact Statistics and formal contacts

If the user uses less number of acronyms, typing trends, etc.for a particular contact but uses them liberally whilechatting with others, this give vital information about thesender receiver relationship. Such contacts are said to beformal contacts.

Characteristics Database

All of the above message characteristics are stored in two databases:

1) general_info -> General characteristics

2) curse_info -> Curse characteristics

Every property is saved as a percentage of the number of messages in which it is present to the total number of messages.

These percentages are later used by the verification script tocheck the authenticity.

Initialized and Uninitialized contacts

Initialized contacts are those for whom the script checks message authenticity and vice-versa.

Now the question arises, when does the script initialize a contact for checking authenticity. The default initialization occurs, when user 'A' has sent a particular fixed number of messages to user 'B'.

But, in some cases initialization can occur before that also.

Ex. You may not send a lot of messages to someone whom you know formally. In such cases, the script looks out for the following factors:

1) Since how many days has that person been in the user's contact list.

2) If any messages have been sent, then are they 'clean' (based on the characters described above).

3) Does the user send more messages to other people as compared to this particular person.

The script checks for these parameters every time a message issent for an uninitialized user. If the user goes from uninitialized to initialized, the database is updated and message authenticity is tested for the current message based on the message characteristics.

Suspicion and The Buffer Database

The algorithm uses two variables "general_suspicion" and "curse_suspicion" which represent the current amount of suspicion of an intrusion by the software. Both the variables,vary between 0 and an upper limit. If at any time, the sum of the two variables exceeds the upper limit, an intrusion is flagged and the message is blocked. Hence, it is possible that

the script assigns a "general_suspicion" and/or a "curse_suspicion" value to the current message, but as the sumis not beyond the upper limit, the message is not blocked.

Thus if the script cannot find a concrete variation in texting behaviour, a suspicion value is assigned and stored, and the message is passed like a normal message.

Now, if the message is flagged as suspicious, the current message characteristics are updated to the buffer database rather than the normal database. The buffer database simply holds the message characteristics for suspicious messages, andlater on if the script reduces both the suspicion values to zero, then the buffer table values are added to the normal database. If an intrusion is detected then, simply all the buffer entries are deleted.

Time Limit

If any message is flagged as perfectly authentic (both suspicion values are zero), then if any other message is sent by the user within a specific time limit (8-10 sec), then thatmessage is not scanned for authenticity. It is assumed that anintrusion might not have occurred in such a small time interval. Only the message characteristics database is updated.

This technique saves the server the heavy computational task of processing authentic messages again and again and reduces the chance of erroneous intrusion detections.

Message authenticity script

When a message needs to be checked, the message authenticity script comes into play. It is divided into two parts. One partanalyses the general characteristics and updates the value of general_suspicion while the other updates the curse_suspicion.The general and curse characteristics of the user are retrieved from the database and by using some conditions the appropriate suspicion value is generated.

General characteristics checker

The following conditions are tested:

1) Does the current message use acronyms, short words, etc. and do the previous messages don't show less usage of such words.

2) Using spam, sentence completions.

3) Is the contact formal? This is determined using the contactstatistics.

Curse characteristics checker

The following conditions are tested for each the cases:

a) 1st person abused b) 2nd person abused c) 3rd person abused

The conditions:

1) Is this a curse response?

2) Is there full word occupation and does the database tell otherwise.

3) Punctuation used.

If it is a first instance of the type of curse then:

1) Is the curse directed to the user him/herself and is there no other information in the message.

2) If it is a response curse.

3) If there are instances of other types of cursing.

These conditions and the suspicions generated for each can be understood in a much better way by reading the source code.

Script Mechanism Infographic

The script checks and maintains the characteristics database according to a primary key, which is of the form : sender.userid+'_'+receiver.userid

-Algorithms-

Determining to whom is being cursed in a sentence: (Cursing characteristics)

Principle

By analysing the position of a pronoun (also type) and the curse word in a sentence, some estimate can be made on who is being cursed in the sentence. If the person is cursing himself, pronouns like "me, myself, etc." are used. "You, your, etc." are used if the 2nd person is being cursed. "He, them, etc." is used for 3rd person.

In a sentence, if a pronoun lies within close proximity (in terms of number of words) to the curse word, it can be said that the person being cursed is the type of the pronoun. Thereare cases where, different types of pronouns are present in the same sentence, in such cases we need to assign some priority to these pronoun types. Sometimes it can be assigned,otherwise it is flagged as a 3rd person or generic case.

Input

The current message, is type coded into a string of alpha-numeric characters. Each word, based on whether it is an acronym, a pronoun, a cuss word, etc. is transformed to an alpha numeric character.

Every sentence is separated by using the '.' character. The following set of rules map each word to its type.

Normal word 0Cuss word 1Acronym 2Normal Haste 3Cuss Haste 4Normal short 5Cuss short 61st person pronoun

7

2nd person pronoun

8

Other pronoun 9

Article aTo be verb bSpam (spelling mistake + other languages + non dictionary)

c

Number dEx. "I want to meet Ramesh on Sunday. I want you to arrange anappointment."

Is converted to -> .7000c00.70800a0.

Such a type coded string is taken as the input.

Output

The function returns a 3 bit number, specifying who has been cursed in the message.

'D2D1D0' == ' 1st person, 2nd person, General or 3rd person' .

1-> yes, 0-> No.

Program Flowchart

Assigning Pronoun Priority (the crux)

1) The position of the cuss word in the current sentence is stored. (only consecutive occurrence of curse words is presentand the entire group is treated as a single word)

2) First pronoun occurrence is found out both on the left and right side of the curse word.

3) Number of other words like to-be verbs, spam is noted.

4) Type of each pronoun on each side is noted.

5) Now if there is only one type of pronoun and it is not separated by more than 4 words (total) and 2 spam words, the output is the same as that of the type of the pronoun.

6) If different types of pronouns is present, one closer to the curse word provided it does not exceed 4 words + 2 spam, is given the priority.

7) If no priority can be assigned on the basis of 1st and 2nd person pronouns present, the priority is given to the 3rd person.

Finding a typing personality trait. (Trend finding)

Objective: To find a pattern in the typing habits of a person.The patterns include use of excessive punctuations, emoji's, etc.

Input

The current message in its original form.

Output

1 if the function finds a trend (pattern) or 0 if it does not.

Trends

1) Use of punctuation marks '!', '?', '.', etc. more than twice consecutively at a position.

2) Use of three or more emoticons in a message.

3) According to the case of letters in a word. The following cases arise:

a) Starting letter in lowercase and other any other letteruppercase.

b) Some other letters are in uppercase while others in lower.

Words like oKay, BinGo, paiNT, etc.

If more than two such words are present, the message is said to have a trend in it.

4) Use of more than one space characters, ' ' to separate two words. If such spaces are frequent in a sentence, the message is said to have a trend in it.

-Scope For Improvement-

As stated in the previous sections, this research project is based on providing an alternate security system, therefore thecode for accessing databases, text processing is not efficientto run on commercial applications. This section gives some of the possible code improvements for making the application moreefficient, fast and more secure.

1) Accessing the dictionary database

During text processing, every word present in the current sentence is looked up in the database to get its type, which is further used for finding the general and curse characteristics.

The database is maintained using MySQL, and to sort and organize the long list of words and their types, the followingkeys are used:

a) Length: The length of the word

b) Number of vowels

- A better RDMBS can be created to manage hundreds of thousands of words for the dictionary.

2) Finding spelling mistakes that are a result of typing too fast

This special case of selling mistakes was discussed in the previous sections, and the sub-functions used to detect these can be modified to increase performance.

When a spelling mistake is detected, the following possibilities are:

a) It is a chat acronym

-> Then the acronym database is used.

b) It is a 'short' word.

-> To incorporate a large set of word variations that users might use, a sound-key is used to detect words that are similar to each other. A word might be detected as a spelling mistake (spam), but could be actually a new chat word. So the software maintains a field of sound-key that is checked.

c) It is a spelling mistake (not a result of tying too fast)

-> Flagged as spam.

d) Spelling mistake as a result of typing too fast.

-> To detect this, a set of words 'similar' to the input word is chosen. Then for every word in the set, the difference withthe original word is examined. If a word is found which has different letters a, but those letters and the letters of the original word are adjacent on the QWERTY keyboard, then it is flagged as a typing too fast spelling mistake.

Ex. Original word ='flpw'

Correct word = 'flow'

To get the set of similar words, fields like no. of characterspresent in the word, ascii sum, etc. are adopted. There is scope for increasing code performance in this function.

3) A generic trend finder script

The trend script incorporates a small number of cases of finding typing habits. If a function could compare a sentence with its ideal version (acc. to correct grammar and punctuation), find differences and find whether the sentence contains a trend or not according to the differences, the script would become more reliable.

4) Extending for other languages

This project assumes that the users use English in all messages. All the functions work according to rules of Englishgrammar. Use of other languages in messaging will result in setting the spam value high. So, the script can judge if the user uses other languages, but cannot perform most its functions on such messages. For incorporating other languages,new functions based on other grammatical rules are required.

-Source Code-

The source code for the following functions has been hosted on Github. The links are:

a) Curse addresser:

https://github.com/pushkarmoi/curse_addresser

b) General characteristics checker:

https://github.com/pushkarmoi/generalchar_checker

c) Curse characteristics checker:

https://github.com/pushkarmoi/curse_checker

-Contact Information-

For queries, proposals, etc.

Email: [email protected]

Pushkar Gupta

PEC University of Technology,

Chandigarh,

India.