MATLAB Based Visual is at Ion

8/4/2019 MATLAB Based Visual is at Ion

1/122

U N I V E R S I T Y O F S U R R E YD E P A R T M E N T O F C O M P U T I N G

B S C . ( H O N O U R S ) C O M P U T I N G A N D I N F O R M A T I O N T E C H N O L O G Y

FINAL YEAR PROJECT2002-2003

T ext C l as s if ic at i o n o f U SE N E T

m e s s a ge s f o r a C o n v er s a t i o n

Visua l i s a t ion Sys t em

CS300: FIN AL REPORT

JOLYON H UN TER

[email protected]

U RN : 1930192

SUPERVISOR: DATE:Dr. Andrew Salway

STUDENT: DATEJolyon Hunter


2/122

JOLYON HUNTER

[email protected]

1

BSC. (HO NO URS) COMPUTING AND I NFO RMATION TECHNO LOGYFINAL YEAR PROJECT

FIN AL REPORT

A B S T R A C T

T ex t C lassi f icat ion of U S E N E T messages for a

C onversation V isualisation S ystem

The widespread use of computers and the Internet in particular, as a means of communication has

seen more and more people connecting around the globe. These online discussions are usually text-

based, fast, simple and global in scope, yet they lack the non-verbal elements, the extra levels of

meaning and the emotion which face-to-face, person-to-person communication provides. Surely,

with the technology available to modern man, these meanings or emotions can be depicted visually?

This report details the investigation of how messages or conversations within USENET newsgroups

can be classified automatically as part of a system to visually represent online discussions. The project

aims and objectives are defined before the nature and history of USENET is described to provide

some context to the investigation. The current state-of-the-art in Visualisation Designs and Text

Classification methods is described before a period of observation is undertaken in an attempt to

define phenomena to visualise and features (i.e. particular words in the text) which characterise those

phenomena.

Once candidate features have been defined, a corpus of USENET messages from a range of

newsgroups is collected in order to conduct in-depth analysis into those features. This analysis

involves the use of the language engineering workbench System Quirk (to glean word frequencies

from the sample), and the clustering toolkit gCLUTO (to aid interpretation of the data). From this

analysis, example rules are defined which can be programmed into a system to classify the text

according to defined classes. An example conversation visualisation system is specified, before a

system to classify text is then developed, implemented, tested and evaluated.


3/122

JOLYON HUNTER

[email protected]

2

A C K N O W L E D G E M E N T S

If we k new what we were doing, it wouldn't be called research, would it?

ALBERT EINSTEIN

Dr. Andrew Salway for his continued support and advice with all aspects of this project.Also for keeping me sane, and making me realise that I didknow what I was doing.

CodeZ ebra: Sara & Erik in particular, thanks for letting me participate.

The Smashing Pumpkins, Dave Matthews Band and the rest, for getting me through the tough times;

Starbucks Coffee Co. for the even tougher times;

Personal thank s must go to

My Dad, and my sisters Eliane & Abigail, for their continuing support and love. I could not havedone this without you.

And of course

Mum always with me in spirit.

Forever my inspiration in everything. T his is dedicated to you.


4/122

JOLYON HUNTER

[email protected]

3

T A B L E O F C O N T E N T S

Abstract ...............................................................................................................................................................1

Acknowledgements ...........................................................................................................................................2

Table of Contents ...........................................................................................................................................3

1. Introduction .................................................................................................................................................5

1.1 Project Overview.............................................................................................................................5

1.2 A History of USENET ..................................................................................................................7

2. Systems to Visualise Conversation .....................................................................................................10

2.1 Visualisation Designs....................................................................................................................10

2.1.1 Loom ................................................................................................................................11

2.1.2 CodeZebra.......................................................................................................................14

2.1.3 Conversation Map..........................................................................................................16

2.1.4 Netscan ............................................................................................................................18

2.1.5 Visualisation Designs: How they compare.................................................................20

2.2 Methods for Classifying Text ......................................................................................................21

2.2.1 Smokey.............................................................................................................................22

2.2.2 WEBSOM .......................................................................................................................24

2.2.3 CLUTO............................................................................................................................26

2.3 Discussion ......................................................................................................................................28

3. Features for classifying USEN ET conversations...........................................................................29

3.1 Introduction ...................................................................................................................................29

3.2 Initial Observations of USENET Newsgroups .......................................................................30

3.3 Statistical Corpus Analysis...........................................................................................................33


5/122

JOLYON HUNTER

[email protected]

4

3.3.1 The Analysis Process .....................................................................................................35

3.3.2 Cyberspeak: Features for classification?.....................................................................37

3.3.3 Personal Pronouns: Features for classification?........................................................42

3.3.4 Synonyms of Agreement/ Disagreement: Features for classification?...................47

3.4 Summary: Creating some Rules?.................................................................................................51

3.5 Further Analysis: Phenomena Classification ............................................................................54

3.5.1 Summary: Creating more rules?...................................................................................59

4. Developing a System to Classify text for use in a Conversation Visualisation System ......62

4.1 Requirements Analysis .................................................................................................................63

4.2 Design .............................................................................................................................................65

4.3 Implementation .............................................................................................................................67

4.4 Testing.............................................................................................................................................69

4.4 Evaluation ......................................................................................................................................71

5. Closing Remarks ......................................................................................................................................75

5.1 Achievements.................................................................................................................................76

5.2 Discussion & Future Work..........................................................................................................78

References.......................................................................................................................................................79

Bibliography...................................................................................................................................................83

Appendices .....................................................................................................................................................86

APPENDIX 1: Supporting Material for Chapter 3 Analysis .......................................................87

APPENDIX 2: gCLUTO Input files...............................................................................................99

APPENDIX 3: Rule-Based Processor PERL code.................................................................101

APPENDIX 4: System Testing & Evaluation Questionnaire....................................................103

APPENDIX 5: Ten Sample Messages...........................................................................................106


6/122

JOLYON HUNTER

[email protected]

5

1 . I N T R O D U C T I O N

Cyberspace: A consensual hallucination experienced daily by billions of legitimate operators, in every nation

A Graphic representation of data abstracted from the banks of every computer in the human system

WILLIAM GIBSON N euromancer (1984:51)

1.1 Project Overview

Cyber-culture and the area of social interaction on the Internet (or in Cyberspace) is a

burgeoning area of research, not just for Computer Scientists, but for psychologists, sociologists and

linguists alike. The principle means of communication and interaction on the Internet are online

discussions, in the form of chat (IRC or Internet Relay Chat), Instant Messaging and newsgroup

discussion (USENET).

Although newsgroups in themselves can be extremely valuable sources of information, they tend to

be relatively inaccessible to outsiders or the general Internet user. For example, a discussion

centred around a topic requiring a high amount of technical knowledge in, say, Computing, might beinaccessible to a student of Archaeology (yet the information contained in the discussion may be of

relevance to the Archaeology student the problem is in accessing it). This has prompted some

academics to investigate ways and means of visualising these online discussions, so as to convey the

social qualities as well as the informative qualities of the discussion.

Pioneers in the field of visualising online discussions come from the Sociable Media Group (SMG)40

at the Massachusetts Institute of Technologys Media Lab, the Social Technologies Group at UC

Berkeley41, and more recently from Microsoft Research26, all of whom will be covered in detail in

Section 2.

Given the huge range of social interactions concerning many different topics, ranging from pet care

advice and recipes to debates about the origin of life, it is clear that opportunities for research exist in


7/122

JOLYON HUNTER

[email protected]

6

exploring new ways to classify, and ultimately display these interactions, in a more intuitive way than

is currently possible.

To facilitate this research aim, it will be necessary to carry out an in-depth examination of past and

current research in the area of visualising online conversations. As stated in the project inception

document, preliminary investigations into this research suggests that the focus tends to be more on

the visualisation process and the interface considerations of designing systems, rather than the actual

means of classifying text. From these research papers it should be possible to determine what

phenomena others have attempted to visualise, and this information will be useful to bear in mind

when investigating USENET messages in greater detail.

This project deals with automatic classification of USENET messages, which requires an analysis of

textual features which exist in messages, and an investigation into whether these can be meaningfully

represented in machine language i.e. coded into a program for automatic extraction.

An aim of this project is to make the automatic collection of that data much easier, and therefore aid

the design and implementation of a conversation visualisation system. Through greater automatic

classification of the text contained within messages, it should be easier to use the data to construct

meaningful visualisations: An ideal situation would be where one could go from numerical to

graphical representations and back again without losing accuracy of the data. To facilitate this, a

corpus of text (a minimum of 250,000 words) will be collated from one or a number of USENET

newsgroups and subsequently analysed. The aim of this analysis will be to determine patterns and

heuristics that will enable the creation of a rule set. These rules will then be incorporated into a

system that will facilitate the automatic classification of messages and conversations.

The following report details the process of research, analysis and system development concerning the

classification of USENET messages for a conversation visualisation system. However, in order to

provide some context to the focus of this report, it is first necessary to understand more about

USENET itself.


8/122

JOLYON HUNTER

[email protected]

7

1.2 A H istory of USEN ET

Throughout human existence, a gifted few visionaries have influenced thinking and furthered the

development of our species one might include Darwin, da Vinci or Einstein in this group, but in

recent times one might consider the relatively unknown JCR Licklider. In his and Robert Taylors

The Computer as a Communication Device (1968) 25, Licklider foresaw many innovations which today we

take as granted. Innovations such as team working via computers distributed over great distances,

and video conferencing. It was when he became Director of the Information Processing Techniques

Office (IPTO) at the Pentagons Advanced Research Projects Agency (ARPA) that he put in place

the financial priorities which would eventually lead to the development of what we know today as

The Internet. Effectively he held the purse-strings, but at the same time created a working

environment where graduate students ran a multi-million dollar research project.

The ARPANET developed out of this endeavour in the late 1960s and was originally intended as a

military resource for sharing supercomputers across the United States. Academic institutions soon

saw the benefits of this network for research and the sharing of knowledge, and in 1969 researchers

at four universities connected their individual campuses up to ARPANET, becoming the first

hosts on the network. Within a couple of years, most US Universities were connected and a

phenomenon known as e-mail had become the most popular application of the network Over the

following 30 years, the network grew exponentially and various protocols were developed for

particular types of communication between computers. Protocols such as the Transmission ControlProtocol/ Internet Protocol [TCP/ IP] (the basis of the internet), and the Unix-to-Unix Copy

Protocol [UUCP] upon which USENET was developed.

USENET was born in 1979 when Duke University graduate students Tom Truscott and Jim Ellis

implemented an idea they had about linking discussions within the UNIX community37. Using the

UUCP they managed to get computers to communicate via auto-dial modems: Steve Bellovin of the

University of North Carolina created scripts which meant computers could automatically dial each

other up and search for changes in date stamps of files if these date stamps were different, then the

files were copied from one computer to the other. Truscott and Ellis described their ideas and plans

for construction at the January 1980 USENIX (UNIX Users Group) conference; they emphasised

the collaborative nature of their efforts and welcomed a collaborative development process. These

basic concepts were built upon (principally by Stephen Daniel, a fellow Duke graduate) and

developed using the C programming language. His program for accessing USENET, called A


9/122

JOLYON HUNTER

[email protected]

8

News, debuted at the summer 1980 USENIX conference. USENET was seen to be an alternative

to the expensive clique-like ARPANET - Essentially, Truscott, Ellis and Bellovin created a poor-

mans ARPANET an electronic community created cheaply and without the political problems

associated with the actual ARPANET, which grew exponentially over the following years14.

USENET developed into one of the largest asynchronous communications mediums ever created: It

is communication structured in turns the major benefit of this being that people from across the

globe can discuss or collaborate on a particular topic regardless of the restrictions of daily schedule or

time zone. Essentially it is an informal worldwide network of discussion groups called newsgroups

a distributed database of messages, passed between clients and servers using the Network News

Transfer Protocol (NNTP which superseded UUCP). Because of its distributed nature, there is no

central authority dictating boundaries or policing the USENET. By technical definition, anarchy

exists, however order and structure do exist. USENET is organised into subject hierarchies with the

first few letters of the newsgroups name indicating the major subject category, with subtopic names

representing sub-categories: For example rec.music.jazz represents recreational (rec) discussions

about music (music), with the further specific sub-topic about jazz music (jazz).

Figure 1:A typical N ewsgroup Reader - messages are downloaded from a news serverover N N T P. O n the left is a list of subscribed newsgroups, with the top frame depictingthe subject lines of messages and threads within that group. Below is a preview panedisplaying the selected message.


10/122

JOLYON HUNTER

[email protected]

9

Conferencing systems, of which USENET is the largest example, are essentially a refinement on e-

mail discussion lists a series of messages strung together and related through similar topics. The

distinction can be made though, with the fact that e-mail is a push technology (people receive it

regardless of whether they asked for it or actually want it), whereas newsgroups are a form of pull

technology where users can select the groups and messages they want and request only those groups

or messages. Figure 1 depicts a typical Newsgroup reader downloading messages from a news server

using NNTP. A more recent alternative to accessing USENET exists in the form of Google Groups

who provide a fully searchable archive of messages dating back to 1981, displayed in HTML format

(see Figure 2). Google also provide a good overview of the associated terminology of USENET via

their website11. This more user-friendly interface for USENET is a useful starting point for some

initial observations into the structure and content of messages, and the language used in newsgroups,

which will provide a basis for more in-depth investigation in the sections to come.

Figure 2:A view of "rec.music.jazz" from the Google Groups website (http:/ / groups.google.com/ ) whichprovides searchable, archived newsgroups browsable in H T M L format


11/122

JOLYON HUNTER

[email protected]

10

2 . S Y S T E M S T O V I S U A L I S E C O N V E R S A T I O N

The soul never think s without an image

ARISTOTLE

2.1 Visualisation D es igns

Having established an understanding of USENET, it is time to investigate what is essential in the

design of a system which can be used to classify text for use in a conversation visualisation system. It

is first pertinent to identify the relevant research and projects already undertaken in relation to this

topic. By examining the state-of-the-art it may be possible to further enhance or combine

techniques for the purposes of this project, and any future implementations.

This section attempts to compare and contrast the different systems which already exist, both in the

field of text classification and visualisation systems, and will distinguish the key areas related to this

project. Armed with this knowledge it should be possible to proceed and develop a method (or

system) for classifying the text of USENET messages.

It is perhaps worth mentioning that a good overview of designing graphical representations for

persistent conversations can be found in the paper V isualising Conversation 7 written by Judith

Donath, Karrie Karahalios and Fernanda Vigas.


12/122

JOLYON HUNTER

[email protected]

11

2.1.1 Loom

The original Loom project began as a class project within MITs Sociable Media Group, and was

taken on as a project by Graduate student Karrie Karahalios 19, 20. The underlying vision for the

project was the development of a tool for visualising Usenet newsgroups. This was to be achieved by

observing patterns in key events of the newsgroups: Such events included the beginning and end of

conversation threads, the tone of messages, the entrances and exits made by participants in

conversations as well as observing the path traversed by users as they create their social fabric 20

(the concept of fabric being also used in a message thread visualisation metaphor).

The first implementation of Loom focused on visualisations of the mood or tone of the messages.

The system itself has a simple structure: Newsgroup messages are collected then filtered through

what Karahalios calls a general classifier. The designated mood clusters arising from this classification

are then represented visually. (For the purposes of this system, four clusters comprising angry,

peaceful, news and other were chosen these are consequently represented in the visual

output [Figure 3]). Clicking on each individual container enables the user to view the message being

visually represented. Karahalios admits the limitations of the system in its infancy (please refer to the

Loom website 20), and apart from the obvious display improvements she recommends further work

on classification, in particular breaking down the categories further.

Subsequent implementations focused on abstractions of message threads, and how they can be

represented visually. Loom 2 is an evolution of the Loom project and is overseen by Karahalios

supervisor for Loom, and prominent SMG professor, Judith Donath. It was Donaths article in the

Communications of the ACM 8 which provided a starting point for this project. In it she notes how

the nature of Usenet makes it a vast source of information, but the current means of accessing it lack

visual appeal, and tend to obscure many of the cues that aid us as human beings in our social

interactions. Donath maintains that more focus on theses cues, coupled with improved visualisation

could help the viewer perceive the online space more intuitively, and create a legible social

environment within which they can interact. Ultimately this is the aim of the Loom2 project: To use

the salient features of social interaction to create a legible interactive visualisation of Usenet.


13/122

JOLYON HUNTER

[email protected]

12

Figure 3: T he Original "L oom" V isualising Mood. It is possible to identify the four classification clusters ofangry, peaceful, news and other, represented here by the red, green, yellow and blue colours respectively.T his implementation displays the individual postings as coloured containers over a grid which represents acalendar.

In order to achieve this aim, the project members set themselves numerous questions to answer in

the development of their system principally based in domain of social structure and activity. The

focus of the Loom project being visualisation to aid social interaction, the data itself acted only as a

starting point. At a fundamental level, Looms data gathering technique comprises a subset of

newsgroups collected from a news server and parsed, then stored in an Oracle database. This was

just a small sample of Usenet and was restricted to only English newsgroups with a minimum

number of posts in the last month.

Another visualisation project developed by Donath, and Rebecca Xiong at MIT, is PeopleGarden 48.

This system uses the metaphor of flowers in a garden to represent people and their posts within a

newsgroup (to create what they term data portraits). The abstract representations of a users


14/122

JOLYON HUNTER

[email protected]

13

interaction history are illustrated in Figure 4: Two groups are represented, the left-hand side being a

group with one dominant voice, the right being a more evenly distributed, democratic group.

Figure 4: "PeopleGarden". E ach participant is represented by a flower. T he stems of the flowers representlength of time in the newsgroup, the petals represent init ial posts (blue) and replies (pink ). T he older posts arerepresented by less intense shades. T his figure represents two groups on the left hand-side a group with a singledominant voice, on the right a more democratic group.

PeopleGarden provides a good example of a visualisation metaphor, based upon activity and

participation of the users (much like L oom), but again it could be argued that more focus is needed

on the analysis of the content of messages, the text itself rather than just participants, in order to

provide another level of the context that Donath et al are seeking. More information about

PeopleGarden can be found in Xiong and Donaths paper, and at the projects website 48.


15/122

JOLYON HUNTER

[email protected]

14

2.1.2 Code Zebra

CodeZebra4 is an ongoing, international research project, attempting to develop an entirely new kind

of chat environment in visual 3-D space. In a similar vein as Loom, CodeZebra depicts navigation

and relationships between groups, individuals, topics and conversations, using a central concept of

animal print metaphors. Indeed, CodeZebra owes much to the natural world for this as users can

play games with each other, create conversational histories and generate their own patterns based on

metaphors which permeate the CodeZebra world. For example, users who post long, considered

responses might be classified as zebras, a typically academic animal, whereas users who tend to

frequent the same threads as others, and perhaps pick on other users might be classed as hyenas

typically pack-hunters in the real world. Similarly there are classifications of butterfly, ocelot,

cheetah and peacock amongst others, each with their own unique characteristics.

The need for this system originates from the inherent weakness of existing chat mediums once a

large number of people interact with each other they generate a lot of text but it is all 2-

Dimensional and the social and emotional context is sometimes unclear or lost altogether. The aim

of CodeZebra is to create a dynamic visual depiction of the underlying associations between issues

and topics, through providing interactive games and discussion structures. The 3-D environment

attempts to create a visual guide to what is being said, by whom and with what emotional tones. Part

of the driving force behind CodeZebra is the bringing together of the worlds of Art and Science in

essence, the project creators say, artists construct metaphors yet there are problems of representationwithin science therefore a move away from realism and further into abstraction is necessary to aid

communication between these disciplines. The metaphor of patterns is also important as they show

links between topics, emotional characteristics of topics and show relationships between postings.

A graphical explanation of CodeZebra second prototype can be found below in Figure 5, and the

software itself can be accessed online at the CodeZebra website4. The third prototype has just been

released, and the system underwent a globally interactive road test during the recent Dutch

Electronics Arts Festival (DEAF) in Rotterdam, the Netherlands3.

Head of the project is Sara Diamond from the Banff New Media Institute1 in Canada, and research

partners include V2 Labs (NL)43, the SMARTLab Centre (UK)34 and the University of Surrey (in

particular Dr. Andrew Salway). CodeZebras links with the University of Surrey, in particular with the

supervisor of this project, will facilitate the sharing of information and the spirit of research


16/122

JOLYON HUNTER

[email protected]

15

collaboration in the evolution of this project. With CodeZebras emphasis being on visualisation,

there is ultimately an essential need for the underlying system to accurately analyse and classify

messages correctly. Fundamentally, this is where it is hoped that this project will contribute to this

field of research.

Figure 5:CodeZebra Prototype 2 ex plained


17/122

JOLYON HUNTER

[email protected]

16

2.1.3 Conversation Map

The Conversation Map system is the brainchild of Warren Sack and formed the basis for his PhD

thesis30 at the Massachusetts Institute of Technology (MIT) in 2000. The system attempts to map

message threads alongside Social and Semantic Networks in the same application window (see Figure

6).

In Discourse Diagrams: Interface Design for Very Large-Scale Conversations 31, Sack classifies the

internet conversations taking part between many thousands of people at any one time as Very

Large-Scale Conversations. He alleges that it is the sheer size and variety of these conversations,

such as those conducted within newsgroups, which render them difficult to understand or critically

reflect upon for interested observers and participants alike. The system attempts to facilitate greater

understanding through creating visualisations of the Social and Semantic networks in action in a

particular thread of messages. In the aforementioned paper, Sack discusses the design criteria

necessary in order to transform these social scientific representations into what he terms interface

devices. The Conversation Map system he developed is such a device, and is essentially a browsing

system for Very Large-Scale Conversations (VLSCs).

A Social network involves sketching vertices between a set of nodes labelled with the names of

people, with the vertices between the nodes representing interaction between these people. In a

similar manner, Semantic networks involve nodes and vertices, with the nodes labelled with words or

concepts and the vertices representing the semantic relationship between those words or concepts.

Social and Semantic Networks are usually used in the social sciences as analytic devices - as scientific

models/ representations of hypotheses what Sack proposes is that they be used as generative

devices; i.e. as a means of exploring VLSCs by being an interface into the archive of a discussion 32.

Effectively, this means they are diagrams of discourse.


18/122

JOLYON HUNTER

[email protected]

17

Figure 6:T he Conversation M ap system: The upper left part of the screen shows the social network present withina group, the upper right shows the semantic network . I n-between is a list of themes present in the group and belowa representation of the message threads within the group.

Conversation Map can be used in a similar way to regular newsgroup readers like Eudora, Agent and

Outlook Express, but has features which others do not provide. It performs a series of

computational linguistic and sociologic analyses on the messages of the newsgroup, and presents its

results in the form of a graphical interface (see Figure 6).

In discussing his work (at an HCI Seminar 2000-2001 held at Stanford 32) Sack notes how a

newcomer to a VLSC might want access to previous discussions as well as the social layout of the

group in order to determine which participants and threads are relevant to the newcomer. This also

distinguishes which topics or themes bring together participants, or divide them as the case may be.Conversation Map does this by focusing more on summarisation, navigation and browsing rather

than information retrieval.


19/122

JOLYON HUNTER

[email protected]

18

2.1.4 N etscan

Netscan began in 1994 as a graduate project at the University of California Los Angeles (UCLA)

conducted by Marc Smith. Today it continues in the industrial research arena under the banner of

Microsoft Research. Compared to Loom and Conversation Map, Netscan has a much more

grandiose aim to map the entirety of Usenet.

Smith, Fiore and Tiernan10 are quick to note that current newsgroup browsers offer very little (if any)

information about the history of authors in a newsgroup for example their activity, what other

newsgroups they post to or which other participants they converse with etc. They argue that it is

these features which, if they were available in browsers, would provide a guide to social context and

interactional history that is typically used by participants in physical spaces. Current browsers present

information about the messages themselves (subject line, posting date, and lines) which causes the

user to focus their attention on the structure of the medium itself, rather than the qualities or value

of the participants.

One of Netscans aims is to develop a newsgroup interface which allows the user to sort and search

by salient behavioural features such as the fraction of each authors messages that were initiating

conversation as opposed to replies, or the number of days on which each author contributed a

message to that particular newsgroup. By combining these features with the regular features which

depict structure and thread development, Smith, Fiore and Tiernan believe they can extract valuable

content out of the large social cyberspaces that comprise Usenet.

Netscan connects to a news server that carries nearly 50,000 Usenet newsgroups and collects all of

the messages in all the newsgroups. Information is gathered from message headers and

stored/ maintained in a database. The analyser part of Netscan is then able to read selected

portions of the database and generate reports and analysis of selected newsgroups over a specific

period of time. A demonstration of this can be found on the main page of the Netscan website 35.

The Netscan website demonstrates a few of the tools/ interfaces that Smith has developed in order totry and map Usenet, and Figure 7 depicts a boxplot representation of the rec.musichierarchy.


20/122

JOLYON HUNTER

[email protected]

19

Figure 7:A N etscan boxplot of " rec.music": T his shows the sub-groups present within the rec.music hierarchy; the siz eof each sub-group represents the number of posts to that group in the last month. T he colours represent the increase(green) or decrease (red) in posting within that group compared to the previous month the intensity of the colourindicates the level of increase/ decrease.

It could be argued that Netscan, like the systems mentioned already, focus upon the more

sociological/ visual representations of conversations whereas there needs to be more research into

automatic text analysis incorporating the content of the message as well as the features that

Netscan has already extracted perhaps. Marc Smith notes how Netscan has succeeded in its initial

aims, but that there is also scope for improvement with regard to content analysis36; Smith suggests

that content analysis of message bodies could lead to the mapping of the diffusion of topics across

USENET, and indeed other communications media. He further suggests that this could possibly

assist studies into informal communication networks and the transmission of folk beliefs, as well as

the development of academic disciplines.


21/122

JOLYON HUNTER

[email protected]

20

2.1.5 Visualisation Designs: How they compare

At a glance, here is how the various visualisation systems covered in the previous sections compare

by feature. It is clear that they all offer something unique, yet there is a central theme to what they all

require more investigation into automatically classifying text.

Loom

PeopleGarden

CodeZebra

ConversationMap

Netscan

Participation analysis? ! ! !

Semantic Structures Represented? ! !

Social Structures Represented? ! !

Message Information Extraction? ! !

Message Content Analysis? ! !

Author Profiling/ Representation? ! ! !

A utomaticclassification of text?

Requires more research into automatic text

classification?! ! ! ! !


22/122

JOLYON HUNTER

[email protected]

21

2.2 Methods for Classifying Text

It is clear from the previous section that visualisation systems are plentiful, and vary greatly in how

they visually represent conversations. Before being able to visualise such conversations, it is necessary

to impose some form of classification system with regards to the content of messages or

conversations. Some initial work has been conducted into the area of automatic text classification,

and the following section will introduce a number of different methods which could potentially be

used or adapted for use in a future automatic text classification system for conversation visualisation.

Firstly, a rule-based method of classifying text exists in the form of Ellen Spertus Smokey, a

system which automatically classifies abusive flame messages. Next Teuvo Kohonens

WEBSOM will be covered; this is a method for organising text documents and preparing visual

maps of them to facilitate information retrieval this method is based on the Self-Organising Map

(SOM) data processing method which has its roots in neural computing. Finally, a look at George

Karypis CLUTO clustering toolkit. This software package enables the user to cluster low and

high-dimensionality data, and in its recently released form of gCLUTO provides a user-friendly

method for analysing the characteristics of various clusters. From a more detailed look at these

methods and systems, it should be possible to draw inspiration for a future system to classify

USENET messages for a conversation visualisation system.


23/122

JOLYON HUNTER

[email protected]

22

2.2.1 Smokey

Smokey was the brainchild of MIT graduate Ellen Spertus38. The system focuses upon abusive

email messages, otherwise known as flames. The system aims to identify abusive, insulting or

offensive messages by looking for specific words and their context and syntactic constructs. The

prototype system builds a 47-element feature vector based on the semantics and syntax of each

particular sentence. It combines these vectors for sentences within a message, so that effectively the

resulting vector represents all the constituent parts (sentences) of the message.

The messages are converted to one sentence per line and delimiters are put between messages

(manually). The resulting text is then run through a parser developed by the Microsoft Natural

Language Processing Research Group38, and the parser outputs converted by sed and awk scripts

into Lisp s-expressions. The s-expressions are then processed through a set of rules developed in

Emacs Lisp, producing the aforementioned feature vector for each message. The feature vectors

were then evaluated using simple rules produced by Quinlans C4.5 decision tree generator.

Spertus reports that with a training set of 720 messages, and using the process detailed above, the

system was able to correctly categorise 64% of flames and, in a separate 460 message set, 98% of

non-flames. It could be argued that this success rate could be improved, and Spertus acknowledges

the limitations of Smokey with regards to elements such as grammatical errors, the use of innuendo

and the use of sarcasm. For example Smokey could mistake the statement Im glad to see that your

incessant name calling and whining hasnt stopped as a praise rule because it starts with Im glad.

The messages used in the prototype came from the webmasters of a number of controversial

websites, including The Right Side of the Web (a conservative resource), NewtWatch (a site which

criticises Newt Gingrich) and Fairness and Accuracy in Reporting (FAIR, a media watch group best

known for questioning the claims of Rush Limbaugh). All of the messages from these sources were

in the form of email messages Spertus recognises that there are many different types of flames in

real-time communication; as facilitated by the WWW, mailing lists and Usenet newsgroups. However

she also notes that publicly posted messages, such as those found in newsgroups, tend to be more

clever and less personal (i.e. more indirect) than the private email flames (which Smokey focuses

upon) thus, Spertus claims, this makes the publicly posted flame messages harder to reliably

detect. First instincts might tell a researcher to look for obscene expressions or vulgarities within the


24/122

JOLYON HUNTER

[email protected]

23

messages, to identify them as flames. However, Spertus found that only 12% of the flames actually

contained vulgarities and over a third of the vulgar messages were not actually flames.

The most important element of Smokey is the set of rules being used in order to process the textual

input and classify the output. These are listed in full in Spertus paper38, but examples include the

likes of Imperative statements containing Look (e.g. Look forward to hearing from you), and

phrases containing a negative word near you. The clear advantage of a method such as this is that a

set of rules are relatively easily converted into a programmable algorithm. However, the scope for

such rules would depend upon an in-depth analysis of USENET messages to determine if any

patterns or features exist which could potentially be represented in rule form, and coded into a future

system


25/122

JOLYON HUNTER

[email protected]

24

2.2.2 WEBSOM

The Self-Organising Map24 (or SOM in WEBSOM) was developed by Teuvo Kohonen at the

University of Helsinki, Finland. This is a method which facilitates the automatic organisation of

complex datasets, and provides a means of creating visual maps of these datasets to aid in the

retrieval of information. It has also been the basis of more than 3000 published studies worldwide,

with applications ranging from image analysis and speech recognition to medical diagnoses and

telecommunications applications.

With WEBSOM45, documents are automatically arranged into a visual map using the SOM method.

Before this can take place, the textual content of the documents is converted into numerical form

several hundred numeric values correspond to each document. A certain combination of these values

within a document corresponds to a word; hence if the size of this combination is larger in the code

of the document, then the word occurs more frequently. Words are also weighted according to how

common they are. A pre-processing stage removes punctuation marks, numbers and other non-

standard characters, and the most common and most rare words are excluded.

A zoomed view of WEBSOM can be found in Figure 8. Here, the user has already clicked on the

map image with their mouse. This zoomed view shows white points or nodes which allow the user

to investigate the contents of individual map units (threads of messages). The Arrow button in the

top left-hand corner allows the user to move to a nearby cluster. The map image contains labels

which are example of the core vocabulary of the area in question. The labels give a general idea of the

topics in the document collection. The colouring of areas of the map represents the density of

documents in that area. Light areas contain more documents.

As the WebSOM site depicts, this tool is especially good with large bodies of text such as USENET,

and has great potential in the classification of USENET messages right down to individual words.

With the SOM base, there is also the potential for it to be adapted into a system incorporating an

adaptive/ learning neural network which could dynamically adapt to the ever changing environment

of USENET.


26/122

JOLYON HUNTER

[email protected]

25

Figure 8:"W E BS OM ": A zoomed view - E ach white dot is a map node. Colour denotes the density or the clusterintendency of the documents. W hite areas are clusters and dark areas empty space, "ravines", between the clusters. Thon the right hand side denotes which topics are being represnted.

ge tex t


27/122

JOLYON HUNTER

[email protected]

26

2.2.3 CLUTO

This software facilitates the clustering of low and high-dimensionality data, and has application areas

including customer purchasing transactions, Geographic Information Systems (GIS), general science

and information retrieval. In the sense of information retrieval, clustering is based on the idea that

objects (usually documents, but in this case keywords) clustered together that is, closely associated

with each other tend to be relevant to the same query or topic domain. Formally, this is known as

the Clustering H ypothesis:

Closely associated documents tend to be relevant to the same requests 29

The CLUTO21 package includes three different classes of clustering algorithms which operate either

in the objects similarity space or in the objects feature space. These algorithms are based on thepartitional, agglomerative and graph-partitioning paradigms. Fundamentally the software package

allows a user to input data in the form of a matrix the software then processes this data into

clusters and enables the user to analyse the discovered clusters through looking at the relations

between objects within each cluster and the relations between different clusters. CLUTO is

particularly good at identifying the features which best describe or distinguish each cluster. Based

upon studies13 and 22 conducted into the effectiveness of the algorithms, it is generally held that

partitional clustering offers the best results.

CLUTO is freely available to download21and is available for Linux, Sun and Windows platforms (the

current version is 2.1). The package features two programs, vcluster (which takes as an input the

multidimensional representation of the objects that need to be clustered) and scluster (which takes

the similarity matrix/ graph between these objects as its input). Essentially both programs perform

the same functions, and are invoked via the command line using the following call:

vcluster [optional parameters] MatrixFileNClusters

WhereMatrixFile is the name of the file that stores the n objects to be clustered. With vcluster, each

one of those objects is considered to be a vector within m-dimensional space, and the collection of

these objects is represented in an n x m matrix.N Clusters is the number of clusters that are required

(default is usually 10). The optional parameters which can be specified are given in detail and


28/122

JOLYON HUNTER

[email protected]

27

explained in the CLUTO manual these include directing output to named files, setting different

clustering methods and number of clusters etc.

Not only is CLUTO a powerful and free piece of software, but it can also be used within bespoke

applications thanks to its availability as a set of C libraries. CLUTO can effectively become an

integral part of a larger information retrieval, or text classification system.

As of February 2003, Version 0.5 (Alpha release) of gCLUTO 21 became available. This is a GUI-

based version of CLUTO intended to make the clustering toolkit more user friendly (see screenshot

in Figure 9). Essentially this

eliminates the command line-

based operation of CLUTO

and turns it into a point-and-click interface. The command

line input depicted earlier is

simply replaced by a standard

Windows-style GUI Browse

button for input files.

Even more recently, a web-

based version was released:

wCLUTO21. This is a web-

enabled data clustering

application designed for the

clustering and analysis of gene-

expression datasets, and is

based on the CLUTO toolkit.

Figure 9:gCL U T O: A graphical clustering toolk it based on CL U T O.H ere we see the clustering solution at the top of the screen with the M atrixand M ountain visualisations of these clusters at the bottom left and rightrespectively.

Clearly from all of these different implementations, and the fact that CLUTOs libraries can be

accessed from within a bespoke C/ C++ program, this toolkit is highly versatile and could easily be

incorporated into a future implementation of a system to automatically classify text for a

conversation visualisation system. This is not only a very powerful statistical analysis tool, it also has

massive potential as part of a future system real-time clustering of text would be of immense use in

identifying text with similar/ dissimilar characteristics (and therefore the classification of that text).


29/122

JOLYON HUNTER

[email protected]

28

2.3 Discussion

In this chapter the state-of-the-art in visualisation systems has been characterised. What is apparent

from current systems is that the focus so far has generally been biased towards the actual

visualisation requirements of the system, rather than the underlying processing and pre-processing

needed in order to classify text for use in such a visualisation system. Donath and her contemporaries

at MIT have tended to focus more on the social context of the interactions involved in online

communication, as has Warren Sack to a certain extent. Others such as Smith et al have focused

more on the analytical aspects of such a system, yet there seems to have been little (if any) research

into the bare-bones computational analysis aspects of a visualisation system. To a certain extent, the

CodeZebra team are now pioneering this work, but it is hoped that this report will also be of value in

this field.

When considering this topic from a computational analysis viewpoint, it is becomes clear that to

successfully analyse the text involved (in order to classify it), one must have an idea of some

techniques for analysing text. Therefore, rather than attempting to reinvent the wheel it is only

practical to assess the current state-of-the-art with regard to text classification as well. Section 2.2

described three different techniques and systems for text analysis, the rule-based Smokey, the neural-

network inspired WebSOM and the clustering toolkit of CLUTO. The example set by Smokey is an

interesting one, and is a good foundation upon which to base a simple first system. WebSOM and

the possible use of neural/ Bayesian networks in a future system would be an interesting andrewarding avenue of research, but is beyond the scope and timescale of this particular project.

However, the CLUTO toolkit has great potential, and in the case of this project gCLUTO would

provide an excellent means of analysing USENET text. As such, implementing a fully functional

system using CLUTO libraries is probably also beyond the limitations of this project.

In conclusion, it would be worthwhile attempting to develop a simple rule-based classification system

as a basis for this project, with the possibility that elements of CLUTO could be incorporated in

some way. Future implementations could implement CLUTOs libraries to provide on-the-fly

clustering to aid classification, and similarly some form of Self-Organising Map might alternatively be

used in a learning system. Before any system can be attempted, it is essential to conduct some in-

depth analysis of USENET messages to see if any patterns or heuristics exist, and whether any rules

can be derived from them. This analysis is described in the following chapter.


30/122

JOLYON HUNTER

[email protected]

29

3 . F E A T U R E S F O R C L A S S I F Y I N G U S E N E TC O N V E R S A T I O N S

Y ou and me are just different points of view

CHANG-TZU

3.1 Introduction

It has been established that current visualisation systems are sophisticated, yet little research has been

conducted into the underlying system which classifies text for those visualisations. After

characterising the fields of visualisation systems and text classification methods, it is important to

realise that any potential system development begins at an atomic level - with the text itself.

In Chapter 1, the immense size and scope of USENET was described along with its historical

context. With such a large source of data, it is necessary to form some idea of the structure and

specific features which exist within newsgroups. To facilitate this understanding, some initial

observations of USENET are documented in Section 3.2 this aims to determine what key features

possibly exist, and which features might be worthy of further in-depth investigation.

Following this period of observation, a corpus of USENET messages was collected at random and

analysed using the processes documented in section 3.3. The final sections of this chapter deal with

the statistical analysis of the corpus, ending with the formulation of some example rules for a

potential system implementation.


31/122

JOLYON HUNTER

[email protected]

30

3.2 Init ial Observations of USEN ET N ewsgroups

In order to get a general feel for USENET messages and their constructs, a period of Initial

Observation of USENET messages was undertaken. This consisted of examining various

newsgroups and messages contained within them in an attempt to identify key (or simply interesting)

features existing in these messages. For

the purposes of these observations,

Google Groups provided an effective

way of navigating and reading groups

and messages. Google Groups archives

USENET in its entirety back to 1981,

and makes groups and articles

(messages) available in HTML format,

browsable and searchable via their

website11. Originally Deja News created

this user-friendly interface for USENET

discussions in 1995; it was later acquired

by Google Inc. in 2001.

Initially, the most striking aspect of

USENET is the variety of groups whichexist, each one facilitating an even more diverse range of topics (Figure 10 shows the Google Groups

listing for the most popular top level domains there are hundreds more top level, and sub-level

groups). This variety is also reflected in the composition and structure of the messages themselves.

Posts can range from simple one-line questions with short replies, basic statements of opinion or

thread-based opinion polls to lengthy, thought-out responses which emphasise the knowledge and

writing style of the author. Frequently, message replies quote the original or previous message in their

body, some people preferring to reply after the quote, some before, and some taking the message one

section at a time by interspersing their replies with quoted text.

Figure 10: T he Google Groups main page athttp:/ / groups.google.com/ . T his site was used for initialobservations of USE N E T newsgroups and the messages withinthem. G oogle Groups archives all of USE N E T back to 1981 more than 700 million messages

In the more free-spirited discussion groups, personal opinions are highly valued by authors, with

subsequent personal insults being the reward if an authors views/ opinions are disagreeable to

another author. Such insults or flames are characterised by the use of profanity and capitalised

words (which indicate shouting in the context of online conversation). Another feature which is


32/122

JOLYON HUNTER

[email protected]

31

specific to communication over the internet is the popularity of acronyms and abbreviations for

popular or frequently used words and phrases, sometimes known under the umbrella term

Cyberspeak. Some examples of these include Laughing-Out-Loud (LOL), Rolling-On-The-Floor-

Laughing-My-Ass-Off (ROTFLMAO), and By-The-Way (BTW). These features are generally net-

specific so would be an interesting feature to investigate further with respect to this project, to see

how widespread their usage is and if they can be used to classify text.

Also notable are the representations of emotion in the form of emotional icons or emoticons.

These are typically constructed from two or more punctuation symbols, for example a colon a dash

and a closing bracket represents a smiling face. These features are also specific to online

communications, and as they provide some representation of facial expressions or facial actions (in

this non verbal medium), then they could also be indicators of emotion or mood.

In some cases, there are also textual representations of expressions, such as grinning (*g* or for

example) which might also be worth investigating. The use of asterisks (*) around words to

provide emphasis is also a feature for example:

The Beatles were ok, but I *really* lik e the Rolling Stones music

Sometimes this is used interchangeably with all-capital letters in the context of online conversations

this represents shouting or heavy emphasis. Coupled with this it might be worthwhile investigating

the use of exclamation marks and other punctuation, to see if these can aid the classification of text.

With regard to the overall content of USENET, the newsgroups concerning more academic or

technical subjects tend to be populated by a more civilised readership; yet sometimes certain

subjects are inaccessible to visiting outsiders due to their limited knowledge in that domain.

Responses and initial posts in groups like these are usually well structured and fairly detailed

indicated by sentence and paragraph length and structure.

With close-knit groups, such as highly technical academic groups, there is an element of prior

knowledge of subjects which facilitates the telling of jokes based around that knowledge. Groups

such as this, where posters are generally known to each other or articulate their views soundly, are on

the whole more civil than other groups (where for example, a new poster might get abusive replies

for asking a valid question). See also section 2.2.1 regarding Smokey38, a classification system

developed to aid the identification of these abusive flame messages.


33/122

JOLYON HUNTER

[email protected]

32

Each message posted to USENET also has a unique structure identifying it. This is in the form of

header information which describes when the message was posted, the server it was posted to and

uniquely identifies the user who posted the message. Another feature of messages, this time

dependent on the individual user, is the signature which users sometimes attach to their messages.

These are usually text-based and can provide information about the poster (i.e. email, website or

telephone information) but can also sometimes include elements of ASCII art pictures made up

from standard ASCII characters. Another popular trait is to include a favourite quote in signatures.

When users reply to messages, the previous message is often quoted to provide context to the reply

(the standard message quotation starting with the > symbol, multiple quotes resulting in many of

these). Such replies might start with a question or include questions, with some replies containing

URLs referring the reader to extra information on websites. However, some threads, whilst lengthy

sometimes hundreds of messages long can tend to be just idle chit-chat between two or more

posters.

There are also more specific linguistic features worth investigating the use of imperative statements,

(L earn x before you talk about it or Get with the program!) which also links to the use of personal

pronouns. Directly addressing a fellow user as you could be an indicator of response tone (this

could depend upon some understanding of the context of the message or conversation).

From the features detailed above there appear to be a number of potential phenomena to visualise,

some of which have been attempted by a few of the systems mentioned in section 2. These include

conversational tone, mood, emotion, themes across groups, contributor information (dominance,

reliability, links to others, activity etc) and issues of formality or informality. The scope of these

phenomena, and the features to identify them, is potentially huge, therefore a select few of these

features will be examined when analysing the corpus of USENET messages.


34/122

JOLYON HUNTER

[email protected]

33

3.3 Statistical Corpus Analysis

The aim of this stage of analysis is to attempt to determine if features which have been identified

offer an insight or indication of how text can be classified, by the derivation of some set of rules

based upon any patterns (or not) which exist within the corpus. This insight might be supported by a

significant distribution of the cue word across the sample corpus for example. The process used is

documented in the following sections, but first it is necessary to describe the origin of the sample

corpus being used in this project.

The University of Surrey News feed (ref news.surrey.ac.uk) was used in order to collect a corpus of

messages with which to conduct this research. The Tin42 newsreader under UNIX has functionality

which allows the archiving (saving) of all messages within subscribed newsgroups to a specified

location, and so was utilised for this project. For the purposes of this analysis, newsgroups were

picked at random from a variety of different topics. The list of archived newsgroups included

alt.book s, alt.music, alt.music.lyrics, alt.music.smash-pumpk ins, alt.politics.democrats, alt.politics.usa,

alt.politics.usa.republicans, sci.environment and uk .media. Tin stored each newsgroup in a folder named after

the group and each article (message) was uniquely numbered and stored in that folder.

The corpus amounted to a total of 6005 messages, containing 4,303,358 words. It should be noted

that message header information (information used by mail servers and clients to identify and route

messages) was also included in this count however, even if header information accounted for a

third of this number, the total massively exceeds the anticipated minimum of 250,000 words stated in

this projects Inception Document.

Using System Quirk39, a language engineering workbench/ tool developed by the University of

Surrey, it was possible to determine the frequency of certain words within the corpus (Using the

Kontext tool). These keywords were specified in a separate file known as a Word List. Kontext

allows various functions to be performed on a corpus of text, from Indexing (counting the frequency

of words) to the more complicated Keyword In Context (KWIC) function which allows a user-

specified number of words either side of the keyword to be displayed (thus, displaying some of the

context in which the word is being used). For the purposes of this project, Kontexts frequency

counting functionality will be employed.


35/122

JOLYON HUNTER

[email protected]

34

As a precursor to this analysis, Monday 27th and Tuesday 28th January 2003 saw a collaborative

workshop take place between myself, Sotiris Rompas (UniS MSc Student), Dr. Andrew Salway (UniS)

and Erik Kemperman, an AI Programmer from V2 Labs43 in the Netherlands, and a key member of

the CodeZebra project team (see Section 2.1.2). The idea behind this workshop was to share ideas

and get a feel for possible classification cues existing in the text of USENET messages. For this

purpose, the corpus created for this project was used. Full details of the work undertaken were

documented23, and included some basic work with cyberspeak and personal pronoun wordlists, and

also a look at the theme of agreement and disagreement using a wordlist comprised of synonyms of

agree and disagree obtained from Princeton Universitys WordNet thesaurus45.

The following subsections continue and expand upon the initial groundwork established in the

workshop by examining the features of cyberspeak, personal pronouns and synonyms of

agreement and disagreement in greater detail through a comprehensive analysis of the entire corpus

(using System Quirk) and performing further clustering analysis using CLUTO. The analysis process

is detailed in the next section, with the following sections dealing with the individual phenomena to

be analysed.


36/122

JOLYON HUNTER

[email protected]

35

3.3.1 The Analysis Proces s

In order to glean meaningful results from the data produced by System Quirk, the results (i.e

frequencies produced by indexing) were transferred into a Microsoft Excel spreadsheet. With figures

for the number of words within all messages in each group, it was also possible to calculate relative

frequencies of the keywords. All of this data is tabulated at the end of this report in Appendix 1, and

should be referred to as a means of complementing the discussion of the following sections.

As described earlier (in section 2.2.3), the CLUTO clustering toolkit offers many functions and is

extremely versatile, therefore the decision was made to utilise some of the basic functionality of this

package in order to analyse the data obtained in the first stage of analysis. In theory, clustering these

results should enable us to discern whether or not the keywords contained in the System Quirk

wordlists are being used in a similar way (i.e. have associated characteristics such as similar relative

frequencies) across all or a portion of the newsgroups in question. For the purposes of this project it

was decided to use the newly released gCLUTO, the more user-friendly package developed by the

inventors of CLUTO21. Whilst gCLUTO is a separate application, it is still based heavily upon the

CLUTO toolkit and as such it proved worthwhile experimenting with the original command-line

CLUTO and referencing the CLUTO manual in interpreting gCLUTOs output.

The input file for CLUTO and gCLUTO is in the form of a matrix (a .mat file). The internal

format of the file itself is straightforward, with the first line containing two numbers, the first being

the number of rows, the second being the number of columns in the matrix. Below this line the

values for the cells of the matrix can be found (for an example see Appendix 2). Two other files are

needed (.clabel and .rlabel files) for labelling the rows and columns respectively, and these

contain the appropriate labels, one per line. In CLUTO, each of these files is specified using the

command described earlier, whereas in gCLUTO the user can select these files using a point-and-

click interface.

The relative frequency data contained in the Excel spreadsheet was converted into the matrix input

format used by gCLUTO (and CLUTO), and fed into the program using the default setup for the

Repeated Bisection clustering method. This method involves the input matrix being split into two

clustered groups, then one of those groups is selected and further bisected this continues until the

specified number of clusters is produced (by default 10). CLUTO uses the Repeated Bisection

clustering method by default (others can be specified please refer to CLUTOs manual for more


37/122

JOLYON HUNTER

[email protected]

36

details21). It was felt that leaving everything at default would give a satisfactory analysis as the purpose

of using CLUTO in this case is only to interpret/ visualise the data rather than in-depth analysis using

different clustering techniques. Of course, due to the versatility and options possible with the

CLUTO toolkit and its derivatives, such in-depth analysis would not be difficult to achieve if so

desired.

This in theory provides a means of determining whether similarities or patterns exist within the

groups and between the groups if so clustering these features should depict this. If similarities or

differences do exist then it should facilitate the derivation of some rules upon which a suitable

automatic text classification system could be built.

gCLUTO in particular produces some interesting visualisations of the data. Alongside the clustering

solution, the user can create a matrix visualisation and/ or a mountain visualisation. The mountainvisualisation represents the relative similarity of clusters as well as their size, internal similarity and

internal deviation. Each cluster is represented by a peak in the terrain, with the shape of the peak

being a Gaussian curve rough estimation of the distribution of data within that cluster. Its height is

proportional to the cluster's internal similarity, and volume is proportional to the number of elements

within the cluster. The colour of the peaks is also important with red indicating low deviation

blending into blue which represents high deviation. Using the Show Features function, the three

most descriptive features of the cluster are displayed above the cluster itself and detailed information

is available about each cluster by clicking on its number (for reasons of visual clarity, this feature was

turned off for the included screenshots). The matrix visualisation uses colours to graphically

represent the data in the matrix with white representing colours near zero, green representing

negative values and red representing positive values (different shades of red & green being

lighter/ darker depending on how close to zero they are). With tree-building enabled, a column tree

and a row tree are generated - These trees effectively depict the relationship between the columns

and rows, showing the relationships between the discovered clusters. Examples of both of these

visualisations will be produced, and explained in the following sections, and an example of both can

be seen in the gCLUTO screenshot in Figure 9. Clustered features are divided by horizontal black

lines across the matrix visualisation area.

The following sub-sections contain details of all of the stages of analysis mentioned above. The

results tables and graphs being discussed are available in Appendix 1 and should be referred to whilst

reading these sections.


38/122

JOLYON HUNTER

[email protected]

37

3.3.2 Cyberspeak: Features for classification?

Cyberspace: a virtual world which relies largely upon written rather than spoken communication. The

receiver (in this case the reader) is unable to see the speaker, hence non-verbal message cues are lost

and the meaning of some messages is distorted or lost altogether. With the reader unable to see a

speaker grin, chuckle, wink or smile, a means of communication unique to cyberspace evolved as a

means of a enhancing meaning. This came to be known as Cyberspeak (sometimes called

Netspeak).

This consists of a number of dynamic elements which

include emotional ASCII icons, otherwise known as

emoticons (see Figure 11) which can be used to

represent facial expressions or actions. These emoticons

are one aspect of cyberspeak, and another is the use of

Cyberspeak acronyms short words made up of the first letters of a lengthy phrase. For example,

by the way is abbreviated to BTW and also known as to AKA in the case of non-verbal

cues, laughing out loud can be abbreviated to LOL. Comprehensive lists of these acronyms exist

in various online dictionaries of Netspeak 16, 17 and there are a number of good sites for the

uninitiated 15, 28 and 46.

Figure 11: Example of "Emoticons"representing smiling, winking andsticking out the tongue respectively

From the initial observations in section 3.2, it became clear that these acronyms could

possibly provide a feature for classifying phenomena such as mood or emotion. For

example excessive use of LOL in a particular group or by a particular person could be

a good indicator of emotion. It is interesting to note the prominence of some of these

features in everyday life for example the LOL acronym can now be found in the

Oxford English Dictionary 2 perhaps a sign of just how much the Internet and

associated technologies have infiltrated modern day society.

For the purposes of this analysis, a selection of acronyms was used as the wordlist for

System Quirk. These are given in Table 1 and full results should be referred to in

Appendix 1b. From the results of the wordlist analysis it is plain to see that the

acronyms specified do exist, although not in large numbers. There appears to be no

clear pattern across all groups, with only LOL, BTW and AKA registering frequencies above ten

(note the smallest group contained more than 22,000 words). In order to get a more accurate view of

akabblbrb

btw

lmao

lol

rotf

rotflmaostfu

ttfn

ttyl

wtfTable 1:Cyberspeak

wordlist forSystem Quirkanalysis


39/122

JOLYON HUNTER

[email protected]

38

the distribution of the words, the relative frequencies for each group were calculated (i.e. frequency

of feature divided by number of words within the group. See Appendix 1b for these results).

It is noticeable that with relative frequencies, one can determine a stark difference in the usage of

LOL across groups. In the alt.politics.democrats newsgroup, there is the highest relative usage of any

group (0.227x10-3), even though this group is almost a third of the size of the largest group in the

corpus (which is alt.politics.usa.republican somewhat contentiously!).

This is perhaps a representation of the characteristics of the newsgroups selected it might be the

case that the participants of the alt.politics.democrats group are more light-hearted and jovial (or

liberal as a cynical republican might suggest), or it might simply be a case of the threads captured at

that time being particularly humorous in nature. LOL can generally be assumed to be a positive

response if we negate the use of sarcasm, however this would defeat the point of this project.

Examining acronyms could aid the visualisation of phenomena such as formality/ informality,

emotion and mood. For example LOL would likely have positive or happy connotations;

similarly the complete lack of cyberspeak phenomena might indicate a more formal message or

subdued group. The difficulty of examining cyberspeak from a language use perspective is that it is a

feature of online communication spaces, and is a relatively new phenomenon which has been

arguably researched less than standard language features such as specific words or phrases. One

could argue that cyberspeak could warrant an entire research project in itself, but the fact that it is

cyber-specific and not used widely means it is difficult to derive generic rules for classification

without knowing more about the context of the rest of the message, and the rest of the entire

communication space.

Similarly, research into emoticons would be interesting, however in this case System Quirk could not

recognise the combination of punctuation marks and symbols which constitute emoticons, therefore

it could not be pursued further. However, it might turn out to be the case that they are not a widely

used feature across USENET There is a chance that emoticons may have been more prominent in

the past when USENET was the domain of the highly technologically-literate, but with the

proliferation of internet access in the early 21st Century there has been a socio-demographic shift in

the user base of USENET specifically, non-technical everyday people have access, and utilise all

aspects of internet communication, including USENET. Further research into this area would help

to support or negate this theory, but is outside the scope of this particular project.


40/122

JOLYON HUNTER

[email protected]

39

The results produced from gCLUTO processing are useful for visualising the data and identifying

clusters that exist. Despite the fact that CLUTO is generally more useful for larger datasets than the

results produced from the cyberspeak analysis, clusters did exist which show interesting relationships

between the newsgroups. The matrix visualisation produced can be seen in Figure 12 and it is clear to

see similarities between groups from the hierarchical column tree created at the top. On the left-hand

branch, the first leaf nodes clustered depicts the relationship between alt.music.lyrics & alt.music, then

between those two groups and uk .media. On the right-hand branch alt.politics.usa &

alt.politics.usa.republican are related, alt.politics & alt.music.smash-pumpkins are related, and sci.environment &

alt.book s are related. These pairings can then be connected as you read up the tree. What this suggests

is that within the clusters discovered, the relative frequencies share similarities which could be an

indicator that the words are being used in a similar way across the related newsgroups. If one glances

across the matrix visualisation it is clear to see that BTW is the most widespread word from the

list, occurring in all but two of the newsgroups. Interestingly, the darkest shades of red (i.e. the

largest values) all occur for keywords within the politics/ debate groups - alt.politics.usa,

alt.politics.usa.republican, alt.politics.democrats and sci.environment alongside this it is worth noting that

most of the cyberspeak words that did exist in the sample occurred in these same groups, perhaps

suggesting that their use within these groups is comparable.

The mountain visualisation depicted in Figure 13 clearly shows the discovered clusters (represented

by the peaks), but also displays an area marked in red which is not at the top of a peak. This indicates

a low deviation across the values in the cluster, but also a low level of internal similarity (i.e. features

within that cluster arent very similar, yet they do not differ greatly in value this is to be expected

with very few values).


41/122

JOLYON HUNTER

[email protected]

40

Figure 12:A gCL U T O M atrix V isualisation of clustered features for "C yberspeak " words within theindicated nine newsgroups. H ere it is clear to see similarities between groups from the hierarchical tree createdat the top. On the left-hand branch, the first leaf nodes clustered depict similarities between alt.music.lyrics &alt.music, then between those two groups and uk .media. O n the right-hand branch alt .politics.usa &alt.politics.usa.republican are related, alt.politics & alt.music.smash-pumpk ins are related, andsci.environment & alt.books are related. T hese pairings can then be connected as you read up the tree.


42/122

JOLYON HUNTER

[email protected]

41

Figure 13:A gCL U T O M ountain V isualisation of clustered features for Cyberspeak . E ach peak in the terrainrepresents a discovered cluster. H eight is proportional to the cluster's internal similarity, and volume isproportional to the number of elements within the cluster, so it is possible to see here that there are several dist inctclusters. Colour is also important with red indicating low deviation blending into blue which represents highdeviation.


43/122

JOLYON HUNTER

[email protected]

42

3.3.3 Personal Pronouns: Features for classification?

personal pronoun (n. )

A pronoun designating the person speak ing (I, me, we, us), the person spok en to (you), or the

person or thing spok en about (he, she, it, they, him, her, them). 6

The nature of personal pronouns means that they are more common in spoken word conversations,

and in this case, the more conversational style associated with internet-based communication. With

the above definition in mind, it is understandable why the usage of these nouns could possibly be an

indicator of any number of phenomena within a conversation tone, mood, emotion, inclusive

behaviour, and exclusivity (perhaps related to a sense of community or belonging which may or may

not exist within a newsgroup). It is these pronouns, constructed into phrase rules (by associating

other words and constructing key phrases), which form the basis of Ellen Spertus Smok ey 38

project. For more details about Smok ey please see section 2.2.1.

Substantial areas for future research exist in the positioning of personal pronouns

within a sentence, i.e. the words surrounding the pronoun and the sentence structure

can provide context to the pronouns usage. However, in this case, the project is

approached from a computer science perspective and as such deals with primarily the

scope for automatically classifying text (and to do this at a computational level, it is

essential to start with the basics and look at individual words). With this in mind,

the frequencies and relative frequencies of the keywords in a personal pronoun

wordlist (see Table 2) were extracted in the same manner as detailed in section 3.4.1.

These results can be found in Appendix 1c.

you

I

he

sheit

they

us

we

themher

him

meTable 2:Pronouns

wordlist forSystem Quirkanalysis

Significant similarities exist in the usage of the pronoun I in six out of the nine

newsgroups with the exceptions being the music-related groups - alt.music, alt.music.lyrics and

alt.music.smash-pumpkins have higher relative frequencies - alt.books which has a lower relative

frequency. This suggests that I is used in a similar way across the political/ debate newsgroupswithin the sample. A similar trend is evident with the use of we, us, them and me in the

political/ debate groups, with all of these keywords having similar relative frequencies for each of the

four groups (namely alt.politics.usa, alt.politics.democrats, alt.politics.usa.

Documents

MATLAB Based Visual is at Ion