MATLAB Based Visual is at Ion

  • Upload
    subucud

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

  • 8/4/2019 MATLAB Based Visual is at Ion

    1/122

    U N I V E R S I T Y O F S U R R E YD E P A R T M E N T O F C O M P U T I N G

    B S C . ( H O N O U R S ) C O M P U T I N G A N D I N F O R M A T I O N T E C H N O L O G Y

    FINAL YEAR PROJECT2002-2003

    T ext C l as s if ic at i o n o f U SE N E T

    m e s s a ge s f o r a C o n v er s a t i o n

    Visua l i s a t ion Sys t em

    CS300: FIN AL REPORT

    JOLYON H UN TER

    [email protected]

    U RN : 1930192

    SUPERVISOR: DATE:Dr. Andrew Salway

    STUDENT: DATEJolyon Hunter

  • 8/4/2019 MATLAB Based Visual is at Ion

    2/122

    JOLYON HUNTER

    [email protected]

    1

    BSC. (HO NO URS) COMPUTING AND I NFO RMATION TECHNO LOGYFINAL YEAR PROJECT

    FIN AL REPORT

    A B S T R A C T

    T ex t C lassi f icat ion of U S E N E T messages for a

    C onversation V isualisation S ystem

    The widespread use of computers and the Internet in particular, as a means of communication has

    seen more and more people connecting around the globe. These online discussions are usually text-

    based, fast, simple and global in scope, yet they lack the non-verbal elements, the extra levels of

    meaning and the emotion which face-to-face, person-to-person communication provides. Surely,

    with the technology available to modern man, these meanings or emotions can be depicted visually?

    This report details the investigation of how messages or conversations within USENET newsgroups

    can be classified automatically as part of a system to visually represent online discussions. The project

    aims and objectives are defined before the nature and history of USENET is described to provide

    some context to the investigation. The current state-of-the-art in Visualisation Designs and Text

    Classification methods is described before a period of observation is undertaken in an attempt to

    define phenomena to visualise and features (i.e. particular words in the text) which characterise those

    phenomena.

    Once candidate features have been defined, a corpus of USENET messages from a range of

    newsgroups is collected in order to conduct in-depth analysis into those features. This analysis

    involves the use of the language engineering workbench System Quirk (to glean word frequencies

    from the sample), and the clustering toolkit gCLUTO (to aid interpretation of the data). From this

    analysis, example rules are defined which can be programmed into a system to classify the text

    according to defined classes. An example conversation visualisation system is specified, before a

    system to classify text is then developed, implemented, tested and evaluated.

  • 8/4/2019 MATLAB Based Visual is at Ion

    3/122

    JOLYON HUNTER

    [email protected]

    2

    A C K N O W L E D G E M E N T S

    If we k new what we were doing, it wouldn't be called research, would it?

    ALBERT EINSTEIN

    Dr. Andrew Salway for his continued support and advice with all aspects of this project.Also for keeping me sane, and making me realise that I didknow what I was doing.

    CodeZ ebra: Sara & Erik in particular, thanks for letting me participate.

    The Smashing Pumpkins, Dave Matthews Band and the rest, for getting me through the tough times;

    Starbucks Coffee Co. for the even tougher times;

    Personal thank s must go to

    My Dad, and my sisters Eliane & Abigail, for their continuing support and love. I could not havedone this without you.

    And of course

    Mum always with me in spirit.

    Forever my inspiration in everything. T his is dedicated to you.

  • 8/4/2019 MATLAB Based Visual is at Ion

    4/122

    JOLYON HUNTER

    [email protected]

    3

    T A B L E O F C O N T E N T S

    Abstract ...............................................................................................................................................................1

    Acknowledgements ...........................................................................................................................................2

    Table of Contents ...........................................................................................................................................3

    1. Introduction .................................................................................................................................................5

    1.1 Project Overview.............................................................................................................................5

    1.2 A History of USENET ..................................................................................................................7

    2. Systems to Visualise Conversation .....................................................................................................10

    2.1 Visualisation Designs....................................................................................................................10

    2.1.1 Loom ................................................................................................................................11

    2.1.2 CodeZebra.......................................................................................................................14

    2.1.3 Conversation Map..........................................................................................................16

    2.1.4 Netscan ............................................................................................................................18

    2.1.5 Visualisation Designs: How they compare.................................................................20

    2.2 Methods for Classifying Text ......................................................................................................21

    2.2.1 Smokey.............................................................................................................................22

    2.2.2 WEBSOM .......................................................................................................................24

    2.2.3 CLUTO............................................................................................................................26

    2.3 Discussion ......................................................................................................................................28

    3. Features for classifying USEN ET conversations...........................................................................29

    3.1 Introduction ...................................................................................................................................29

    3.2 Initial Observations of USENET Newsgroups .......................................................................30

    3.3 Statistical Corpus Analysis...........................................................................................................33

  • 8/4/2019 MATLAB Based Visual is at Ion

    5/122

    JOLYON HUNTER

    [email protected]

    4

    3.3.1 The Analysis Process .....................................................................................................35

    3.3.2 Cyberspeak: Features for classification?.....................................................................37

    3.3.3 Personal Pronouns: Features for classification?........................................................42

    3.3.4 Synonyms of Agreement/ Disagreement: Features for classification?...................47

    3.4 Summary: Creating some Rules?.................................................................................................51

    3.5 Further Analysis: Phenomena Classification ............................................................................54

    3.5.1 Summary: Creating more rules?...................................................................................59

    4. Developing a System to Classify text for use in a Conversation Visualisation System ......62

    4.1 Requirements Analysis .................................................................................................................63

    4.2 Design .............................................................................................................................................65

    4.3 Implementation .............................................................................................................................67

    4.4 Testing.............................................................................................................................................69

    4.4 Evaluation ......................................................................................................................................71

    5. Closing Remarks ......................................................................................................................................75

    5.1 Achievements.................................................................................................................................76

    5.2 Discussion & Future Work..........................................................................................................78

    References.......................................................................................................................................................79

    Bibliography...................................................................................................................................................83

    Appendices .....................................................................................................................................................86

    APPENDIX 1: Supporting Material for Chapter 3 Analysis .......................................................87

    APPENDIX 2: gCLUTO Input files...............................................................................................99

    APPENDIX 3: Rule-Based Processor PERL code.................................................................101

    APPENDIX 4: System Testing & Evaluation Questionnaire....................................................103

    APPENDIX 5: Ten Sample Messages...........................................................................................106

  • 8/4/2019 MATLAB Based Visual is at Ion

    6/122

    JOLYON HUNTER

    [email protected]

    5

    1 . I N T R O D U C T I O N

    Cyberspace: A consensual hallucination experienced daily by billions of legitimate operators, in every nation

    A Graphic representation of data abstracted from the banks of every computer in the human system

    WILLIAM GIBSON N euromancer (1984:51)

    1.1 Project Overview

    Cyber-culture and the area of social interaction on the Internet (or in Cyberspace) is a

    burgeoning area of research, not just for Computer Scientists, but for psychologists, sociologists and

    linguists alike. The principle means of communication and interaction on the Internet are online

    discussions, in the form of chat (IRC or Internet Relay Chat), Instant Messaging and newsgroup

    discussion (USENET).

    Although newsgroups in themselves can be extremely valuable sources of information, they tend to

    be relatively inaccessible to outsiders or the general Internet user. For example, a discussion

    centred around a topic requiring a high amount of technical knowledge in, say, Computing, might beinaccessible to a student of Archaeology (yet the information contained in the discussion may be of

    relevance to the Archaeology student the problem is in accessing it). This has prompted some

    academics to investigate ways and means of visualising these online discussions, so as to convey the

    social qualities as well as the informative qualities of the discussion.

    Pioneers in the field of visualising online discussions come from the Sociable Media Group (SMG)40

    at the Massachusetts Institute of Technologys Media Lab, the Social Technologies Group at UC

    Berkeley41, and more recently from Microsoft Research26, all of whom will be covered in detail in

    Section 2.

    Given the huge range of social interactions concerning many different topics, ranging from pet care

    advice and recipes to debates about the origin of life, it is clear that opportunities for research exist in

  • 8/4/2019 MATLAB Based Visual is at Ion

    7/122

    JOLYON HUNTER

    [email protected]

    6

    exploring new ways to classify, and ultimately display these interactions, in a more intuitive way than

    is currently possible.

    To facilitate this research aim, it will be necessary to carry out an in-depth examination of past and

    current research in the area of visualising online conversations. As stated in the project inception

    document, preliminary investigations into this research suggests that the focus tends to be more on

    the visualisation process and the interface considerations of designing systems, rather than the actual

    means of classifying text. From these research papers it should be possible to determine what

    phenomena others have attempted to visualise, and this information will be useful to bear in mind

    when investigating USENET messages in greater detail.

    This project deals with automatic classification of USENET messages, which requires an analysis of

    textual features which exist in messages, and an investigation into whether these can be meaningfully

    represented in machine language i.e. coded into a program for automatic extraction.

    An aim of this project is to make the automatic collection of that data much easier, and therefore aid

    the design and implementation of a conversation visualisation system. Through greater automatic

    classification of the text contained within messages, it should be easier to use the data to construct

    meaningful visualisations: An ideal situation would be where one could go from numerical to

    graphical representations and back again without losing accuracy of the data. To facilitate this, a

    corpus of text (a minimum of 250,000 words) will be collated from one or a number of USENET

    newsgroups and subsequently analysed. The aim of this analysis will be to determine patterns and

    heuristics that will enable the creation of a rule set. These rules will then be incorporated into a

    system that will facilitate the automatic classification of messages and conversations.

    The following report details the process of research, analysis and system development concerning the

    classification of USENET messages for a conversation visualisation system. However, in order to

    provide some context to the focus of this report, it is first necessary to understand more about

    USENET itself.

  • 8/4/2019 MATLAB Based Visual is at Ion

    8/122

    JOLYON HUNTER

    [email protected]

    7

    1.2 A H istory of USEN ET

    Throughout human existence, a gifted few visionaries have influenced thinking and furthered the

    development of our species one might include Darwin, da Vinci or Einstein in this group, but in

    recent times one might consider the relatively unknown JCR Licklider. In his and Robert Taylors

    The Computer as a Communication Device (1968) 25, Licklider foresaw many innovations which today we

    take as granted. Innovations such as team working via computers distributed over great distances,

    and video conferencing. It was when he became Director of the Information Processing Techniques

    Office (IPTO) at the Pentagons Advanced Research Projects Agency (ARPA) that he put in place

    the financial priorities which would eventually lead to the development of what we know today as

    The Internet. Effectively he held the purse-strings, but at the same time created a working

    environment where graduate students ran a multi-million dollar research project.

    The ARPANET developed out of this endeavour in the late 1960s and was originally intended as a

    military resource for sharing supercomputers across the United States. Academic institutions soon

    saw the benefits of this network for research and the sharing of knowledge, and in 1969 researchers

    at four universities connected their individual campuses up to ARPANET, becoming the first

    hosts on the network. Within a couple of years, most US Universities were connected and a

    phenomenon known as e-mail had become the most popular application of the network Over the

    following 30 years, the network grew exponentially and various protocols were developed for

    particular types of communication between computers. Protocols such as the Transmission ControlProtocol/ Internet Protocol [TCP/ IP] (the basis of the internet), and the Unix-to-Unix Copy

    Protocol [UUCP] upon which USENET was developed.

    USENET was born in 1979 when Duke University graduate students Tom Truscott and Jim Ellis

    implemented an idea they had about linking discussions within the UNIX community37. Using the

    UUCP they managed to get computers to communicate via auto-dial modems: Steve Bellovin of the

    University of North Carolina created scripts which meant computers could automatically dial each

    other up and search for changes in date stamps of files if these date stamps were different, then the

    files were copied from one computer to the other. Truscott and Ellis described their ideas and plans

    for construction at the January 1980 USENIX (UNIX Users Group) conference; they emphasised

    the collaborative nature of their efforts and welcomed a collaborative development process. These

    basic concepts were built upon (principally by Stephen Daniel, a fellow Duke graduate) and

    developed using the C programming language. His program for accessing USENET, called A

  • 8/4/2019 MATLAB Based Visual is at Ion

    9/122

    JOLYON HUNTER

    [email protected]

    8

    News, debuted at the summer 1980 USENIX conference. USENET was seen to be an alternative

    to the expensive clique-like ARPANET - Essentially, Truscott, Ellis and Bellovin created a poor-

    mans ARPANET an electronic community created cheaply and without the political problems

    associated with the actual ARPANET, which grew exponentially over the following years14.

    USENET developed into one of the largest asynchronous communications mediums ever created: It

    is communication structured in turns the major benefit of this being that people from across the

    globe can discuss or collaborate on a particular topic regardless of the restrictions of daily schedule or

    time zone. Essentially it is an informal worldwide network of discussion groups called newsgroups

    a distributed database of messages, passed between clients and servers using the Network News

    Transfer Protocol (NNTP which superseded UUCP). Because of its distributed nature, there is no

    central authority dictating boundaries or policing the USENET. By technical definition, anarchy

    exists, however order and structure do exist. USENET is organised into subject hierarchies with the

    first few letters of the newsgroups name indicating the major subject category, with subtopic names

    representing sub-categories: For example rec.music.jazz represents recreational (rec) discussions

    about music (music), with the further specific sub-topic about jazz music (jazz).

    Figure 1:A typical N ewsgroup Reader - messages are downloaded from a news serverover N N T P. O n the left is a list of subscribed newsgroups, with the top frame depictingthe subject lines of messages and threads within that group. Below is a preview panedisplaying the selected message.

  • 8/4/2019 MATLAB Based Visual is at Ion

    10/122

    JOLYON HUNTER

    [email protected]

    9

    Conferencing systems, of which USENET is the largest example, are essentially a refinement on e-

    mail discussion lists a series of messages strung together and related through similar topics. The

    distinction can be made though, with the fact that e-mail is a push technology (people receive it

    regardless of whether they asked for it or actually want it), whereas newsgroups are a form of pull

    technology where users can select the groups and messages they want and request only those groups

    or messages. Figure 1 depicts a typical Newsgroup reader downloading messages from a news server

    using NNTP. A more recent alternative to accessing USENET exists in the form of Google Groups

    who provide a fully searchable archive of messages dating back to 1981, displayed in HTML format

    (see Figure 2). Google also provide a good overview of the associated terminology of USENET via

    their website11. This more user-friendly interface for USENET is a useful starting point for some

    initial observations into the structure and content of messages, and the language used in newsgroups,

    which will provide a basis for more in-depth investigation in the sections to come.

    Figure 2:A view of "rec.music.jazz" from the Google Groups website (http:/ / groups.google.com/ ) whichprovides searchable, archived newsgroups browsable in H T M L format

  • 8/4/2019 MATLAB Based Visual is at Ion

    11/122

    JOLYON HUNTER

    [email protected]

    10

    2 . S Y S T E M S T O V I S U A L I S E C O N V E R S A T I O N

    The soul never think s without an image

    ARISTOTLE

    2.1 Visualisation D es igns

    Having established an understanding of USENET, it is time to investigate what is essential in the

    design of a system which can be used to classify text for use in a conversation visualisation system. It

    is first pertinent to identify the relevant research and projects already undertaken in relation to this

    topic. By examining the state-of-the-art it may be possible to further enhance or combine

    techniques for the purposes of this project, and any future implementations.

    This section attempts to compare and contrast the different systems which already exist, both in the

    field of text classification and visualisation systems, and will distinguish the key areas related to this

    project. Armed with this knowledge it should be possible to proceed and develop a method (or

    system) for classifying the text of USENET messages.

    It is perhaps worth mentioning that a good overview of designing graphical representations for

    persistent conversations can be found in the paper V isualising Conversation 7 written by Judith

    Donath, Karrie Karahalios and Fernanda Vigas.

  • 8/4/2019 MATLAB Based Visual is at Ion

    12/122

    JOLYON HUNTER

    [email protected]

    11

    2.1.1 Loom

    The original Loom project began as a class project within MITs Sociable Media Group, and was

    taken on as a project by Graduate student Karrie Karahalios 19, 20. The underlying vision for the

    project was the development of a tool for visualising Usenet newsgroups. This was to be achieved by

    observing patterns in key events of the newsgroups: Such events included the beginning and end of

    conversation threads, the tone of messages, the entrances and exits made by participants in

    conversations as well as observing the path traversed by users as they create their social fabric 20

    (the concept of fabric being also used in a message thread visualisation metaphor).

    The first implementation of Loom focused on visualisations of the mood or tone of the messages.

    The system itself has a simple structure: Newsgroup messages are collected then filtered through

    what Karahalios calls a general classifier. The designated mood clusters arising from this classification

    are then represented visually. (For the purposes of this system, four clusters comprising angry,

    peaceful, news and other were chosen these are consequently represented in the visual

    output [Figure 3]). Clicking on each individual container enables the user to view the message being

    visually represented. Karahalios admits the limitations of the system in its infancy (please refer to the

    Loom website 20), and apart from the obvious display improvements she recommends further work

    on classification, in particular breaking down the categories further.

    Subsequent implementations focused on abstractions of message threads, and how they can be

    represented visually. Loom 2 is an evolution of the Loom project and is overseen by Karahalios

    supervisor for Loom, and prominent SMG professor, Judith Donath. It was Donaths article in the

    Communications of the ACM 8 which provided a starting point for this project. In it she notes how

    the nature of Usenet makes it a vast source of information, but the current means of accessing it lack

    visual appeal, and tend to obscure many of the cues that aid us as human beings in our social

    interactions. Donath maintains that more focus on theses cues, coupled with improved visualisation

    could help the viewer perceive the online space more intuitively, and create a legible social

    environment within which they can interact. Ultimately this is the aim of the Loom2 project: To use

    the salient features of social interaction to create a legible interactive visualisation of Usenet.

  • 8/4/2019 MATLAB Based Visual is at Ion

    13/122

    JOLYON HUNTER

    [email protected]

    12

    Figure 3: T he Original "L oom" V isualising Mood. It is possible to identify the four classification clusters ofangry, peaceful, news and other, represented here by the red, green, yellow and blue colours respectively.T his implementation displays the individual postings as coloured containers over a grid which represents acalendar.

    In order to achieve this aim, the project members set themselves numerous questions to answer in

    the development of their system principally based in domain of social structure and activity. The

    focus of the Loom project being visualisation to aid social interaction, the data itself acted only as a

    starting point. At a fundamental level, Looms data gathering technique comprises a subset of

    newsgroups collected from a news server and parsed, then stored in an Oracle database. This was

    just a small sample of Usenet and was restricted to only English newsgroups with a minimum

    number of posts in the last month.

    Another visualisation project developed by Donath, and Rebecca Xiong at MIT, is PeopleGarden 48.

    This system uses the metaphor of flowers in a garden to represent people and their posts within a

    newsgroup (to create what they term data portraits). The abstract representations of a users

  • 8/4/2019 MATLAB Based Visual is at Ion

    14/122

    JOLYON HUNTER

    [email protected]

    13

    interaction history are illustrated in Figure 4: Two groups are represented, the left-hand side being a

    group with one dominant voice, the right being a more evenly distributed, democratic group.

    Figure 4: "PeopleGarden". E ach participant is represented by a flower. T he stems of the flowers representlength of time in the newsgroup, the petals represent init ial posts (blue) and replies (pink ). T he older posts arerepresented by less intense shades. T his figure represents two groups on the left hand-side a group with a singledominant voice, on the right a more democratic group.

    PeopleGarden provides a good example of a visualisation metaphor, based upon activity and

    participation of the users (much like L oom), but again it could be argued that more focus is needed

    on the analysis of the content of messages, the text itself rather than just participants, in order to

    provide another level of the context that Donath et al are seeking. More information about

    PeopleGarden can be found in Xiong and Donaths paper, and at the projects website 48.

  • 8/4/2019 MATLAB Based Visual is at Ion

    15/122

    JOLYON HUNTER

    [email protected]

    14

    2.1.2 Code Zebra

    CodeZebra4 is an ongoing, international research project, attempting to develop an entirely new kind

    of chat environment in visual 3-D space. In a similar vein as Loom, CodeZebra depicts navigation

    and relationships between groups, individuals, topics and conversations, using a central concept of

    animal print metaphors. Indeed, CodeZebra owes much to the natural world for this as users can

    play games with each other, create conversational histories and generate their own patterns based on

    metaphors which permeate the CodeZebra world. For example, users who post long, considered

    responses might be classified as zebras, a typically academic animal, whereas users who tend to

    frequent the same threads as others, and perhaps pick on other users might be classed as hyenas

    typically pack-hunters in the real world. Similarly there are classifications of butterfly, ocelot,

    cheetah and peacock amongst others, each with their own unique characteristics.

    The need for this system originates from the inherent weakness of existing chat mediums once a

    large number of people interact with each other they generate a lot of text but it is all 2-

    Dimensional and the social and emotional context is sometimes unclear or lost altogether. The aim

    of CodeZebra is to create a dynamic visual depiction of the underlying associations between issues

    and topics, through providing interactive games and discussion structures. The 3-D environment

    attempts to create a visual guide to what is being said, by whom and with what emotional tones. Part

    of the driving force behind CodeZebra is the bringing together of the worlds of Art and Science in

    essence, the project creators say, artists construct metaphors yet there are problems of representationwithin science therefore a move away from realism and further into abstraction is necessary to aid

    communication between these disciplines. The metaphor of patterns is also important as they show

    links between topics, emotional characteristics of topics and show relationships between postings.

    A graphical explanation of CodeZebra second prototype can be found below in Figure 5, and the

    software itself can be accessed online at the CodeZebra website4. The third prototype has just been

    released, and the system underwent a globally interactive road test during the recent Dutch

    Electronics Arts Festival (DEAF) in Rotterdam, the Netherlands3.

    Head of the project is Sara Diamond from the Banff New Media Institute1 in Canada, and research

    partners include V2 Labs (NL)43, the SMARTLab Centre (UK)34 and the University of Surrey (in

    particular Dr. Andrew Salway). CodeZebras links with the University of Surrey, in particular with the

    supervisor of this project, will facilitate the sharing of information and the spirit of research

  • 8/4/2019 MATLAB Based Visual is at Ion

    16/122

    JOLYON HUNTER

    [email protected]

    15

    collaboration in the evolution of this project. With CodeZebras emphasis being on visualisation,

    there is ultimately an essential need for the underlying system to accurately analyse and classify

    messages correctly. Fundamentally, this is where it is hoped that this project will contribute to this

    field of research.

    Figure 5:CodeZebra Prototype 2 ex plained

  • 8/4/2019 MATLAB Based Visual is at Ion

    17/122

    JOLYON HUNTER

    [email protected]

    16

    2.1.3 Conversation Map

    The Conversation Map system is the brainchild of Warren Sack and formed the basis for his PhD

    thesis30 at the Massachusetts Institute of Technology (MIT) in 2000. The system attempts to map

    message threads alongside Social and Semantic Networks in the same application window (see Figure

    6).

    In Discourse Diagrams: Interface Design for Very Large-Scale Conversations 31, Sack classifies the

    internet conversations taking part between many thousands of people at any one time as Very

    Large-Scale Conversations. He alleges that it is the sheer size and variety of these conversations,

    such as those conducted within newsgroups, which render them difficult to understand or critically

    reflect upon for interested observers and participants alike. The system attempts to facilitate greater

    understanding through creating visualisations of the Social and Semantic networks in action in a

    particular thread of messages. In the aforementioned paper, Sack discusses the design criteria

    necessary in order to transform these social scientific representations into what he terms interface

    devices. The Conversation Map system he developed is such a device, and is essentially a browsing

    system for Very Large-Scale Conversations (VLSCs).

    A Social network involves sketching vertices between a set of nodes labelled with the names of

    people, with the vertices between the nodes representing interaction between these people. In a

    similar manner, Semantic networks involve nodes and vertices, with the nodes labelled with words or

    concepts and the vertices representing the semantic relationship between those words or concepts.

    Social and Semantic Networks are usually used in the social sciences as analytic devices - as scientific

    models/ representations of hypotheses what Sack proposes is that they be used as generative

    devices; i.e. as a means of exploring VLSCs by being an interface into the archive of a discussion 32.

    Effectively, this means they are diagrams of discourse.

  • 8/4/2019 MATLAB Based Visual is at Ion

    18/122

    JOLYON HUNTER

    [email protected]

    17

    Figure 6:T he Conversation M ap system: The upper left part of the screen shows the social network present withina group, the upper right shows the semantic network . I n-between is a list of themes present in the group and belowa representation of the message threads within the group.

    Conversation Map can be used in a similar way to regular newsgroup readers like Eudora, Agent and

    Outlook Express, but has features which others do not provide. It performs a series of

    computational linguistic and sociologic analyses on the messages of the newsgroup, and presents its

    results in the form of a graphical interface (see Figure 6).

    In discussing his work (at an HCI Seminar 2000-2001 held at Stanford 32) Sack notes how a

    newcomer to a VLSC might want access to previous discussions as well as the social layout of the

    group in order to determine which participants and threads are relevant to the newcomer. This also

    distinguishes which topics or themes bring together participants, or divide them as the case may be.Conversation Map does this by focusing more on summarisation, navigation and browsing rather

    than information retrieval.

  • 8/4/2019 MATLAB Based Visual is at Ion

    19/122

    JOLYON HUNTER

    [email protected]

    18

    2.1.4 N etscan

    Netscan began in 1994 as a graduate project at the University of California Los Angeles (UCLA)

    conducted by Marc Smith. Today it continues in the industrial research arena under the banner of

    Microsoft Research. Compared to Loom and Conversation Map, Netscan has a much more

    grandiose aim to map the entirety of Usenet.

    Smith, Fiore and Tiernan10 are quick to note that current newsgroup browsers offer very little (if any)

    information about the history of authors in a newsgroup for example their activity, what other

    newsgroups they post to or which other participants they converse with etc. They argue that it is

    these features which, if they were available in browsers, would provide a guide to social context and

    interactional history that is typically used by participants in physical spaces. Current browsers present

    information about the messages themselves (subject line, posting date, and lines) which causes the

    user to focus their attention on the structure of the medium itself, rather than the qualities or value

    of the participants.

    One of Netscans aims is to develop a newsgroup interface which allows the user to sort and search

    by salient behavioural features such as the fraction of each authors messages that were initiating

    conversation as opposed to replies, or the number of days on which each author contributed a

    message to that particular newsgroup. By combining these features with the regular features which

    depict structure and thread development, Smith, Fiore and Tiernan believe they can extract valuable

    content out of the large social cyberspaces that comprise Usenet.

    Netscan connects to a news server that carries nearly 50,000 Usenet newsgroups and collects all of

    the messages in all the newsgroups. Information is gathered from message headers and

    stored/ maintained in a database. The analyser part of Netscan is then able to read selected

    portions of the database and generate reports and analysis of selected newsgroups over a specific

    period of time. A demonstration of this can be found on the main page of the Netscan website 35.

    The Netscan website demonstrates a few of the tools/ interfaces that Smith has developed in order totry and map Usenet, and Figure 7 depicts a boxplot representation of the rec.musichierarchy.

  • 8/4/2019 MATLAB Based Visual is at Ion

    20/122

    JOLYON HUNTER

    [email protected]

    19

    Figure 7:A N etscan boxplot of " rec.music": T his shows the sub-groups present within the rec.music hierarchy; the siz eof each sub-group represents the number of posts to that group in the last month. T he colours represent the increase(green) or decrease (red) in posting within that group compared to the previous month the intensity of the colourindicates the level of increase/ decrease.

    It could be argued that Netscan, like the systems mentioned already, focus upon the more

    sociological/ visual representations of conversations whereas there needs to be more research into

    automatic text analysis incorporating the content of the message as well as the features that

    Netscan has already extracted perhaps. Marc Smith notes how Netscan has succeeded in its initial

    aims, but that there is also scope for improvement with regard to content analysis36; Smith suggests

    that content analysis of message bodies could lead to the mapping of the diffusion of topics across

    USENET, and indeed other communications media. He further suggests that this could possibly

    assist studies into informal communication networks and the transmission of folk beliefs, as well as

    the development of academic disciplines.

  • 8/4/2019 MATLAB Based Visual is at Ion

    21/122

    JOLYON HUNTER

    [email protected]

    20

    2.1.5 Visualisation Designs: How they compare

    At a glance, here is how the various visualisation systems covered in the previous sections compare

    by feature. It is clear that they all offer something unique, yet there is a central theme to what they all

    require more investigation into automatically classifying text.

    Loom

    PeopleGarden

    CodeZebra

    ConversationMap

    Netscan

    Participation analysis? ! ! !

    Semantic Structures Represented? ! !

    Social Structures Represented? ! !

    Message Information Extraction? ! !

    Message Content Analysis? ! !

    Author Profiling/ Representation? ! ! !

    A utomaticclassification of text?

    Requires more research into automatic text

    classification?! ! ! ! !

  • 8/4/2019 MATLAB Based Visual is at Ion

    22/122

    JOLYON HUNTER

    [email protected]

    21

    2.2 Methods for Classifying Text

    It is clear from the previous section that visualisation systems are plentiful, and vary greatly in how

    they visually represent conversations. Before being able to visualise such conversations, it is necessary

    to impose some form of classification system with regards to the content of messages or

    conversations. Some initial work has been conducted into the area of automatic text classification,

    and the following section will introduce a number of different methods which could potentially be

    used or adapted for use in a future automatic text classification system for conversation visualisation.

    Firstly, a rule-based method of classifying text exists in the form of Ellen Spertus Smokey, a

    system which automatically classifies abusive flame messages. Next Teuvo Kohonens

    WEBSOM will be covered; this is a method for organising text documents and preparing visual

    maps of them to facilitate information retrieval this method is based on the Self-Organising Map

    (SOM) data processing method which has its roots in neural computing. Finally, a look at George

    Karypis CLUTO clustering toolkit. This software package enables the user to cluster low and

    high-dimensionality data, and in its recently released form of gCLUTO provides a user-friendly

    method for analysing the characteristics of various clusters. From a more detailed look at these

    methods and systems, it should be possible to draw inspiration for a future system to classify

    USENET messages for a conversation visualisation system.

  • 8/4/2019 MATLAB Based Visual is at Ion

    23/122

    JOLYON HUNTER

    [email protected]

    22

    2.2.1 Smokey

    Smokey was the brainchild of MIT graduate Ellen Spertus38. The system focuses upon abusive

    email messages, otherwise known as flames. The system aims to identify abusive, insulting or

    offensive messages by looking for specific words and their context and syntactic constructs. The

    prototype system builds a 47-element feature vector based on the semantics and syntax of each

    particular sentence. It combines these vectors for sentences within a message, so that effectively the

    resulting vector represents all the constituent parts (sentences) of the message.

    The messages are converted to one sentence per line and delimiters are put between messages

    (manually). The resulting text is then run through a parser developed by the Microsoft Natural

    Language Processing Research Group38, and the parser outputs converted by sed and awk scripts

    into Lisp s-expressions. The s-expressions are then processed through a set of rules developed in

    Emacs Lisp, producing the aforementioned feature vector for each message. The feature vectors

    were then evaluated using simple rules produced by Quinlans C4.5 decision tree generator.

    Spertus reports that with a training set of 720 messages, and using the process detailed above, the

    system was able to correctly categorise 64% of flames and, in a separate 460 message set, 98% of

    non-flames. It could be argued that this success rate could be improved, and Spertus acknowledges

    the limitations of Smokey with regards to elements such as grammatical errors, the use of innuendo

    and the use of sarcasm. For example Smokey could mistake the statement Im glad to see that your

    incessant name calling and whining hasnt stopped as a praise rule because it starts with Im glad.

    The messages used in the prototype came from the webmasters of a number of controversial

    websites, including The Right Side of the Web (a conservative resource), NewtWatch (a site which

    criticises Newt Gingrich) and Fairness and Accuracy in Reporting (FAIR, a media watch group best

    known for questioning the claims of Rush Limbaugh). All of the messages from these sources were

    in the form of email messages Spertus recognises that there are many different types of flames in

    real-time communication; as facilitated by the WWW, mailing lists and Usenet newsgroups. However

    she also notes that publicly posted messages, such as those found in newsgroups, tend to be more

    clever and less personal (i.e. more indirect) than the private email flames (which Smokey focuses

    upon) thus, Spertus claims, this makes the publicly posted flame messages harder to reliably

    detect. First instincts might tell a researcher to look for obscene expressions or vulgarities within the

  • 8/4/2019 MATLAB Based Visual is at Ion

    24/122

    JOLYON HUNTER

    [email protected]

    23

    messages, to identify them as flames. However, Spertus found that only 12% of the flames actually

    contained vulgarities and over a third of the vulgar messages were not actually flames.

    The most important element of Smokey is the set of rules being used in order to process the textual

    input and classify the output. These are listed in full in Spertus paper38, but examples include the

    likes of Imperative statements containing Look (e.g. Look forward to hearing from you), and

    phrases containing a negative word near you. The clear advantage of a method such as this is that a

    set of rules are relatively easily converted into a programmable algorithm. However, the scope for

    such rules would depend upon an in-depth analysis of USENET messages to determine if any

    patterns or features exist which could potentially be represented in rule form, and coded into a future

    system

  • 8/4/2019 MATLAB Based Visual is at Ion

    25/122

    JOLYON HUNTER

    [email protected]

    24

    2.2.2 WEBSOM

    The Self-Organising Map24 (or SOM in WEBSOM) was developed by Teuvo Kohonen at the

    University of Helsinki, Finland. This is a method which facilitates the automatic organisation of

    complex datasets, and provides a means of creating visual maps of these datasets to aid in the

    retrieval of information. It has also been the basis of more than 3000 published studies worldwide,

    with applications ranging from image analysis and speech recognition to medical diagnoses and

    telecommunications applications.

    With WEBSOM45, documents are automatically arranged into a visual map using the SOM method.

    Before this can take place, the textual content of the documents is converted into numerical form

    several hundred numeric values correspond to each document. A certain combination of these values

    within a document corresponds to a word; hence if the size of this combination is larger in the code

    of the document, then the word occurs more frequently. Words are also weighted according to how

    common they are. A pre-processing stage removes punctuation marks, numbers and other non-

    standard characters, and the most common and most rare words are excluded.

    A zoomed view of WEBSOM can be found in Figure 8. Here, the user has already clicked on the

    map image with their mouse. This zoomed view shows white points or nodes which allow the user

    to investigate the contents of individual map units (threads of messages). The Arrow button in the

    top left-hand corner allows the user to move to a nearby cluster. The map image contains labels

    which are example of the core vocabulary of the area in question. The labels give a general idea of the

    topics in the document collection. The colouring of areas of the map represents the density of

    documents in that area. Light areas contain more documents.

    As the WebSOM site depicts, this tool is especially good with large bodies of text such as USENET,

    and has great potential in the classification of USENET messages right down to individual words.

    With the SOM base, there is also the potential for it to be adapted into a system incorporating an

    adaptive/ learning neural network which could dynamically adapt to the ever changing environment

    of USENET.

  • 8/4/2019 MATLAB Based Visual is at Ion

    26/122

    JOLYON HUNTER

    [email protected]

    25

    Figure 8:"W E BS OM ": A zoomed view - E ach white dot is a map node. Colour denotes the density or the clusterintendency of the documents. W hite areas are clusters and dark areas empty space, "ravines", between the clusters. Thon the right hand side denotes which topics are being represnted.

    ge tex t

  • 8/4/2019 MATLAB Based Visual is at Ion

    27/122

    JOLYON HUNTER

    [email protected]

    26

    2.2.3 CLUTO

    This software facilitates the clustering of low and high-dimensionality data, and has application areas

    including customer purchasing transactions, Geographic Information Systems (GIS), general science

    and information retrieval. In the sense of information retrieval, clustering is based on the idea that

    objects (usually documents, but in this case keywords) clustered together that is, closely associated

    with each other tend to be relevant to the same query or topic domain. Formally, this is known as

    the Clustering H ypothesis:

    Closely associated documents tend to be relevant to the same requests 29

    The CLUTO21 package includes three different classes of clustering algorithms which operate either

    in the objects similarity space or in the objects feature space. These algorithms are based on thepartitional, agglomerative and graph-partitioning paradigms. Fundamentally the software package

    allows a user to input data in the form of a matrix the software then processes this data into

    clusters and enables the user to analyse the discovered clusters through looking at the relations

    between objects within each cluster and the relations between different clusters. CLUTO is

    particularly good at identifying the features which best describe or distinguish each cluster. Based

    upon studies13 and 22 conducted into the effectiveness of the algorithms, it is generally held that

    partitional clustering offers the best results.

    CLUTO is freely available to download21and is available for Linux, Sun and Windows platforms (the

    current version is 2.1). The package features two programs, vcluster (which takes as an input the

    multidimensional representation of the objects that need to be clustered) and scluster (which takes

    the similarity matrix/ graph between these objects as its input). Essentially both programs perform

    the same functions, and are invoked via the command line using the following call:

    vcluster [optional parameters] MatrixFileNClusters

    WhereMatrixFile is the name of the file that stores the n objects to be clustered. With vcluster, each

    one of those objects is considered to be a vector within m-dimensional space, and the collection of

    these objects is represented in an n x m matrix.N Clusters is the number of clusters that are required

    (default is usually 10). The optional parameters which can be specified are given in detail and

  • 8/4/2019 MATLAB Based Visual is at Ion

    28/122

    JOLYON HUNTER

    [email protected]

    27

    explained in the CLUTO manual these include directing output to named files, setting different

    clustering methods and number of clusters etc.

    Not only is CLUTO a powerful and free piece of software, but it can also be used within bespoke

    applications thanks to its availability as a set of C libraries. CLUTO can effectively become an

    integral part of a larger information retrieval, or text classification system.

    As of February 2003, Version 0.5 (Alpha release) of gCLUTO 21 became available. This is a GUI-

    based version of CLUTO intended to make the clustering toolkit more user friendly (see screenshot

    in Figure 9). Essentially this

    eliminates the command line-

    based operation of CLUTO

    and turns it into a point-and-click interface. The command

    line input depicted earlier is

    simply replaced by a standard

    Windows-style GUI Browse

    button for input files.

    Even more recently, a web-

    based version was released:

    wCLUTO21. This is a web-

    enabled data clustering

    application designed for the

    clustering and analysis of gene-

    expression datasets, and is

    based on the CLUTO toolkit.

    Figure 9:gCL U T O: A graphical clustering toolk it based on CL U T O.H ere we see the clustering solution at the top of the screen with the M atrixand M ountain visualisations of these clusters at the bottom left and rightrespectively.

    Clearly from all of these different implementations, and the fact that CLUTOs libraries can be

    accessed from within a bespoke C/ C++ program, this toolkit is highly versatile and could easily be

    incorporated into a future implementation of a system to automatically classify text for a

    conversation visualisation system. This is not only a very powerful statistical analysis tool, it also has

    massive potential as part of a future system real-time clustering of text would be of immense use in

    identifying text with similar/ dissimilar characteristics (and therefore the classification of that text).

  • 8/4/2019 MATLAB Based Visual is at Ion

    29/122

    JOLYON HUNTER

    [email protected]

    28

    2.3 Discussion

    In this chapter the state-of-the-art in visualisation systems has been characterised. What is apparent

    from current systems is that the focus so far has generally been biased towards the actual

    visualisation requirements of the system, rather than the underlying processing and pre-processing

    needed in order to classify text for use in such a visualisation system. Donath and her contemporaries

    at MIT have tended to focus more on the social context of the interactions involved in online

    communication, as has Warren Sack to a certain extent. Others such as Smith et al have focused

    more on the analytical aspects of such a system, yet there seems to have been little (if any) research

    into the bare-bones computational analysis aspects of a visualisation system. To a certain extent, the

    CodeZebra team are now pioneering this work, but it is hoped that this report will also be of value in

    this field.

    When considering this topic from a computational analysis viewpoint, it is becomes clear that to

    successfully analyse the text involved (in order to classify it), one must have an idea of some

    techniques for analysing text. Therefore, rather than attempting to reinvent the wheel it is only

    practical to assess the current state-of-the-art with regard to text classification as well. Section 2.2

    described three different techniques and systems for text analysis, the rule-based Smokey, the neural-

    network inspired WebSOM and the clustering toolkit of CLUTO. The example set by Smokey is an

    interesting one, and is a good foundation upon which to base a simple first system. WebSOM and

    the possible use of neural/ Bayesian networks in a future system would be an interesting andrewarding avenue of research, but is beyond the scope and timescale of this particular project.

    However, the CLUTO toolkit has great potential, and in the case of this project gCLUTO would

    provide an excellent means of analysing USENET text. As such, implementing a fully functional

    system using CLUTO libraries is probably also beyond the limitations of this project.

    In conclusion, it would be worthwhile attempting to develop a simple rule-based classification system

    as a basis for this project, with the possibility that elements of CLUTO could be incorporated in

    some way. Future implementations could implement CLUTOs libraries to provide on-the-fly

    clustering to aid classification, and similarly some form of Self-Organising Map might alternatively be

    used in a learning system. Before any system can be attempted, it is essential to conduct some in-

    depth analysis of USENET messages to see if any patterns or heuristics exist, and whether any rules

    can be derived from them. This analysis is described in the following chapter.

  • 8/4/2019 MATLAB Based Visual is at Ion

    30/122

    JOLYON HUNTER

    [email protected]

    29

    3 . F E A T U R E S F O R C L A S S I F Y I N G U S E N E TC O N V E R S A T I O N S

    Y ou and me are just different points of view

    CHANG-TZU

    3.1 Introduction

    It has been established that current visualisation systems are sophisticated, yet little research has been

    conducted into the underlying system which classifies text for those visualisations. After

    characterising the fields of visualisation systems and text classification methods, it is important to

    realise that any potential system development begins at an atomic level - with the text itself.

    In Chapter 1, the immense size and scope of USENET was described along with its historical

    context. With such a large source of data, it is necessary to form some idea of the structure and

    specific features which exist within newsgroups. To facilitate this understanding, some initial

    observations of USENET are documented in Section 3.2 this aims to determine what key features

    possibly exist, and which features might be worthy of further in-depth investigation.

    Following this period of observation, a corpus of USENET messages was collected at random and

    analysed using the processes documented in section 3.3. The final sections of this chapter deal with

    the statistical analysis of the corpus, ending with the formulation of some example rules for a

    potential system implementation.

  • 8/4/2019 MATLAB Based Visual is at Ion

    31/122

    JOLYON HUNTER

    [email protected]

    30

    3.2 Init ial Observations of USEN ET N ewsgroups

    In order to get a general feel for USENET messages and their constructs, a period of Initial

    Observation of USENET messages was undertaken. This consisted of examining various

    newsgroups and messages contained within them in an attempt to identify key (or simply interesting)

    features existing in these messages. For

    the purposes of these observations,

    Google Groups provided an effective

    way of navigating and reading groups

    and messages. Google Groups archives

    USENET in its entirety back to 1981,

    and makes groups and articles

    (messages) available in HTML format,

    browsable and searchable via their

    website11. Originally Deja News created

    this user-friendly interface for USENET

    discussions in 1995; it was later acquired

    by Google Inc. in 2001.

    Initially, the most striking aspect of

    USENET is the variety of groups whichexist, each one facilitating an even more diverse range of topics (Figure 10 shows the Google Groups

    listing for the most popular top level domains there are hundreds more top level, and sub-level

    groups). This variety is also reflected in the composition and structure of the messages themselves.

    Posts can range from simple one-line questions with short replies, basic statements of opinion or

    thread-based opinion polls to lengthy, thought-out responses which emphasise the knowledge and

    writing style of the author. Frequently, message replies quote the original or previous message in their

    body, some people preferring to reply after the quote, some before, and some taking the message one

    section at a time by interspersing their replies with quoted text.

    Figure 10: T he Google Groups main page athttp:/ / groups.google.com/ . T his site was used for initialobservations of USE N E T newsgroups and the messages withinthem. G oogle Groups archives all of USE N E T back to 1981 more than 700 million messages

    In the more free-spirited discussion groups, personal opinions are highly valued by authors, with

    subsequent personal insults being the reward if an authors views/ opinions are disagreeable to

    another author. Such insults or flames are characterised by the use of profanity and capitalised

    words (which indicate shouting in the context of online conversation). Another feature which is

  • 8/4/2019 MATLAB Based Visual is at Ion

    32/122

    JOLYON HUNTER

    [email protected]

    31

    specific to communication over the internet is the popularity of acronyms and abbreviations for

    popular or frequently used words and phrases, sometimes known under the umbrella term

    Cyberspeak. Some examples of these include Laughing-Out-Loud (LOL), Rolling-On-The-Floor-

    Laughing-My-Ass-Off (ROTFLMAO), and By-The-Way (BTW). These features are generally net-

    specific so would be an interesting feature to investigate further with respect to this project, to see

    how widespread their usage is and if they can be used to classify text.

    Also notable are the representations of emotion in the form of emotional icons or emoticons.

    These are typically constructed from two or more punctuation symbols, for example a colon a dash

    and a closing bracket represents a smiling face. These features are also specific to online

    communications, and as they provide some representation of facial expressions or facial actions (in

    this non verbal medium), then they could also be indicators of emotion or mood.

    In some cases, there are also textual representations of expressions, such as grinning (*g* or for

    example) which might also be worth investigating. The use of asterisks (*) around words to

    provide emphasis is also a feature for example:

    The Beatles were ok, but I *really* lik e the Rolling Stones music

    Sometimes this is used interchangeably with all-capital letters in the context of online conversations

    this represents shouting or heavy emphasis. Coupled with this it might be worthwhile investigating

    the use of exclamation marks and other punctuation, to see if these can aid the classification of text.

    With regard to the overall content of USENET, the newsgroups concerning more academic or

    technical subjects tend to be populated by a more civilised readership; yet sometimes certain

    subjects are inaccessible to visiting outsiders due to their limited knowledge in that domain.

    Responses and initial posts in groups like these are usually well structured and fairly detailed

    indicated by sentence and paragraph length and structure.

    With close-knit groups, such as highly technical academic groups, there is an element of prior

    knowledge of subjects which facilitates the telling of jokes based around that knowledge. Groups

    such as this, where posters are generally known to each other or articulate their views soundly, are on

    the whole more civil than other groups (where for example, a new poster might get abusive replies

    for asking a valid question). See also section 2.2.1 regarding Smokey38, a classification system

    developed to aid the identification of these abusive flame messages.

  • 8/4/2019 MATLAB Based Visual is at Ion

    33/122

    JOLYON HUNTER

    [email protected]

    32

    Each message posted to USENET also has a unique structure identifying it. This is in the form of

    header information which describes when the message was posted, the server it was posted to and

    uniquely identifies the user who posted the message. Another feature of messages, this time

    dependent on the individual user, is the signature which users sometimes attach to their messages.

    These are usually text-based and can provide information about the poster (i.e. email, website or

    telephone information) but can also sometimes include elements of ASCII art pictures made up

    from standard ASCII characters. Another popular trait is to include a favourite quote in signatures.

    When users reply to messages, the previous message is often quoted to provide context to the reply

    (the standard message quotation starting with the > symbol, multiple quotes resulting in many of

    these). Such replies might start with a question or include questions, with some replies containing

    URLs referring the reader to extra information on websites. However, some threads, whilst lengthy

    sometimes hundreds of messages long can tend to be just idle chit-chat between two or more

    posters.

    There are also more specific linguistic features worth investigating the use of imperative statements,

    (L earn x before you talk about it or Get with the program!) which also links to the use of personal

    pronouns. Directly addressing a fellow user as you could be an indicator of response tone (this

    could depend upon some understanding of the context of the message or conversation).

    From the features detailed above there appear to be a number of potential phenomena to visualise,

    some of which have been attempted by a few of the systems mentioned in section 2. These include

    conversational tone, mood, emotion, themes across groups, contributor information (dominance,

    reliability, links to others, activity etc) and issues of formality or informality. The scope of these

    phenomena, and the features to identify them, is potentially huge, therefore a select few of these

    features will be examined when analysing the corpus of USENET messages.

  • 8/4/2019 MATLAB Based Visual is at Ion

    34/122

    JOLYON HUNTER

    [email protected]

    33

    3.3 Statistical Corpus Analysis

    The aim of this stage of analysis is to attempt to determine if features which have been identified

    offer an insight or indication of how text can be classified, by the derivation of some set of rules

    based upon any patterns (or not) which exist within the corpus. This insight might be supported by a

    significant distribution of the cue word across the sample corpus for example. The process used is

    documented in the following sections, but first it is necessary to describe the origin of the sample

    corpus being used in this project.

    The University of Surrey News feed (ref news.surrey.ac.uk) was used in order to collect a corpus of

    messages with which to conduct this research. The Tin42 newsreader under UNIX has functionality

    which allows the archiving (saving) of all messages within subscribed newsgroups to a specified

    location, and so was utilised for this project. For the purposes of this analysis, newsgroups were

    picked at random from a variety of different topics. The list of archived newsgroups included

    alt.book s, alt.music, alt.music.lyrics, alt.music.smash-pumpk ins, alt.politics.democrats, alt.politics.usa,

    alt.politics.usa.republicans, sci.environment and uk .media. Tin stored each newsgroup in a folder named after

    the group and each article (message) was uniquely numbered and stored in that folder.

    The corpus amounted to a total of 6005 messages, containing 4,303,358 words. It should be noted

    that message header information (information used by mail servers and clients to identify and route

    messages) was also included in this count however, even if header information accounted for a

    third of this number, the total massively exceeds the anticipated minimum of 250,000 words stated in

    this projects Inception Document.

    Using System Quirk39, a language engineering workbench/ tool developed by the University of

    Surrey, it was possible to determine the frequency of certain words within the corpus (Using the

    Kontext tool). These keywords were specified in a separate file known as a Word List. Kontext

    allows various functions to be performed on a corpus of text, from Indexing (counting the frequency

    of words) to the more complicated Keyword In Context (KWIC) function which allows a user-

    specified number of words either side of the keyword to be displayed (thus, displaying some of the

    context in which the word is being used). For the purposes of this project, Kontexts frequency

    counting functionality will be employed.

  • 8/4/2019 MATLAB Based Visual is at Ion

    35/122

    JOLYON HUNTER

    [email protected]

    34

    As a precursor to this analysis, Monday 27th and Tuesday 28th January 2003 saw a collaborative

    workshop take place between myself, Sotiris Rompas (UniS MSc Student), Dr. Andrew Salway (UniS)

    and Erik Kemperman, an AI Programmer from V2 Labs43 in the Netherlands, and a key member of

    the CodeZebra project team (see Section 2.1.2). The idea behind this workshop was to share ideas

    and get a feel for possible classification cues existing in the text of USENET messages. For this

    purpose, the corpus created for this project was used. Full details of the work undertaken were

    documented23, and included some basic work with cyberspeak and personal pronoun wordlists, and

    also a look at the theme of agreement and disagreement using a wordlist comprised of synonyms of

    agree and disagree obtained from Princeton Universitys WordNet thesaurus45.

    The following subsections continue and expand upon the initial groundwork established in the

    workshop by examining the features of cyberspeak, personal pronouns and synonyms of

    agreement and disagreement in greater detail through a comprehensive analysis of the entire corpus

    (using System Quirk) and performing further clustering analysis using CLUTO. The analysis process

    is detailed in the next section, with the following sections dealing with the individual phenomena to

    be analysed.

  • 8/4/2019 MATLAB Based Visual is at Ion

    36/122

    JOLYON HUNTER

    [email protected]

    35

    3.3.1 The Analysis Proces s

    In order to glean meaningful results from the data produced by System Quirk, the results (i.e

    frequencies produced by indexing) were transferred into a Microsoft Excel spreadsheet. With figures

    for the number of words within all messages in each group, it was also possible to calculate relative

    frequencies of the keywords. All of this data is tabulated at the end of this report in Appendix 1, and

    should be referred to as a means of complementing the discussion of the following sections.

    As described earlier (in section 2.2.3), the CLUTO clustering toolkit offers many functions and is

    extremely versatile, therefore the decision was made to utilise some of the basic functionality of this

    package in order to analyse the data obtained in the first stage of analysis. In theory, clustering these

    results should enable us to discern whether or not the keywords contained in the System Quirk

    wordlists are being used in a similar way (i.e. have associated characteristics such as similar relative

    frequencies) across all or a portion of the newsgroups in question. For the purposes of this project it

    was decided to use the newly released gCLUTO, the more user-friendly package developed by the

    inventors of CLUTO21. Whilst gCLUTO is a separate application, it is still based heavily upon the

    CLUTO toolkit and as such it proved worthwhile experimenting with the original command-line

    CLUTO and referencing the CLUTO manual in interpreting gCLUTOs output.

    The input file for CLUTO and gCLUTO is in the form of a matrix (a .mat file). The internal

    format of the file itself is straightforward, with the first line containing two numbers, the first being

    the number of rows, the second being the number of columns in the matrix. Below this line the

    values for the cells of the matrix can be found (for an example see Appendix 2). Two other files are

    needed (.clabel and .rlabel files) for labelling the rows and columns respectively, and these

    contain the appropriate labels, one per line. In CLUTO, each of these files is specified using the

    command described earlier, whereas in gCLUTO the user can select these files using a point-and-

    click interface.

    The relative frequency data contained in the Excel spreadsheet was converted into the matrix input

    format used by gCLUTO (and CLUTO), and fed into the program using the default setup for the

    Repeated Bisection clustering method. This method involves the input matrix being split into two

    clustered groups, then one of those groups is selected and further bisected this continues until the

    specified number of clusters is produced (by default 10). CLUTO uses the Repeated Bisection

    clustering method by default (others can be specified please refer to CLUTOs manual for more

  • 8/4/2019 MATLAB Based Visual is at Ion

    37/122

    JOLYON HUNTER

    [email protected]

    36

    details21). It was felt that leaving everything at default would give a satisfactory analysis as the purpose

    of using CLUTO in this case is only to interpret/ visualise the data rather than in-depth analysis using

    different clustering techniques. Of course, due to the versatility and options possible with the

    CLUTO toolkit and its derivatives, such in-depth analysis would not be difficult to achieve if so

    desired.

    This in theory provides a means of determining whether similarities or patterns exist within the

    groups and between the groups if so clustering these features should depict this. If similarities or

    differences do exist then it should facilitate the derivation of some rules upon which a suitable

    automatic text classification system could be built.

    gCLUTO in particular produces some interesting visualisations of the data. Alongside the clustering

    solution, the user can create a matrix visualisation and/ or a mountain visualisation. The mountainvisualisation represents the relative similarity of clusters as well as their size, internal similarity and

    internal deviation. Each cluster is represented by a peak in the terrain, with the shape of the peak

    being a Gaussian curve rough estimation of the distribution of data within that cluster. Its height is

    proportional to the cluster's internal similarity, and volume is proportional to the number of elements

    within the cluster. The colour of the peaks is also important with red indicating low deviation

    blending into blue which represents high deviation. Using the Show Features function, the three

    most descriptive features of the cluster are displayed above the cluster itself and detailed information

    is available about each cluster by clicking on its number (for reasons of visual clarity, this feature was

    turned off for the included screenshots). The matrix visualisation uses colours to graphically

    represent the data in the matrix with white representing colours near zero, green representing

    negative values and red representing positive values (different shades of red & green being

    lighter/ darker depending on how close to zero they are). With tree-building enabled, a column tree

    and a row tree are generated - These trees effectively depict the relationship between the columns

    and rows, showing the relationships between the discovered clusters. Examples of both of these

    visualisations will be produced, and explained in the following sections, and an example of both can

    be seen in the gCLUTO screenshot in Figure 9. Clustered features are divided by horizontal black

    lines across the matrix visualisation area.

    The following sub-sections contain details of all of the stages of analysis mentioned above. The

    results tables and graphs being discussed are available in Appendix 1 and should be referred to whilst

    reading these sections.

  • 8/4/2019 MATLAB Based Visual is at Ion

    38/122

    JOLYON HUNTER

    [email protected]

    37

    3.3.2 Cyberspeak: Features for classification?

    Cyberspace: a virtual world which relies largely upon written rather than spoken communication. The

    receiver (in this case the reader) is unable to see the speaker, hence non-verbal message cues are lost

    and the meaning of some messages is distorted or lost altogether. With the reader unable to see a

    speaker grin, chuckle, wink or smile, a means of communication unique to cyberspace evolved as a

    means of a enhancing meaning. This came to be known as Cyberspeak (sometimes called

    Netspeak).

    This consists of a number of dynamic elements which

    include emotional ASCII icons, otherwise known as

    emoticons (see Figure 11) which can be used to

    represent facial expressions or actions. These emoticons

    are one aspect of cyberspeak, and another is the use of

    Cyberspeak acronyms short words made up of the first letters of a lengthy phrase. For example,

    by the way is abbreviated to BTW and also known as to AKA in the case of non-verbal

    cues, laughing out loud can be abbreviated to LOL. Comprehensive lists of these acronyms exist

    in various online dictionaries of Netspeak 16, 17 and there are a number of good sites for the

    uninitiated 15, 28 and 46.

    Figure 11: Example of "Emoticons"representing smiling, winking andsticking out the tongue respectively

    From the initial observations in section 3.2, it became clear that these acronyms could

    possibly provide a feature for classifying phenomena such as mood or emotion. For

    example excessive use of LOL in a particular group or by a particular person could be

    a good indicator of emotion. It is interesting to note the prominence of some of these

    features in everyday life for example the LOL acronym can now be found in the

    Oxford English Dictionary 2 perhaps a sign of just how much the Internet and

    associated technologies have infiltrated modern day society.

    For the purposes of this analysis, a selection of acronyms was used as the wordlist for

    System Quirk. These are given in Table 1 and full results should be referred to in

    Appendix 1b. From the results of the wordlist analysis it is plain to see that the

    acronyms specified do exist, although not in large numbers. There appears to be no

    clear pattern across all groups, with only LOL, BTW and AKA registering frequencies above ten

    (note the smallest group contained more than 22,000 words). In order to get a more accurate view of

    akabblbrb

    btw

    lmao

    lol

    rotf

    rotflmaostfu

    ttfn

    ttyl

    wtfTable 1:Cyberspeak

    wordlist forSystem Quirkanalysis

  • 8/4/2019 MATLAB Based Visual is at Ion

    39/122

    JOLYON HUNTER

    [email protected]

    38

    the distribution of the words, the relative frequencies for each group were calculated (i.e. frequency

    of feature divided by number of words within the group. See Appendix 1b for these results).

    It is noticeable that with relative frequencies, one can determine a stark difference in the usage of

    LOL across groups. In the alt.politics.democrats newsgroup, there is the highest relative usage of any

    group (0.227x10-3), even though this group is almost a third of the size of the largest group in the

    corpus (which is alt.politics.usa.republican somewhat contentiously!).

    This is perhaps a representation of the characteristics of the newsgroups selected it might be the

    case that the participants of the alt.politics.democrats group are more light-hearted and jovial (or

    liberal as a cynical republican might suggest), or it might simply be a case of the threads captured at

    that time being particularly humorous in nature. LOL can generally be assumed to be a positive

    response if we negate the use of sarcasm, however this would defeat the point of this project.

    Examining acronyms could aid the visualisation of phenomena such as formality/ informality,

    emotion and mood. For example LOL would likely have positive or happy connotations;

    similarly the complete lack of cyberspeak phenomena might indicate a more formal message or

    subdued group. The difficulty of examining cyberspeak from a language use perspective is that it is a

    feature of online communication spaces, and is a relatively new phenomenon which has been

    arguably researched less than standard language features such as specific words or phrases. One

    could argue that cyberspeak could warrant an entire research project in itself, but the fact that it is

    cyber-specific and not used widely means it is difficult to derive generic rules for classification

    without knowing more about the context of the rest of the message, and the rest of the entire

    communication space.

    Similarly, research into emoticons would be interesting, however in this case System Quirk could not

    recognise the combination of punctuation marks and symbols which constitute emoticons, therefore

    it could not be pursued further. However, it might turn out to be the case that they are not a widely

    used feature across USENET There is a chance that emoticons may have been more prominent in

    the past when USENET was the domain of the highly technologically-literate, but with the

    proliferation of internet access in the early 21st Century there has been a socio-demographic shift in

    the user base of USENET specifically, non-technical everyday people have access, and utilise all

    aspects of internet communication, including USENET. Further research into this area would help

    to support or negate this theory, but is outside the scope of this particular project.

  • 8/4/2019 MATLAB Based Visual is at Ion

    40/122

    JOLYON HUNTER

    [email protected]

    39

    The results produced from gCLUTO processing are useful for visualising the data and identifying

    clusters that exist. Despite the fact that CLUTO is generally more useful for larger datasets than the

    results produced from the cyberspeak analysis, clusters did exist which show interesting relationships

    between the newsgroups. The matrix visualisation produced can be seen in Figure 12 and it is clear to

    see similarities between groups from the hierarchical column tree created at the top. On the left-hand

    branch, the first leaf nodes clustered depicts the relationship between alt.music.lyrics & alt.music, then

    between those two groups and uk .media. On the right-hand branch alt.politics.usa &

    alt.politics.usa.republican are related, alt.politics & alt.music.smash-pumpkins are related, and sci.environment &

    alt.book s are related. These pairings can then be connected as you read up the tree. What this suggests

    is that within the clusters discovered, the relative frequencies share similarities which could be an

    indicator that the words are being used in a similar way across the related newsgroups. If one glances

    across the matrix visualisation it is clear to see that BTW is the most widespread word from the

    list, occurring in all but two of the newsgroups. Interestingly, the darkest shades of red (i.e. the

    largest values) all occur for keywords within the politics/ debate groups - alt.politics.usa,

    alt.politics.usa.republican, alt.politics.democrats and sci.environment alongside this it is worth noting that

    most of the cyberspeak words that did exist in the sample occurred in these same groups, perhaps

    suggesting that their use within these groups is comparable.

    The mountain visualisation depicted in Figure 13 clearly shows the discovered clusters (represented

    by the peaks), but also displays an area marked in red which is not at the top of a peak. This indicates

    a low deviation across the values in the cluster, but also a low level of internal similarity (i.e. features

    within that cluster arent very similar, yet they do not differ greatly in value this is to be expected

    with very few values).

  • 8/4/2019 MATLAB Based Visual is at Ion

    41/122

    JOLYON HUNTER

    [email protected]

    40

    Figure 12:A gCL U T O M atrix V isualisation of clustered features for "C yberspeak " words within theindicated nine newsgroups. H ere it is clear to see similarities between groups from the hierarchical tree createdat the top. On the left-hand branch, the first leaf nodes clustered depict similarities between alt.music.lyrics &alt.music, then between those two groups and uk .media. O n the right-hand branch alt .politics.usa &alt.politics.usa.republican are related, alt.politics & alt.music.smash-pumpk ins are related, andsci.environment & alt.books are related. T hese pairings can then be connected as you read up the tree.

  • 8/4/2019 MATLAB Based Visual is at Ion

    42/122

    JOLYON HUNTER

    [email protected]

    41

    Figure 13:A gCL U T O M ountain V isualisation of clustered features for Cyberspeak . E ach peak in the terrainrepresents a discovered cluster. H eight is proportional to the cluster's internal similarity, and volume isproportional to the number of elements within the cluster, so it is possible to see here that there are several dist inctclusters. Colour is also important with red indicating low deviation blending into blue which represents highdeviation.

  • 8/4/2019 MATLAB Based Visual is at Ion

    43/122

    JOLYON HUNTER

    [email protected]

    42

    3.3.3 Personal Pronouns: Features for classification?

    personal pronoun (n. )

    A pronoun designating the person speak ing (I, me, we, us), the person spok en to (you), or the

    person or thing spok en about (he, she, it, they, him, her, them). 6

    The nature of personal pronouns means that they are more common in spoken word conversations,

    and in this case, the more conversational style associated with internet-based communication. With

    the above definition in mind, it is understandable why the usage of these nouns could possibly be an

    indicator of any number of phenomena within a conversation tone, mood, emotion, inclusive

    behaviour, and exclusivity (perhaps related to a sense of community or belonging which may or may

    not exist within a newsgroup). It is these pronouns, constructed into phrase rules (by associating

    other words and constructing key phrases), which form the basis of Ellen Spertus Smok ey 38

    project. For more details about Smok ey please see section 2.2.1.

    Substantial areas for future research exist in the positioning of personal pronouns

    within a sentence, i.e. the words surrounding the pronoun and the sentence structure

    can provide context to the pronouns usage. However, in this case, the project is

    approached from a computer science perspective and as such deals with primarily the

    scope for automatically classifying text (and to do this at a computational level, it is

    essential to start with the basics and look at individual words). With this in mind,

    the frequencies and relative frequencies of the keywords in a personal pronoun

    wordlist (see Table 2) were extracted in the same manner as detailed in section 3.4.1.

    These results can be found in Appendix 1c.

    you

    I

    he

    sheit

    they

    us

    we

    themher

    him

    meTable 2:Pronouns

    wordlist forSystem Quirkanalysis

    Significant similarities exist in the usage of the pronoun I in six out of the nine

    newsgroups with the exceptions being the music-related groups - alt.music, alt.music.lyrics and

    alt.music.smash-pumpkins have higher relative frequencies - alt.books which has a lower relative

    frequency. This suggests that I is used in a similar way across the political/ debate newsgroupswithin the sample. A similar trend is evident with the use of we, us, them and me in the

    political/ debate groups, with all of these keywords having similar relative frequencies for each of the

    four groups (namely alt.politics.usa, alt.politics.democrats, alt.politics.usa.