Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

Embed Size (px)

Citation preview

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    1/16

    Clustering the Short Stories of Edgar Allan Poe UsingWord Groups and Formal Concept Analysis

    Tuesday, June 23rd, 2009Digital Humanities 2009

    University of Maryland, College Park

    Roger Bilisoly, Ph.D.Department of Mathematical SciencesCentral Connecticut State University

    New Britain, Connecticut

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    2/16

    Why analyze Edgar Allan Poe?

    Poe wrote in many styles and used many themes: see Hoffman(1998).

    Some styles: Horror (The Black Cat), Detective (The Murdersin the Rue Morgue), Satire (A Predicament), Science Fiction

    (The Unparalleled Adventure of One Hans Pfaall), etc. Poes styles are distinctive.

    Should be easy for a computer to cluster.

    There are only ~70 short stories.

    These take up only ~750 pages in Poe (1992).

    Poe is available on Web for free. He lived 1809-1849, so his original publications are out of

    copyright.

    Project Gutenberg has Poe (and many other authors) athttp://www.gutenberg.org/wiki/Main_Page, Poe (2000).

    http://www.gutenberg.org/wiki/Main_Pagehttp://www.gutenberg.org/wiki/Main_Page
  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    3/16

    First Goal: Pick a text metric.

    Text similarity is used in information retrieval (IR) to help match queries and texts. This can be used to measure distances between two texts.

    There are many ways to perform IR. Vector Space Models: the term-document matrix allows a geometric

    approach in high dimensional spaces (e.g., R20,591 for Poes short stories).

    Probabilistic Models: find maxi P(Texti | Query). Language Models: find maxi P(Query | Language Modeli). See Grossman and Frieder (2004) for these and additional approaches.

    IR is well tested. Search engines such as Google.com are profitable and competitive.

    However, the IR approaches have drawbacks:1. Above text distances lack intuitive appeal: e.g., angles in 20,591

    dimensional space are hard to comprehend directly. Hence, linking textdistances to a humans experience of reading can be difficult.

    2. There are problems with both sparseness and complexity of language.

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    4/16

    Second Goal: Find clustersthat are understandable to humans.

    This researcher uses formal concept theory.

    Formal concepts have the form {{objects}, {attributes}}. See Corpineto and Romano(2004) for a detailed exposition.

    In the example below rows represent authors (objects) and columns represent attributes.

    Idea: look for maximal submatrices that contain all 1s. This can be done efficiently byGanters algorithm. For programming details see Fu and Nguifo (2004).

    Poet ShortStories Novelist USA UK Male Female

    Poe 1 1 0 1 0 1 0

    Stowe 0 0 1 1 0 0 1

    Dickens 0 0 1 0 1 1 0

    Eliot 0 0 1 0 1 0 1

    Whitman 1 0 0 1 0 1 0

    {{Poe, Whitman}, {Poet, USA, Male}} is in red above Other examples: {{Poe}, {Poet, Short Stories, USA, Male}} {{Stowe, Eliot}, {Novelist, Female}} {{Stowe, Dickens, Eliot}, {Novelist}} {{Poe, Dickens, Whitman}, {Male}}

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    5/16

    Example Continued:All the concepts together form a Galois lattice.

    Poe, Short Story Stowe Dickens Eliot

    Whitman, Poet

    American Male Female English

    Novelist

    We see that Dickens is a Male and English (going upwards) andall English are Novelists (going upwards again). We see that Males are Whitman and Dickens (goingdownwards), as well as Poe (going downwards again).

    These concepts are only true for the incidencematrix on the preceding slide. Adding authors orattributes usually changes the formal concepts.

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    6/16

    Complication: Word rates can be a function of text length.Example: Compare the word diversity (an inverse rate) of

    The Black Cat to The Unparalleled Adventures of One Hans Pfaall.

    The top line representsThe Black Cat and the

    bottom Hans Pfaall.

    Clearly the former ishigher in its range even

    though the ending valuesare 3.17 < 5.61.

    An approximate solutionis to consider sets ofstories close in size.

    Three groups are usedhere: 2001 to 3000

    words; 3001 to 4200; and4201 to 6000. (This

    includes 44 of Poes short

    stories).

    The final value for The Black

    Cat is 3.17 tokens per type.For Hans Pfaall it is 5.61.

    See Section 4.6 and Figure 4.6 of Bilisoly (2008)and chapter 1 of Baayen (2001)

    Word diversity =(# tokens)/(# types)

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    7/16

    A Galois lattice for Poe stories (as objects)and word groups (as attributes).

    Words are grouped by five themes, each of which isevocative for a human. Death: death, corpse, dead, murder, died, die, deceased,

    Body: eyes, head, hand, body, feet, heart, face, eye,

    Spiritual: soul, god, spirit, heaven, moral, angel, devil, Horror: horror, terror, fear, horrible, anxiety, fearful,

    Family: family, wife, mother, daughter, father, uncle,

    Word groups were formed using a thematic thesaurus These are available online: e.g., WordNet 2.1.

    Dimensionality using groups is much lower, and frequencies aremuch higher, so sparseness no longer a problem. Word groups easier to link to literary ideas, and words have

    been analyzed by human critics, e.g., see Clough (1930).

    For the incidence matrix, let 1 = story in top 25%,0 otherwise (other percentiles have been tested).

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    8/16

    Family4 Beasts, Amontillado,

    Eleonora, Morella,3 Sundays,

    Frenchmans Sling

    HorrorRed Death, Amontillado,

    Imp of Perverse,Tell-Tale Heart, Morella,

    Eiros and Charmion

    Spiritual4 Beasts, Imp of Perverse,

    Eleonora, Morella,Eiros and Charmion,

    Frenchmans Sling

    BodyRed Death,Amontillado,

    Tell-Tale Heart,

    Eleonora, Morella,Frenchmans Sling

    Death4 Beasts, Red Death,

    Imp of Perverse,Tell-Tale Heart,

    Eleonora, Morella

    Spiritual, Family4 Beasts,Eleonora,Morella,

    Frenchmans Sling

    Spiritual, HorrorImp of Perverse,

    Morella,Eiros andCharm.

    Body, FamilyAmontillado,

    Eleonora,Morella,

    Frenchmans Sling

    Body, HorrorRed Death,Amontillado,

    Tell-TaleHeart, Morella

    Death, HorrorRed Death,

    Imp of Perverse,Tell-Tale Heart,

    Morella

    Death, Spiritual4 Beasts,

    Imp Perverse,Eleonora,Morella

    Death, BodyRed Death,

    Tell-Tale Heart,Eleonora,Morella

    Body, Horror, FamilyAmontillado,

    Morella

    Body, Spiritual, FamilyEleonora, Morella,Frenchmans Sling

    Death, Spiritual, Family4 Beasts,

    Eleonora, Morella

    Death, Spiritual, HorrorImp Perverse,

    Morella

    Death, Body, HorrorRed Death,

    Tell-Tale Heart, Morella

    Death, Body, Spiritual, FamilyEleonora, Morella

    Death, Body, Spiritual, Horror,FamilyMorella

    Galois Lattice for Poe

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    9/16

    Morella appears in all 5 word groups

    (making it quintessential Poe).Tell-Tale Heart and the Red Death appear in 3.

    The Tell-Tale Heart is about a man who kills his older roommate, hidesthe body under the floor, then the police visit, which causes him to crackand shows them the body. This story ranks 3rd (of 13) in Death (5.12 per K), 1st in Body (18.18 per K),

    1st in Horror (9.32 per K).

    This story is, in fact, considered one of Poes iconic tales. It forms a concept with Morella and The Masque of the Red Death. These

    three stories have the same genre: horror.

    Morella is the narrators wife who is obsessed with mysticism, dies duringchildbirth, the daughter grows up to resemble Morella more and more, and

    upon baptism she cries I am here, and then dies herself. Poe has several stories about wives who die: Berenice, Ligeia, The Oval

    Portrait, The Oblong Box and Eleonora (who dies before she and narratorcan marry). So this plot is one Poe has explored several times.

    Ranks 1st in Death (8.94 per K), 3rd in Body (13.18 per K), 1st in Spiritual (8.47per K), 2nd in Horror (5.65 per K), and 2nd in Family (5.18 per K).

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    10/16

    This Morella paragraph has all five word groups:

    spiritual, body, family, death and horror.

    And as years rolled away, and I gazed day after day upon her holy,and mild, and eloquent face, and poured over her maturing form, dayafter day did I discover new points of resemblance in the child toher mother, the melancholy and the dead. And hourly grew darker theseshadows of similitude, and more full, and more definite, and moreperplexing, and more hideously terrible in their aspect. For that hersmile was like her mother's I could bear; but then I shuddered at itstoo perfect identity, that her eyeswere like Morella's I couldendure; but then they, too, often looked down into the depths of mysoulwith Morella's own intense and bewildering meaning. And in the

    contour of the high forehead, and in the ringlets of the silken hair,and in the wan fingerswhich buried themselves therein, and in thesad musical tones of her speech, and above all -- oh, above all, inthe phrases and expressions of the deadon the lipsof the loved andthe living, I found food for consuming thought and horror, for a wormthat would not die.

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    11/16

    References

    Word Frequency Distributions R. Harald Baayen(2001)

    Practical Text Mining with Perl Roger Bilisoly (2008)

    The Use of Color Words by Edgar Allan Poe

    PMLA, 45(2), Wilson Clough (1930) Concept Data Analysis: Theory and Applications

    Claudio Corpineto and Giovanni Romano (2004) A Lattice Algorithm for Data Mining

    Huaiguo Fu and Engelbert Mephu Nguifo (2004) http://www.cril.univ-artois.fr/~mephu/fu-mephu_ISI_04.pdf

    Information Retrieval: Algorithms and Heuristics David A. Grossman and Ophir Frieder(2004)

    Poe Poe Poe Poe Poe Poe Poe Daniel Hoffman (1998)

    The Collected Tales and Poems of Edgar Allan Poe The Modern Library (1992)

    The Works of Edgar Allan Poe, Volumes 1 through 5 Edgar Allan Poe (2000) Project Gutenberg, EText Nos. 2147-2151. http://www.gutenberg.org/browse/authors/p#a481

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    12/16

    Appendix: Core Mathematica Codefor Ganters Algorithm

    primeA[v_,r_]:=Module[{maxP, product},

    (* Output is an Attribute *)

    product = v.r;

    maxP = Fold[Plus,0,v];

    Return[Map[If[#maxP,1,0]&,product]]]

    nextA[v_,r_]:=Module[{i, new, first},(* Output is an Attribute *)

    Do[first = v*Table[If[i-1, Continue[], Null];

    If[Min[first[[1;;i0-1]]-new[[1;;i0-1]]] >= 0, Break[], Null],

    {i0, Length[v], 1, -1}];

    Return[new]]

    compare[v1_,v2_]:=Module[{ans, idiff=0},

    (* Consider v1 and v2 as binary numbers. Then v1 > v2 returns 1, equality

    returns 0, and v1 < v2 returns -1 *)

    Do[If[v1[[i]] == v2[[i]], Null, idiff=i; Break[] ], {i, 1, Length[v1]}];

    Return[If[idiff>0, Sign[v1[[idiff]]-v2[[idiff]] ], 0] ] ]

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    13/16

    Note that 12 slides in a 4 by 3 rectangle fills an area of 33 by 34inches, which easily fits in square meter. A few useful slides tobring with (but not to make part of the poster) are included afterthis slide.

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    14/16

    Ganters Algorithm:

    The operator

    Let Xbe a subset of O, then define

    Let Ybe a subset of A, then define

    Concepts of the context (O, A, R) are pairs of sets (X,Y)where

    Example: Concept = {{Stowe, Dickens, Eliot}, {Novelist}}

    },:{ XxxRaAaX

    },:{ YyoRyOoY

    Definition from Concept Data Analysisby Carpineto and Romano

    XYYXAYOX

    and,

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    15/16

    Inclusiveness of Concept Lattices

    This is for aGalois lattice

    that includes all70 short stories

  • 8/4/2019 Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

    16/16

    Poes Short Stories

    1. The Unparalleled Adventures of One Hans Pfaall2. The Gold Bug3. Four Beasts in One4. The Murders in the Rue Morgue5. The Mystery of Marie Rogt6. The Balloon-Hoax7. MS. Found in a Bottle8. The Oval Portrait9. The Purloined Letter10. The Thousand-and-Second Tale of Scheherezade11. A Descent into the Maelstrm

    12. Von Kempelen and his Discovery13. Mesmeric Revelation14. The Facts in the Case of M Valdemar15. The Black Cat16. The Fall of the House of Usher17. Silence -- a Fable18. The Masque of the Red Death19. The Cask of Amontillado20. The Imp of the Perverse21. The Island of the Fay22. The Assignation23. The Pit and the Pendulum

    24. The Premature Burial25. The Domain of Arnheim26. Landor's Cottage27. William Wilson28. The Tell-Tale Heart29. Berenice30. Eleonora31. Ligeia32. Morella33. A Tale of the Ragged Mountains34. The Spectacles

    35. King Pest

    36. Three Sundays in a Week37. The Devil in the Belfry38. Lionizing39. X-ing a Paragrab40. Metzengerstein41. The System of Doctor Tarr and Professor Fether42. How to Write a Blackwood article43. A Predicament44. Mystification45. Diddling46. The Angel of the Odd

    47. Mellonta Tauta48. The Duc de L'Omlette49. The Oblong Box50. Loss of Breath51. The Man That Was Used Up52. The Business Man53. The Landscape Garden54. Maelzel's Chess-Player55. The Power of Words56. The Colloquy of Monas and Una57. The Conversation of Eiros and Charmion58. Shadow -- A Parable59. Philosophy of Furniture60. A Tale of Jerusalem61. The Sphinx62. Hop Frog63. The Man of the Crowd64. Never Bet the Devil Your Head65. Thou Art the Man66. Why the Little Frenchman Wears his Hand in a Sling67. Bon-Bon68. Some words with a Mummy69. Literary Life of Thingum Bob Esq.

    70. Morning on the Wissahiccon