Social Network Analysis. Computational Journalism week 10

Preview:

DESCRIPTION

Jonathan Stray, Columbia University, Fall 2015Syllabus at http://www.compjournalism.com/?p=133

Citation preview

FrontiersofComputationalJournalism

ColumbiaJournalismSchoolWeek9:SocialNetworkAnalysis

November20,2015

Asetofpeople

Network

andasetofconnectionsbetweenpairsofthem

Typesofconnections

Socialnetworkanalysis:onlyonetypeofconnectionbetweenindividuals (e.g."friend")

Linkanalysis:multiple typesofconnectionsfriendbrotheremployerwenttouniversitywithsoldacartoowns51%of

Linkanalysisismuchmorerelevanttojournalism, becauseitallowsrepresentationofmuchmoredetailandcontext.

PeopleActinGroupsFamilyandfriendships:Iammostcloselyconnectedtoasmallsetofpeople,whoareusuallycloselyconnectedtoeachother.

Business:IammuchmorelikelytodobusinesswithpeopleIalreadyknow.

Influence:IlistentopeopleIknowmorethanIlistentostrangers.

Norms:whatisrightdependsonwhatthepeoplearoundmethink.

Peopletendtomarry,dobusinesswith,spendtimewith,etc.peoplefromsimilarbackgrounds...andpeoplewhohavesocialtiestendtobesimilar.

Homophily

Homophily istheprinciple thatcontactbetweensimilarpeopleoccursatahigher ratethanamongdissimilarpeople.Thepervasivefactofhomophily meansthatcultural,behavioral,genetic,ormaterialinformation thatflowsthrough networkswilltendtobelocalized.Homophily imples thatdistanceintermsofsocialcharacteristicstranslatesintonetworkdistance,thenumberofrelationships through whichapieceofinformationmusttraveltoconnecttwoindividuals.

- McPherson,Smith-Lovin,CookBirdsofafeather:homophily insocialnetworks

StructureRelatestoBehavior

Ina1951experiment, researchershadfivepeopleworktogether,onlyallowedtocommunicateaccordingtooneofthepatternsabove.Theywereeachgivenacardwithseveralsymbolsonit.Thetaskwastodeterminewhichsymbolwasincommonbetweenallofthecards.Itwasrepeatedmanytimes.

Howdidthegroupsorganizethemselves?Whichpatternswere fastest?

From H. Leavitt, Some effects of certain communication patterns on group performance,Journal of Abnormal Psychology 46(1)

Correlationofdifferenttypesofinfo

Supposeyouhavearecordofphonenumberscalled,adatabaseofpoliticalcampaigndonations,andalistofgovernmentappointees.Putthemtogether,andyouhavethisstory:

WASHINGTON—Timeandagain,TexasGov.RickPerrypickeduphisofficephoneinthemonthsbeforehewouldannouncehisbidforthepresidency.Hedialedwealthyfriendswhowerehisbigfundraisersandstateofficialswhoowedhimfortheirjobs.

PerryalsometwithaTexasexecutivewhowould laterco-foundanindependentpoliticalcommitteethathaspromisedtoraisemillionstosupportPerrybutisprohibitedfromcoordinatingitsactivitieswiththegovernor.

- JackGillum,Perrycalledtopdonorsfromworkphones, AP,6Dec2011

SocialNetworkAnalysisinJournalism

• Identifypeopleorcommunities• Trackmoneyandcriminalnetworks• Understandspreadofinformationandbehavior• Illustratecomplexstories

UsefulinallareaswhereCSintersectsjournalism!(Reporting,communication,filtering,effecttracking)

Twomajoranalysismethods

…afteryouhavethenetworkdata,whichmaybeaverymanualprocess.

• Lookatavisualization• Applyalgorithm

Inbothcases,theresultsarenotinterpretablewithoutcontext!

Force-DirectedLayout

Eachedgeisa"spring" withafixedpreferredlength.Plusglobal repulsiveforcethatpushesallnodesapart.

FromTheEffectofGraphLayoutonInferencefromSocialNetworkData,Blytheetal.

FromTheEffectofGraphLayoutonInferencefromSocialNetworkData,Blytheetal.

FromTheEffectofGraphLayoutonInferencefromSocialNetworkData,Blytheetal.

Weaskedrespondentsthreequestionsaboutthesamefivefocalnodesineachsociogram:

1)howmanysubgroupswereinthesociogram2)how“prominent”waseachplayerinthesociogram3)howimportanta“bridging”roledideachplayeroccupyinthesociogram

Centrality

Oftenidentifiedwith"influence"or"power."Oftenimportantinjournalism.

Wecanvisualizethegraphanduseoureyes,orwecancomputecentralityvaluesalgorithmically.

Degreecentrality:numberofedges

Models: caseswherethenumberofconnections isimportant.Example:whichcelebritycanreachthemostpeopleatonce?

Closenesscentrality:averagedistancetoallothernodes

Models: caseswheretimetakentoreachanodeisimportant.Example:whofindsoutaboutgossip first?

Betweenness centrality:numberofshortestpathsthatpassthrough node

Models: caseswherecontrolovertransmission isimportant.Example:whohasthemostpower tomakeintroductions?

Eigenvectorcentrality:howlikelyyouaretoendupatanodeonarandomwalk

(sameideaasPageRank)

Models: caseswhereimportanceofneighbors isimportant.Example:theprivateadvisertothepresident

Journalismcentrality:howimportant isthispersontothisstory?

Whois"important"?Whattypeofpersondoyouwanttoidentify inthenetwork?

Oftenassumedwe'reafter"influential."Butsociology says"power" isacomplicatedthinganddifficult todefineandmeasure.

Networkanalysishasmostlyignored thisproblem.Iknowofnosuccessfuluseofcentralitymetricsinjournalism– maybeyou'llbethefirst.

FindingCommunities

Noonedefinitionof"community."Couldmeanatown,oraclub,oranindustrynetwork.

Butforourpurposes,acommunityis"agroupofpeoplewithpre-existingpatternsofassociation."

Insocialnetworkanalysis,thattranslatesintoclustersinthegraph.

Friends/followers

Co-consumption – Networkofpoliticalbooksales, Orgnet.com

Communicationsnetwork– ExploringEnron, JefferyHeer

Weblinkstructure–MapofIranianBlogosphere,Berkman Center

Individual time/locationtrails– CitySense,SenseNetworks

Warning:nonetworkisever"complete."Otherwisetherewouldbe7billionpeople init

Mathematicaldefinitionsof"cluster"

You'vealreadyseenseveral!Ifyoucancomputedistancebetweenanytwoitems,youcancluster.

Butinsocialnetworks,noteveryoneisconnectedtoeveryoneelse...

Modularity

Aretheremoreintra-groupedgesthanwewouldexpectrandomly?

Modularity

n=numberofverticeski =degreeofvertexiAij =1ifedgebetweeni,j,0otherwisegij =1ifi,j insamegroup,0otherwise

Therearetotaledgesinthegraph.Iftheygobetweenrandomverticesthennumberofedgesbetweeni,j is

m = 12 ki∑

kik j / 2m

Modularity

n=numberofverticeski =degreeofvertexiAij =1ifedgebetweeni,j,0otherwisegij =1ifi,j insamegroup,0otherwise

Modularity

IfQ>0thenthereare"excess"edgesinsidethegroups(andfeweredgesbetweenthem.)

Q = Aij − kik j / 2m( )ij∑ gij

Modularityalgorithm

• LookforadivisionofnodesintotwogroupsthatmaximizesQ

• Canfindthisthrougheigenvectortechnique• Possiblethatno divisionhasQ>0,inwhichcasethegraphisasinglecommunity

• IfadivisionwithQ>0found,split• Recursivelysplitsub-graphs

TheHairballproblem

Realsocialnetworksarebig,withcomplex,overlappingcommunities inthecentralcomponent.Modularityandothercommunitydetectionalgorithmsgivepoor results.

K-coreDecomposition

Findthenodesatthe"center"ofanetwork.

for k=1 to maximum node degreerepeat

remove all nodes with degree < kuntil all remaining nodes have degree >= kset "core number" of remaining nodes to k

K-coreDecomposition

Carmietal.,AmodelofInternettopologyusingk-shelldecomposition

ProtestDynamicsonTwitter

González-Bailon etal,TheDynamicsofProtestRecruitmentthroughanOnlineNetwork

k-corenumbervs.maximumcascadesize.Color=sentatleastonetweetwhichreachedthisfractionofusers(orange=reachedallusers)

Keyinsight:trianglesnotedges

Simmel's theoryofsociology(early20th C.)saysrelationshipbetweentwopeoplecannotbeunderstoodwithoutcontext.

Idea:countsharedtriangles

1.GiveneachnodeA,giveneachofA'sfriendsB,countthenumberoftrianglesinvolvingAandeachB(=numberofsharedfriendsofAandB).2.RankA'sfriends(eachB)bynumberofsharedfriends(numberofC'sforA,B)tocreate"topfriends"listforA.2.KeeptheedgebetweennodesA,Donlyifthereissomethresholdpercentageoverlapintheirtopfriendslist.

Simmelian Backbones

SNAinjournalism

• ICIJOffshoreTaxHavenleak• ICIJhumantissueinvestigation• OrganizedCrimeandCorruptionReportingProject• WSJGalleon'sWeb insidertradingstory• SCMP'sWhoRunsHongKong• Muckety.com

Theotherchallengewasthedataitself.Howtoseparatetheextraordinaryfromtheroutineandfindthepublicinterestinsideamazeofmorethan37,000offshorecompanyholders?Afirststepwastobuildasmanylistsaspossibleofpublicfigures:Politburomembers,militarycommanders,mayorsoflargecities,billionaireslistedinForbesandHurun’s rankingsofthemega-wealthyandso-calledprincelings(relativesofthecurrentleadershiporformerCommunistPartyelders).

Throughpainstakingdatabasework,areporterinSpaincross-referencedthelistsofnotableChineseagainstthenamesofoffshoreclientslistedwithinICIJ’sOffshoreLeaksdata.Theaddeddifficultywasthatinmostcases,namesintheoffshorefileswereregisteredinRomanizedform,notChinesecharacters.Thismademakingexactmatchesextremelyhard,becauseRomanizedspellingsfromChinesecharacterstendtovarywidely:WangmightbespelledWong,ZhangcouldbeCheung,andYemightbespelledYeh.AddressesandIDnumbershelpedconfirmedmanyidentitiesbutmanyothersnamesweredroppedbecausethereportingteamcouldnotbe100percentsurethatthepersonwasacorrectmatch.

Apictureslowlybegantoemerge:China’seliteswereaggressivelyusingoffshorehavenstoholdassets,listcompaniesintheworld’sstockexchanges,buyandsellrealestateandconducttheirbusinessawayfromBeijing’sredtapeandcapitalcontrols.

HowWeDidOffshoreLeaksChina,ICIJ

AnalyzingtheDatabehindSkinandBone,ICIJ

WhoRunsHK?TheFightoverStanleyHo'sFortuneSouthChinaMorningPost,2010

SNAthatcouldbeusedinJournalism

• TheNetworkofGlobalCorporateControlpaper• Networkofcampaignfinancecontributions(SuperPACs)• Internationalfinancialsystem/HFT• "Revolvingdoor"/regulatorycapture• Politicaleliteinanycountry• Findaudienceforstory,akintotargetedmarketing• ...

Vitali,Glattfelder,Battiston,TheNetworkofGlobalCorporateControl

SNAinjournalism• Visualizationwidelyused• Linkanalysissuccessfulininvestigativereporting• Mostoftheworkrequiredtodothesetypesofstoriesistraditional research,notalgorithmically-guided.

• Iamnotawareofsuccessfulapplicationofcentralitymetricsorcommunitydetectionalgorithms.

• Thismaychangeasthegraphsjournalismexaminesgetbigger...

• Woulditbepossibletousecommunitydetectiontofindthe"right"audienceforastory?