Download pdf - Inferring user traits via unsupervised methodscs229.stanford.edu/proj2017/final-posters/5135626.pdf · 3Department of Physics, [email protected] Data Acquisition Data and Feature

CharacterizingtheEthereumaddressspaceInferringusertraitsviaunsupervisedmethods

JamesPayette1,SamuelSchwager2,JosephMurphy31DepartmentofComputerScience,[email protected],[email protected],[email protected]

DataAcquisition

DataandFeatureSet ModelsandAnalysis

ResultsandDiscussion

OngoingInvestigations References

Successful, efficient data acquisition was a majormilestone for our project• Using etherscan.io, we recursively scraped data from

the publically available blockchain, eventuallyaggregating a data set of 250,000 unique addresses.

• Queried the etherscan API for an address’ ethereumbalance and all of their transactions

• Used this information to build our feature vectors

Each row of our overall design matrix corresponds to thefeature vector for a single Ethereum address and each columnto a single feature. The dataset is normalized to the samplemean and unit sample variance.

Our feature set is made up of 34 features which are calculatedbased on the raw data returned by API calls to etherscan.io.Feature selection for this problem was difficult since for eachaddress we only only have access to the information containedin each transaction. We selected features that would help todistinguish different types of ethereum users (i.e. industrialusers probably move more money and have more transactionsthan hobby users). We tried to select features that, whenaggregated, would paint a descriptive picture of the user.

Features include: Total Ether, number of transactions,transactions per month, average Ether transaction, etc.

Recent public and commercial interest incryptocurrencies has made their published, yetanonymous ledgers, or “blockchains”, objects ofsupreme interest. Successfully identifying or evencharacterizing users, known only by their addresses,would have enormous security implications [1]. Weexamine the blockchain of Ethereum with theobjective of clustering addresses into distinct“behavior groups” to qualitatively infer their traits. AnexampletransactionontheEthereumblockchain[2]

TheEthereum addressspace

The main objective of our quantitative analysis was to use clustering evaluation metricsand Principal Component Analysis (PCA) to determine an informed estimate for theoptimal number of clusters with which to examine as behavior groups.

[1] Monaco, John V. "Identifying bitcoin users bytransaction behavior." SPIE Defense+ Security. InternationalSociety for Optics and Photonics, 2015.[2] Wood, Gavin. "Ethereum: A secure decentralisedgeneralised transaction ledger." Ethereum Project YellowPaper 151 (2014).[3] Kodinariya, Trupti M., and Prashant R. Makwana."Review on determining number of Cluster in K-MeansClustering." International Journal 1.6 (2013): 90-95.[4] Tibshirani, Robert, Guenther Walther, and TrevorHastie. "Estimating the number of clusters in a data…

• PCAfindsthatonly33% ofthevarianceisexplainedbythefirsttwocomponents• K-meansclusteringusedoverothermethodsforitsscalability,versatility• UseunsupervisedmetricCalinski HarabazScoreasmeasureofclusterdefinition• “Elbow”ofCalinskiHarabazplotgivesinsightonoptimalnumberofclusters[3]• FurtherinvestigateoptimalnumberofclustersviaSilhouetteScores

Acknowledgements

Determining the optimal number of K-meansclusters is not always a well-defined problem[3], [4]. Employing various evaluationtechniques, we estimate the best number ofbehavior groups lies roughly between 8 to 20.

…set via the gap statistic." Journal of the Royal Statistical Society:Series B (Statistical Methodology) 63.2 (2001): 411-423.[5] Meiklejohn, Sarah, et al. "A fistful of bitcoins: characterizingpayments among men with no names." Proceedings of the 2013conference on Internet measurement conference. ACM, 2013.

Figures at right: Silhouette scores range from 0to 1 (-1 = misclassification). Scores closer to 1indicate a confident cluster mapping (i.e. shortdistance to cluster centroid, far from neighbors).Left: Silhouette scores of clusters with size >100,average score (dotted red line). Right: Mappingof clusters to 2 principal component space.Adapted from Scikit-Learn starter code.

We would like to thank the entire CS 229 teachingstaff for their instruction and useful insights thisquarter. We look forward to presenting our finalresults in the forthcoming report.

With our quantitative analysis complete, we have aninformed estimate of the number of behavior groups toconsider. Using heuristics, we will qualitatively analyzethe clusters based on their locations in feature space tocharacterize their traits [5]. The qualitative analysis willbe included in our final report (in preparation). Longterm applications of this work include exploringgenerative models to learn specific behavior groupcharacteristics in order to “impersonate” other users.

Inertiaisthesumofsquareddistancesofsamplestotheirclosestclustercenter.“Elbow”similartoCHScore.

Naively, we expect a handful of clustersto explain the unique behavior groups inthe address space. This is confirmed bythe “elbows” of the CH Score and Inertiafigures. However, considering theSilhouette analysis, we see that theoccupations of clusters is highlydisproportionate in this regime. This maynot be entirely troublesome, as there islikely a biased distribution of users.