Applications of HPCC Systems at Clemson University
Amy Apon, PhD ● Linh Ngo, PhD ● Michael Payne Big Data Systems Laboratory
Clemson University
Applications of HPCC Systems at Clemson University Clemson Strengths and Opportunities
PhD-level faculty & research staff Talented students
Significant industry collaborators
Palmetto – Top 5 in US Academic Supercomputers
~2000 nodes, 20K cores, 600 GPUs 100Gb Internet connectivity
Facilities People
Applications of HPCC Systems at Clemson University Big Data Systems Lab Overview
Perform World Class Research on the Systems and Enabling Information
Technology for Advanced Data Analytics
Big Data Systems Lab Research Areas
Systems and Architectures
Tools and Operations
Data Analytics and Applications
Big Data Systems Lab Vision
Applications of HPCC Systems at Clemson University Effect of High Performance Computing on Academic Research Productivity
Motivation: There is a lot of pressure on federal funding We propose efficiency as a measure from which to gain insights on return on investment We show that locally-available HPC has a positive effect on the ability of a university to do research
Motivation: Government and business need information about public sentiment. Research: We develop and apply methods to analyze large amounts of textual data to enable inquiry of social and business problems.
Applications of HPCC Systems at Clemson University Text mining of news reports and social media for business intelligence
Shared Execution Environment
Temporary Local Storage
User Privileges Only
Applications of HPCC Systems at Clemson University Shared Computing Resources among Researchers
Applications of HPCC Systems at Clemson University
Linh Ngo, PhD HPCC Systems in a Shared Research Computing Environment
Shared Execution
Environment
Temporary Local Storage
User Privileges Only
Applications of HPCC Systems at Clemson University Shared Computing Resources among Researchers
How to provision and configure an HPCC cluster dynamically for research purposes?
• Step 1: Configure, install, and deploy HPCC as a non-root user
• Step 2: Dynamically provision HPCC cluster in a shared research environment
binutils ICU
XALAN APR
… /
/usr
/lib64
/lib
…
/opt
/
/home
$USER …
/parallel_scratch
/local_scratch
Applications of HPCC Systems at Clemson University Installation and Configuration of Dependencies
Administrative privileges Non-administrative privileges
…
/ etc
init.d HPCCSystems
opt HPCCSystems
var lib
HPCCSystems
log HPCCSystems
configmgr mydafilesrv mydafilesrv
…
/ home
$USER
hpcc local_scratch
hpcc
parallel_scratch
lib HPCCSystems
log
$USER
lock pid
$USER
Applications of HPCC Systems at Clemson University Resolving Non-default Installation Path Conflicts
Remove/relax root-level settings: i.e.: is_root
Reduce default configuration settings for resource
requirements: depended on resource allocation requests
Applications of HPCC Systems at Clemson University Non-root Deployment
mydafilesrv mydafile myeclc mythor myroxie
…
PBS_NODEFILE environment.xml
1
2 3
4
5
user.palmetto.clemson.edu
Applications of HPCC Systems at Clemson University Dynamic Provisioning
Deploy to /local_scratch or /parallel_scratch?
Applications of HPCC Systems at Clemson University
Michael Payne
Using HPCC Systems to Manage Academic Data LexisNexis Summer 2014 Internship
• Research in Scholarly Data requires academic data from many different sources, which store data under various formats
• Aggregating these sources into a useful and cohesive structure requires a data-intensive approach to preprocessing, integration and analysis
• HPCC Systems is a platform to streamline this process
Applications of HPCC Systems at Clemson University Using HPCC Systems to Manage Academic Data
Higher Education
Institutions
Research
High Performance Computing Capability
Funding Support
Applications of HPCC Systems at Clemson University Categories of Scholarly Data
Higher Education
Institutions
Research
High Performance Computing Capability
Funding Support
Detailed Award (XML) Federal Funding (tab-delimited) Expenditures (tab-delimited) Institution Patent (Excel) Degrees Conferred (Excel)
NIH Award Data (CSV/Excel) Top500 Supercomputer affiliated with academic institutions (XML)
Institutions from States with EPSCoR status (tab-delimited)
Institutional Information with Carnegie’s Research Classifications (multi-sheet Excel)
Enrollment (multi-sheet Excel) Financial (multi-sheet Excel) Faculty (multi-sheet Excel)
Detailed list of articles by discipline with abstracts, references, and disambiguated authors. (XML)
Applications of HPCC Systems at Clemson University Scholarly Data Description
Institution name/email
Name similarity Address
PI/A
utho
r nam
e In
stitu
tion
nam
e
Name similarity
Match name with WoS ‘s Organization-Enhanced name
Ackn
owle
dgm
ent a
ttrib
utes
(200
8 on
war
d, a
utom
atic
)
Applications of HPCC Systems at Clemson University Examples of Scholarly Data Links
• Porting data analytic processes to ECL
• Applying Machine Learning techniques for article abstract classification
Applications of HPCC Systems at Clemson University Ongoing Work
LexisNexis Internship Machine Learning
Manager Timothy Humphrey
Mentor Arjuna Chala
Applications of HPCC Systems at Clemson University Summer 2014 Internship - Logistic Regression for Dense Matrices
• Prediction using continuous and discrete values
• No distributional assumptions on the predictors • May not be
normally distributed or linearly related
• Relationship between the discrete variable and the predictor is non-linear
Applications of HPCC Systems at Clemson University Logistic Regression
Matrices can be partitioned
Schemes must be compatible
There are multiple choices!
X = 4 x 4 4 x 1 4 x 1
2 x 3
X = 2 x 1 1 x 1 2 x 1
3 x 1
Applications of HPCC Systems at Clemson University Parallel Block Basic Linear Algebra Subprograms (PB-BLAS)
• Logistic Runtimes
• Hard Coded Mapping
• Full Higgs Dataset 11,000,000 x 28
0
5
10
15
20
25
30
35
Higgs 1,000 Higgs 10,000 Higgs 100,000
Tim
e in
Min
utes
PB-BLAS Non PB-BLAS
Applications of HPCC Systems at Clemson University Machine Learning in ECL
• Logistic Runtimes
• Auto Mapping
• Full Elsevier Dataset 100,000 x 3,291 0
5
10
15
20
25
Elsevier 100
Tim
e in
Min
utes
PB-BLAS Non PB-BLAS
Applications of HPCC Systems at Clemson University Machine Learning in ECL
050
100150200250300350400450
Elsevier 1,000
Tim
e in
Min
utes
PB-BLAS Non PB-BLAS
Applications of HPCC Systems at Clemson University Machine Learning in ECL
• Logistic Runtimes
• Auto Mapping
• Full Elsevier Dataset 100,000 x 3,291
• Logistic Regression code and supporting functions have been documented and merged to ECL-ML GitHub repository
• Auto block vector mapping function for any user that wants to use PB-BLAS
• Ready to use element wise multiplication in PB-BLAS • Updated debugging statements that a clear
understanding of errors • Test functions for both block vector mapping function • Sample code for using logistic regression • Currently working on K-means implementation that
utilizes PB-BLAS
Applications of HPCC Systems at Clemson University Project Summary
Linh Ngo, PhD ● Alex Herzog, PhD ● Michael Payne ● Amy Apon, PhD
{lngo, aherzog, mpayne3, aapon}@clemson.edu
Big Data Systems Laboratory Clemson University