Upload
fujio-turner
View
461
Download
1
Embed Size (px)
DESCRIPTION
Learn how to index your Big Data to get the speed that you want and need. With HPCC Systems use less machines and do more work faster then Hadoop. To Install HPCC Systems in just 5 Minutes Watch this Youtube video. http://www.youtube.com/watch?v=8SV43DCUqJg
Citation preview
HPCC Systems Load, Index & Query
Big Data the EZ way
By Fujio Turner
@myhousehippo
Comparison
JAVA C++Petabytes
1-80,000 Jobs/day
Since 2005
Exabytes
Non-Indexed 4X-13X
Since 2000
Indexed: 2K-3K Jobs/sec
? ? ? ? ? ?
BusinessDevelopmentCustomers1 20
Non-Indexed Full Data Set
http://hpccsystems.com/why-hpcc/benchmarks
Map/Reduce
SQL w/ JOINS
GraphDB
Machine Learning
Simple to Complex Queries
PluginsBITransport
SecurityQuery
Encrypted on disk
“I’m sub-second fast.”
“I can query all or part of your
data.”
Thor RoxieHard Disk
Index(optional)Hard Disk
Index(optional) In-memory Index
SSD
Either/Both
Architecture
Data QueryFile
Example 2
Example 1
HPCC Systems Sample Data for Examples 1 & 2
Sample Data
http://hpccsystems.com/download/docs/learning-ecl
More Examples
CREATE TABLE layout_person ( PersonID INT(15) NOT NULL, FirstName VARCHAR(15) NOT NULL, LastName VARCHAR(25) NOT NULL, PRIMARY KEY (PersonID) );
1. Schema
2.
3.
Load
Query
INSERT INTO`layout_person` (`FirstName`,`LastName`)VALUE(‘Joe’,’Smith’;
SELECT * FROM `layout_person`;
Typical
1.
2.
Load
Queryw/ Applied Schema
on Read allPeople := DATASET(‘~file’,Layout_Person,THOR);
Layout_Person := RECORD UNSIGNED1 PersonID; STRING15 FirstName; STRING25 LastName; END;
allPeople;
Structured or
Semi-structured or
Unstructured
All data has: 1. Origin 2. DateTime 3. Info
Administrator Web GUI!on
Port 8010IP / Url of HPCC install
4.
5.
1. Upload file*!2. Distribute to cluster!3. Name of file in cluster!4. Size of each row!5. Push to cluster
*2GB file size limit through web No limit if uploaded via SOAP
Load Data
In Thor Cluster
Loaded
Query !Example 1
Data
allPeople := DATASET(‘~test::originalperson’,Layout_Person,THOR);
Layout_People := RECORD STRING15 FirstName; STRING25 LastName; STRING15 MiddleName; STRING5 Zip; STRING42 Street; STRING20 City; STRING2 State; END;
Smiths; //Output
Smiths := allPeople(LastName = ‘Smith’);Query
Schema
WHERE `LastName` = ‘Smith’
File TypeFile Location,!“FROM Table”
“USE DATABASE;”
“SELECT * ….”
1. Go to playground!2. Edit ECL!3. Pick “thor” Cluster!4. Submit
http://www.meetup.com/HPCC-SV/pages/ECL_EXAMPLE_1/
Practice
Full !Table or Data !
Scan
Why Index ?
++and
from date to date
Indexing!Example 2
Make Index
File Position Number!pseudo recordID!
“Alter Table”(new column)Index Filename
allPeople := DATASET(‘~test::originalperson’, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}}, THOR);
datax := INDEX(allPeople,{State,RecPtr},’~test::key_person’);
BUILDINDEX(datax);
Ex. Creating an index by “STATE”
http://www.meetup.com/HPCC-SV/pages/ECL_EXAMPLE_2a_-_Create_Index
Query
filterdata; //Output
w/ IndexData
Queryfilterdata:= FETCH(allPeople,datax(State=‘NJ’),RIGHT. RecPtr);
datax:= INDEX(allPeople,{State,RecPtr},’~thor::test::key_person’);
WHERE `State` = ‘NJ’ from Index
allPeople := DATASET(‘~test::originalperson’, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}},THOR);
http://www.meetup.com/HPCC-SV/pages/ECL_EXAMPLE_2b_-_Query_with_Index
2013-06-06 Twitter
2013-06-07 Twitter
2013-06-08 Twitter
2013-06 Twitter
2013-06-06 ……….. -07 ……….. -08
Logical File
Real File
SuperFile!organizing your files
+ Append new real files
1. Create New !! or !! Update Existing!! Super File
2. Super File Name!!2b. Add new file to !! existing superfile!!
3. Create Superfile!!
Creating a SuperFile
2013-06-06 Twitter
2013-06-07 Twitter
2013-06-08 Twitter
2013-06 Twitter
2013 Twitter
SuperKeys!organizing your indexes
2013-06-06 Twitter
2013-06-07 Twitter
2013-06-08 Twitter
2013 Twitter
SuperKeys No Sub-Super Files or Keys
in Roxie
When and where NOT to Index
Filtered Data
80-100% Queries @ Roxie
Index HereDo Not Index Here
100% of Data Enters Here
100% of Data Enters Here
• Query 100% of all data • Lots of Regular Expressions • Few or No DateTime DataDo Not Index Here