51
PK Chunking Divide and conquer massive objects in salesforce Daniel Peter Lead Applications Engineer, Kenandy Inc. Co-organizer of the Bay Area Salesforce Developer User Group @danieljpeter

PK chunking presentation from Tahoe Dreamin' 2016

Embed Size (px)

Citation preview

PK ChunkingDivide and conquer massive objects in salesforce

Daniel Peter• Lead Applications Engineer, Kenandy Inc.

• Co-organizer of the Bay Area Salesforce Developer User Group• @danieljpeter

Why shouldn’t I leave?

Because you need to learn how to avoid these errors!Query not “selective” enough:� Non-selective query against large object type (more

than 100000 rows).

Query takes too long:� No response from the server� Time limit exceeded� Your request exceeded the time limit for processing.

Too much data returned in query:� Too many query rows: 50001� Remoting response size exceeded maximum of 15 MB.

GET THE DATA

Sounds great. How?

Not so fast……first we need some pre-requisite knowledge!

� Database Indexes� Salesforce Ids

Database indexes (prereq)

“Allow us to quickly locate rows without having to scan every row in the database”

(paraphrased from wikipedia)

Database indexes (prereq)

Database indexes (prereq)

Database indexes (prereq)

locationlocationlocation

Salesforce Ids (prereq)

�Composite key containing multiple pieces of data.

�Uses base 62 numbering instead of the more common base 10.

�Fastest way to find a database row.Is it time to go

skiing yet?

Salesforce Ids (prereq)

Salesforce Ids (prereq)

Digits Values

1 62

2 3,844

3 238,328

4 14,776,336 million

5 916,132,832 million

6 56,800,235,584 billion

7 3,521,614,606,208 trillion

8 218,340,105,584,896 trillion

9 13,537,086,546,263,600quadrillion

Digits Values

1 10

2 100

3 1,000

4 10,000

5 100,000

6 1,000,000 million

7 10,000,000 million

8 100,000,000 million

9 1,000,000,000 billion

Base 10 Base 62vs

(sorry for covering you, logo)

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789

Salesforce Ids (prereq)

MO’ NUMBERS

Base 62

Prerequisites complete!

How does PK Chunking work?

Analogy: fetching people in a city.

Fetching people in a city: problems

Non-selectiveRequest: “get me all the people who are female”

Response: “yer trippin’!”

Fetching people in a city: problems

TimeoutRequest: “find me a 7 foot tall person in a pink tuxedo in Beijing”

Response: (after searching all day) “I can’t find any! I give up!”

Finding people in a city: problems

Too many people foundRequest: “find me all the men in San Francisco with beards”

Response: (after searching for 10 mins) “The bus is full!”

PK Chunking fixes those problems

Divide and conquer!Parallelism!

Fetching people in a city: solutions

Non-selectiveRequest: “get me all the people who are female,in your small search area”

Response: “¡Con mucho gusto!”

Fetching people in a city: solutions

TimeoutRequest: “find me a 7 foot tall person in a pink tuxedo in Beijing, in your small search area”

Response: SP1: “Didn’t find any, sorry!”SP2: “Didn’t find any, sorry!”SP3: “Found one!”SP4: “Didn’t find any, sorry!”

Finding people in a city: solutions

Too many people foundRequest: “find me all the men in San Francisco with beards, in your small search area”

Response:SP1: 30 people in our busSP2: Didn’t find anySP3: 50 people in our bus

Technical details

2 different implementations

QLPKQuery Locator PK Chunking

Base62PKBase62 PK Chunking

QLPK

Salesforce SOAP or REST API – AJAX toolkit works great.

Create and leverage a server-side cursor. Similar to an Apex query locator.

Analogy: Print me a phone book of everyone in the city so I can flip through it.

QLPK – AJAX Toolkit Request

QLPK – AJAX Toolkit Response

Chunk the database, in size of your choice, by offsetting the queryLocator:

01gJ000000KnRpDIAV-5000001gJ000000KnRpDIAV-100000…01gJ000000KnRpDIAV-3995000001gJ000000KnRpDIAV-40000000

QLPK – The Chunks

800 chunksx 50,000 records40,000,000 total records

Analogy: we have exact addresses for clusters of 50k people to give to 800 different search parties.

QLPK – How to use in a query?

Perform 800 queries with the Id ranges in the where clause:

SELECT Id, Autonumber__c, Some_Number__cFROM Large_Object__cWHERE Some_Number__c > 10 AND Some_Number__c < 20 AND Id >= 'a00J000000BWNYk' AND Id <= 'a00J000000BWO4z'

database so hard, take 800 queries to find me

THAT SPLIT CRAY

QLPK – Parallelism

Yeah it’s 800 queries, but…

They all went out at once, and they might all come back at once.

Analogy: We hired 800 search parties and unleased them on the city at the same time.

Shift Gears

QLPK Base62PK

Base62PK

Get the first and last Id of the database and extrapolate the ranges in between.

Analogy: Give me the highest and lowest address of everyone in the city and I will make a phonebook with every possible address in it. Then we will break that into chunks.

Base62PK – first and last Id

Get the first IdSELECT Id FROM Large_Object__c ORDER BY Id ASC LIMIT 1

Get the last IdSELECT Id FROM Large_Object__c ORDER BY Id DESC LIMIT 1

Even on H-U-G-E databases these return F-A-S-T. No problem.

Base62PK – extrapolate

1. Chop off the last 9 digits of the 15 digit first/last Ids. Decompose.

2. Convert the 9 digit base 62 numbers into a Long Integer.

3. Add the chunk size to the first number until you hit or exceed the last number.

4. Last chunk may be smaller.5. Convert those Long Integers back to base 62

and re-compose the 15 digit Ids

Base62PK – benefits

� High performance! Calculates the Ids instead of querying for them.

Base62PK – issues

� Digits 4 and 5 of the Salesforce Id are the pod Identifier. If the Ids in your org have different pod Id’s this technique will break, unless enhanced.

� Fragmented Ids lead to sparsely populated ranges. You will search entire ranges of Ids which have no records.

So which do I pick?

QLPK

or

Base62PK

So which do I pick?

Hetergeneous Pod Ids Homogeneous Pod Ids

Low Id Fragmentation

(<1.5x)

Medium Id Fragmentation

(1.5x - 3x)

High Id Fragmentation

(>3x)

QLPK X X X

Base62PK X X

How do I implement?

� Needs to be orchestrated via JS in your page.� Doesn’t work on Lightning Component

Framework. No support for real parallel controller actions. (boxcar’ed)

� Has to be Visualforce or Lightning / Visualforce hybrid.

How do I implement?

� Use RemoteActions to get the chunk queries back into your page.

� Can be granular or aggregate queries!� Process each chunk query appropriately when

it comes back. EX: update totals on a master object or push into a master array.

function queryChunks() {for (var i=0; i<chunkList.length; i++) {

queryChunk(i);}

}

function queryChunk(chunkIndex) {var chunk = chunkList[chunkIndex];

Visualforce.remoting.Manager.invokeAction('{!$RemoteAction.Base62PKext.queryChunk}',chunk.first, chunk.last,function (result, event) {

for (var i=0; i<result.length; i++) {objectAnums.push(result[i].Autonumber__c);

}

queryChunkCount++;if (queryChunkCount == chunkList.length) {

allQueryChunksComplete();}

},{escape: false, buffer: false}

);

}

@RemoteActionpublic static List<Large_Object__c> queryChunk(String firstId, String lastId) {

String SOQL = 'SELECT Id, Autonumber__c, Some_Number__c ' +'FROM Large_Object__c ' +'WHERE Some_Number__c > 10 AND Some_Number__c < 20 ' +'AND Id >= \'' + firstId + '\' ' +'AND Id <= \''+ lastId +'\' ';

return database.query(SOQL);}

Landmines

� Timeouts – retries � Cache warming means if you first fail, try and

try again!� Concurrency� Beware: ConcurrentPerOrgApex Limit exceeded� Keep your individual chunk queries lean. < 5

secs.

Demos

Harrah’s internet doesn’t like 800 parallel http connections.

Video:

https://www.youtube.com/watch?v=KqHOStka0eg

How did you figure this out?

Had to meet requirements for Kenandy’s largest customer. $2.5B / yr manufacturer.

High visibility project.

Necessity mother of invention!

How did you figure this out?

Query Plan Tool

How did you figure this out?

Debug logs from real execution

How did you figure this out?

QLPK

Ran into an org that had a mixture of sandbox and production IDs. Base62PK broke!

Why doesn’t Salesforce do this?

They do! (kinda)

The Bulk API uses a similar technique, but it is more asynchronous and wrapped in a message container to track progress.

Final Questions?

Thank you!

More info:

Article on Salesforce Developers Bloghttps://developer.salesforce.com/blogs/developer-relations/2015/11/pk-chunking-techniques-massive-orgs.html

Github repohttps://github.com/danieljpeter/pkChunking

Bulk API documentation:https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/async_api_headers_enable_pk_chunking.htm