As enterprises continue to push more or their data to the cloud, Salesforce has seen data volumes in its tenant orgs grow at an exponential rate. How do you manage such volumes efficiently? How do you build queries and reports that respond in a timely manner?
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
1. Tools, Techniques and Solutions To
Avoid A Big-Data Blowout In Your Org
moyez@t.digital, @moyezthanawalla
Moyez Thanawalla, President – Thanawalla Digital
4. What Prompted Me To Speak About Large Data in Salesforce?
AT&T Uverse:
• Exponential Record Growth.
• Expected to double in size next year
• Slow queries, mostly relegated to overnight batch jobs
• 48 hour turn-around to get leads allocated to dealers
• Client need to react much, much faster (minutes instead of days) to business ad-hoc needs
• Yes, Salesforce CAN go there
5. By [2020], our accumulated digital universe of data will grow from
4.4 zettabyets today to around 44 zettabytes, or
44 trillion gigabytes.
Even on a logarithmic scale, data is growing at an exponential rate…
6. By [2020], our accumulated digital universe of data will grow from
4.4 zettabyets today to around 44 zettabytes, or
44 trillion gigabytes.
Even on a logarithmic scale, data is growing at an exponential rate…
7. …And Salesforce Orgs are Leading The Way
”The truth is that as salesforce.com popularity has
skyrocketed, so too has the size of databases
underlying custom and standard app implementations
on our cloud platforms. It might surprise you to learn
that our team works regularly with customers that have
large Force.com objects upwards of 10 million
records.”
Steve Bobrowski, Salesforce Customer Centric Engineering Group
8. Your Six Steps To Database Success
Step 1. Understand What You Can Control…(and what you can’t)
Step 2. Understand How your Data is Conceptualized
Step 3. Understand and Leverage Indexes
Step 4. Ask for Skinny Tables
Step 5. Develop Metadata Tables Where Possible
Step 6. With Lightning, Push Processing to Client-Side
9. Step 1. Understand What You Can Control…(and what you can’t)
“As a customer, you also cannot
optimize the SQL that
underlie many application operations because it is
generated by the
system, not written by each tenant. “
10. …And Managing Large Volumes in Salesforce is Different..
Multitenancy and Metadata
11. Step 2. Understand How your Data is Conceptualized
In Agile, the Class-diagrams of Domain
Modelling, derived from the Use-Cases, have
usually replaced Entity-Relationship modelling; but the
need for planning has not diminished. We still need to
understand the data and what it’s
supposed to do and what are the best and safest ways
to manage, store, and protect it.
….in other words…Are class-diagrams the enemy of database design?
13. Step 3. Understand and Leverage Indexes
Salesforce supports custom indexes to speed up queries, and you can create custom
indexes by contacting Salesforce Customer Support.
On Most Objects…
• RecordTypeId
• Division
• CreatedDate
• Systemmodstamp
• Name
• Email (for contacts and leads)
• Foreign key relationships
• The unique Salesforce record
ID.
Salesforce also supports
custom indexes on custom
fields, Except for
• multi-select picklists,
• text areas (long),
• text areas (rich),
• non-deter. formula fields,
• encrypted text fields.
Declaring a field as an
External ID causes an index
to be created on that field;
You can create External IDs
only on the following fields.
• Auto Number
• Email
• Number
• Text
15. What Does The Query Optimizer Tell Me?
If the cost for the table scan is lower than the index, and the query is timing
out, you will need to perform further analysis on using other filters to improve selectivity,
or, if you have another selective filter in that query that is not indexed but is a candidate
for one.
16. What Is The Criteria for a Selective Query”
Does Your Query Have and Index?
• If the filter is on a standard field, it'll have an index if it is a primary key (Id, Name, OwnerId), a foreign key (CreatedById, LastModifiedById,
lookup, master-detail relationship), and an audit field (CreatedDate, SystemModstamp).
Custom fields will have an index if they have been marked as Unique or External Id
• If the filter doesn't have an index, it won't be considered for optimization.
• If the filter has an index, determine how many records it would return:
For a standard index, the threshold is 30 percent of the first million targeted records and 15 percent of all records after that first
million. In addition, the selectivity threshold for a standard index maxes out at 1 million total targeted records, which you could reach
only if you had more than 5.6 million total records.
For a custom index, the selectivity threshold is 10 percent of the first million targeted records and 5 percent all records after that
first million. In addition, the selectivity threshold for a custom index maxes out at 333,333 targeted records, which you could reach only if
you had more than 5.6 million records.
If the filter exceeds the threshold,it won't be considered for optimization.
If the filter doesn't exceed the threshold, this filter IS selective, and the query optimizer will consider it for optimization.
• If the filter uses an operator that is not optimizable, it won’t be considered for optimization.
The following type of operators are not optimizable: != , Leading %, null value comparisons,
23. Step 4. Ask for Skinny Tables
Salesforce uses the concept of “Skinny Tables” to speed up queries by avoiding joins
Characterisitics…
• Must be enabled by
Salesforce
• Is a collection of frequently
used fields
• Records are kept in sync with
the underlying table structure.
• Contains both Standard and
Custom fields.
• Does not include soft-deleted
records.
• Ideal when your table size
grows over a million records
• The unique Salesforce record
ID.
Considerations…
• Can be created on all
custom objects…
• but only on certain std
objects.,
• Skinny tables can contain
the following field types:
• Checkbox, Date, Date/Time,
Email, Number, Percent,
Phone, Picklist, Multi-select
Picklist, Text, Text Area, Text
Area (long) and URL.
24. Step 5. Develop Metadata Tables Where Possible
Can you infer aggregate abstractions in your
data? If so, pull those away into a metadata table,
and query, sort and report on *that* table instead.
25. Step 6. With Lightning, Push Processing to Client-Side
If moving excel tables to Salesforce, where the user wants to ‘filter on the fly’
Consider doing a broad query against Salesforce, and loading the data into a
Lightning Component (array or grid) where the user can further filter his
data in an ‘excel’ manner.
26. Your Six Steps To Database Success
Step 1. Understand What You Can Control…(and what you can’t)
Step 2. Understand How your Data is Conceptualized
Step 3. Understand and Leverage Indexes
Step 4. Ask for Skinny Tables
Step 5. Develop Metadata Tables Where Possible
Step 6. With Lightning, Push Processing to Client-Side
27. Want To Know More?
Salesforce Best Practices For Large Data
Volume:
• https://resources.docs.salesforce.com/sfdc/pdf/sal
esforce_large_data_volumes_bp.pdf
Trailhead:
• https://trailhead.salesforce.com/en/modules/datab
ase_basics_dotnet/units/writing_efficient_queries
Query Plan Tool Details:
• https://help.salesforce.com/articleView?id=000199
003&language=en_US&type=1
Editor's Notes
Thanawalla Digital….Salesforce Architect and Engineers.
https://www.entrepreneur.com/article/273561
In May of this year, Entrepreneur magazine rang the alarm bell on the need to tackle big data in your org NOW. Their approach suggested that the problem is two-fold. First, the data itself if growing at a growing rate. That is, we want to store more information about each transaction and identify MORE touchpoints on MANY MORE clients and prospects than ever before. C-suite executives want to know that we are amassing all needles in every haystack, and rigorously identifying an ever more complex understanding of our markets and clients.
BUT!!!, the article goes on, that’s only HALF the story. The more data we accumulate, the more efficient our processing engines MUST be in order to tackle the reporting and tracking requirements set by our CMOs, CFO….and down to our line managers. That’s where our companies are failing today. We ARE gathering more needles in more haystacks than ever before, but our ability to extract those needles IN-THE-MOMENT is significantly hampered by the data structures that we choose, and how we choose to access that data once it is in our possession.
In April of this year, Salesforce Customer Success invited us in to look at a problem that one of their premier clients was facing. Their database, mainly lead records, had grown…and continues to grow at an exponential rate. This, in itself does not usually cause a problem, but in this case, the number of records had already reached into the 10’s of millions of records, and the database was…is…growing at an exponential, exponential rate. The client was feeling real pain caused by delays in allocating leads. From the time a request to allocate leads came in….to the time that the leads were allocated…..was typically 4 days or more. This time was expected to deteriorate even further as the number of records continues to grow. This is not uncommon. Your business will face a similar issue, perhaps even as soon as next year……
https://www.youtube.com/watch?v=0kTH15TsxDU&feature=youtu.be
Ray Kurzweil, author of The Singularity is Near, shows us how large this problem of data-doubling really is. He makes the point that if you take 30 steps of equal size…..say, 1 meter each…. to reach the end of the hall, at the end of the 30 steps, you’ll be at the end of the hall………
…….on the other hand, if you take 30 steps, each one twice the size of the previous one………the doubling of data size in our example……at the end of the 30 steps,
https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#ab59fe817b1e
you would have circled the earth 26 times. This, then is the challenge that you face as your company’s database administrator.
AND…..Salesforce Orgs are not exempt from this geometric growth….
https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#ab59fe817b1e
Steve Bobrowski is an Architect Evangelist within the Salesforce Customer Centric Engineering group. Recently, he articulated Salesforce’s own experience with data-growth………[read]…..This number…that is, the number of TENANTS whose data exceeds 10 million records is also growing..
So that’s all well and good, but what CAN we do about storing and retrieving an every increasing number or records in our Saesforce database.
Today, we’ll talk about the six key concepts that you should architect around:
They Key to understanding how to tackle the problem of an ever-expanding data-set is to understand what you CAN control, and what you can’t. For those of you who come from a traditional database architecture background, you understand relational databases, indexes, SQL queries and the like. You may have also run your own queries on your local databases in Microsoft Access or SEEQUEL SERVER. But optimizing your database and your queries in a multi-tenant org is fundamentally different. For one thing, you don’t control your own SQL query. In fact, you can only have abstract inputs into the THING that ultimately generates the SQL queries that extract your data. There are multiple reasons for this, not the least of which is how the data in Salesforce is ACTUALLY laid out, compared to how you THINK it’s laid out…..
In Salesforce, your data for a single table is stored multiple places. This architecture is necessary to (1) accommodate multiple tenants on the same server, and (2) abstract and maintain indexes and differing number (and types) of fields in the same physical table. For instance, All standard objects and their standard fields (that is, those items that EVERY tenant has in common) are, simply enough, stored on one table. However, the custom fields for these same standard objects are relegated, by necessity to another table. You can see, then, if you run a query that returns a combination of fields from a standard object, then Salesforce has to first translate the query into TWO Oracle SQL queries, execute those queries, and aggregate the results before showing it to you on your list-view page or report.
Similarly, Custom objects and their fields are stored in other underlying SEEQL tables altogether. There are additional tables that store pivot tables for fields, tables that store indexes and relationships. For today’s discussion, the Index plays a front & center role.
Instead of attempting to manage a vast, ever-changing set of actual database structures for each application and tenant, the platform storage model manages virtual database structures using a set of metadata, data, and pivot tables.
Thus, if you apply traditional performance-tuning techniques based on the data and schema of your organization, you might not see the effect you expect on the actual, underlying data structures.
https://www.red-gate.com/simple-talk/sql/database-administration/how-to-get-database-design-horribly-wrong/
Robert Sheldon, In his article “How to Get Database Design Horribly Wrong”, points out that in most companies, the Agile methods of communication ignore the schema diagram in favor of Class Diagrams, which obfuscate the underlying intelligence of our database structure. As we get used to seeing Class Diagrams instead of Schema’s we tend to slowly forget how our database is laid out at the database layer….in addition (next slide)….
https://www.red-gate.com/simple-talk/sql/database-administration/how-to-get-database-design-horribly-wrong/
He makes the point that you must keep your data clean and normalized. That is, follow the rules of data-sanitation. Duplicate data must be rigorously prevented from entering your system, and duplicates that exist within your database today, must be rooted out and eliminated. The other side of that same coin is to enforce that your data is normalized. Within the Salesforce paradigm, tables have parent/child relationships. Leverage this capability to ensure that you store a clients billing address only once, and his shipping address only once, and that anytime you need that address on an order, that you lookup back to the account object to retrieve that information. Do not, store one piece of data in multiple locations.
The last point in Robert Sheldon’s essay is to…….(next slide)
Keeping Your Data Clean
Why? How?
Keeping Your Data Relational
Don’t Store Your Data in Multiple Places
Index Your Database
What is an Index, and Why do I Care?
Optimize Your Queries
How?
Certain standard fields on virtually all objects that you might query are already indexed. That makes them great as the “WHERE” part of any SOQL query as well as the filter part of an list or report. In addition, if you create certain TYPES of custom fields, these too are automatically indexed for you. Everything else….that is fields that don’t fall into these catagories MAY be indexed by asking salesforce to index them for you. Open a case, and include in that request, the org ID, the API name of the object and the API name of the field within the object that you want indexed. Here (in the center column), you see the types of fields that Salesforce CAN NOT index.
The Query Plan Tool is button on the Developer Console that allows you see the project cost of a query. To Enable the button, go to ‘Help’ on the Developer Console, and under ‘Preferences’ select Enable Query Plan Tool.
DEMO….show them how to enable the QUERY PLAN TOOL.
Why Should you care about optimizing your queries. The biggest reason to care is this. If your query is not optimized, that is…it’s running a full table scan in order to extract your data, then……even if it’s performing reasonably well today…….you risk the query timing-out when your database grows. That is, the search is not sustainable long term. Your objective, always should be to make sure that you have selective queries in your searches.
Why Should you care about optimizing your queries. The biggest reason to care is this. If your query is not optimized, that is…it’s running a full table scan in order to extract your data, then……even if it’s performing reasonably well today…….you risk the query timing-out when your database grows. That is, the search is not sustainable long term. Your objective, always should be to make sure that you have selective queries in your searches.
Selective: select name from account where name = 'GenePoint’
Not Selective because operation is not optimizable: select name from account where name != 'GenePoint'
Not considered for optimization because unindexed: select name from account where billingcity = 'paris’
Selective: select name from account where name = 'GenePoint’
Not Selective because operation is not optimizable: select name from account where name != 'GenePoint'
Not considered for optimization because unindexed: select name from account where billingcity = 'paris’
Selective: select name from account where name = 'GenePoint’
Not Selective because operation is not optimizable: select name from account where name != 'GenePoint'
Not considered for optimization because unindexed: select name from account where billingcity = 'paris’
Selective: select name from account where name = 'GenePoint’
Not Selective because operation is not optimizable: select name from account where name != 'GenePoint'
Not considered for optimization because unindexed: select name from account where billingcity = 'paris’
Selective: select name from account where name = 'GenePoint’
Not Selective because operation is not optimizable: select name from account where name != 'GenePoint'
Not considered for optimization because unindexed: select name from account where billingcity = 'paris’
Selective: select name from account where name = 'GenePoint’
Not Selective because operation is not optimizable: select name from account where name != 'GenePoint'
Not considered for optimization because unindexed: select name from account where billingcity = 'paris’
In the case of our client we identified metadata that could easily be extracted to a lookup table, which allowed queries to executed, in real-time against a table that was significantly smaller. For instance, if your leads can be aggregated into districts or neighborhoods, and you are able to assign them as neighborhoods to an agent, you can filter your leads at the neighborhood level, and then, when you have determined the neighborhoods to assign (after filtering and sorting through the available neighborhoods), you can execute a final routine to change the owner of the leads associated with the selected neighborhood. In some cases, we were able to run queries against a significantly smaller table (400k records) instead of doing the same thing against 80 million records in the lead table.
You are able to achieve this sort of improvement if you look at your queriable tables with an eye toward the metadata contained within the table, and ask the question….Can we abstract the metadata away into a smaller table, run our queries against the smaller table, and regain the equivalent records in the original table at the end.
Not all data should be filtered on the server. With Lightning Components, an architect has the ability to move significant processing away from server side by executing broad filters against the target data, loading that data into client-side tables, and allowing the user to apply excel style column filters to suit their needs. This is particularly useful where the user needs to be able to apply filters that the user wishes to apply in an ad-hoc manner.
So what we’ve talked about, are the six steps that we use at my company to look at a clients database….with a critical eye towards significantly improving their capability to grow their Salesforce database without their business grinding to a halt.