Companies want to validate products early, with little time for good engineering and performance work. Yet good code can provide 10-100x speed up which brings tremendous value to clients. We get help from modern hardware and algorithms but we need to know what are its strengths and limitations so we can consciously decide when to invest in engineering and what added value to expect.
9. 0.1X DEVELOPER NEEDED
THE 0.1X DEVELOPER
▸ Let’s not add that feature
▸ Let’s not adopt this new technology
▸ Let’s not automate this
▸ Can we approximate the result so no one notices?
▸ Can we use a small lookup table?
▸ Benji Weber: https://benjiweber.co.uk/blog/2016/01/25/why-i-strive-to-be-a-0-1x-
engineer/
▸ Mike Acton: https://twitter.com/mike_acton/status/989001065893801984
16. NOW, WE ARE GOING TO INTRODUCE SOME
NEW CANDIDATES FOR "STATE-OF-THE-ART"
ALGORITHMS, BY WRITING SOME FOR-
LOOPS AND RUNNING THEM ON MY LAPTOP.
Frank McSherry
SCALABILITY! BUT AT WHAT COST?
36. NUMA
RAM READ THROUGHPUT - PREDICTABLE PATTERN
@Benchmark
public int readByteBufferInOrder() {
Numa.runOnNode(threadNumaNode);
int result = 0;
for (int i = 0; i < byteBuffer.limit(); i += 8) {
result += byteBuffer.getLong(i);
}
return result;
}
39. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
0
150MS
300MS
450MS
600MS
Selectivity (%)
0 10 20 30 40 50 60 70 80 90 100
533ms
43ms 43ms
40. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
@Benchmark
public int plainIf() {
int result = 0;
for (int i = 0; i < inputSize; i++) {
if (onHeapValues.getInt(i * 4) < predicate) {
result++;
}
}
return result;
}
SIMPLE PREDICATE
42. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
MASK = 0111 1111
~MASK = 1000 0000
42 = 0010 1010
13 = 0000 1101
13 ^ MASK = 0111 0010
+ 42 = 1001 1100
&~MASK = 1000 0000
43. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
@Benchmark
public int plainIf() {
int result = 0;
for (int i = 0; i < inputSize; i++) {
if (onHeapValues.getInt(i * 4) < predicate) {
result++;
}
}
return result;
}
SIMPLE PREDICATE
44. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
@Benchmark
public int plainIf() {
int result = 0;
for (int i = 0; i < inputSize; i++) {
result += RETURN_1_IF_MATCH(
onHeapValues.getInt(i * 4),
predicate
)
}
return result;
}
BRANCHLESS (PSEUDOCODE)
45. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
@Benchmark
public int plainArithmetic() {
int result = 0;
int mask = 0b0111_1111;
for (int i = 0; i < inputSize; i++) {
int x = onHeapValues.getInt(i * 4) ^ mask;
int z = (predicate + x) & ~mask;
result += Integer.bitCount(z);
}
return result;
}
BRANCHLESS
46. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
0
150MS
300MS
450MS
600MS
0 10 20 30 40 50 60 70 80 90 100
CONDITIONAL ARITHMETIC
97ms
533ms
43ms 43ms
47. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
0
150MS
300MS
450MS
600MS
0 10 20 30 40 50 60 70 80 90 100
CONDITIONAL ARITHMETIC
PACKED
11ms
97ms
533ms
43ms 43ms
49. BRANCHLESS DOOM
“The mov-only DOOM renders approximately
one frame every 7 hours,
so playing this version requires somewhat increased patience.”
50. CONCLUSION
‣ THERE ARE TWO HATS YOU CAN WEAR
‣ 0.1X CONCENTRATES ON USER VALUE
‣ 10X ON TECHNOLOGY
‣ THESE ARE NOT EXCLUSIVE
‣ LEARN ALGORITHMS AND DATA STRUCTURES (LINDY’S!)
‣ EXPERIMENT
‣ DEVELOP INTUITION OF WHAT IS POSSIBLE
51. PAPERS
REFERENCES
‣Efficiently Compiling Efficient Query Plans for Modern
Hardware
‣Constant-Time Query Processing
‣https://danluu.com/branch-prediction/
‣StarCraft II as an AI research environment
‣Multiple Byte Processing with FullWord Instructions
‣http://www.vldb.org/2017/keynotes.php
52. COPYRIGHTS
COPYRIGHTS
▸ Houston opportunities by XKCD.com
▸ A/B Testing example: https://en.wikipedia.org/wiki/A/
B_testing#/media/File:A-B_testing_example.png by Maxime
Lorant
▸ AMD for EPYC diagrams
▸ Martin Thompson (https://mechanical-
sympathy.blogspot.com/2013/02/cpu-cache-flushing-
fallacy.html) for 2 socket CPU diagram