While compute becomes faster and cheaper we are tempted to abandon sanity and shield ourselves from reality and laws of physics. The resulting mess of monstrous Slack instances rampaging across our RAM should makes us stop (because our computers did it already) and wonder where did we go wrong? Rising developer salaries and time to market pace are tempting us to abandon all hope for optimising our code and understanding our systems.
Contrary to what casual reader could think this is a deeply technical presentation. We will gaze into hardware counters, NUMA nodes, vector registers and that darkness will stare back at us.
All this to get a taste of what is possible on current hardware, to learn the COST of scalability and forever change how you feel when accessing invoice list in your local utilities provider UI so that after 20s of waiting all 12 elements will be displayed (surely Cthulhu must be eating their compute because it is NOT possible Tauron hosts it’s billing services on FIRST GEN IPHONE).
16. NOW, WE ARE GOING TO INTRODUCE SOME
NEW CANDIDATES FOR "STATE-OF-THE-ART"
ALGORITHMS, BY WRITING SOME FOR-
LOOPS AND RUNNING THEM ON MY LAPTOP.
Frank McSherry
SCALABILITY! BUT AT WHAT COST?
37. NUMA
RAM READ THROUGHPUT - PREDICTABLE PATTERN
@Benchmark
public int readByteBufferInOrder() {
Numa.runOnNode(threadNumaNode);
int result = 0;
for (int i = 0; i < byteBuffer.limit(); i += 8) {
result += byteBuffer.getLong(i);
}
return result;
}
44. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
MASK = 0111 1111
~MASK = 1000 0000
42 = 0010 1010
13 = 0000 1101
13 ^ MASK = 0111 0010
+ 42 = 1001 1100
&~MASK = 1000 0000
45. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
@Benchmark
public int plainIf() {
int result = 0;
for (int i = 0; i < inputSize; i++) {
if (onHeapValues.getInt(i * 4) < predicate) {
result++;
}
}
return result;
}
46. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
@Benchmark
public int plainArithmetic() {
int result = 0;
int mask = 0b0111_1111;
for (int i = 0; i < inputSize; i++) {
int x = onHeapValues.getInt(i * 4) ^ mask;
int z = (predicate + x) & ~mask;
result += Integer.bitCount(z);
}
return result;
}
47. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
0
150MS
300MS
450MS
600MS
0 10 20 30 40 50 60 70 80 90 100
CONDITIONAL ARITHMETIC
97ms
533ms
43ms 43ms
48. MULTI BYTE PROCESSING WITH FULL WORD INSTRUCTIONS
0
150MS
300MS
450MS
600MS
0 10 20 30 40 50 60 70 80 90 100
CONDITIONAL ARITHMETIC
PACKED
11ms
97ms
533ms
43ms 43ms
50. BRANCHLESS DOOM
“The mov-only DOOM renders approximately one
frame every 7 hours, so playing this version requires
somewhat increased patience.”
51. CONCLUSION
‣ COMPUTE IS GETTING FASTER
‣ DEVELOPERS ARE GETTING MORE EXPENSIVE
‣ (GOOD :))
‣ SOFTWARE & LIBRARIES ARE GETTING SLOWER
‣ OPPORTUNITIES AHEAD
52. CONCLUSION
‣ O(NLOGN) ALGORITHM IS 50 000 TIMES FASTER
THAN O(N2)
‣ DO WE WANT TO ADD 50 000 TIMES MORE
SERVERS OR OPTIMISE?
53. PAPERS
REFERENCES
‣Efficiently Compiling Efficient Query Plans for Modern
Hardware
‣Constant-Time Query Processing
‣https://danluu.com/branch-prediction/
‣StarCraft II as an AI research environment
‣Multiple Byte Processing with FullWord Instructions
‣http://www.vldb.org/2017/keynotes.php
54. TEXT
COPYRIGHTS
▸ Houston opportunities by XKCD.com
▸ A/B Testing example: https://en.wikipedia.org/wiki/A/
B_testing#/media/File:A-B_testing_example.png by
Maxime Lorant
▸ AMD for EPYC diagrams