Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

In memory databases presentation

1,198 views

Published on

John Sullivan, CEO of INMemory.Net, gives an overview of in-memory databases.

Published in: Data & Analytics
  • Did u try to use external powers for studying? Like ⇒ www.HelpWriting.net ⇐ ? They helped me a lot once.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

In memory databases presentation

  1. 1. In Memory Databases An Overview By John Sullivan john@inmemory.net
  2. 2. Row Store
  3. 3. Features • Data is stored sequentially by Row • Essentially an Array / List Structure • Easy to Add / Update / Insert /Delete • Need to read entire Row to get to one Column’s Data
  4. 4. Column Store
  5. 5. Features • Data is stored by Column • Faster to Read a few Columns • Very Hard to Update / Insert • Reading Data Sequentially from Column, CPU Cache Friendly
  6. 6. Compressed Column Store
  7. 7. Compressed Column Store • Column Array is converted into 2 arrays –One array contains a list of sorted Unique Values –Another array containing an integer index to the values
  8. 8. Sqlite • Opened by Special Filename :memory: • Designed for Single Process / File • Great for embedded systems/ mobile devices. E.g. IOS Apps • Row Store , No Column Store • One Writer only. Non Server Based. • Free & Open Source
  9. 9. Excel • Power Pivot, Introduced in Excel 2010 • Non SQL Query Language • Data Analysis Expressions (DAX) • Syntax similar to Excel Formulae • Requires Pro version of Office or Excel
  10. 10. Tableau • Primarily a Visualization Tool • Tableau Data Extracts (TDE) • Compressed Column Store • Generates one table flat Extract from Source ( that may involve joins ) • Uses ODBC / OLEDB For Extraction • Only loads required columns from Extract
  11. 11. Qlik • One of the Original Developers in Compressed Columnar In Memory Analytics • Nice Dashboards • Incremental Updates • Autojoins Fields based on Field Name • Scripting Langauge for Generating QVD Files
  12. 12. Qlik Load Script Example Companies: LOAD id AS COMPANY_ID, name as COMPANY_NAME, postcode AS COMPANY_POSTCODE, address AS COMPANY_ADDRESS, If(id > 100, 1, 0) AS FLAG_NATIONAL; SQL SELECT id, name, postcode, address FROM database.Companies;
  13. 13. Monet DB • Pioneer in Columnar Databases • Research Focussed out of the Netherlands • Open Source • Can Cache Expensive Computations and Reuse • Early versions was used by Data Distilleries, which got bought out by SPSS • R Integration
  14. 14. SQL Server Enterprise • ColumnStore Indexes –Data is stored by column. –Blocks of 1,048,576 Values • InMemory OLTP (MEMORY_OPTIMIZED=ON) after Create Table Data/Delta files of 128 MB
  15. 15. Oracle • TimesTen – Works with Oracle Database as a Cache – Telecoms and Financial Companies • Oracle 12 Enterprise – Row & Column Formats – In Memory Columnstore • Exalytics
  16. 16. SAP Hana • Pure In Memory Database • In Memory OLTP Rowstore • In Memory Columnstore – Up to 2^31 rows per block • Cluster Large Fact tables across nodes • Hana One Available on EC2 & IBM
  17. 17. SAP Hana Archictecture
  18. 18. Memsql • Pure In Memory Database • Mysql Wire Protocol Compatible • Lockfree Linked Lists and Skiplists • SQL Queries compiled into C++ • Split Large Tables Across Nodes • Column Store Aimed at Analytics • Apache Spark Integration
  19. 19. Skiplists
  20. 20. Clustered Databases • Amazon Redshift • EMC Greenplum • IBM Netezza • HP Vertica • Teradata
  21. 21. Other In Memory Players • Sisense BI Focussed • Parstream Cisco Owned • Domo SAAS BI Company. Omniture Founder • Iri • InsightSquared BI Focussed • VoltDB Java Stored Procedure Unit of Exec • Infobright Open Sourced based on Mysql • KDB Focussed on HFT / Terse
  22. 22. InMemory.Net public static void testDoublePerformance() { double total = 0; for (int kk = 0; kk < 1000000000; kk++) { total += kk; } Console.WriteLine(total); }
  23. 23. Results • Ran in about 2.5 second for a billion Rows • 400 million rows per second on Single Core • About 50% of performance of C++ Prog. • 1.6 billion / second when running using 4 Core • 2.0 billion / second when running with HT Cores
  24. 24. Initial Version • InMemoryColumn<T> { Dictionary <T,int> initialValuesDict; List <int> initialIndexes; T [] finalValues; int [] finalIndexes; • }
  25. 25. Next Version • InMemoryColumn<T> { Dictionary <T,int> initialValuesDict; int [][] initialIndexes; T [] finalValues; int [] finalIndexes; • }
  26. 26. Final Version • InMemoryColumn<T> { Dictionary <T,int> initialValuesDict; byte/ushort/int [][] initialIndexes; T [] finalValues; byte/ushort/int [] finalIndexes; • }
  27. 27. ANLTR to Parse Queries grammar Expr; prog: (expr NEWLINE)* ; expr: expr ('*'|'/') expr | expr ('+'|'-') expr | INT | '(' expr ')' ; NEWLINE : [rn]+ ; INT : [0-9]+ ;
  28. 28. Example Rule from Grammer mainquery [ImpVars vars] returns [InMemoryQuery query ] : { $query = new InMemoryQuery(); } SELECT1 (CACHE {$query.setCache();} )? (NOCACHE {$query.setNoCache();} )? (DISTINCT {$query.setDistinct();} )? fieldclause [$query,$vars] ( (INTO label { $query.setInto ($label.text2 ) ;})? FROM tableclause [$query,$vars] ( (COMMA|CROSS JOIN ) tableclause [$query,$vars] ) * (WHERE whereclause [$query,$vars])? (GROUP BY groupclause [$query,$vars])? (HAVING havingclause [$query,$vars])? (ORDER BY orderclause [$query,$vars])? (LIMIT limitclause [$query,$vars])? )? ;
  29. 29. Code Generation • Generate C# To Evaluate Query • Compiled Code undergoes JIT for fast exec • Parameterize Constants – Simplify complex Constant Expressions • Generic Table / Column Naming • Reuse Generated Code
  30. 30. Detail Queries • Detail Query –Initial List Algorithm –Improved by using Arrays of Arrays –Only one thread works on one Array
  31. 31. SELECT customerid FROM Orders for (int tab1_counter = rowStart; tab1_counter < rowEnd; tab1_counter++,) { groupRowD1 = groupRowCount >> 14; groupRowD2 = groupRowCount & 16383; if (groupRowD2 == 0) { if (groupRowD1 > 0) { blockCounts[groupRowD1 - 1] = 16384; } lock (lock_newBlockObject) { groupRowCount = nextRecordD1 << 14; nextRecordD1++; } groupRowD1 = groupRowCount >> 14; t_total0[groupRowD1] = new byte[16384]; total0 = t_total0[groupRowD1]; }; total0[groupRowD2] = val_t1_c1[tab1_counter]; groupRowCount++; if ((groupRowCount & 16383) == 0) { blockCounts[groupRowD1] = 16384; } }
  32. 32. Aggregative Queries • Group Cardinality =1 • Group Cardinality < 500k – Use Arrays of Arrays, – Lookup Key being Group Index • Group Cardinality > 500k – Use Dictionaries to Correlate Group Index -> Storage – Arrays of Arrays
  33. 33. SELECT customer, SUM(1) FROM orders WHERE employee=1 GROUP BY customer for (int tab1_counter = rowStart; tab1_counter < rowEnd; tab1_counter++, newRow = false) { if ((val_t1_c2[tab1_counter] == const_0_t1_c2)) { rowIndex = val_t1_c1[tab1_counter]; if (groupRowExists[rowIndex] == 0) newRow = true; groupRowExists[rowIndex] = 1; total1[rowIndex] += const_0; if (newRow) { total0[rowIndex]=val_t1_c1[tab1_counter]; } } }
  34. 34. COUNT DISTINCT • Initial Algorithm used Byte [] • Used lots of Memory on Large Cores • Upgraded to 1 [] across all Cores • Interlocked.CompareExchange to set Bit • Hashmap for initial Values • Then switch to byte []
  35. 35. Subqueries • Subquery in Table clause can be materialized into temp table ( CACHE ) • Simplify Subquery ( NOCACHE) Only Fields Parent SELECT Requires Pass Through Parent WHERE Clause
  36. 36. JOINS • LEFT & INNER JOIN SUPPORT • Merge Parent & Child Column Values • Parent Value -> Child Indexes • ONE to ONE – Join becomes an Array Lookup • ONE to Many – Join Becomes for Loop
  37. 37. Query Simplification • Rewrite Aggregate Queries with Expressions SELECT SUM(1) / SUM (qty ) FROM Orders SELECT SUM(1) as A, SUM(QTY) as B from Orders SELECT A/B FROM TEMP_QUERY
  38. 38. More Simplifications • Group Expressions with 1 Database Field e.g. Group by Month ( OrderDate ) Inner Join OrderDate to Table of Its Unique Values and Month ( OrderDate ) • Remove Redundant Group By Parts Group BY OrderDate , Month ( Orderdate ) Group BY OrderDate , Month ( Orderdate )
  39. 39. HAVING Clause • Convert to two Queries • One Query without Having Clause • Having Clause becomes Where of Second Query
  40. 40. Function List String Functions CAST | CAST_STR_AS_INT | CAST_STR_AS_DECIMAL | CHAR | CHARINDEX | COALESCE | CONCAT | CSTR | ENDSWITH | INSERT | ISNULL | ISNULLOREMPTY | LEFT | LEN | LCASE | LTRIM | REMOVE | REPLACE | REVERSE | RIGHT | RTRIM | SUBSTRING | STARTSWITH | TRIM | UCASE Date Functions CDATE | DATEADD | DATEDIFF | DATEDIFFMILLISECOND | DATEPART | DATESERIAL | DAY | DAYOFWEEK | MONTH | TRUNC | YEAR Math ABS | CAST_NUM_AS_BYTE | CAST_NUM_AS_DECIMAL | CAST_NUM_AS_DOUBLE | CAST_NUM_AS_INT | CAST_NUM_AS_LONG | CAST_NUM_AS_SHORT | CAST_NUM_AS_SINGLE | FLOOR | LOG | MAX | MAXLIST | MIN | MINLIST | POWER | RAND | ROUND | SIGN | SQRT Trigonometric ASIN | ACOS | ATAN | ATAN2 | COS | COSH | SIN | SINH | TAN | TANH Aggregate Functions MIN | MAX | COUNT | AVG | SUM | COUNT ( DISTINCT() ) | MINLIST | MAXLIST Statistical Functions STDEV| STDEVP | VAR | VARP
  41. 41. Special Cases • SELECT DISCOUNT ( COUNT CUSTOMER ) FROM ORDERS • Answer is No of Customer Values • SELECT DISTINCT CUSTOMER FROM ORDERS Answer is List of Customer Unique Values
  42. 42. Importing Data DATASOURCE a1=ODBC 'dsn=ir_northwind' IMPORT Customers=a1.customers IMPORT Products=a1.{SELECT * FROM Products} IMPORT orders-a1.'somequery.sql' SAVE
  43. 43. Importing Data II • ODBC / OLEDB / DOT NET Providers • Special ME Datasource • Existing In Memory Databases • UNION ALL Between Sources • SLURP Command • Variables, Expressions & IF
  44. 44. Interfacing to the Database • Native Dot Net API • Dot Net Data Provider • COM/ ACTIVEX API • ODBC Driver C / C++ IO Licensed ODBC Kit Parameterized Queries + Cursor Support
  45. 45. Hard Learned Lessons • Allocated and Store Variables Relating to One Thread Sequentially. Don’t intermix • Xeon Servers with Maxed out memory can have slower memory access speed – 1 Rank 1,866 Mhz – 2 Ranks 1,600 Mhz – 3 Ranks 1,333 Mhz
  46. 46. Bitcoin Mining / HFT • CPUS • GPUs • FPGAs • Dedicated Mining Chip
  47. 47. GPU & InMemory Databases • GPUDB, MAPD – Good for Visualising Billions of Points – GPUs can run thousands of Cores on Data – GPU to Main Memory Bottleneck – Potentially more Data Reduction • Blazegraph, Graphsql Fast Graph Database that can use GPU
  48. 48. FPGA Potential • Field-Programmable Gate Array – is an integrated circuit designed to be configured by a customer or a designer after manufacturing – Programmable Integrated Circuit • Could be used to enhanced In Memory DBs • Intel bought Altera back in June 2015 – Will roll technology out into Data Center
  49. 49. Hardware Transaction Memory • Simplifies Concurrent Programming – Group of Load & Store Instructions – Can Execute Atomically • Hardware of Software Transactional Memory • Intel TSX – Transaction Synchronization Extensions – Available in some Skylake Processors – Added to Haswell/Broadwell but Disabled
  50. 50. 3D XPoint Memory • Announced by Intel & Micron June 2015 • 1000 times more Durable than Flash • Like DRAM that has Permanence • Latency 10 times faster than NAND SSD • 4-6 Times slower than DRAM
  51. 51. Thanks for help with Market Research • Dan Khasis • Niall Dalton • Jeff Cordova – Wavefront • SapHanaTutorial.com

×