Slides of IMC Essentials workshop.
The workshop covers fundamental capabilities of in-memory computing platforms that boost high-load applications and services, and bring existing IT architecture to the next level by storing and processing a massive amount of data both in RAM and, optionally, on disk.
The capabilities and benefits of such platforms will be demonstrated with the usage of Apache Ignite, which is the in-memory computing platform that is durable, strongly consistent, and highly available with powerful SQL, key-value and processing APIs.
The Apache Ignite Platform
Apache Ignite is a memory-centric data platform that is used to build fast, scalable & resilient solutions.
At the heart of the Apache Ignite platform lies a distributed memory-centric data storage platform with ACID semantics, and powerful processing APIs including SQL, Compute, Key/Value and transactions. Built with a memory-centric approach, this enables Apache Ignite to leverage memory for high throughput and low latency whilst utilising local disk or SSD to provide durability and fast recovery.
The main difference between the memory-centric approach and the traditional disk-centric approach is that the memory is treated as a fully functional storage, not just as a caching layer, like most databases do. For example, Apache Ignite can function in a pure in-memory mode, in which case it can be treated as an In-Memory Database (IMDB) and In-Memory Data Grid (IMDG) in one.
On the other hand, when persistence is turned on, Ignite begins to function as a memory-centric system where most of the processing happens in memory, but the data and indexes get persisted to disk. The main difference here from the traditional disk-centric RDBMS or NoSQL system is that Ignite is strongly consistent, horizontally scalable, and supports both SQL and key-value processing APIs.
Apache Ignite platform can be integrated with third-party databases and external storage mediums and can be deployed on any infrastructure. It provides linear scalability, built-in fault tolerance, comprehensive security and auditing alongside advanced monitoring & management.
The Apache Ignite platform caters for a range of use cases including: Core banking services, Real-time product pricing, reconciliation and risk calculation engines, analytics and machine learning.
Ignite Data Grid is a distributed key-value store that enables storing data both in memory and on disk within distributed clusters and provides extensive APIs. Ignite Data Grid can be viewed as a distributed partitioned hash map with every cluster node owning a portion of the overall data. This way the more cluster nodes we add, the more data we can store.
Apache Ignite incorporates distributed SQL database capabilities as a part of its platform. The database is horizontally scalable, fault tolerant and SQL ANSI-99 compliant. It supports all SQL, DDL, and DML commands including SELECT, UPDATE, INSERT, MERGE, and DELETE queries. It also provides support for a subset of DDL commands relevant for distributed databases.
Data sets as well as indexes can be stored both in RAM and on disk thanks to the durable memory architecture. This allows executing distributed SQL operations across different memory layers achieving in-memory performance with durability of disk.
You can interact with Apache Ignite using SQL language via natively developed APIs for Java, .NET and C++, or via the Ignite JDBC or ODBC drivers. This provides a true cross-platform connectivity from languages such as PHP, Ruby and more.
Ignite In-Memory Compute Grid allows executing distributed computations in a parallel fashion to gain high performance, low latency, and linear scalability. Ignite compute grid provides a set of simple APIs that allow users distribute computations and data processing across multiple computers in the cluster.
The disk-centric systems, like RDBMS or NoSQL, generally utilize the classic client-server approach, where the data is brought from the server to the client side where it gets processed and then is usually discarded. This approach does not scale well as moving the data over the network is the most expensive operation in a distributed system.
A much more scalable approach is collocated processing that reverses the flow by bringing the computations to the servers where the data actually resides. This approach allows you to execute advanced logic or distributed SQL with JOINs exactly where the data is stored avoiding expensive serialization and network trips.
https://ignite.apache.org/collocatedprocessing.html
Collocation of computations with data allow for minimizing data serialization within network and can significantly improve performance and scalability of your application. Whenever possible, you should always make best effort to colocate your computations with the cluster nodes caching the data that needs to be processed.
Let's assume that a blizzard is approaching New York City. You, as a telecommunication company has to warn all the people sending a message to everyone with precise instructions on how to behave during such weather conditions. There are around 8 million New Yorkers in your database that have to receive the text message.
With the client-server approach the company has to connect to the database, move all 8 million (!) records from there to a client application that will text to everyone. This is highly inefficient that wastes network and computational resources of company's IT infrastructure.
However, if the company initially collocates all the cities it covers with the people who live there then it can send a single computation (!) to the cluster node that stores information about all New Yorkers and send the text message from there. This approach avoids 8 million records movement over the network and helps utilizing cluster resources for computation needs. That's the collocated processing in action!
https://github.com/techbysample/gagrid
GA Grid (Beta) is an in memory Genetic Algorithm (GA) component for Apache Ignite. A GA is a method of solving optimization problems by simulating the process of biological evolution. GA Grid provides a distributive GA library built on top of a mature and scalable Apache Ignite platform. GAs are excellent for searching through large and complex data sets for an optimal solution. Real world applications of GAs include: automotive design, computer gaming, robotics, investments, traffic/shipment routing and more.
Glossary
Chromosome is a sequence of Genes. A Chromosome represents a potential solution.
Crossover is the process in which the genes within chromosomes are combined to derive new chromosomes.
Fitness Score is a numerical score that measures the value of a particular Chromosome (ie: solution) relative to other Chromosome in the population.
Gene is the discrete building blocks that make up the Chromosome.
Genetic Algorithm (GA) is a method of solving optimization problems by simulating the process of biological evolution. A GA continuously enhances a population of potential solutions. With each iteration, a GA selects the 'best fit' individuals from the current population to create offspring for the next generation. After subsequent generations, a GA will "evolve" the population toward an optimal solution.
Mutation is the process where genes within a chromosomes are randomly updated to produce new characteristics.
Population is the collection of potential solutions or Chromosomes.
Selection is the process of choosing candidate solutions (Chromosomes) for the next generation.
DEMO: run several ML samples from the standard distribution.
Main benefits:
No ETL – online “in place” ML
In-memory speed & scale
Large scale parallelization
Optimized ML/DL algorithms
Last-mile GPU optimization
The rationale for building ML Grid is quite simple. Many users employ Ignite as the central high-performance storage and processing systems for various data sets. If they wanted to perform ML or Deep Learning (DL) on these data sets (i.e training sets or model inference) they had to ETL them first into some other systems like Apache Mahout or Apache Spark.
The roadmap for ML Grid is to start with core algebra implementation based on Ignite co-located distributed processing. The initial version was released with Ignite 2.0. Future releases will introduce custom DSLs for Python, R and Scala, growing collection of optimized ML algorithms such as Linear and Logistic Regression, Decision Tree/Random Forest, SVM, Naive Bayes, as well support for Ignite-optimized Neural Networks and integration with TensorFlow.
Current beta version of Apache Ignite Machine Learning Grid (ML Grid) supports a distributed machine learning library built on top of highly optimized and scalable Apache Ignite platform and implements local and distributed vector and matrix algebra operations as well as distributed versions of widely used algorithms.
Apache Ignite memory-centric platform is based on the Durable Memory architecture that allows storing and processing data and indexes both in memory and on disk when the Ignite Persistent Store feature is enabled. The memory architecture helps achieve in-memory performance with durability of disk using all the available resources of the cluster.Ignite's durable memory is built and operates in a way similar to the Virtual Memory of operating systems such as Linux. However, one significant difference between these two types of architectures is that Durable Memory always keeps the whole data set and indexes on disk if the Ignite Persistent Store is used, while Virtual Memory uses the disk for swapping purposes only.
In-Memory
• Off-Heap memory
• Removes noticeable GC pauses
• Automatic Defragmentation
• Predictable memory consumption
• Boosts SQL performance
On Disk
• Optional Persistence
• Support of flash, SSD, Intel 3D Xpoint
• Stores superset of data
• Fully Transactional
◦ Write-Ahead-Log (WAL)
• Instantaneous Cluster Restarts
Ignite Native Persistence is a distributed ACID and SQL-compliant disk store that transparently integrates with Ignite's Durable Memory as an optional disk layer storing data and indexes on SSD, Flash, 3D XPoint, and other types of non-volatile storages.
With the Ignite Persistence enabled, you no longer need to keep all the data and indexes in memory or warm it up after a node or cluster restart because the Durable Memory is tightly coupled with persistence and treats it as a secondary memory tier. This implies that if a subset of data or an index is missing in RAM, the Durable Memory will take it from the disk.
B-tree is a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.
B+Tree is a central part of the whole Ignite Virtual memory architecture because even basic key-value operations work via it (cache.get and cache.put)! Move to the next slide.
On the previous slide we explained how to look up a value inside of the virtual memory. However, how does the virtual memory know where to put a new value?
In fact, Ignite uses a special data structure called Free List to support this. Basically, a free list is a doubly linked list that stores references to pages of approximately equal free space. For instance, there is a free list that stores all the data pages that have up to 75% free space and a list that keeps track of the index pages with 25% capacity left.
Data and index pages are tracked in separate free lists.
Ignite Native Persistence is a distributed ACID and SQL-compliant disk store that transparently integrates with Ignite's Durable Memory as an optional disk layer storing data and indexes on SSD, Flash, 3D XPoint, and other types of non-volatile storages.
With the Ignite Persistence enabled, you no longer need to keep all the data and indexes in memory or warm it up after a node or cluster restart because the Durable Memory is tightly coupled with persistence and treats it as a secondary memory tier. This implies that if a subset of data or an index is missing in RAM, the Durable Memory will take it from the disk.