Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Thank you all for coming. Thanks for your time.
My name is Yang, or 李扬 in Chinese. Co-founder and CTO of Kyligence. Also a PMC member of Apache Kylin.
The topic today is about Apache Kylin 2.0. It is in beta at the moment and is planned to release in April.
I will first brief what is Aapche Kylin and then talk about the important and new features it has, including snowflake schema support, spark cubing, and streaming.
So what is Apache Kylin. It is an OLAP engine on Hadoop. Perhaps the most popular one at the moment. If you google “OLAP on Hadoop”, Kylin is the first result as I tried this morning.
It sits on top of Hadoop infrastructure and exposes your relational data to upper application via the standard SQL interface.
Kylin can handle very big data set and is very fast in terms of query latency. For example, the biggest Kylin instance we know in production is at toutiao.com, the top 1 news feed app in China. It has a table of 3,000 billion rows, and with Kylin, the average query response time is below 1 second.
Kylin can handle very complex data models, and we will talk more about snowflake support soon. The widest cube we know is at CPIC (中国太平洋保险), the top 3 insurance group in China. It contains more than 60 dimensions.
And Kylin provides standard JDBC / ODBC / RestAPI interfaces. Can integrate with existing BI tools very well, like Tableau and PowerBI, you name it.
Why is Kylin so fast. We can show it with an example. Image a retail scenario, I want to report revenue by “returnflag” and “orderstatus” within a date range, to see the total amount of successful transactions, canceled transaction, and returned transactions etc. A typical SQL will look like this.
And to execute it, we will compile it into a relational expression like the diagram on the left. It is so called the execution plan. On the execution plan, we can see, executing the query involves scanning all the rows in table, join them together, go through the date range filter, sum revenue by “returnflag” and “orderstatus”, and finally produced sorted result.
It is easy to see that the time complexity of such execution is at least O(N), where N represents the total number of rows in the tables. Because at least each table row is visited once. And we assume the joining tables are perfectly partitioned and indexed, such that the expensive join operator can also finish in linear time complexity, which is actually not very possible in real cases.
Anyway, O(N) is the best you could have doing ad-hoc SQL processing.
How can Kylin go beyond O(N)? That is by precalculation.
So if I know the query pattern in advance, I could precalculate the Aggregate, Join, and Table Scan operators, to create a cuboid. If cuboid sounds unfamiliar, you can think it as a materialized summary table.
The summary table is transaction amount grouped by “returnflag”, “orderstatus”, and “date”. And because there are fixed number of return flags, order status, and let’s say the date range is limited too, for 3 years, there are about 1000 days. That means the number of rows in the summary table is at most “flag x status x days”, which is a constant in the big O notion.
That means, if execute the same SQL on the precalculated cuboid, the maximum rows to process is a constant. And that is why Kylin can be faster.
So Kylin is all about precalculation. The core idea is based on the classic cube theory and is developed from there into the SQL on big data domain.
Kylin provides Model and Cube to help you define the space of precalculation. It has Build Engine that execute the precalculation in a distribute system using MR or Spark. And a Query Engine allows to run SQL on top of the precalculated result.
The key here is modeling. If you have good understanding of your data and business analysis requirement, you can capture the necessary precalculation with the right cube.
Once you got the right precalculation, then most of your queries (if not all) will be able to transform into the cube query like we have just seen. And the execution time complexity can be reduced to O(1) and achieve very fast query speed regardless of the original data size.
That is the brief introduction of Apache Kylin.
Next we will talk about the Snowflake Schema support, which is the most important new feature in Kylin 2.0.
Kylin 1.0 has a limitation that it only supports star schema data model. That means it only allows precalculation of joining 1 level of lookup tables. Like the diagram shows. It is difficult to support real world business cases, which are often more complicated than star schema.
Kylin 2.0 introduced big changes to model metadata, and can support snowflake data model out-of-box. Allows precalculate unlimited levels of lookup tables. Like the diagram shows.
Also there were many bug fixes and improvements regarding joins and sub-queries. As a result, Kylin 2.0 is able to support very complex data models and queries.
To demonstrate the greatly improved modeling and SQL capability, we made effort to run TPC-H queries on Kylin 2.0. For those who are not familiar with TPC-H, following is quote from TPC-H website: It is a popular decision support system for commercial RDBMS & DW solutions. It includes both queries and data that have broad industry-wide relevance. The queries are designed to examine large volumes of data, to have a high degree of complexity, and to give answers to critical business questions.
Kylin 2.0 can run all the 22 TPC-H queries. We have put up a page with all the steps and resources that is needed to reproduce the work. So anyone who want to verify can give it a try in your own environment. This shows that with precalculated cube, Kylin is also very flexible and can answer very complex queries.
Another note is, the goal here is not comparing performance with other TPC-H result. On one hand, according to TPC-H spec, precalculation is not permitted across tables, so in that sense, Kylin does not qualify a valid system to compare with other TPC-H results. On the other hand, we haven’t done performance tuning for the TPC-H queries yet. Just had enough time to let all the queries pass. The room for performance improvement is still very big.
To quickly show you some interesting TPC-H queries. On the left side is the SQL. Its font size is very small and is not intended to be read. I highlighted the sub-queries in different colors, so we have a feeling of the complexity. On the right side is the relational expression of the query in the tree form. In the tree structure, I marked precalculated nodes with white background and red text, so they look lightweight. The other solid nodes are submit to online calculation and will be the cause if a query runs slow.
This is TPC-H query 07, as you can see, almost all of the relational operators are precalculated. The remaining Sort / Proj / Filter work on very few records, thus the query is very fast. Takes Kylin only 0.17 second to run, and on the same hardware and same data set, Hive+Tez run 35.23 second for this query. That shows the difference between precalculation and online calculation.
This is TPC-H query 11. It has 4 sub-queries in terms of SQL and is more complex than the previous query. Its relational expression tree is more complex too, include two branches, each load from a precalculated cuboid. And finally the result of the branches are joined again, which is a very heavy online computation. With the percentage of online calculation increased, the query takes longer for Kylin, runs 3.42 second. While a full online calculation is still much slower, takes 15.87 second to run.
This is an even more complex query, TPC-H query 12. It contains 5 sub-queries, the SQL font size is even smaller than the previous pages, to fit the SQL in screen. And the relational form is more ugly too. Reading 3 cuboids in 3 branches, and online join them together. There are 2 heavy online join operators. It is expected that the more online calculation, the slower the query gets. It takes 7.66 second to run in Kylin and 12.64 second to run in Hive+Tez. As more heavy operators stay on online, the advantage of precalculation decreases.
So sum it up, Kylin 2.0 is much more than multidimensional analysis. With snowflake schema enabled, it can support very complex relational models and answer very complex queries. It can run TPC-H benchmark, and has enhanced/added many other SQL features like Percentile functions, Window functions, and Time functions.
Comparing to Kylin 1.0, Kylin 2.0 is a big step forward in terms of SQL maturity. Given the right model design, it can answer all your queries of any kind of complexity. Comparing to other data warehouse on Hadoop, Kylin 2.0 may still a little behind in terms of SQL, e.g. it does have UDF yet, but the gap is very small now. And in terms of query latency, because of precalculation, Kylin would have big advantage.
Next we will talk about Spark Cubing, which cuts the build time by half.
Since 1.5, we have been trying to do cubing on Spark. However at that time, the attempt was not successful.
The first attempt was a port of MR in-mem cubing algorithm on to Spark. It is the easiest to do at that time, due to ease of implementation. The algorithm uses as much memory as it could and builds the whole in one round, thus Spark’s memory cache is not an advantage since the MR job is doing the same. The result is no obvious improvement compare to MapReduce.
So the 2.0 did a complete rework based on the layered cubing algorithm. For layered cubing, the cube is calculated by layers. Each layer’s output is the next layer’s input. We use RDD to abstract the layer data, then the parent RDD can be cached and speed up the processing of the next layer. RDD can be exported to sequence file using the same format as MapReduce. This keeps the compatibility in the cube data format. Implementation wise, the original mapper can translate into “flatMap”, original reducer can translate into “reduceByKey”. Most of the code gets reused.
This is a sample of a Spark job, calculating the 3rd layer of a cube. The first 2 stages are skipped, because they have been calculated and cached by previous layer’s job. The later two stages do the additional processing to calculate the 3rd of cube.
Compare Spark cubing performance with MR layered cubing. The red bar is Spark and the blue bar is MR.
The test is done in a 4-node cluster. Spark 1.6.3 on YARN, with 24 vcores and 30 GB memory allocated to Spark. We tested 3 data sets with increasing size: 0.15 GB, 2.5 GB, and 8 GB. The first two data sets are very small comparing to 30 GB memory. So spark could cache everything in memory, and the result is Spark build time is about 50% of MR layered cubing. The 8 GB data set is much bigger, and as cubing cause data expansion, it won’t be able to all fit in memory, thus the advantage of Spark decreases little, as the diagram shows.
Compare Spark cubing to MR In-mem cubing. Still the red bar is Spark, the blue bar is MR in-mem cubing.
They are almost the same fast. However, remember in-mem cubing is very picky on data distribution. It is only effective when the data is sharded or nearly sharded. And if you force in-mem cubing to run on random data, it performs even slower than MR layered cubing. In the diagram, the data sets are all sharded for in-mem cubing.
On the other hand, Spark cubing shows good performance consistently, regardless of the level of data distribution. So in general, Spark cubing is still a better choice than in-mem cubing.
Last thing is about near real-time streaming.
This is actually a 1.6 feature. However I feel that it is not marketed well enough, so is it again.
Start from 1.6, Kylin can connect Kafka as a source just like Hive. Using in-mem cubing algorithm, we can trigger micro incremental build very frequently, e.g. every 2 minutes. The result is many small cube segments and they can be queried to give very real-time result.
To show this really truly works, we have put up a demo site to analyze twitter messages in real-time.
It runs on a 8-node AWS cluster, 3 Kafka brokers. The input is Twitter sample feed, which has 10+ K messages per second. The cube is average complex, 9 dimensions and 3 measures. For such setup, the incremental build is triggered every 2 minutes and finishes in 3 minutes. That’s why you will see 2 jobs running at the same time like the screenshot shows. That is perfectly OK as long as your cluster continue to complete jobs at the fixed rate.
As a result, the system has about 5 minutes delay in terms of real-time-ness.
The demo shows twitter message trends by language and by devices. The left diagram is message trend by language of a whole day. We can see English message volume goes up in the US day time, and meanwhile Asia message volume goes down because it’s Asia night.
There’s also the tag cloud, show the most recent hot topics. And below it the trend of the hottest tags.
All the charts real-time to the latest 5 minutes.
To summarize.
Apache Kylin 2.0 is about to reach it 2.0 release. It has rich features like snowflake schema support, runs TPC-H benchmark, and many enhanced SQL functions. It has spark cubing that halves the build time, and has streaming capability. 2.0 is still in beta at the moment, there is a beta package which you can download and try. We are very eager to hear feedbacks from community, good or bad. After fixing any critical issues, the plan is to release in April.
As to Kylin’s roadmap, there is a lot we want to do. Hadoop 3.0 with erasure coding could save cube storage greatly, something we will definitely catch up. Spark cubing has many room of improvement too, e.g. at the moment it does not source from Kafka yet. Connecting more sources is another frequently asked requirement. We could source from JDBC, or maybe SparkSQL. Alternative storage, like Kudu? Lastly but not least, a lambda architecture to support true real-time.
That is all. Thanks for your time again.