Astra is a distributed SQL database for data analysis and prediction. We're aiming to achieve near real-time data analysis, and to deliver the components of a Data Lake as a Service which contains it. Astra’s another feature is integration with Machine learning to support many kinds of data analysis.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project
1. Rakuten Technology Conference 2017
A Distributed SQL Database
For Data Analysis, Astra Project
2017-10-28
Yosuke Hara (原 陽亮)
Rakuten Institute of Technology
Rakuten, Inc. rev. 1.0.5
2. Skylab
A Microservices Framework
11 0101
0010111011
110110010011
01110111011001
011101110110010
2
LeoFS
A Distributed Storage
11 0101
0010111011
110110010011
01110111011001
011101110110010
Astra
A Distributed SQL Database
For Data Analytics
11 0101
0010111011
110110010011
01110111011001
011101110110010
R&D Projects
6. Initial Concept
6
Provides Components of DataLake as a Service
Data Science
+
DataLake
Data Governance Job Scheduler
+
Distributed
Computing
Data Store
Astra Skylab
Spark, Hadoop
Self-Service
Analytics
11 0101
0010111011
110110010011
01110111011001
011101110110
7. 7
Current Concept
Advanced Data Analysis In Semi-Realtime At Low Cost
Aggregate, and
Analyze Data
Find Insights
Streaming Data
Un/Semi-
Structured Data
1100101
10010111011
110110010011
0110111011001
1101110110
Store Data
Into Astra
Data Intelligence Action
Tools / Apps
Automated
Systems
8. 8
Current Concept: Depends on Single Source Of Truth
Self-Service Analytics
Data Governance
Distributed Computing
For Massive-Parallel
Processing
Distributed Database
For Aggregation and
Analysis
+
Distributed Storage
(DataLake Store)
+
Astra’s Components
1100101
10010111011
110110010011
0110111011001
1101110110
In-place Analysis
10. Database
SQL Engine
Data Science
Analysis Functions
On The Distributed
Computing
Reliability, Scalability, and
Massive Parallel Processing
Ad-hoc Query
Various Data
Without Limit
Data Store
10
Unified Components
11. Confirms To ANSI SQL99 Standard
• Communication With Any BI / Data Visualization Tools, and Apps
• Able To Call All Astra’s Functions, UDFs and ML With SQL
The Features - ANSI SQL99 Standard
11
astra:test> SELECT workclass, COUNT(income)
-> AS income_count
-> FROM adult_income
-> WHERE income = '<=50K'
-> GROUP BY workclass
-> ORDER BY workclass;
workclass | income_count
------------------+--------------
? | 2534
Federal-gov | 871
Local-gov | 2209
Never-worked | 10
Private | 26519
Self-emp-inc | 757
Self-emp-not-inc | 2785
State-gov | 1451
Without-pay | 19
(9 rows)
12. Advanced Data Analytics On The Distributed Computing, Massive-
Parallel Processing
• Built-In Analysis Functions and UDF
• Machine Learning
The Features - Advanced Data Analytics
12
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
1100101
10010111011
110110010011
0110111011001
1101110110
Feedback
Able To Repeat
Trial And Error
w/o Limit
13. The Features - Availability and Scalability
High Availability
• Automated Data Replication And Recovery, and Failover
High Scalability
• An Elastic Cluster - Nodes That Can Flexibly Attach And Detach
13
Worker
Worker
Worker
Worker
Request
Worker
Response
Clients
Coordinator(s)
HTTP
Message with
Gossip Protocol
Monitoring Resources
Scheduling Jobs
* Circuit Breaker: martinfowler.com/bliki/CircuitBreaker.html
Circuit Breaker
Figure: Akka Circuit breaker
Requesting Jobs
15. 15
High-level ArchitectureSQLEngine
Workers
Database
Layer
DataStore
Layer
Astra
CLIClients
SQL over ODBC/JDBC
Astra DataStore
AstraSQL
AstraBase
- Original Data
- Semi-Structured Data
- Cold Data
- Columnar Tables
- Metadata Store
- Record Operation
- Record Set Cache (Hot Data)
- Distributed Computing
- Data Analysis
- Data Converter
- Semi-Structured Data To
Columnar Table
Original Data Load
Operate Astra
Multi-Coordinator
16. LeoFS is a software defined storage (SDS)
for DataLake and Web
LeoFS is an Enterprise Open Source Storage, and it is a highly
available, distributed, eventually consistent object/blob store
Goals:
- High Availability
- High Cost Performance Ratio
- High Scalability
LeoFS For Astra DataStore
16
17. Astra DataStore (LeoFS)
AstraSQL
AstraCLI
1-1. Put Original Data w/AstraCLI
2. Store the Data and Metadata
4. Request Converting Data Format of a Table
5. Convert Data Format of a Table
and Change Table’s Metadata
Processing Flow - Store a CSV file, Then Query Data
AstraBase 6. Store Converted Data
1-2. Create Metadata
[Store a CSV File]
[Convert Data Format At Async]
[Execute Query]
3. Query Data For Aggregation Or Data Analysis
1-1
1-2
2
3
17
REST-API
gRPCS3-API
gRPC
O/JDBC
AstraBase
Coordinator(s)
AstraBase
Workers
Resource Monitor
+ Scheduler
S3-API
gRPC
gRPC
AstraBase
Coordinator(s)
6
4
5
18. Astra DataStore (LeoFS)
AstraSQL 3-1. Retrieve Target Records from the Cache
4. Process Data Analysis in Parallel
5. Reply To AstraBase Coordinator,
Then Summarize the Result on the Coordinator
Processing Flow - Query for Advanced Analysis
AstraBase
3-2. Retrieve Target Records From LeoFS
(Cache Miss)
[Retrieve Records]
[Reply]
[Execute Query]
1. Execute SQL For Data Analysis
3-2
1
2-1
2-1. Request Data Analisys to AstraBase
gRPC
18
gRPCO/JDBC
AstraBase
Coordinator(s)
AstraBase
Workers
Resource Monitor
+ Scheduler
S3-API
3-1, 4
AstraBase
Coordinator(s)
5
gRPC
gRPC
2-2
2-2. Request Message to AstraBase’s Workers
19. Store Files Into Astra
(Original Data,
Semi-Structured Files)
Data Validation
Data Verification
Data Type Inference
Store Chunks and
Metadata
1. Data Load
To Handle Plural Data Formats In A Table
Partition Into Plural
Chunks
CSV / TSV / JSON
To Parquet / CarbonData SerDes
19
Able To Do Self Data
Analytics Even If During
Data Conversion
Data is partitioned by a condition
of a specified column
2. Data Conversion At Async
20. Data Storage
Supports Data Format and SerDes
- CSV, TSV, and Custom Delimiter Files
- JSON
- RegEx SerDes for Unstructured Data
- Parquet SerDes (A Columnar Storage Format)
- CarbonData SerDes (A Columnar Storage Format)
Supports Compression Methods
- SNAPPY
- ZLIB
- GZIP
- LZO
20
Supports Plural Data Formats And SerDes
21. Table Schema Parquet Format
CSV Format
An Example of METADATA as JSON
21
Stores Each File
Into Astra Data Store, LeoFS
Data Type
Inference
22. AstraBase
Coordinator(s)
Astra DataStore (LeoFS)
AstraSQL
AstraBase
3
2, 5
1
22
gRPCO/JDBC
Machine Learning on Astra - Modeling
[Create A Model, Then Store It]
2. Generate Tasks From A Job On A Coordinator
3. Request A Task To Workers
[Request A Modeling]
1. Request A Modeling To An Initiator Of AstraBase
4-1. Execute Function(s)
In Parallel On Each Worker
5. Summarize The Result On A Coordinator
Then Store The Model Into The Cluster To Reuse
4-2
4-2. Load Data From Data Store If Not Exists On Cache
S3-API
AstraBase
Workers
gRPC 4-1
gRPC
Resource Monitor
+ Scheduler
AstraBase
Coordinator(s)
S3-API
27. Future Plans
By Oct/E, 2017 Nov, 2017 - June/E, 2018 Q3 2018
Alpha 1st Beta
2nd Beta
Publish It
- Alpha
- Un/Semi-Structured Data and Parquet SerDes Support
- BI Tools and Visualization Tools Integration
- 1st Beta, Step-Growth Phase
- Record Set Cache
- Distributed Computing For UDF and ML
- Other SerDes Support
27