Stratio Crossdata is a distributed data platform that allows for both batch and streaming queries across multiple data stores. It uses Spark to enable operations not natively supported and provides connectors to integrate different data sources. The platform aims to simplify deployment, administration and querying for clients through its metadata management and support for features like full text search, joins and streaming queries.
DSPy a system for AI to Write Prompts and Do Fine Tuning
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities
1. Stratio Meta
An efficient distributed datahub with batch and
streaming query capabilities
Daniel Higuero
Alvaro Agea
dhiguero@stratio.com
alvaro@stratio.com
#CassandraSummit-20141"
2. Stratio Crossdata
An efficient distributed datahub with batch and
streaming query capabilities
Daniel Higuero
Alvaro Agea
dhiguero@stratio.com
alvaro@stratio.com
#CassandraSummit-20142"
3. Who are we?
STRATIO
• Stra3o-is-a-Big-Data-Company
• Founded-in-2013
• Commercially-launched-in-2014
• 50+-employees-in-Madrid
• Office-in-San-Francisco
• Cer3fied-Spark-distribu3on
#CassandraSummit-2014
3"
7. What our clients demand?
o Easy-deployment
o Easy-administra3on
o Read/write-performance
o EasyRtoRlearn-query-language-o
Integra3on-with-BI-Tools
o Join-opera3ons
o Support-for-streaming-sources
o Integra3on-with-other-data-stores
o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
#CassandraSummit-2014
7"
8. What our clients demand?
! Easy%deployment%
! Easy%administra0on%
! Read/write%performance%
! Easy6to6learn%query%language%
o Integra3on-with-BI-Tools
o Join-opera3ons
o Support-for-streaming-sources
o Integra3on-with-other-data-stores
o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
#CassandraSummit-2014
8"
12. Connecting to the outside world
o Crossdata-defines-an-IConnector-extension-interface
o User-can-easily-add-new-connectors-to-support
• Different-datastores
• Different-processing-engines
• Different-versions
o Where-each-connector-defines-its-capabili3es
#CassandraSummit-2014
12"
Our planner will choose the best connector for each query
13. Query execution
#CassandraSummit-2014
13"
Parsing" Valida8on" Planning" Execu8on"
C*"
Connector1"
Connector2"
Connector3"
Our planner will choose the best connector for each query
14. Multi-cluster support
o Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-
across-a-set-of-datastores.-
• Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance
" E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-
readRop3mized-cluster,-etc.-
• A-table-is-saved-in-a-unique-datastore
#CassandraSummit-2014
14"
22. Streaming queries: windows syntax
#CassandraSummit-2014
22"
SELECT fieldGroup,avg(Field2)
FROM eph_table
WITH WINDOW 5 minutes
WHERE field1=100 AND field2>100
GROUP BY fieldGroup;
23. Joining batch and streaming
SELECT * FROM demo.temporal
WITH WINDOW 10 secs
INNER JOIN demo.users
#CassandraSummit-2014
ON users.name = temporal.name;
SELECT * FROM
demo.temporal
WITH WINDOW 10 secs
"
SELECT *
FROM demo.users
"
INNER JOIN ON
users.name =
temporal.name
"
23"
25. Full text search with
o Clients-request-the-ability-to-perform-full-text-searches
o We-have-developed-an-integra3on-between-Lucene-and-
Cassandra
o C*-users-can-now-enjoy-all-Lucene-features:
• Full-text-searches,-range-queries,-fuzzy-queries….
#CassandraSummit-2014
25"
https://github.com/Stratio/stratio-cassandra
29. Why Spark?
o Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons
o Spark-brings-several-benefits-over-Hadoop-o
InRMemory-processing
o RDD-abstrac3on
o Simpler-API-o
Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)
#CassandraSummit-2014
29"
30. What about Spark SQL?
o Different-approach-to-query-execu3on
• We-only-use-Spark-when-it-speedups-queries
" Na3ve-drivers-are-faster-for-simple-queries
" Spark-SQL-has-limited-RDD-sources
• Avoid-some-Spark-limita3ons
• Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243
#CassandraSummit-2014
30"
36. Stratio Crossdata ODBC
o WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)
o We-have-implemented-for-Crossdata-using-Simba-SDK
o ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-
world
o Currently-tested-with-Tableau,-Qlikview-and-MS-Excel
#CassandraSummit-2014
36"
One ODBC for all datastores!
38. The future
o Security
o Query-op3mizer-and-smart-query-planner
o Leverage-system-sta3s3cs
o Support-for-UDFs
o Become-an-Apache-project
#CassandraSummit-2014
38"
https://github.com/Stratio/stratio-meta
39. We are looking for an Apache Champion
#CassandraSummit-2014
39"
Can"you"
help"us?"
40. A wish list for Cassandra
o Ability-to-stop-running-queries
o Interac3ve-users-are-unpredictable
o Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)
o Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator
• E.g.,-aggrega3ons-like-count(*)
#CassandraSummit-2014
40"
41. Stratio Crossdata
An efficient distributed datahub with batch and
streaming query capabilities
Daniel Higuero
Alvaro Agea
dhiguero@stratio.com
alvaro@stratio.com
#CassandraSummit-201441"