SlideShare a Scribd company logo
1 of 59
Download to read offline
Big data ° Real time
The open big data serving engine; store, search,
rank and organize big data at user serving time.
By Vespa architect @jonbratseth
This deck
What’s big data serving?
Vespa - the big data serving engine
Vespa architecture and capabilities
Using Vespa
Big data maturity levels
Latent Data is produced but not systematically leveraged
Example Logging: Movie streaming events are logged.
Analysis Data is used to inform decisions made by humans
Example Analytics: Lists of popular movies are compiled to create curated recommendations for user segments.
Learning Data is used to make decisions offline
Example Machine learning: Lists of movie recommendations per user segment are automatically generated.
Acting Automated data-driven decisions online
Example Big data serving: Personalized movie recommendations are computed when needed by that user.
Big data serving
Selection, organization and machine-learned model inference
● Over many, constantly changing data items (thousands to billions)
● With low latency (~100 ms) and high load (thousands of queries/second)
In short: AI + big data + online
Why use big data?
Big data serving: AI + big data + online
Necessary in use cases like search, recommendation and many others, but
being able to consider relevant data always improves decision making
Intuition AI: Data compressed into a function (regression, ANN etc.)
Deliberate reasoning: Look up relevant data to make informed decisions
Just like humans, having “system 1” and “system 2”, AI need both
Why make decisions online?
Decisions use up to date information
Decisions are made now, and see the current state of the world
Big data serving: AI + big data + online
No wasted computation
Only those decisions that are needed will be made
Fine-grained decisions
A separate computation is made for each specific case
Architecturally simple
Just write data, and send the queries, in real time to the big data serving component
Big data serving: What is required?
Real-time actions: Find data and make inferences in tens of milliseconds.
Realtime knowledge: Handle data updates at a high continuous rate.
Scalable: Handle high requests rates over big data sets.
Always available: Recover from hardware failures without human intervention.
Online evolvable: Change schemas, logic, models, hardware while online.
Integrated: Data feeds from Hadoop, learned models from TensorFlow etc.
Mutable state x distributed computing x low latency x high availability
Making big data serving universally available
Open source, available on https://vespa.ai (Apache 2.0 license)
Provenance:
● Web search: The canonical big data serving use case
● Yahoo search made Hadoop and Vespa to solve it
● Core idea of both: Move computation to data
Introducing ...
Example usage: Vespa at Verizon Media
TechCrunch, Huffington Post, Aol, Engadget, Gemini, Yahoo News, Yahoo Sports, Yahoo Finance,
Yahoo Mail, etc.
Hundreds of Vespa applications,
… serving over a billion users
… over 450,000 queries per second
… over billions of content items
… including
● Selecting and serving the
personalized content of all
landing pages and apps
● Selecting and serving the
personalized ads on the
world’s 3rd largest ad network
Big data serving use case: Search
Data items: Text documents
Query: Keywords
Model(s) evaluated: Relevance
Selected items: By relevance
Vespa: Full text indexes, GBDT models, text match relevance features,
snippeting, linguistics, 2-phase ranking, text processing, ...
“Search 2.0”: Convert text to tensors and use vector similarity and neural nets
Vespa: Native tensor data model and computation engine
Approximate nearest neighbour vector search in combination with text
Support for fast, distributed evaluation of transformer models
Big data serving use case: Recommendation
Data items: Anything that can be recommended to somebody
Query: Filters + user/context model
Model(s) evaluated: Recommendation
Selected items: By recommendation score
Vespa: Native tensor data model and computation engine
Built-in support for models in TensorFlow, Onnx and XGBoost
Fast vector similarity search (parallel WAND)
Fast vector similarity brute force computation
Approximate nearest neighbour search with filters
Big data serving use case: Finance prediction
Data items: Assets (e.g stock)
Query: World state update
Model(s) evaluated: Price predictor
Selected items: By largest price change
Result:
● Find the assets changing most in response to an event
● … using completely up-to-date information
● … faster than anybody else
Big data serving use case: Question answering
https://blog.vespa.ai/efficient-open-domain-question-answering-on-vespa/
https://blog.vespa.ai/from-research-to-production-scaling-a-state-of-the-art-machine-learning-system/
How is big data serving different from analytics
Analytics (e.g ElasticSearch) Big data serving (Vespa)
Response time in low seconds Response time in low milliseconds
Low query rate High query rate
Time series, append only Random writes
Down time, data loss acceptable HA, no data loss, online redistribution
Massive data sets (trillions of docs) are cheap Massive data sets are more expensive
Analytics GUI integration Machine learning integration
VS
Where are we?
What’s big data serving?
Vespa - the big data serving engine
Vespa architecture and capabilities
Using Vespa
Vespa is
A platform for low latency computations over large, evolving data sets
• Search and selection over structured and unstructured data
• Scoring/relevance/inference: NL features, advanced ML models, TensorFlow, Onnx etc.
• Query time organization and aggregation of matching data
• Real-time writes at a high sustained rate
• Live elastic and auto-recovering stateful content clusters
• Processing logic container (Java)
• Managed clusters: One to hundreds of nodes
Typical use cases: text search, personalization / recommendation / targeting, real-time data display, ++
Vespa architecture
Container node
Query
Application
Package
Admin &
Config
Content node
Deploy
- Configuration
- Components
- ML models
Scatter-gather
Core
sharding
models models models
1) Parallelization
2) Prepare data structures at write time and in the background: Posting-lists, B-trees, HNSW
3) Move execution to data nodes
Scalable low latency execution:
How to bound latency in three easy steps
Evaluating ML models on data nodes avoids scaling bottlenecks
Latency: 100ms @ 95%
Throughput: 500 qps
10Gbps network
Takeaway:
Without distributing computation
to data you run out of
datacenter bandwidth
surprisingly quickly
Query execution and data storage
● Document-at-a-time evaluation over all query operators
● index string fields:
○ positional text indexes (dictionaries + posting lists), and
○ B-trees in memory containing recent changes
● attribute fields:
○ In-memory forward dense data, optionally with B-trees
○ For search, grouping and ranking
● index vector (1d-dense tensor) fields: Persistent, real-time HNSW indexes
● Transaction log for persistence+replay
● Separate store of raw data for serving+recovery+redistribution
● One instance of all of this per doc schema
Approximate nearest neighbor vector search
Billions of vectors, of thousands of numbers, in milliseconds
Efficient when combined with text and filters
Vectors can be updated in real time, thousands of writes/second per node
2-3 times faster than ElasticSearch+FASS (benchmark)
https://github.com/vespa-engine/vespa/pull/15552/files#diff-4c3722c7699f675ceebf9
4e0d0f3e04af571dd165b9e3a5046f57a4f23ce4ec9
Achieved by Vespa embedding its own modified HNSW implementation in C++
Data distribution
Vespa auto-distributes data over
● A set of nodes
● With a certain replication factor
● Optionally: In multiple node groups
● Optionally: With locality (e.g personal search)
Changes to nodes/configuration -> Automatic online data redistribution
No need to manually partition data or manage partition placement
Distribution based on CRUSH algorithm: Minimal data movement without registry
Inference in Vespa
Tensor data model: Multidimensional collections of
numbers: In queries, documents, and models
Tensor math operations express all common
machine-learned models with join, map, reduce etc.
Tensor dimensions may be sparse (mapped) or dense
(indexed): tensor<float>(key{}, x[1000])
Math operations work the same over both.
Model learning integration: Deploy TensorFlow,
ONNX (SciKit, Caffe2, PyTorch etc.), XGBoost and
LightGBM models directly on Vespa
Vespa execution engine optimized for repeated
execution of models over many data items and running
many inferences in parallel
map(
join(
reduce(
join(
Placeholder,
Weights_1,
f(x,y)(x * y)
),
sum,
d1
),
Weights_2,
f(x,y)(x + y)
),
f(x)(max(0,x))
)
Placeholder Weights_1
matmul Weights_2
add
relu
Releases
New production releases of Vespa are published Monday to Thursday every week
All development is in the open: https://github.com/vespa-engine/vespa
Releases:
● Have passed our suite of ~1100 functional tests and ~75 performance tests
● Are already running the ~150 production applications in our cloud service
Releases are backwards compatible, unless it’s a major version change (bi-yearly)
-> Upgrades can happen live node by node
Big Data Serving and Vespa intro summary
Making the best use of big data often means making decisions in real time
Vespa is the only open source platform optimized for such big data serving
Available on https://vespa.ai
Quick start: Run a complete application (on a laptop or AWS) in 10 minutes
http://docs.vespa.ai/documentation/vespa-quick-start.html
Tutorial: Make a scalable blog search and recommendation engine from scratch
http://docs.vespa.ai/documentation/tutorials/blog-search.html
Where are we?
What’s big data serving?
Vespa - the big data serving engine
Vespa architecture and capabilities
Using Vespa
Installing Vespa
Rpm packages or Docker images
All nodes have the same packages/image
CentOS (On Mac and Win inside Docker or VirtualBox)
1 config variable:
https://docs.vespa.ai/documentation/vespa-quick-start.html
https://docs.vespa.ai/documentation/vespa-quick-start-centos.html
https://docs.vespa.ai/documentation/vespa-quick-start-multinode-aws.html
Configuring Vespa: Application packages
Manifest-based configuration
All of the application: system config, schemas, jars, ML models
deployed to Vespa:
○ vespa-deploy prepare [application-package-path]
○ vespa-deploy activate
Deploying again carries out changes made
Most changes happen live (including Java code changes)
If actions needed: List of actions needed are returned by deploy prepare
https://docs.vespa.ai/documentation/cloudconfig/application-packages.html
A complete application package, 1: Services/clusters
./services.xml ./hosts.xml
A complete application
package, 2: Schema(s)
./searchdefinitions/music.sd:
https://docs.vespa.ai/documentation/search-definitions.html
Calling Vespa: HTTP(S) interfaces
POST docs/individual fields:
to http://host1.domain.name:8080/document/v1/music/music/docid/1
(or use the Vespa Java HTTP client for high throughput)
GET single doc: http://host1.domain.name:8080/document/v1/music/music/docid/1
GET query result: http://host1.domain.name:8080/search/?query=track:place
{
"fields": {
"artist": "War on Drugs",
"album": "A Deeper Understanding",
"track": "Thinking of a Place",
"popularity": 0.97
}
}
Operations in production
No single point of failure
Automatic failover + data recovery -> no time-critical ops needed
Log collection to config server
Metrics integration
● Prometheus integration in https://github.com/vespa-engine/vespa_exporter
● Or, access metrics from a web service on each node
Matching
Matching finds all the documents matching a query
Query = Tree of operators:
● TERM, AND, OR, PHRASE, NEAR, RANK, WeightedSet, …
● NearestNeighbor, RANGE, WAND
Goal of matching: a) Selecting a subset of data, b) Skipping for performance
Queries are evaluated in parallel:
over all clusters, document types, partitions, and N cores
Queries are passed in HTTP requests (YQL), or constructed in Searchers
Execution
Low latency computation over large data sets
… by parallelization over nodes and cores
... pushing execution to the data
... and preparing data structures at write time
Container
Execution middleware
Query
Content partition
Matching+1st ranking
Grouping & aggregation
2nd phase ranking
Content fetch + snippeting
...
Ranking/inference
It’s just math
Ranking expressions: Compute a score from features
a + b * log(c) - if( e > f, g, h)
● Feature values are scalars or tensors
● Constant features (in application package - model parameters)
● Document features
● Query features
● Match features: Computed from doc+query data at matching time
First-phase ranking: Computed during matching, on each match
Second-phase ranking: Optional re-ranking of top n on each partition
https://docs.vespa.ai/documentation/ranking.html
Match feature examples
● bm25, or nativeRank feature: Pretty good text ranking out of the box
● Text ranking: fieldMatch feature set
○ Positional info
○ Text segmentation
● Multivalue text field signal aggregation:
○ elementCompleteness
○ elementSimilarity
● Geo distance
○ closeness
○ distance
○ distanceToPath
● Time ranking:
○ freshness
○ age
https://docs.vespa.ai/documentation/reference/rank-features.html
fieldMatch text ranking feature set
Accurate proximity based text matching features
Highest on the quality-cost tradeoff curve: Usually for second-phase ranking
fieldMatch feature: Aggregate text relevance score
Fine-grained fieldMatch sub-features: Useful for ML ranking
Machine learned scoring
Example: Text search
● Supervised machine-learned ranking of matches to a user query
Example: Recommendation/personalization
● Query is a user+context in some vector/tensor space
● Document belongs to same space
● Evaluate machine-learned model on all documents
○ ...ideally - optimizations to reduce cost: 2nd phase, WAND, match-phase, clustering, …
● Reinforcement learning
“Search 2.0”
Gradient boosted decision trees
● Commonly used for supervised learning of text search ranking
● Defer most “Natural language intelligence” to ranking instead of matching ->
better result at higher cpu cost … but modern hardware has sufficient power
● Ranking function: Sum of decision trees
● A few hundreds/thousand trees
● Written as a sum of nested if expressions on scalars
● Vespa can read XGBoost models
● Special optimizations for GBDT-shaped ranking expressions
● Training: Issue queries which requests ranking features in the response
… however
Tensors
A data type in ranking expressions (in addition to scalars)
Makes it possible to deploy large and complex ML models to Vespa
● Deep neural nets
● FTRL (regression models with millions of parameters)
● Word2vec models
● etc.
https://docs.vespa.ai/documentation/tensor-intro.html
What is a tensor?
Tensor: A multidimensional array which can be used for computation
Textual form: { {address}:double, .. } where address is {identifier:value},...
Examples
● 0-dimensional: A scalar {{}:0.1}
● 1-dimensional: A vector {{x:0}:0.1, {x:1}:0.2}
● 2-dimensional: A matrix {{x:0,y:0}:0.1, {x:0,y:1}:0.2}
Indexed tensor dimensions: Values addressed by numbers, continuous from 0
Mapped tensor dimensions: Values addressed by identifiers, sparse
Tensor sources
Tensors may be added to documents
field my_tensor type tensor(x{},y[10]) { ... }
… queries
query.getRanking().getFeatures()
.put("my_tensor_feature", Tensor.from("{{x:foo,y:0}:1.3}"));
… and application packages
constant tensor_constant {
file: constants/constant_tensor_file.json.lz4
type: tensor(x{})
}
… or be created on the fly from other doc fields
From document weighted sets
tensorFromWeightedSet(source, dimension)
From document vectors
tensorFromLabels(source, dimension)
From single attributes
concat(attribute(attr1), attribute(attr2), dimension)
Tensor computation
A few primitive operations
map(tensor, f(x)(expr))
reduce(tensor, aggregator, dim1, dim2, ...)
join(tensor1, tensor2, f(x,y)(expr))
tensor(tensor-type-spec)(expr)
rename(tensor, from-dims, to-dims)
concat(tensor1, tensor2, dim)
… composes into many high-level operations
https://docs.vespa.ai/documentation/reference/tensor.html
The tensor join operator
Naming is awesome, or computer science strikes again!
Generalization of other tensor products:
Hadamard, tensor product, inner, outer matrix product
Like the regular tensor product, it is associative:
a * (b * c) = (a * b) * c
Unlike the tensor product, it is also commutative:
a * b = b * a
Use case: FTRL
sum( // model computation:
tensor0 * tensor1 * tensor2 // feature combinations
* tensor3 // model weights application
)
Where tensors 0, 1, 2 come from the document or query:
and tensor 3 comes from the application package:
Use case: Neural net
rank-profile nn_tensor {
function nn_input() {
expression: concat(attribute(user_item_cf), query(user_item_cf), input)
}
function hidden_layer() {
expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden))
}
function final_layer() {
expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final))
}
first-phase {
expression: sum(final_layer)
}
}
TensorFlow, ONNX and XGBoost integration
1) Save models directly to
<application package>/models/
2) Reference model outputs in ranking
expressions:
Faster than native TensorFlow evaluation
More scalable as evaluation happens at
content partitions
https://docs.vespa.ai/documentation/tensorflow.html
https://docs.vespa.ai/documentation/onnx.html
https://docs.vespa.ai/documentation/xgboost.html
map(
join(
reduce(
join(
Placeholder,
Weights_1,
f(x,y)(x * y)
),
sum,
d1
),
Weights_2,
f(x,y)(x + y)
),
f(x)(max(0,x))
)
Placeholder Weights_1
matmul Weights_2
add
relu
Grouping and aggregation
Organizing data at request time
…
For navigational views, visualization, grouping, diversity etc.
Evaluated over all matches
… distributed over all partitions
Any number of levels and parallel groupings (may become expensive)
https://docs.vespa.ai/documentation/grouping.html
Grouping operations
all: Perform an operation on a list
each: Perform an operation on each item in a list
group: Create a new list level
max: Limit the number of elements in a list
order: Order a list
output: Add some data to the output produced by the current list/element
Grouping aggregators and expressions
Aggregators: count, sum, avg, max, min, xor, stddev, summary
(summary: Output data from a document)
Expressions:
● Standard math
● Static and dynamic bucketing
● Time
● Geo (zcurve)
● Access attributes + relevance score of documents
Grouping examples
Group hits and output the count in each group :
Group hits and output the best in each group:
Group into fixed buckets, then on attribute “a”, and count hits in leafs:
Group into today, yesterday, last week and month, group each into separate days:
https://docs.vespa.ai/documentation/reference/grouping-syntax.html
Container for Java components
● Query and result processing, federation, etc.: Searchers
● Document processors
● General request handlers
● Any Java component (no Vespa interface/class needed)
● Dependency injection, component config
● Hotswap of code, without disrupting traffic
● Query profiles
● HTTP serving through embedding Jetty
https://docs.vespa.ai/documentation/jdisc/
Summary
Making the best use of big data often implies making decisions in real time
Vespa is the only open source platform optimized for such big data serving
Available on https://vespa.ai
Quick start: Run a complete application (on a laptop or AWS) in 10 minutes
http://docs.vespa.ai/documentation/vespa-quick-start.html
Tutorial: Make a scalable blog search and recommendation engine from scratch
http://docs.vespa.ai/documentation/tutorials/blog-search.html
Questions?
By Vespa architect @jonbratseth

More Related Content

What's hot

Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...Dataconomy Media
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoTJim Haughwout
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureOliver Buckley-Salmon
 
Introduction to TitanDB
Introduction to TitanDB Introduction to TitanDB
Introduction to TitanDB Knoldus Inc.
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 

What's hot (20)

Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Introduction to TitanDB
Introduction to TitanDB Introduction to TitanDB
Introduction to TitanDB
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 

Similar to Big data serving: Processing and inference at scale in real time

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Designing Artificial Intelligence
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial IntelligenceDavid Chou
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Big data on AWS
Big data on AWSBig data on AWS
Big data on AWSStylight
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 

Similar to Big data serving: Processing and inference at scale in real time (20)

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Designing Artificial Intelligence
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial Intelligence
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Big data on AWS
Big data on AWSBig data on AWS
Big data on AWS
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 

More from Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentItai Yaffe
 
Serverless data processing built for internet SCALE
Serverless data processing built for internet SCALEServerless data processing built for internet SCALE
Serverless data processing built for internet SCALEItai Yaffe
 
Ask me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelAsk me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelItai Yaffe
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 
Serverless data processing built for internet SCALE
Serverless data processing built for internet SCALEServerless data processing built for internet SCALE
Serverless data processing built for internet SCALE
 
Ask me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelAsk me anything - Women in Big Data Israel
Ask me anything - Women in Big Data Israel
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 

Recently uploaded

YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 

Recently uploaded (17)

YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 

Big data serving: Processing and inference at scale in real time

  • 1. Big data ° Real time The open big data serving engine; store, search, rank and organize big data at user serving time. By Vespa architect @jonbratseth
  • 2. This deck What’s big data serving? Vespa - the big data serving engine Vespa architecture and capabilities Using Vespa
  • 3. Big data maturity levels Latent Data is produced but not systematically leveraged Example Logging: Movie streaming events are logged. Analysis Data is used to inform decisions made by humans Example Analytics: Lists of popular movies are compiled to create curated recommendations for user segments. Learning Data is used to make decisions offline Example Machine learning: Lists of movie recommendations per user segment are automatically generated. Acting Automated data-driven decisions online Example Big data serving: Personalized movie recommendations are computed when needed by that user.
  • 4. Big data serving Selection, organization and machine-learned model inference ● Over many, constantly changing data items (thousands to billions) ● With low latency (~100 ms) and high load (thousands of queries/second) In short: AI + big data + online
  • 5. Why use big data? Big data serving: AI + big data + online Necessary in use cases like search, recommendation and many others, but being able to consider relevant data always improves decision making Intuition AI: Data compressed into a function (regression, ANN etc.) Deliberate reasoning: Look up relevant data to make informed decisions Just like humans, having “system 1” and “system 2”, AI need both
  • 6. Why make decisions online? Decisions use up to date information Decisions are made now, and see the current state of the world Big data serving: AI + big data + online No wasted computation Only those decisions that are needed will be made Fine-grained decisions A separate computation is made for each specific case Architecturally simple Just write data, and send the queries, in real time to the big data serving component
  • 7. Big data serving: What is required? Real-time actions: Find data and make inferences in tens of milliseconds. Realtime knowledge: Handle data updates at a high continuous rate. Scalable: Handle high requests rates over big data sets. Always available: Recover from hardware failures without human intervention. Online evolvable: Change schemas, logic, models, hardware while online. Integrated: Data feeds from Hadoop, learned models from TensorFlow etc. Mutable state x distributed computing x low latency x high availability
  • 8. Making big data serving universally available Open source, available on https://vespa.ai (Apache 2.0 license) Provenance: ● Web search: The canonical big data serving use case ● Yahoo search made Hadoop and Vespa to solve it ● Core idea of both: Move computation to data Introducing ...
  • 9. Example usage: Vespa at Verizon Media TechCrunch, Huffington Post, Aol, Engadget, Gemini, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Mail, etc. Hundreds of Vespa applications, … serving over a billion users … over 450,000 queries per second … over billions of content items … including ● Selecting and serving the personalized content of all landing pages and apps ● Selecting and serving the personalized ads on the world’s 3rd largest ad network
  • 10. Big data serving use case: Search Data items: Text documents Query: Keywords Model(s) evaluated: Relevance Selected items: By relevance Vespa: Full text indexes, GBDT models, text match relevance features, snippeting, linguistics, 2-phase ranking, text processing, ... “Search 2.0”: Convert text to tensors and use vector similarity and neural nets Vespa: Native tensor data model and computation engine Approximate nearest neighbour vector search in combination with text Support for fast, distributed evaluation of transformer models
  • 11. Big data serving use case: Recommendation Data items: Anything that can be recommended to somebody Query: Filters + user/context model Model(s) evaluated: Recommendation Selected items: By recommendation score Vespa: Native tensor data model and computation engine Built-in support for models in TensorFlow, Onnx and XGBoost Fast vector similarity search (parallel WAND) Fast vector similarity brute force computation Approximate nearest neighbour search with filters
  • 12. Big data serving use case: Finance prediction Data items: Assets (e.g stock) Query: World state update Model(s) evaluated: Price predictor Selected items: By largest price change Result: ● Find the assets changing most in response to an event ● … using completely up-to-date information ● … faster than anybody else
  • 13. Big data serving use case: Question answering https://blog.vespa.ai/efficient-open-domain-question-answering-on-vespa/ https://blog.vespa.ai/from-research-to-production-scaling-a-state-of-the-art-machine-learning-system/
  • 14. How is big data serving different from analytics Analytics (e.g ElasticSearch) Big data serving (Vespa) Response time in low seconds Response time in low milliseconds Low query rate High query rate Time series, append only Random writes Down time, data loss acceptable HA, no data loss, online redistribution Massive data sets (trillions of docs) are cheap Massive data sets are more expensive Analytics GUI integration Machine learning integration VS
  • 15. Where are we? What’s big data serving? Vespa - the big data serving engine Vespa architecture and capabilities Using Vespa
  • 16. Vespa is A platform for low latency computations over large, evolving data sets • Search and selection over structured and unstructured data • Scoring/relevance/inference: NL features, advanced ML models, TensorFlow, Onnx etc. • Query time organization and aggregation of matching data • Real-time writes at a high sustained rate • Live elastic and auto-recovering stateful content clusters • Processing logic container (Java) • Managed clusters: One to hundreds of nodes Typical use cases: text search, personalization / recommendation / targeting, real-time data display, ++
  • 18. Container node Query Application Package Admin & Config Content node Deploy - Configuration - Components - ML models Scatter-gather Core sharding models models models 1) Parallelization 2) Prepare data structures at write time and in the background: Posting-lists, B-trees, HNSW 3) Move execution to data nodes Scalable low latency execution: How to bound latency in three easy steps
  • 19. Evaluating ML models on data nodes avoids scaling bottlenecks Latency: 100ms @ 95% Throughput: 500 qps 10Gbps network Takeaway: Without distributing computation to data you run out of datacenter bandwidth surprisingly quickly
  • 20. Query execution and data storage ● Document-at-a-time evaluation over all query operators ● index string fields: ○ positional text indexes (dictionaries + posting lists), and ○ B-trees in memory containing recent changes ● attribute fields: ○ In-memory forward dense data, optionally with B-trees ○ For search, grouping and ranking ● index vector (1d-dense tensor) fields: Persistent, real-time HNSW indexes ● Transaction log for persistence+replay ● Separate store of raw data for serving+recovery+redistribution ● One instance of all of this per doc schema
  • 21. Approximate nearest neighbor vector search Billions of vectors, of thousands of numbers, in milliseconds Efficient when combined with text and filters Vectors can be updated in real time, thousands of writes/second per node 2-3 times faster than ElasticSearch+FASS (benchmark) https://github.com/vespa-engine/vespa/pull/15552/files#diff-4c3722c7699f675ceebf9 4e0d0f3e04af571dd165b9e3a5046f57a4f23ce4ec9 Achieved by Vespa embedding its own modified HNSW implementation in C++
  • 22. Data distribution Vespa auto-distributes data over ● A set of nodes ● With a certain replication factor ● Optionally: In multiple node groups ● Optionally: With locality (e.g personal search) Changes to nodes/configuration -> Automatic online data redistribution No need to manually partition data or manage partition placement Distribution based on CRUSH algorithm: Minimal data movement without registry
  • 23. Inference in Vespa Tensor data model: Multidimensional collections of numbers: In queries, documents, and models Tensor math operations express all common machine-learned models with join, map, reduce etc. Tensor dimensions may be sparse (mapped) or dense (indexed): tensor<float>(key{}, x[1000]) Math operations work the same over both. Model learning integration: Deploy TensorFlow, ONNX (SciKit, Caffe2, PyTorch etc.), XGBoost and LightGBM models directly on Vespa Vespa execution engine optimized for repeated execution of models over many data items and running many inferences in parallel
  • 24. map( join( reduce( join( Placeholder, Weights_1, f(x,y)(x * y) ), sum, d1 ), Weights_2, f(x,y)(x + y) ), f(x)(max(0,x)) ) Placeholder Weights_1 matmul Weights_2 add relu
  • 25. Releases New production releases of Vespa are published Monday to Thursday every week All development is in the open: https://github.com/vespa-engine/vespa Releases: ● Have passed our suite of ~1100 functional tests and ~75 performance tests ● Are already running the ~150 production applications in our cloud service Releases are backwards compatible, unless it’s a major version change (bi-yearly) -> Upgrades can happen live node by node
  • 26. Big Data Serving and Vespa intro summary Making the best use of big data often means making decisions in real time Vespa is the only open source platform optimized for such big data serving Available on https://vespa.ai Quick start: Run a complete application (on a laptop or AWS) in 10 minutes http://docs.vespa.ai/documentation/vespa-quick-start.html Tutorial: Make a scalable blog search and recommendation engine from scratch http://docs.vespa.ai/documentation/tutorials/blog-search.html
  • 27. Where are we? What’s big data serving? Vespa - the big data serving engine Vespa architecture and capabilities Using Vespa
  • 28. Installing Vespa Rpm packages or Docker images All nodes have the same packages/image CentOS (On Mac and Win inside Docker or VirtualBox) 1 config variable: https://docs.vespa.ai/documentation/vespa-quick-start.html https://docs.vespa.ai/documentation/vespa-quick-start-centos.html https://docs.vespa.ai/documentation/vespa-quick-start-multinode-aws.html
  • 29. Configuring Vespa: Application packages Manifest-based configuration All of the application: system config, schemas, jars, ML models deployed to Vespa: ○ vespa-deploy prepare [application-package-path] ○ vespa-deploy activate Deploying again carries out changes made Most changes happen live (including Java code changes) If actions needed: List of actions needed are returned by deploy prepare https://docs.vespa.ai/documentation/cloudconfig/application-packages.html
  • 30. A complete application package, 1: Services/clusters ./services.xml ./hosts.xml
  • 31. A complete application package, 2: Schema(s) ./searchdefinitions/music.sd: https://docs.vespa.ai/documentation/search-definitions.html
  • 32. Calling Vespa: HTTP(S) interfaces POST docs/individual fields: to http://host1.domain.name:8080/document/v1/music/music/docid/1 (or use the Vespa Java HTTP client for high throughput) GET single doc: http://host1.domain.name:8080/document/v1/music/music/docid/1 GET query result: http://host1.domain.name:8080/search/?query=track:place { "fields": { "artist": "War on Drugs", "album": "A Deeper Understanding", "track": "Thinking of a Place", "popularity": 0.97 } }
  • 33. Operations in production No single point of failure Automatic failover + data recovery -> no time-critical ops needed Log collection to config server Metrics integration ● Prometheus integration in https://github.com/vespa-engine/vespa_exporter ● Or, access metrics from a web service on each node
  • 34. Matching Matching finds all the documents matching a query Query = Tree of operators: ● TERM, AND, OR, PHRASE, NEAR, RANK, WeightedSet, … ● NearestNeighbor, RANGE, WAND Goal of matching: a) Selecting a subset of data, b) Skipping for performance Queries are evaluated in parallel: over all clusters, document types, partitions, and N cores Queries are passed in HTTP requests (YQL), or constructed in Searchers
  • 35. Execution Low latency computation over large data sets … by parallelization over nodes and cores ... pushing execution to the data ... and preparing data structures at write time Container Execution middleware Query Content partition Matching+1st ranking Grouping & aggregation 2nd phase ranking Content fetch + snippeting ...
  • 36. Ranking/inference It’s just math Ranking expressions: Compute a score from features a + b * log(c) - if( e > f, g, h) ● Feature values are scalars or tensors ● Constant features (in application package - model parameters) ● Document features ● Query features ● Match features: Computed from doc+query data at matching time First-phase ranking: Computed during matching, on each match Second-phase ranking: Optional re-ranking of top n on each partition https://docs.vespa.ai/documentation/ranking.html
  • 37. Match feature examples ● bm25, or nativeRank feature: Pretty good text ranking out of the box ● Text ranking: fieldMatch feature set ○ Positional info ○ Text segmentation ● Multivalue text field signal aggregation: ○ elementCompleteness ○ elementSimilarity ● Geo distance ○ closeness ○ distance ○ distanceToPath ● Time ranking: ○ freshness ○ age https://docs.vespa.ai/documentation/reference/rank-features.html
  • 38. fieldMatch text ranking feature set Accurate proximity based text matching features Highest on the quality-cost tradeoff curve: Usually for second-phase ranking fieldMatch feature: Aggregate text relevance score Fine-grained fieldMatch sub-features: Useful for ML ranking
  • 39. Machine learned scoring Example: Text search ● Supervised machine-learned ranking of matches to a user query Example: Recommendation/personalization ● Query is a user+context in some vector/tensor space ● Document belongs to same space ● Evaluate machine-learned model on all documents ○ ...ideally - optimizations to reduce cost: 2nd phase, WAND, match-phase, clustering, … ● Reinforcement learning
  • 41. Gradient boosted decision trees ● Commonly used for supervised learning of text search ranking ● Defer most “Natural language intelligence” to ranking instead of matching -> better result at higher cpu cost … but modern hardware has sufficient power ● Ranking function: Sum of decision trees ● A few hundreds/thousand trees ● Written as a sum of nested if expressions on scalars ● Vespa can read XGBoost models ● Special optimizations for GBDT-shaped ranking expressions ● Training: Issue queries which requests ranking features in the response
  • 43. Tensors A data type in ranking expressions (in addition to scalars) Makes it possible to deploy large and complex ML models to Vespa ● Deep neural nets ● FTRL (regression models with millions of parameters) ● Word2vec models ● etc. https://docs.vespa.ai/documentation/tensor-intro.html
  • 44. What is a tensor? Tensor: A multidimensional array which can be used for computation Textual form: { {address}:double, .. } where address is {identifier:value},... Examples ● 0-dimensional: A scalar {{}:0.1} ● 1-dimensional: A vector {{x:0}:0.1, {x:1}:0.2} ● 2-dimensional: A matrix {{x:0,y:0}:0.1, {x:0,y:1}:0.2} Indexed tensor dimensions: Values addressed by numbers, continuous from 0 Mapped tensor dimensions: Values addressed by identifiers, sparse
  • 45. Tensor sources Tensors may be added to documents field my_tensor type tensor(x{},y[10]) { ... } … queries query.getRanking().getFeatures() .put("my_tensor_feature", Tensor.from("{{x:foo,y:0}:1.3}")); … and application packages constant tensor_constant { file: constants/constant_tensor_file.json.lz4 type: tensor(x{}) }
  • 46. … or be created on the fly from other doc fields From document weighted sets tensorFromWeightedSet(source, dimension) From document vectors tensorFromLabels(source, dimension) From single attributes concat(attribute(attr1), attribute(attr2), dimension)
  • 47. Tensor computation A few primitive operations map(tensor, f(x)(expr)) reduce(tensor, aggregator, dim1, dim2, ...) join(tensor1, tensor2, f(x,y)(expr)) tensor(tensor-type-spec)(expr) rename(tensor, from-dims, to-dims) concat(tensor1, tensor2, dim) … composes into many high-level operations https://docs.vespa.ai/documentation/reference/tensor.html
  • 48. The tensor join operator Naming is awesome, or computer science strikes again! Generalization of other tensor products: Hadamard, tensor product, inner, outer matrix product Like the regular tensor product, it is associative: a * (b * c) = (a * b) * c Unlike the tensor product, it is also commutative: a * b = b * a
  • 49. Use case: FTRL sum( // model computation: tensor0 * tensor1 * tensor2 // feature combinations * tensor3 // model weights application ) Where tensors 0, 1, 2 come from the document or query: and tensor 3 comes from the application package:
  • 50. Use case: Neural net rank-profile nn_tensor { function nn_input() { expression: concat(attribute(user_item_cf), query(user_item_cf), input) } function hidden_layer() { expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden)) } function final_layer() { expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final)) } first-phase { expression: sum(final_layer) } }
  • 51. TensorFlow, ONNX and XGBoost integration 1) Save models directly to <application package>/models/ 2) Reference model outputs in ranking expressions: Faster than native TensorFlow evaluation More scalable as evaluation happens at content partitions https://docs.vespa.ai/documentation/tensorflow.html https://docs.vespa.ai/documentation/onnx.html https://docs.vespa.ai/documentation/xgboost.html
  • 52. map( join( reduce( join( Placeholder, Weights_1, f(x,y)(x * y) ), sum, d1 ), Weights_2, f(x,y)(x + y) ), f(x)(max(0,x)) ) Placeholder Weights_1 matmul Weights_2 add relu
  • 53. Grouping and aggregation Organizing data at request time … For navigational views, visualization, grouping, diversity etc. Evaluated over all matches … distributed over all partitions Any number of levels and parallel groupings (may become expensive) https://docs.vespa.ai/documentation/grouping.html
  • 54. Grouping operations all: Perform an operation on a list each: Perform an operation on each item in a list group: Create a new list level max: Limit the number of elements in a list order: Order a list output: Add some data to the output produced by the current list/element
  • 55. Grouping aggregators and expressions Aggregators: count, sum, avg, max, min, xor, stddev, summary (summary: Output data from a document) Expressions: ● Standard math ● Static and dynamic bucketing ● Time ● Geo (zcurve) ● Access attributes + relevance score of documents
  • 56. Grouping examples Group hits and output the count in each group : Group hits and output the best in each group: Group into fixed buckets, then on attribute “a”, and count hits in leafs: Group into today, yesterday, last week and month, group each into separate days: https://docs.vespa.ai/documentation/reference/grouping-syntax.html
  • 57. Container for Java components ● Query and result processing, federation, etc.: Searchers ● Document processors ● General request handlers ● Any Java component (no Vespa interface/class needed) ● Dependency injection, component config ● Hotswap of code, without disrupting traffic ● Query profiles ● HTTP serving through embedding Jetty https://docs.vespa.ai/documentation/jdisc/
  • 58. Summary Making the best use of big data often implies making decisions in real time Vespa is the only open source platform optimized for such big data serving Available on https://vespa.ai Quick start: Run a complete application (on a laptop or AWS) in 10 minutes http://docs.vespa.ai/documentation/vespa-quick-start.html Tutorial: Make a scalable blog search and recommendation engine from scratch http://docs.vespa.ai/documentation/tutorials/blog-search.html