We started as an ad network. The challenge was to recommend the best product (out of millions) to the right person in a given moment (thousands of users within a second). We have delivered 5 billion ad views since 24 months. To put it in the scale context: If we would serve 1 ad per second it will take 160 years to serve 5 billion ads.
So we needed a solution. SQL databases did not work. Popular NoSQL databases did not work. Standard data warehouse approaches (pre-aggregations, creating schemas) - did not work too.
Re-thinking all the problems with huge data streams flowing to us every second we have built a complete solution based on open-source technologies and fresh, smart ideas from our engineering team. It is called deep.bi and now we make it available to other companies.
deep.bi lets high-growth companies solve fast data problems by providing scalable, flexible and real-time data collection, enrichment and analytics.
It was built using:
- Node.js - API
- Kafka - collecting and distributing data
- Spark Streaming - ETL, data enrichments
- Druid - real-time analytics
- Cassandra - user events store
- Hadoop + Parquet + Spark - raw data store + ad-hoc queries
2. We started as an ad network
The challenge was to recommend
the best product (out of millions)
to the right person in a given moment
(thousands of users within a second)
4. To put it in the scale context:
If we would serve 1 ad per second it will take
160 years
to serve 5 billion ads
5. So we needed a solution
SQL databases did not work
Popular NoSQL databases did not work
Standard data warehouse approaches (pre-
aggregations, creating schemas) - did not work
6. Re-thinking all the problems with
huge data streams flowing to us every second
we have built a complete solution
based on open-source technologies
and fresh, smart ideas from our engineering team
It is called deep.bi
and now we make it available to other companies
7. DEEP.BI = BIG DATA FAST DATA SOLUTION
high velocity
high volume
8. deep.bi lets high-growth companies
solve fast data problems by providing
scalable, flexible and real-time
data collection, enrichment and analytics
9. deep.bi – complete data processing flow
Data
enrichment,
transformation
and integration
Unstructured,
raw data from
many sources
page views, IoT events,
IP, URL, cookie,
transactions, call detail
records, etc.
Find
patterns,
build models,
predict
behavior
collect enrich analyze
10. How to predict the best offer
based on online data – case study.
11. Collect website, campaigns and CRM data
Website:
Google
Analytics
Campaigns:
Agency
reports
Apps:
Dedicated
monitoring
tools
Other
systems:
Call center
IVR, emails
Instead of integrating current reporting tools we need to
gather all the single events that our customers generate.
Data is stored in silos. Reporting tools provide aggregated
reports impossible to integrate around single customer.
12. Collecting raw web data is not enough
2015-05-15T00:26:41.328Z,3,D,
[ip_hidden],i1xszg0f-19hqrje,"Mozilla/5.0 (Windows NT
5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/
42.0.2311.152 Safari/537.36",”[url_hidden]",
7279848891,@906,"https://www.google.pl/",vuser-history-
allegro-1-
hc20150509.1,"122_100003_Park@700:html_620x100_single_ban
ner:See offer"
IP, URL, cookie, user-agent, timestamp
13. * Coming soon
Enrich raw web and mobile data
50+information
from one
interaction
Purchase intent
Device
Time
Location
ISP
Online context
Weather*
Demographics
14. We can learn quite a few things from user IP
Example use:
• international travellers
• townspeople
• people in mountains
• rainy day
• Country
• Region
• City
• ZIP Code
• Population
• Latitude & Longitude
• Time zone
• IDD prefix to call the city from
another country
• Phone area code
• Mobile Country Code (MCC)
• Mobile Network Code (MNC)
• Elevation
• Weather at the moment of event
15. ISP tells us more we could expect
Example use:
• competitors’ users->
acquisition
• our users -> retention/up-
selling/cross-selling
• people from particular
company or company type
• ISP name or Organization name
• Organization type:
• Commercial
• Organization
• Government
• Military
• University/College/School
• Library
• Content Delivery Network
• Fixed Line ISP
• Mobile ISP
• Data Center/Web Hosting/Transit
• Search Engine Spider
• Reserved
• Mobile brand
• Net speed
16. Detailed information about user device
Example use:
• smartphone users
• Apple users
• Samsung Galaxy users
• Google browser users
• Device Type
• Device Brand
• Device Model
• Device Operating System
• Operating System Producer
• Browser
• Browser Producer
17. Besides user features, track user behavior too.
Deeper understanding of people’s behavior:
• RFM Segmentation (Recency, Frequency, Monetary)
• Shopping cart analysis
• Purchase sequence analysis
18. User behavior and characteristics
helps predicts next best action/offer
What product should we recommend?
How could end this purchase path?
19. So, how to build tailored recommendations?
Pick an algorithm that is suitable for the problem
Product [ feature_1, feature_2, …, feature_N]
User [ feature_1, feature_2, …, feature_N]
User [ product_1, product_2, …, product_N]
20. Simple rules: if a user has some features serve
this group of products
Manual segment creating: analysts find
segments of users and match them with
product segments
Simple feature matching: get user weighted
feature vector and match with products feature
vectors
Manual / people managed rules
21. Find segments automatically (e.g. k-means)
Product features based recommendations
User features based recommendations
Combined product and user based
recommendations (collaborative filtering, deep
learning)
Machine learning-supported recommendations
24. Complex data model for query optimization
split dimensions in several tables based on reports made
pre cherry-pick dimensions which we can aggregate based on
cardinality
index every dimension column is a must
Impossible to add high-cardinality dimensions
no way to analyze per user (millions of them)
no way to event add all of user-agent, url, geo-info, ...
Problems with SQL and NoSQL databases
25. Complex data loading process
needs to pre-aggregate in memory
non-trivial reliability issues
hard to parallelize
There is always latency
pre-aggregation in job loading memory
Problems with SQL and NoSQL databases
26. Customer
databases
Event
sources*
Raw data
stream
Transformed
data stream
Real-time data ingestion
Kafka
Data
Transformation
& Enrichment
Node.js, Spark
Streaming
Real-time
OLAP Store
Druid
Operational
Store
Cassandra
High performance, multi-purpose storage
Webanalytics
dashboard
deep.biAPI
ETL
Customer
analytics
dashboard
*e.g.. mobile apps, websites, marketing campaigns, IoT (beacons, wearables)
Raw Data Store
Hadoop,
Parquet, Spark
deep.bi – real-time big data architecture
27. DEEP
Data enrichment,
storage
& analytics
Client’s DEEP
Data Space
End-user browser
Web Data Collection API
(HTML or JS)
Trackers pass event data with
<DEEP tracker>
Ingestion
API
Data Collection APIs
1
<D>
<D>
Mobile Data Collection API
(HTML, JS or Native SDK)
Trackers pass event data with
29. Publish-subscribe service
The nervous system of enterprise data
decouple producers from consumers
reliable buffer data
send now, process later.
Scalable distributed, replicated log system
Pause components, restart processing
Powered by:
web giants like LinkedIn, Twitter, Netflix, Uber, Spotify or Pinterest
>10M messages/second
Apache Kafka
30. Scalable, fault-tolerant stream processing system
With simple programming model & rich API & integrations
Powered by:
Yahoo, Netflix, eBay
NASA, Intel, Cisco
It is our fundamental technology for streaming applications
sessionize events
detect frauds
attribute purchases to click or views
load & read external stores like Druid, Hadoop, Cassand
Apache Spark Streaming
31. Open Source Streaming Data Store for Interactive Analytics at Scale
denormalized data
no more snowflake or star-schema!
Build real-time dashboards, analytic applications, exploratory tools on it.
It’s FAST!
aggregate, drill-down, slice-n-dice in sub-seconds
advanced column-store with compression
sophisticated approximate algorithms
It’s SCALABLE
horizontally scalable - just add more machines
replicated, highly-available
Over 100 PBs of data, millions events/second
Druid – Real-time OLAP Store
32. Ingest historical & real-time data
data available for exploration in milliseconds
can store years of data in very optimized storage
Powered by
eBay, Netflix, PayPal, Yahoo
Cisco
It is our core data store of all events, historical and real-time data
Druid – Real-time OLAP Store
33. Apache Spark for batch-processing: fast and general engine for
large-scale data processing
Replaces Map-Reduce, being up to 10x-100x faster!
Number 1 open-source project in big data space (contributors, commits)
In-memory processing (if possible)
Spark SQL for SQL processing
Apache Parquet - an optimized storage format
columnar – read only columns you need
compressed – specialized compression for data type + generic compression
2x-4x: 600 GB data -> 150 GB data
Hadoop can be optimized by 2 order of magnitudes: from hours
to seconds!
Hadoop Optimized
34. Thank you!
Share your thoughts, challenges
or case studies with us.
Or drop us a line: hello@deep.bi
SUBMIT»
36. Let’s assume we want to find users who:
Were interested in smartphones
Use Samsung product
Live in cities with population over 1M people
Are woman
Were traveling abroad
Came from our display campaign
So, we have a combination of 6 (k) dimensions from 50 (n).
Using the combination formula: we will have…
Complexity of multidimensional queries
37. … similar number of possible combinations:
15,890,700
as in Lotto (6 from 49).
38. Thank you!
Share your thoughts, challenges
or case studies with us.
Or drop us a line: hello@deep.bi
SUBMIT»
Segmentacja RFM (Recency, Frequency, Monetary)
Ocena potencjału przychodowego klientów
Analiza migracji między segmentami
Analiza koszyków zakupowych
Zrozumienie jakie koszyki konstruują klienci.
Zrozumienie, które kategorie produktów najczęściej sprzedają się razem.
Analiza sekwencji zakupowych
Zrozumienie jak zachowania klientów układają się w czasie.
Jakie sekwencje poprzedzają zakup.
Jakie sekwencje poprzedzają wycofanie się.
Modele typu uplift:
Kupi po rekomendacji
Grupa celowa
Kupi bez rekomendacji
Zbędny wydatek
Nie kupi po rekomendacji
Strata klienta
Modele typu uplift:
Kupi po rekomendacji
Grupa celowa
Kupi bez rekomendacji
Zbędny wydatek
Nie kupi po rekomendacji
Strata klienta
Source: http://saasaddict.walkme.com/saas-2015-new-shifts-will-see/
1.Companies Will Be Investing More in Personal Consumer Research
Currently a lot of consumer research is performed in a very static manner, through surveys and analysis of raw data. What more companies will be investing in is in personalization and customization in their services. They will also focus on getting to know their customers more personally, usually through social media, through the use of Big Data (see more on that below) and through direct engagement (via email and social media). Details like purchasing motivations, lifestyle, and desires are all important.
Relevant marketing strategies seek to improve customer satisfaction and motivate customers to value your brand as more than just a service.
2. Cloud Data Services Will Overtake Traditional Means of Storage
According to Forrester research, Microsoft will be generating more revenue from its cloud services compared to its traditional on-premise application. Traditional services are limited by their on-premise storage space, while cloud data services are much more open. This will allow for businesses to look into contracting cloud services for meaningful growth while it is still relatively inexpensive.
One challenge to watch out for is that cloud data breaches are a legitimate issue. Expect companies to invest heavily in shoring up their securities to avoid breaches.
3. More SaaS Apps Will Specialize in Specific Industries
Industries like healthcare, manufacturing, and retail will be developing more apps in their specific fields. One of the challenges to this new approach is that it burdens the customer with a deeper, more complex experience to acclimate to. However, a benefit to specialized SaaS is that companies will have a built-in userbase which gives them a head start when developing features. It also benefits enterprise customers.
The reason that this trend is important is because consumers are demanding more apps that are relevant to specific needs. Generalized apps avoid getting too complex in any one area which can alienate consumers by not providing solutions they desire.
4. New Alternatives to Multitenancy Will Develop
Allowing multiple customers to share a single application instance is useful for managing data on cloud services. While the traditional sense allowed for multiple users to be plugged in, and had individual views, alternatives that allow for more personalized experiences are being developed. For example, Salesforce.com is offering a new 'Superpod' service for enterprises. This allows companies to have their own dedicated infrastructure inside their data centers, rather than connect to a single server-side instance.
These new hybrid services gives enterprises more options leading into the future, allows for more innovation in developing delivery systems, and thus frees up the bottleneck in the cloud service market. It also gives consumers options as well.
5. A Bigger Emphasis on Big Data Analytics
According to IDC reports, there is a trend leading towards a greater use of data-as-a-service (DaaS) with spending reaching $215 billion in 2015. DaaS will leverage cloud to deliver their services. They also predict that more companies will be using big data analytics as a part of their commercial and open data sets.
Cloud storage offers more flexibility for enterprise access and overall capacity. Since the relative cost of cloud storage per unit is decreasing, more companies are becoming interested in big data analysis, which makes it a perfect opportunity to begin implementing open data set technologies.
Source: http://saasaddict.walkme.com/saas-2015-new-shifts-will-see/
1.Companies Will Be Investing More in Personal Consumer Research
Currently a lot of consumer research is performed in a very static manner, through surveys and analysis of raw data. What more companies will be investing in is in personalization and customization in their services. They will also focus on getting to know their customers more personally, usually through social media, through the use of Big Data (see more on that below) and through direct engagement (via email and social media). Details like purchasing motivations, lifestyle, and desires are all important.
Relevant marketing strategies seek to improve customer satisfaction and motivate customers to value your brand as more than just a service.
2. Cloud Data Services Will Overtake Traditional Means of Storage
According to Forrester research, Microsoft will be generating more revenue from its cloud services compared to its traditional on-premise application. Traditional services are limited by their on-premise storage space, while cloud data services are much more open. This will allow for businesses to look into contracting cloud services for meaningful growth while it is still relatively inexpensive.
One challenge to watch out for is that cloud data breaches are a legitimate issue. Expect companies to invest heavily in shoring up their securities to avoid breaches.
3. More SaaS Apps Will Specialize in Specific Industries
Industries like healthcare, manufacturing, and retail will be developing more apps in their specific fields. One of the challenges to this new approach is that it burdens the customer with a deeper, more complex experience to acclimate to. However, a benefit to specialized SaaS is that companies will have a built-in userbase which gives them a head start when developing features. It also benefits enterprise customers.
The reason that this trend is important is because consumers are demanding more apps that are relevant to specific needs. Generalized apps avoid getting too complex in any one area which can alienate consumers by not providing solutions they desire.
4. New Alternatives to Multitenancy Will Develop
Allowing multiple customers to share a single application instance is useful for managing data on cloud services. While the traditional sense allowed for multiple users to be plugged in, and had individual views, alternatives that allow for more personalized experiences are being developed. For example, Salesforce.com is offering a new 'Superpod' service for enterprises. This allows companies to have their own dedicated infrastructure inside their data centers, rather than connect to a single server-side instance.
These new hybrid services gives enterprises more options leading into the future, allows for more innovation in developing delivery systems, and thus frees up the bottleneck in the cloud service market. It also gives consumers options as well.
5. A Bigger Emphasis on Big Data Analytics
According to IDC reports, there is a trend leading towards a greater use of data-as-a-service (DaaS) with spending reaching $215 billion in 2015. DaaS will leverage cloud to deliver their services. They also predict that more companies will be using big data analytics as a part of their commercial and open data sets.
Cloud storage offers more flexibility for enterprise access and overall capacity. Since the relative cost of cloud storage per unit is decreasing, more companies are becoming interested in big data analysis, which makes it a perfect opportunity to begin implementing open data set technologies.