My keynote presentation about how we developed FiloDB, a distributed, Prometheus-compatible time series database, productionized it at Apple and scaled it out to handle a huge amount of operational data, based on the stack of Kafka, Cassandra, Scala/Akka.
4. This is not a contribution
Requirements
• Massive scale, billions of metrics
• Resiliency and maximum uptime
• Real time (seconds, not minutes)
• Low latency querying
• High concurrency (thousands of dashboards, alerts)
• Easy debugging - flexible ad-hoc queries
5. This is not a contribution
What Users Wanted
• Flexible data model and queries, tag-based querying
• User-defined “tags” on metrics and data
• Prevents abuse of hierarchical system
• Can query across regions, other boundaries
• Flexible rollups
• Longer views of fine grained data
• or flexible retention policies
6. This is not a contribution
Design for the
• Internal cloud @Apple similar to public cloud
• Containers and “stateless” apps
• Use of Docker, etc. promotes more containers = more metrics
• Stateless = more frequent restarts, more UUIDs => more
metrics
• Leverage hosted cloud services
• Hosted Cassandra, Kafka, other data services
• Let someone else manage persistent storage
7. Where are we going?
Dashboards
Real-time
Debugging
Events
Metrics
Tracing
???
Real-time ML/
AI
Actionable
Insights
8. This is not a contribution
(Re)Introducing FiloDB
A Prometheus-compatible, Distributed, In-Memory
Time Series Database
OPEN SOURCE!
http://www.github.com/filodb/FiloDB
Built on the proven reactive SMACK stack.
9. This is not a contribution
Core Principles
Designed for Cloud
Infrastructure
Built for Scale and
Resiliency
Flexible
Data
Model
Multi-Tenant
10. This is not a contribution
Proudly built on the
Reactive Stack
11. This is not a contribution
In-Memory
Time Series
13. This is not a contribution
Facebook Gorilla
• Keep most recent time series data IN MEMORY,
stored using efficient time series encoding
techniques
• Serve queries using separate process
• Allows dense, massively scalable TS storage +
very fast, rich queries of recent data
• https://github.com/facebookarchive/beringei
15. This is not a contribution
Shard 1
Shard 0
Data Flow on a NodeRecords
Records
Records
Monix / RX ingestion / back pressure
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Encoding
Encoding
16. This is not a contribution
Columnar Compression
timestamp value
Row-based
timestamp value
timestamp value
timestamp value
timestamp value
t1 t2 t3 t4 t5 t6 t7 t8
Column-based
v1 v2 v3 v4 v5 v6 v7 v8
• Compressing all timestamps together is much
more efficient
17. This is not a contribution
Delta-Delta Encoding
• Encode increasing numbers (timestamps,
counters) as deltas from slope
18. This is not a contribution
Results
• Millions of time series and billions of samples per
node
• Up to 1 million samples/sec per node ingestion rate
peak (measured during recovery)
• Up to 8x better than previous system (Storm/HBase)
• Storage density of ~3 bytes per metric sample
• About 10x better than previous system (HBase)
19. This is not a contribution
Tackling Heap Issues
• 60+ second GC pauses / OOM
• Filled up old gen, GC stuck finding tiny bit of free space
• Solution: move as many permanent objects offheap as
possible
• Too high rate of allocation on ingest
• Temporary objects only, but producing too many
• Solution: Switch from Protobuf to custom, no-allocation
BinaryRecord
20. This is not a contribution
Off-heap Data Structures
• BinaryVector - one compressed column of
data (say timestamps, or values)
• BinaryRecord - one ingestion data record,
variable schema
• OffheapLFSortedIDMap - offheap lightweight
sorted map
21. This is not a contribution
Block
Moving Object Graphs
OffHeap
Write buffer
Chunks Chunks
OffheapOnheap
TSPartition
WriteBufferObject
ConcurrentSkipListMap
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
VectorObject
VectorObject
22. This is not a contribution
Block
Moving Object Graphs
OffHeap
Write buffer
Chunks Chunks
ChunkMap
OffheapOnheap
TSPartitionTSPartition
WriteBufferObject
ChunkSetInfoPartID
ChunkSetInfo
ConcurrentSkipListMap
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
VectorObject
VectorObject
Ptr Ptr Ptr Ptr
23. This is not a contribution
Flexible Distributed
Queries
24. This is not a contribution
Prometheus Compatible
• Don’t reinvent a popular time series query language
• Prom HTTP API gives out of box Grafana support
sum(http_requests{partition=“P2”,dc="DC0",job="A0"}) by (host)
• Filtering/indexing on many time series
• Time windowing-based aggregation with multiple windows
• Group by
25. This is not a contribution
Queries to Logical Plan
sum(http_requests{partition=“P2”,dc="DC0",job="A0"}) by (host)
Aggregate(Sum,
PeriodicSeries(
RawSeries(
IntervalSelector(t1, t2, step),
List(ColumnFilter(partition,Equals(P2)),
ColumnFilter(dc,Equals(DC0)),
ColumnFilter(job,Equals(A0)),
ColumnFilter(__name__,Equals(http_requests))
AST
26. This is not a contribution
Shard 0
Physical Plan Execution
• Location transparency of Akka actors is crucial here
ReduceAggregateExec
Sum
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
Chunks
Chunks
Chunks
Chunks
Shard 1
27. This is not a contribution
Shard 0
Physical Plan Execution
• Look ma, plan change, no code changes!
ReduceAggregateExec
Sum
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
Chunks
Chunks
Chunks
Chunks
Shard 1
Query Service
28. This is not a contribution
Node 2Node 1
Actor Hierarchy
NodeCoordinator
Actor
IngestionActor QueryActor
MemStore
HTTP
NodeCoordinator
Actor
IngestionActor QueryActor
MemStore
HTTP
CLI / Akka Remote
29. This is not a contribution
Comparisons
• Queries possible on FiloDB and not on old system:
• Tag-based querying (filter, group by etc based on
flexible tags)
• Histograms and quantiles
• Group by and topK queries
• Flexible time series joins
• 100’s millions samples queried/sec
30. This is not a contribution
Datasets and Data
Model
31. This is not a contribution
What Kind of Data Works?
• High cardinality of individual time series (operational metrics,
devices, business metrics)
• Many data points in each series, append only
Series1 {k1=v1, k2=v2}
Time
Series2 {k1=v3, k2=v4}
Series3 {k1=v5, k2=v6}
32. This is not a contribution
Flexible Tags
• Each time series is defined by a metric name and
a unique combination of keys-values
• Index on tags allows filter/search by combo of tags
memstore_partitions_queried {
dataset=timeseries,
host=MacBook-Pro-229.local,
shard=0
}
memstore_partitions_queried {
dataset=timeseries,
host=MacBook-Pro-229.local,
shard=1
}
33. This is not a contribution
Flexible Schemas and
Datasets
• Datasets allow for namespacing different schemas, ingestion
sources, SLAs, # shards, and offheap memory isolation
• Main dataset with 2-day retention
• Pre-aggregates dataset with 1 week retention
• Histograms dataset with schema for efficient histogram storage
• OpenTracing dataset — start, end, span duration, etc.
• Historical data using different schema
34. This is not a contribution
The Hard Stuff: Recovery
and Persistence
35. This is not a contribution
What is persisted?
• Raw time series data - using a custom format
designed for efficient ingestion & recovery - is
stored using and ingested from Apache Kafka
• Compressed, columnar time series data is written
periodically to a ColumnStore, typically
Cassandra
• Time series metadata for reconstructing each
node’s index is persisted as well
37. Recovery
Kafka
Shard0 Shard1 Shard2 Shard3 Shard4
FiloDB
Node
FAILU
RE
FiloDB
Node
FiloDB
Node
S0 S1 S2 S3 S4
New
Filo
Node
S2
CassandraChunkSink
On-demand paging
FiloDB
Node
S2
Queries
to other DC
38. This is not a contribution
Recovery
• The most recent raw data - before encoding - is
recovered by replaying Apache Kafka partitions
• Index metadata is recovered
• Compressed data is loaded on-demand from
Cassandra. This works because most data
written is never queried.
39. This is not a contribution
FiloDB vs Alternatives
40. This is not a contribution
vs Prometheus
• FiloDB supports PromQL, HTTP query API
• Prometheus is single-node only
• FiloDB is multi-schema and multi-tenant
• FiloDB designed to run as a resilient, distributed,
high-uptime cloud service
• FiloDB open source not as rich feature-wise (yet)
41. This is not a contribution
vs InfluxDB
• FiloDB data model is very close to Influx: multi-
schema, multiple columns, namespaces, tags on
series
FiloDB InfluxDB
Clustering Peer to peer distributed
Single node (OSS),
clustered ($$)
Query language PromQL SQL (PromQL coming)
Maturity New Established
42. This is not a contribution
vs Cassandra
• C*: Very well established and widely used, robust
• Like FiloDB: real time, distributed, low-latency
• C*: Very simple queries ideally to one partition
• FiloDB: complex PromQL queries, topK,
groupBy, time series joins and windowing
• FiloDB: much higher storage density and
ingestion throughput for time series
43. This is not a contribution
vs Druid
• Druid and FiloDB have different data models
• Druid is an OLAP database with an explicit time
dimension. Dimensions are fixed.
• FiloDB supports millions/billions of time series
with flexible tags
• FiloDB stores raw data, Druid stores roll ups
44. This is not a contribution
Tradeoffs and Lessons
45. This is not a contribution
Tradeoffs of using the JVM
• Pluses: solid, proven libraries for building
distributed and data systems
• Apache Lucene
• Akka Cluster
• Minuses: Lack of low-level memory layout and
control
• The devil you know best
46. This is not a contribution
JVM Production Tips
• Get to know different GCs, Eden, OldGen, G1GC, etc.
really really well
• SJK (https://github.com/aragozin/jvm-tools)
• Runtime visibility
• Multiple APIs to access cluster state
• JMXbeans
• Measure measure measure!! (Use JMH)
47. This is not a contribution
Current Status
• Development at github.com/filodb/FiloDB
• Time/value schema ingestion and querying is
stable
• Looking for partners to work together, add
integrations, etc.
48. This is not a contribution
Try it out today
• Ingest data using https://github.com/influxdata/
telegraf
• Expose a Prometheus HTTP read endpoint in
your apps
• Use Grafana to visualize metrics
49. This is not a contribution
Roadmap
• Speed and efficiency improvements in core FiloDB
database
• Histogram optimizations
• Improved cluster state management
• Support for Spark/ML/AI jobs and metrics. How can we
improve observability for data engineers?
• Support for non-metrics schemas
• Long term storage
50. This is not a contribution
Thank you!
• Note: We are hiring! If you love reactive systems,
distributed systems, love to push the performance
envelope…. there’s a place for you.
52. This is not a contribution
On Heap vs Off HeapRecords
Records
Records
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Chunks
Chunks
Chunks
ChunkMap
ChunkMap
ChunkMap
OffheapOnheap
TSPartition
TSPartition
TSPartition
PartitionMap
Lucene MMap
Index Files