Paul Dix, CTO and co-founder of InfluxData, discussed the future of InfluxDB and the release of InfluxDB 2.0 Open Source. He explained that InfluxDB 2.0 has been rebuilt from the ground up to address limitations of the original InfluxDB like lack of distributed features and poor performance for high cardinality analytics data. The new database, called InfluxDB IOx, uses a columnar data store with parquet files and is designed to be distributed, federated, and able to run analytics at scale on high cardinality data.
23. Requirements
• What cardinality?
• Analytics performance
• Separate compute from storage and tiered storage
• Operator defined Replication & Partitioning
• Able to run without locally attached storage
• Bulk data import and export
• Subscriptions
• Federated by design
• Embeddable scripting
• Greater compatibility
50. In-memory Perf Preview (tracing example)
• env - production or staging environment
• data_centre - the region within a cloud vendor
• cluster - a specific cluster, e.g., a k8s cluster
• user_id - an id associated with the user that issued a request that was traced
• request_id - an id associated with a single request that started a trace
• trace_id - a single id associated with all spans in the trace
• node_id - the id of compute node that the trace execution ran across
• pod_id - the id of containers that the trace execution ran across
• span_id - a random id for every sample generated in the trace
51. Test data cardinalities
104,998,932 rows
• env - 2
• data_centre - 20
• cluster - 200
• user_id - 200,000
• request_id - 2,000,000
• trace_id - 10,000,000
• node_id - 2,000
• pod_id - 20,000
• span_id - ∞ (a new one for each sample row)
53. Find spans for a trace
SELECT * FROM “traces”
WHERE “trace_id” = “0000MjNg” AND
“time” >= ‘2020-10-30 15:12’ AND
“time” < ‘2020-10-30 16:12’;
54. Find spans for a trace
SELECT * FROM “traces”
WHERE “trace_id” = “0000MjNg” AND
“time” >= ‘2020-10-30 15:12’ AND
“time” < ‘2020-10-30 16:12’;
Returned in: 84.666665ms ~ 1.1B rows/sec
56. Flexible Replication Rules
• Synchronous & Asynchronous
• Push & Pull
• Request by request, batch, or bulk
• Partition to servers, groups of servers
• Total operator control via RESTful API
76. Get Involved
• Star & watch the repo at github.com/influxdata/influxdb_iox
• Find the InfluxDB IOx topic on community.influxdata.com
• Join the #influxdb_iox channel in our community Slack
• Join us on the 2nd Wednesday of every month at 8:30 AM Pacific Time for a
tech talk on InfluxDB IOx - influxdata.com/community-showcase/influxdb-tech-
talks/
• We’re hiring for Rust, distributed systems, and columnar databases expertise.
Email to recruiting@influxdata.com and CC me paul@influxdata.com.
• Star & watch the repo at github.com/influxdata/influxdb_iox
• Find the InfluxDB IOx topic on community.influxdata.com
• Join the #influxdb_iox channel in our community Slack
• Join us on the 2nd Wednesday of every month at 8:30 AM Pacific Time for a
tech talk on InfluxDB IOx - influxdata.com/community-showcase/influxdb-tech-
talks/
• We’re hiring for Rust, distributed systems, and columnar databases expertise.
Email to recruiting@influxdata.com and CC me paul@influxdata.com.
Editor's Notes
Today I want to talk to you about the future of InfluxDB. But before that, let’s talk about some the big news!
InfluxDB 2.0 open source is now released! This represents a multi-year effort. With our cloud offering, our goal was to switch to a continuous, services based, cloud first delivery model that could be billed by usage, not by servers. This means that we ship production code every business day and make continuous incremental improvement and our customers for Cloud 2 only pay for what they use.
For our open source, we wanted to ship an all-in-one database, monitoring system, visualization engine, and scripting scheduler. Flux, our new scripting and query language was the center of this effort. With it, users can now do more than ever before within the database. They can even call out to third-party APIs to bring in more data, send data out, trigger action, or send alerts. This can happen at query time in ad-hoc queries, or scheduled through the Task scheduling system.
Our goal was to ship the same API in our cloud offering and in open source. We think you’ll love the open source InfluxDB 2.0 for local development and deployment at the edge or on single servers within your cloud or data center environment. Ryan Betts, our VP of Engineering will be covering more of the details in the talk right after mine.
For my talk, I wanted to tell you about what we’re thinking for The Future. I realize it may be early to start thinking about this with 2.0 open source just being released today, but I’m very excited about the work some of us have been doing and I want to share it publicly.
But before we get into the future, I need to talk about the past. Specifically, November 12th, 2013, which is the day I gave the first talk about InfluxDB and introduced it to the world. While the next 60 seconds will likely be review for all of you, I hope you’ll bear with me as I set the stage.
The talk was titled: Introducing InfluxDB, an open source distributed time series database.
In that talk I sought to define what I meant by time series data. I pointed to some specific examples.
Metrics being the first and most obvious example of time series as that was what most people thought of when I talked about that
I went further to give more examples of events. All things I thought could be analyzed, inspected, visualized and summarized as a time series.
I would later add sensor data to this list of time series examples.
And I talked about two different kinds of time series
More broadly, I claimed that all data you perform analytics on is time series data. Meaning, anytime you’re doing data analysis, you’re doing it either over time or as a snapshot in time.
I saw time series as a useful abstraction for solving problems and building applications in a number of different use cases.
The vision I laid out then is still one I have today, which is that InfluxDB should be useful for all kinds of time series data. It should also be the building block upon which future monitoring, analytics, sensor data and time series applications can be built on.
So where are we today? Some of what I’ll say is generally about the platform and some of it will be specific to open source.
Easy to write data in with libraries in many languages. Easy to query using either InfluxQL or Flux.
With the addition of Flux, there are so many more things that InfluxDB can do outside of what a normal declarative query language can provide. It’s great for analytics. However, the caveat exists that this is only true for lower cardinality data. That is you don’t have too many unique time series and your tag values don’t have too many unique values.
InfluxDB lacking distributed features in open source means that it is frequently not chosen as a building block for time series applications. This limitation is an unfortunate, but at the time it was a necessary choice that enabled us to build a business to support our open source efforts. However, it definitely gets in the way of our broader platform vision. InfluxDB should be a platform that is adopted by a very wide audience, well outside the audience of our paying customer base.
We want to push what’s possible with InfluxDB forward. Ideally for both our open source users and our paying customers.
No limits on cardinality. Write any kind of event data and don’t worry about what a tag or field is.
Best-in-class performance on analytics queries in addition to our already well-served metrics queries.
Tiered data storage. The DB should use cheaper object storage as its long-term durable store.
Operator control over memory usage. The operator should be able to define how much memory is used for each of buffering, caching, and query processing.
Operator-controlled replication. The operator should be able to set fine-grained replication rules on each server.
Operator-controlled partitioning. The operator should be able to define how data is split up amongst many servers and on a per-server basis.
Operator control over topology including the ability to break up and decouple server tasks for write buffering and subscriptions, query processing, and sorting and indexing for long term storage.
Designed to run in an ephemeral containerized environment. That is, it should be able to run with no locally attached storage.
Bulk data export and import.
Fine-grained subscriptions for some or all of the data.
Broader ecosystem compatibility. Where possible, we should aim to use and embrace emerging standards in the data and analytics ecosystem.
Run at the edge and in the datacenter. Federated by design.
Embeddable scripting for in-process computation.
Not only does it expand the index, for cases like tracing where you have new values all the time, the index becomes larger than the time series data itself.
One way around this is to use fields rather than tags, but that is a limiting choice since you don’t have control over how data is organized in the DB, and thus how you might want to organize it outside of the tag system.
In order to support high cardinality use cases, we’d need to ditch the inverted index and also our indexing by individual time series. As our VP of Engineering, Ryan Betts, says: InfluxDB over indexes for these use cases.
InfluxDB uses memory mapped files for the inverted index and for the time series data storage. Many modern databases have been built using this because it gives you speed of development and offloads memory management to the OS.
The downside is that you loose fine grained control over how memory is used and allocated. Mmap has also proven tricky in containerized environments.
Finally, we want to be able to run with or without locally attached storage. The way that TSM and TSI organizes data doesn’t lend itself well to having some data in object storage, some in memory, and some cached on local SSD.
Once I realized that a gradual refactor wasn’t possible, I started thinking about what it would look like to start new in 2020 rather than 2013. What tools exist today that weren’t at my disposal seven years ago? What other open source could I bring to bear that would speed this effort up?
So we’re building a new core for InfluxDB. And here’s the first thing to know about it.
This project is written in Rust. I’ve written about my excitement for the language before. I think Rust is the future of systems software. It gives us the fine grained control over memory that we’re looking for, but with the safety of a higher level language.
Even better, its model for programming concurrent applications, which most server software, including this project are, eliminates data races. Within our Go codebase this has been a source of a number of very hard to track down bugs over the years. Its error handling also helps developers write correct software and reduces the number of runtime bugs you might otherwise create.
Also, it’s embeddable into other languages and systems. This means we can embed it into InfluxDB or other parts of our stack or other analytics systems. We could even compile it down to web assembly and run it in the browser.
There’s so much to love about Rust, but this talk isn’t about that. But ultimately, I want this project to form the basis of future analytics systems for the next few decades and beyond. I remember some blog post that Bryan Cantril wrote about Rust where he talked about software with longevity and he felt that Rust was a language that would ultimately help you build that kind of software. That’s the bet we’re making here.
The project is InfluxDB IOx, which is short for iron oxide so it’s pronounced InfluxDB eye-ox.
We’ll take a look at the high level architecture of it, but I just want to caveat this. This project is very early stage. We’ve largely been in research mode validating our assumptions on performance, compression and functionality. We’re not producing builds yet and we don’t have documentation up yet. But there’s a project README and you can build from source.
We wanted to open this up early so that our community of users could see what we’re doing.
The second thing to know is that this project is built around Apache Arrow. Arrow is an in-memory columnar data specification. But it’s also a persistence format via Apache Parquet, which is widely used both inside and outside the Arrow ecosystem. Most data warehouses and big data processing systems can read and write Parquet data.
Arrow is also Arrow Flight, an RPC specification and high performance client/server framework for transferring large datasets over the network.
Within the Rust part of Arrow is another project called DataFusion, which is a columnar SQL execution engine. We’re building on top of that and contributing to it.
We’re using all of these tools. That makes the big headline with Arrow the fact that we’re no longer creating this database by ourselves. With Arrow as the core, we’re working with contributors around the world that are using these libraries in their own data systems.
This is the big architectural change. InfluxDB IOx is an in-memory columnar database that uses object storage for persistence with data stored in Parquet files.
We looked at the existing open source columnar databases when we were starting out. We wondered if they could form the basis of a future InfluxDB backend. What we found was that they weren’t optimized for time series. Specifically, they have varying degrees of dictionary support, which is critical for our use case, little support for querying directly on compressed in-memory data with late materialization, and they weren’t optimized for windowed aggregates and computation on time. They seem to be built around a pure analytics use case that asks a question about aggregations to a single point in time.
Further, they weren’t built with our core need of being able to run with in an ephemeral environment with no locally attached storage using object store for all persistence. Our evaluation pointed to a missing solution in the open source market.
It’s not a storage engine. We’re not building our own storage engine short of buffering data in memory and writing it out to Parquet files. The persistence formats we’re using under the hood are Flatbuffers for the write ahead log and Parquet files for immutable blocks of data.
With Parquet and object storage for persistence, this opens up how you can interact with your data. Backup and restore is outside the concerns of InfluxDB IOx. You can create any kind of backup & restore system you’d like. An IOx server can read some or all of its data from object storage on startup.
Bulk data transfers become trivial. Clients can get Parquet files directly from object storage and they can send Parquet files to InfluxDB IOx to organize in object storage for later query workloads. Thanks to Apache Arrow, there are libraries in many languages to work with Parquet and the support is getting better month over month. Notably, Python, C++ and Java are first class citizens in the Arrow ecosystem. They represent the gold standard of functionality. We’ll help bring Rust up to the same level of compatibility.
Training a machine learning model? Ask IOx where the Parquet files are that have the data you’re looking for, get the directly from object storage and have it in your Python library of choice, all with a few lines of code.
I should mention that I’m referring to object store, but there are other abstractions
I want to talk quickly about how data is organized in InfluxDB IOx. I think this is important because it shows the flexibility you have as an operator and a user and it lets you optimize for having large blocks of immutable unchanging data, which is really what time series is all about. If you’re updating your data, that means you’re literally rewriting history. Sometimes you might do this, but that’s not what we’re optimizing for. We’re optimizing for history being a fixed thing that you can work with easily and modify on the fly at query time.
That means that you have blocks of data that you can move around to other servers, send out to clients, and represent compactly in object storage.
First you have the partition key, which is generated for each line that comes in. It can use any of the metadata or actual data to generate a string that represents the partition key. You could have the measurement name, tag key information or field information or time/date formatting.
Partitions are logical groupings of data based on the same partition key. When a partition is snapshotted, you create an immutable block of data. A partition can have multiple blocks, but ideally you’re buffering up everything to snapshot once into a single block. You can always compact blocks later, but this can be a separate process completely outside of the DB.
Blocks have tables of data where a table is once again a logical concept. At the physical level, you have individual Parquet files, which have one table in each and you have in-memory compressed segments that are optimized for query speed with some compression via encoding schemes.
One table per measurement. Tags and fields become columns. One table per Parquet file.
This means that tag and field names must be unique within a measurement.
Schema gets defined and created on the fly as you write data in.
But it’s a start. And we know that we can switch to Parquet as our persistence format without any fear of some sort of data explosion.
We break data up into partitions. How data is partitioned can change over time, because each partition is self describing in terms of the summary metadata that specifies what tables it has, what columns each of those tables has, and what the summary information is for each of those columns like min, max, count, sum and potentially even bloom filters for identifiers.
This summary data is used by the planner at query time. Partition summaries are kept in memory and the query is analyzed to determine which partitions need to be queried to produce a result. Once in a partition, we brute force query against it, and if we have it in our segment store, that happens against compressed data without decompressing it. That is, we perform late materialization and only decompress the values we use.
This means that the partitioning scheme you choose has great impact on what your queries look like. This is why we let the users define it when they create a database/bucket. It can change on a per-database basis.
We can likely do better. We’re using RLE for the span IDs and trace IDs and we’d be better just going with dictionary without the RLE.
Notice that we have time in this example. If you’re looking up by some trace ID, where’d you get it? From a log line? You’ll have a timestamp associated with it. Use it.
If you’re partitioning your data by time, and in most cases this will be at least one of the criteria by which you partition your data, you can quickly narrow down the blocks of data to query against. If you have 2h partitions, then you’ll be able to find the spans you’re looking for by querying at most 2 partitions.
This returns the 10 rows in about 85 milliseconds. If you do the rough math on this it means it was able to brute force on about 1.1B rows/sec. Note that we didn’t actually process all those rows. It was operating on compressed data.
We can likely get this down by a bit more by removing the RLE compression for trace ID and span. Maybe another 2x improvement.
The specifics of the compressed in memory columnar store will definitely be the subject of some future tech talks.
Here’s what I think the real future is. The example I just showed takes a data center centric view. It assumes that all your data is getting pushed up to some central cluster. I think the future is federated. It operates at the edges as single nodes, it operates in factories in small clusters, and it operates in many data centers worldwide.
You’ll likely have high precision data that doesn’t make sense to replicate everything up to a central place. Or at least you’ll only replicated it in highly compressed form. The future distributed time series system isn’t a cluster that runs in a data center, even if it has rack aware capabilities and multi-region routing.
There’s no limit to the scale of time series data that we’ll be collecting over the coming decades. We need flexibility in how it’s replicated, queried, and stored.
* Created InfluxDB because we saw so many people re-inventing the wheel and we wanted Influx to be the basis of it
* However, the lack of distributed features left a gap in the market
* Infrastructure projects that fall under source available or community licenses severely limit the audience and what you can build
* InfluxDB IOx is dual-licensed under MIT and Apache 2 as is common in the Rust community. No community license, no source available license, no restrictions. You can build new projects using this code, you can build new businesses using this code, you can do whatever you want with it.
Conway’s law says that you ship your org chart. That is if you create two teams to build a system, you’ll get a system comprised of two parts.
I propose Dix’s maxim as it relates to open source and licensing generally, which is that your licensing strategy is your commercialization strategy, whether by accident or design.
The architecture approaches for IOx are deliberate choices because of not only the functionality and operational properties we wanted in the system, but also in how we plan to commercialize it.
InfluxDB IOx is designed to be a shared-nothing server that has an API giving the operator total control over how it behaves. However, the operator must make those changes as they are needed. Who does this operation and coordination?
In the most simple setups of a single server, you don’t worry about it. In two server setups you can likely get by with shell scripts and a cron job.
But the more complex your environment becomes, the more complicated this coordination becomes. It was a design goal for us to separate out the core database work from the operational work across a fleet of servers. We will create this software for our own needs to operate our cloud environment. However, our cloud may be different than yours. Your environment may be different. This is why the operational coordination is kept separate. So there is maximum flexibility in topology and configuration.
We plan to run the InfluxDB IOx open source bits as is in our own cloud. We won’t be running a fork, we’ll be running right off the main branch.
At the beginning of this talk I mentioned my introduction of InfluxDB to the world. And I titled it this.
I’ll be giving more talks about InfluxDB IOx over the coming months. But here’s how I’m thinking about it. Yes, it’s a distributed time series database. But it’s a lot more than just that.
It’s federated and this is a core part of its design. With time series and analytics data, the future is federated. The scale is larger than you’ll want to manage and push up to a single cluster. You’ll have edge, multiple data centers, and many thousands of potential nodes all communicating with each other.