Arrow Flight is a proposed RPC layer for Apache Arrow that allows for efficient transfer of Arrow record batches between systems. It uses GRPC as the foundation to define streams of Arrow data that can be consumed in parallel across locations. Arrow Flight supports custom actions that can be used to build services on top of the generic API. By extending GRPC, Arrow Flight aims to simplify the creation of data applications while enabling high performance data transfer and locality awareness.
2. Apache Arrow
Why Arrow Flight: Arrow Promises Interoperability
• But it’s primary medium is in-memory
• Some work to support shared memory in-process
• But not all systems can be collocated
– Especially in a modern K8s/containerized deployment
• Shared memory has other problems:
– Reference management and security are complex
– Different requirements for long-term datasets versus
ephemeral datasets
Arrow Needs an RPC layer to simplify the creation of Data Applications
3. Apache Arrow
Arrow Messaging Paradigm: Batch Streams
Primary Communication:
• A Stream of Arrow Record
Batches
• Bulk transfer targeting efficient
movement
• Effectively Peer to Peer
Client Server
Put HeaderDataDataDataend
Thanks
endDataDataDataHeader
Get Descriptor
Specific Methods:
• Put Stream: Client sends a stream
to server
• Get Stream: Server sends a stream
to client
• Both Initiated by Client
4. Apache Arrow
Endpoint: Retrieved with Ticket
Flight
Location 1
Location 2
Arrow Messaging Paradigm: Stream Management
• Parallel consumption and locality awareness
– A flight is composed of streams
– Each stream has a FlightEndpoint: A opaque stream
ticket along with a consumption location
– Systems can take advantage of location information to
improve data locality
• Flights have two reference systems:
– Dotted path namespace for simple services (e.g.
marketing.yesterday.sales)
– Arbitrary binary command descriptor: (e.g. “select a,b
from foo where c > 10”)
• Support for Stream Listing
– ListFlights(Criteria)
– GetFlightInfo(FlightDescriptor)
Stream
Stream
Stream
Stream
5. Apache Arrow
Arrow Messaging Paradigm: Data as a Service Customization
• Arrow Flight Also support a simple Generic Messaging Framework
– Support Customization and Extensibility within the Arrow Flight context
• ListActions()
– Each Data Service can expose actions along with descriptions about what they support
– Each action should describe how to structure the action and corresponding result
– Normal HTTP2 exceptions can be used to manage error states
• DoAction(Action) => Result
– Generic Containers that can carry execute Data Service specific operations
– Examples might include: forget stream, load stream from disk,
• Actions and Results, each have:
– ActionType String token
– Body: JSON body of instruction
• Arrow Flight Clients can be written without knowledge of custom Actions/Results
– Lightweight wrappers can be built for Data Services as needed
– Or Simply use existing JSON tooling on top of generic API
6. Apache Arrow
But How? GRPC as a Foundation
• Generic RPC generation framework
• Built on HTTP/2 Standard
• Many language bindings (see right)
• Supports security &compression
• Uses Protobuf as primary format
• Designed primarily for application messaging
7. Apache Arrow
Extend GRPC To Better Work With Arrow Streams
• Streams are valid Protobuf Objects so systems that don’t
have custom processing can still consume Arrow streams
– The entirety of the Arrow RecordBatch is a single length
delimited Protobuf “bytes” field.
• For high performance situations, do direct byte encoding
and one-copy reads/zero-copy writes to avoid extra
copies/overhead
– Java Flight implementation cuts through multiple layers to
achieve this using currently released GRPC (despite no formal
support for it).
8. Apache Arrow
Check it out
• Arrow Flight Proposal
– https://github.com/jacques-n/arrow
• Example Usage in Dremio Formation
– https://github.com/jacques-n/formation