This package is an attempt to make a minimal interface for fast tables and relational operations in Julia. The primary goals are to provide:
- A minimalistic extension of core Julia concepts to support relational algebra. The common
definition of a relation is as a collection of named tuples. Here, a
Tableis just anAbstractVectorofNamedTuples, a Cartesian product of tables is anAbstractMatrixofNamedTuples, and we my support other collections ofNamedTuples in the future. - Standard Julia operations are fast. A strongly typed data structure provides fast
iteration of rows. Column-based storage and (some) lazy evaluation provides for efficient
operations for analytics workloads, and operations like
filter,mapandfindallare specialized for speed. - An extensible system of acceleration indexes allows for maximal speed, no matter the
problem or domain. Common acceleration indexes like hash- or sort-based indexes are
provided by the package, along with
Predicates that know how to take advantage of the structure. The system is extensible such that an external package may introduce e.g. accelerated spatial indexing, or inverted indexing to support accelearted textual search, etc.
The package is a work-in-progress and currently only provides a rather low-level interface to relational algebra operations and acceleration indexes. A higher-level interface for end users may be desirable.
A "table" is a finite relation, which is simply a collection of named tuples (rows). Here,
we think of this as any AbstractVector{<:NamedTuple{names}} (though we may extend this to
AbstractDicts later). The package provides a concrete Table type which stores data as
columns, but presents it to Julia as rows of NamedTuple (i.e. a relation).
Constructing a Table is straight forward, such as:
Table(a = [1,2,3], b = [2.0, 4.0, 6.0])Relational algebra consists of a small set of operations. One can "project" a table to fewer columns using:
project(table, (:a, :b))
# or
Project{(:a, :b)}()(table)Similarly columns can be renamed with the rename function (or Rename singleton type).
Rows can be manipulated with the standard Julia AbstractVector interface. Operations like
vcat, union, intersect, setdiff will produce new vectors of named tuples. We can
"select" certain rows using filter, or find the row indexes of certain rows using findall,
or use map to construct a boolean array indicating the image of matching- and non-matching rows.
For example, we may chose to use a perfectly standard Julia filter operation to obtain only
the rows where the values in column a are equal to 1 like so:
filter(row -> row.a == 1, table)To make certain operations faster, tables may provide a set of "acceleration indexes" which allow mapping of some condition to the row indices. These might include a mapping from the hash of some columns to a row index, or the row indices sorted in a certain order, or simply the knowledge that the values in one or more columns is unique.
An acceleration index provides information about one-or-more columns of a Table to allow
for fast lookup of relevant rows. These indexes may accelerate operations such as filtering
rows, performing grouping and aggregates, or joins with other tables, and are typically
attached to a Table with the accelerate function like so:
table2 = accelerate(table, HashIndex{(:a,)})Multiple indexes may be attached to a single Table. The built-in set of AbstractIndexes
are:
SortIndex- prodides the ordering (with respect toisless) of the rows of one or more columns.UniqueSortIndex- like aSortIndex, but also with the guarantee that the elements of the specified column(s) are unique.HashIndex- provides aDictmapping from the values of one or more columns to the corresponding row indices.UniqueHashIndex- like aHashIndex, but also with the guarantee that the elements of the specified column(s) are unique.UniqueIndex- demarks simply that the elements of certain row(s) are unique.NoIndex- an internal placeholder used where no suitable index is found.
External packages are free to create subtypes of AbstractIndex{names} and specialized
algorithms that make use of the new accelerations.
Julia needs a way of determining how to use an acceleration index to make a given operation faster. We might attempt to use an anonymous function to indicate which rows to select, like our earlier example:
filter(row -> row.a == 1, table)Unfortunately, this obscures a lot of information about our query such as that we only care
about the values in column :a, and that we are performing an equality comparison to 1.
This means we have no way of knowing how to use one of acceleration indexes to make this
operation faster.
To solve this, this package provides the Predicate abstract type.
Subtypes of Predicate{names} are Functions that always return a Bool and depend only on
the values in the columns names. Each Predicate may be used with filter, findall or
map in combination with an acceleration index to achieve a faster result than a full table
scan. The package will attempt to use the best acceleration index for the given Predicate
and operation. Some example usage:
filter(IsEqual(a = 1), table)
findall(IsLess(b = 2.0), table)
map(In(c = 3..10), table)The built-in predicates designed to work on a single table are:
IsEqual- returnstrueif a the element is equal to the specified value. For example,filter(IsEqual(a = 1), table)will use hash- or sort-based acceleration indexes for fast searches of rows where the column:aisisequalto1. Multiple columns may be specified (such asIsEqual(a = 1, b = 2.0)).IsLess- returnstrueif the element is less than the specified value. For example,filter(IsEqual(a = 1), table)will return all the rows where the values of column:ais less than1according toisless, and is accelerated by sort-based indexes. Multiple columns may be specified.IsLessEqual- similar toLessThan, but also accepts equal elements.IsGreater- similar toLassThan, but only accepts greater elements.IsGreaterEqual- similar toLassThan, but only accepts greater or equal elements.In- returnstrueif the element is in a given collection. Specializes for speed onInterval(which is a minimalistic, v0.7-compatible version of IntervalSets.jl). This allows for fast "windowing" type searches such asfilter(In(a = 3..10), table)in combination with sort-based indexes. Multiple columns may be provided, for a "box"-style search.
The final relational algebra operation we haven't mentioned is the Cartesian product of two
tables. This package provides the ProductTable type which is an AbstractMatrix of
NamedTuples. Example usage:
t3 = ProductTable(t1, t2)So far, we haven't mentioned a "join" operation. In relation algebra, a join is simply the
combination of a Cartesian product operation followed by a filter over a Predicate such
as equality between two columns. Predicates that relate multiple columns of a table are
provided and can be used to accelerate filter, findall and map of a ProductTable,
i.e. facilitate fast joins. For example
t4 = filter(Equals(:a, :b), t3)will join t1 and t2 on equality of columns :a and :b. The set of built-in comparitive
Predicates are
Equals- values in two columns are equal according toisequalLessThan- value in first column is less than second according toislessLessEqualThan- value in first column is less than or equal to secondGreaterThan- value in first column is greater than secondGreaterEqualThan- value in first column is greater than or equal to secondWithin- values in the two columns are within a specified distance of each other
The joining operation can occur in one of three ways
- No acceleration index is used, so a full double loop over the two tables is required.
- A loop is perfomed over the rows of one table (say
t1) and for each of these rows a newPredicateis created to query the second table (e.g.t2). For example,Equalsmay result in a series ofIsEqualsearches on the second table, orWithinmay create a series ofIn(::Interval)searches of the second table for windowing-style joins. - A sort-merge join is perfomed where two sort-based indexes are available and the predicate
is
IsEqual.
This package needs some more concise surface-level syntax, such as some concept of join
and groupby and so-on. One possibility is to integrate this with SplitApplyCombine.jl.
So far all Tables are vectors and ProductTables are limited to 2D. Ideally we could take
the Cartesian product of arbitrarily many tables.
For ultimate speed of complex queries, we will probably need to make some more operations behave in a lazy, streaming fashion so e.g. multiple filtering operations may occur in a single loop.
All the predicates are using isequal and isless, for simplicity. We should really
support sensible data behavior with NaN and missing.
As always, performance can be optimized. There are some simple benchmarks in the
benchmarks/ directory.
I/O operations (including show) should be provided somehow (hopefully using the package
ecosystem).