Skip to content

Commit 2999279

Browse files
committed
Minor, adding conclusion.
1 parent 48d3c85 commit 2999279

File tree

1 file changed

+44
-7
lines changed

1 file changed

+44
-7
lines changed

paper/git4data/git4data.tex

Lines changed: 44 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,14 @@ \section{Introduction}\label{sec:intro}
137137
clone/branch, push/pull, diff, merge, revert, on terabytes of data almost
138138
instantly.
139139

140+
The data version control system in a relational database system also unlocks
141+
many AI applications on structured data. Relational database holds very
142+
high quality, high value dataset but often not flexible for data engineers
143+
to conduct experiments. MatrixOne allow data engineers to label data,
144+
to make hypothetical changes to data, to compare and review these changes,
145+
to join or aggregate different versions of data with the full power of SQL,
146+
all without any disruption to existing business applications.
147+
140148
In the rest of this paper, we will first introduce the version control operations
141149
supported by MatrixOne. We explain the semantics of these operations and
142150
walk through a typical day to day workflow of a data engineer using MatrixOne.
@@ -208,8 +216,8 @@ \section{Version Control Operations}\label{sec:vcop}
208216
SNAPSHOT T{snapshot='sn2'}
209217
\end{verbatim}
210218
Restore will overwrite all modifications in $TClone_{sn3}$ and completely replace
211-
data of \texttt{TClone} with data of $T_{sn2}$. It is equivalent to
212-
\texttt{git reset --hard sn2}.
219+
data of \texttt{TClone} with data of $T_{sn2}$. It is equivalent to perform
220+
\texttt{git reset --hard sn2} on \texttt{TClone}.
213221

214222
User can diff two snapshots, may or may not be of the same table, using
215223
\begin{verbatim}
@@ -235,8 +243,8 @@ \section{Version Control Operations}\label{sec:vcop}
235243
Each row in the result of \texttt{SNAPSHOT DIFF} represents a potential conflict.
236244
\texttt{SNAPSHOT DIFF} does not require the two snapshots are branched from a
237245
common base revision as long as they have the same schema, that is, the same
238-
column names and types in the same order, and same primary key definition
239-
(if the tables have primary key). Later in the paper we will see that when
246+
column names and types in the same order and same primary key definition
247+
if the tables have one. Later in the paper we will see that when
240248
two snapshots share a commmon base revision, MatrixOne can perform diff and merge
241249
between them very efficiently.
242250

@@ -537,7 +545,7 @@ \subsection{Two Way Merge}
537545
section \ref{sec:vcop} by simply observing that rows in the common objects
538546
of the two tables will cancel each other out.
539547

540-
\subsection{Discussion}
548+
\subsection{Discussion} \label{sec:discussion}
541549
We discuss some interesting issues and possible future works related to
542550
the implementation of version control operations of MatrixOne.
543551

@@ -607,6 +615,17 @@ \subsubsection{Large Object Types}
607615
like S3. MatrixOne does not manage changes of the external resource.
608616
A datalink value is changed only if the URL is changed.
609617

618+
\subsubsection{Schema Change}
619+
User can make schema changes on a table using \texttt{ALTER TABLE}
620+
statement. Especially, MatrixOne supports \texttt{RESTORE TABLE}
621+
to a snapshot that was taken before the schema change. However,
622+
if user alter the schema of a table of a cloned table, MatrixOne
623+
will not be able to perform diff or merge between the two tables
624+
because the schema of the two tables are different. To use data
625+
version control on such a table, it is generally advised to make
626+
schema changes on a table before cloning it.
627+
628+
610629
\section{Experimental Results}
611630
We performed a series of experiments to evaluate the performance of
612631
our version control operations. We used the TPCH 1TB dataset on one
@@ -622,8 +641,26 @@ \section{Experimental Results}
622641
TODO: really do the work.
623642

624643
\section{Conclusion and Future Work}
625-
626-
TODO: You really want to say something
644+
MatrixOne has a powerful snapshot system and based on this, we have
645+
developed a version control system for data. We support all common
646+
data version control operations like clone, tag, diff, merge,
647+
revert, on large amount of data. Team of data engineers can cooperate
648+
and work on the same dataset. They can work on the same table and
649+
database transactions will handle the concurrency and consistency.
650+
They can also fork a table, make modifications, and merge the changes
651+
back to the original table. The fork/merge model allows data
652+
engineers to publish a "complete and clean" revision of a dataset.
653+
Data engineers are free to experiment, saving intermidiate results
654+
and reverting/rolling back bad changes without fear of losing data.
655+
\texttt{SNAPSHOT DIFF} will allow data engineers to conduct
656+
data review on changes between two snapshots. All these operations
657+
are very efficient in both time and storage space.
658+
659+
Section \ref{sec:discussion} discusses some interesting issues
660+
and possible improvements. Better or smarter conflict
661+
resolution strategies is one of the important areas to work on.
662+
We will continue to work with customers with real world use
663+
cases to further improve our version control system.
627664

628665

629666
%% \begin{table}

0 commit comments

Comments
 (0)