@@ -137,6 +137,14 @@ \section{Introduction}\label{sec:intro}
137137clone/branch, push/pull, diff, merge, revert, on terabytes of data almost
138138instantly.
139139
140+ The data version control system in a relational database system also unlocks
141+ many AI applications on structured data. Relational database holds very
142+ high quality, high value dataset but often not flexible for data engineers
143+ to conduct experiments. MatrixOne allow data engineers to label data,
144+ to make hypothetical changes to data, to compare and review these changes,
145+ to join or aggregate different versions of data with the full power of SQL,
146+ all without any disruption to existing business applications.
147+
140148In the rest of this paper, we will first introduce the version control operations
141149supported by MatrixOne. We explain the semantics of these operations and
142150walk through a typical day to day workflow of a data engineer using MatrixOne.
@@ -208,8 +216,8 @@ \section{Version Control Operations}\label{sec:vcop}
208216 SNAPSHOT T{snapshot='sn2'}
209217\end {verbatim }
210218Restore will overwrite all modifications in $ TClone_{sn3}$ and completely replace
211- data of \texttt {TClone } with data of $ T_{sn2}$ . It is equivalent to
212- \texttt {git reset --hard sn2 }.
219+ data of \texttt {TClone } with data of $ T_{sn2}$ . It is equivalent to perform
220+ \texttt {git reset --hard sn2 } on \texttt { TClone } .
213221
214222User can diff two snapshots, may or may not be of the same table, using
215223\begin {verbatim }
@@ -235,8 +243,8 @@ \section{Version Control Operations}\label{sec:vcop}
235243Each row in the result of \texttt {SNAPSHOT DIFF } represents a potential conflict.
236244\texttt {SNAPSHOT DIFF } does not require the two snapshots are branched from a
237245common base revision as long as they have the same schema, that is, the same
238- column names and types in the same order, and same primary key definition
239- ( if the tables have primary key) . Later in the paper we will see that when
246+ column names and types in the same order and same primary key definition
247+ if the tables have one . Later in the paper we will see that when
240248two snapshots share a commmon base revision, MatrixOne can perform diff and merge
241249between them very efficiently.
242250
@@ -537,7 +545,7 @@ \subsection{Two Way Merge}
537545section \ref {sec:vcop } by simply observing that rows in the common objects
538546of the two tables will cancel each other out.
539547
540- \subsection {Discussion }
548+ \subsection {Discussion } \label { sec:discussion }
541549We discuss some interesting issues and possible future works related to
542550the implementation of version control operations of MatrixOne.
543551
@@ -607,6 +615,17 @@ \subsubsection{Large Object Types}
607615like S3. MatrixOne does not manage changes of the external resource.
608616A datalink value is changed only if the URL is changed.
609617
618+ \subsubsection {Schema Change }
619+ User can make schema changes on a table using \texttt {ALTER TABLE }
620+ statement. Especially, MatrixOne supports \texttt {RESTORE TABLE }
621+ to a snapshot that was taken before the schema change. However,
622+ if user alter the schema of a table of a cloned table, MatrixOne
623+ will not be able to perform diff or merge between the two tables
624+ because the schema of the two tables are different. To use data
625+ version control on such a table, it is generally advised to make
626+ schema changes on a table before cloning it.
627+
628+
610629\section {Experimental Results }
611630We performed a series of experiments to evaluate the performance of
612631our version control operations. We used the TPCH 1TB dataset on one
@@ -622,8 +641,26 @@ \section{Experimental Results}
622641TODO: really do the work.
623642
624643\section {Conclusion and Future Work }
625-
626- TODO: You really want to say something
644+ MatrixOne has a powerful snapshot system and based on this, we have
645+ developed a version control system for data. We support all common
646+ data version control operations like clone, tag, diff, merge,
647+ revert, on large amount of data. Team of data engineers can cooperate
648+ and work on the same dataset. They can work on the same table and
649+ database transactions will handle the concurrency and consistency.
650+ They can also fork a table, make modifications, and merge the changes
651+ back to the original table. The fork/merge model allows data
652+ engineers to publish a "complete and clean" revision of a dataset.
653+ Data engineers are free to experiment, saving intermidiate results
654+ and reverting/rolling back bad changes without fear of losing data.
655+ \texttt {SNAPSHOT DIFF } will allow data engineers to conduct
656+ data review on changes between two snapshots. All these operations
657+ are very efficient in both time and storage space.
658+
659+ Section \ref {sec:discussion } discusses some interesting issues
660+ and possible improvements. Better or smarter conflict
661+ resolution strategies is one of the important areas to work on.
662+ We will continue to work with customers with real world use
663+ cases to further improve our version control system.
627664
628665
629666% % \begin{table}
0 commit comments