-
Notifications
You must be signed in to change notification settings - Fork 7
How to Migrate from Subversion
Table of Contents
This page provides tips for migrating Subversion repositories to Git.
Conversion from Subversion to Git is typically done using the git svn command that comes with Git.
It is not necessary to use the separate svn2git tool.
Note
Command-line examples in this document assume a bash prompt and that the current working directory persists across all sections.
The git svn command provides an init sub-command to configure
a Git repository for incremental bidirectional operation with a
Subversion repository. However, for a one-time permanent conversion
we suggest manual configuration.
Create a bare local Git repository to perform the initial conversion:
$ mkdir example-svn.git $ cd example-svn.git $ git --bare init
Configure git svn with the Subversion repository top level URL.
For example:
$ git config svn-remote.svn.url https://ncisvn.nci.nih.gov/svn/example
prepares a Git repository to convert the example Subversion repository.
Alternatively, if one has a copy of the Subversion repository server files, see the below Tip: Fetch Locally.
Subversion records commit authorship by username but Git records
it by name and email adderss e.g. Your Name <[email protected]>.
In order to convert one must construct a map from username to name and email.
Use svn log to list all user names recorded in the Subversion repository.
For example, the command:
$ url=https://ncisvn.nci.nih.gov/svn/example && svn log -q "$url" |grep '^r' |cut -f 2 -d '|' |sed 's/^ *//;s/ *$//' |sort|uniq user1 user2 ...
lists all user names that ever committed to the example repository.
From this list create an authors file, say authors.txt, of the form:
$ cat authors.txt user1 = Name One <[email protected]> user2 = Name Two <[email protected]> ...
The authors file must contain a complete map or the tool may complain
later during conversion.
If there are any unknown entries choose something reasonable.
For example, Subversion repositories converted from CVS by cvs2svn
may contain a "(no author)" user. Map this to a name such as:
(no author) = no_author <no_author@no_author.cvs2svn>
to indicate that the commit was generated by the tool. If no reasonable authorship information is available for an unknown username one may map the name to itself:
unknown = unknown <unknown>
but this should be a last resort.
Configure git svn with the authors file to map commit authorship:
$ git config svn.authorsfile "authors.txt"
Subversion represents branches and tags as directories. One must identify the repository directories that directly hold the actual project content. These are typically the directories developers checkout to work.
Each such directory must be mapped to a Git branch or tag reference
(refs/heads/<branch> or refs/tags/<tag>) using a command of
the form:
$ git config --add svn-remote.svn.fetch $dir:$ref
where $dir is the Subversion repository directory (not including
the top-level repository URL) and $ref is the Git reference. One
may also map "container" directories whose immediate subdirectories
are all branches or tags using commands of the form:
$ git config --add svn-remote.svn.branches $bdir/*:refs/heads/* $ git config --add svn-remote.svn.tags $tdir/*:refs/tags/*
where $bdir and $tdir are branch and tag container
directories.
A "standard" layout under the repository top-level is:
-
trunk: Main Development Path -
branches/*: Branches e.g.branches/release-1.0 -
tags/*: Tags e.g.tags/release-1.0.0
where * refers to all immediate subdirectories (no /).
If the Subversion repository uses this layout, configure git svn
to use it as follows:
$ git config --add svn-remote.svn.fetch trunk:refs/heads/trunk $ git config --add svn-remote.svn.branches branches/*:refs/heads/* $ git config --add svn-remote.svn.tags tags/*:refs/tags/*
However, Subversion does not enforce this layout so many repositories have different layouts.
Some projects modify the Standard Layout by adding a subdirectory named for the project:
-
trunk/example: Main Development Path -
branches/*/example: Branches e.g.branches/release-1.0/example -
tags/*/example: Tags e.g.tags/release-1.0.0/example
In this case configure git svn as:
$ git config --add svn-remote.svn.fetch trunk/example:refs/heads/trunk $ git config --add svn-remote.svn.branches branches/*/example:refs/heads/* $ git config --add svn-remote.svn.tags tags/*/example:refs/tags/*
See the "CONFIGURATION" section of the git svn documentation for a similar example.
Some projects follow no rigid layout at all. For example, consider a repository with directories:
-
docs: Documentation outside main history -
trunk/example: Main Development Path -
branches/release-0.9: Release 0.9 branch -
branches/release-1.0: Release 1.0 branch -
branches/people/user1/test: A test branch foruser1 -
tags/release-0.9.0: Release 0.9.0 tag -
tags/release-1.0.0: Release 1.0.0 tag -
tags/people/user1/before-x: A reference tag foruser1
In this example there is an unrelated docs history, the
example project name subdirectory is used under trunk but not
inside branches, branches/ is not strictly a branch container
directory because its immediate people/ subdirectory does not
directly contain the project content, and tags/ is not strictly
a tag container for the same reason.
One must configure git svn as:
$ git config --add svn-remote.svn.fetch docs:refs/heads/docs $ git config --add svn-remote.svn.fetch trunk/example:refs/heads/trunk $ git config --add svn-remote.svn.fetch branches/release-0.9:refs/heads/release-0.9 $ git config --add svn-remote.svn.fetch branches/release-1.0:refs/heads/release-1.0 $ git config --add svn-remote.svn.fetch branches/people/user1/test:refs/heads/people/user1/test $ git config --add svn-remote.svn.fetch tags/release-0.9.0:refs/tags/release-0.9.0 $ git config --add svn-remote.svn.fetch tags/release-1.0.0:refs/tags/release-1.0.0 $ git config --add svn-remote.svn.fetch tags/people/user1/before-x:refs/tags/people/user1/before-x
to specify the entire directory to branch map explicitly. Note that we
use only svn-remote.svn.fetch to specify individual branches and tags.
We do not use not svn-remote.svn.branches or svn-remote.svn.tags
because there are no branch or tag container directories.
Some projects have no branch structure at all and keep their content
right at the top level of their repositories. For such repositories
one must configure git svn as:
$ git config --add svn-remote.svn.fetch :refs/heads/trunk
to convert the entire repository as a single branch.
In any of the above layout configurations the Subversion repository
may have a file or subdirectory inside the project content that one
wishes to exclude from conversion. For example, a Subversion
repository using the Standard Layout may have an artifacts
directory under trunk, branches, and tags containing
pre-built binaries that are not really part of the project history.
These should be excluded because:
- Large artifacts tend to be generated by tools, are not a source of original human-written content, and therefore do not belong in source code version control.
- The Github File Size Limits enforcement will not allow large files to be pushed to repositories at all.
Any set of paths under the Subversion repository root directory may be
excluded by configuring a Perl regular expression to match them. In
this example we exclude the artifacts directories by configuring
git svn with:
$ git config svn-remote.svn.ignore-paths '^(trunk|(branches|tags)/[^/]+)/artifacts'
Alternatively, one may use the approach below to Redact Proprietary Information.
Before proceeding with this step, submit the configuration constructed above for evaluation of the Conversion Input Checklist.
After all above configuration is complete one may run the command:
$ git svn fetch
to fetch the history from the Subversion repository and store it in
the local Git repository. If the command terminates with an unknown
username error, return to the Map Users to Authors section, update
the authors.txt, and run git svn fetch again to pick up where
it left off.
Optionally check converted content using the below Tip: List All Blobs.
After the above steps we have an example-svn.git repository
containing the raw conversion result. However, additional filtering
should be performed to cleanup and finalize the Git history.
Clone the original Git repository:
$ cd .. $ git clone --mirror example-svn.git example.git $ cd example.git
Now perform any filter operations in the clone and leave the original repository untouched in case the filtering goes awry and needs to be restarted from scratch.
The raw git svn conversion includes meta-data in commit messages
needed for incremental bidirectional operation between the Subversion
repository and the Git repository. It has the form:
git-svn-id: https://ncisvn.nci.nih.gov/svn/example/trunk@1234 2465a239-a9df-dca0-63d6-4bfd70964974
where the URL specifies the repository directory whose content is
represented in the commit as of revision @1234 and the UUID is
that of the Subversion repository.
We do not need these meta-data in their raw form for a one-time
conversion, but we purposely did not use the svn.noMetadata
option that git svn offers. We do want to preserve the SVN
revision number of each commit in its message. This is important
because the revision numbers may be referenced by other commit
messages, mailing list archives, issue trackers, etc. and anyone
reading such references in the future must be able to find the
corresponding Git commit.
Use the git filter-branch command to rewrite every commit message:
$ refs=$(git show-ref --heads --tags | cut -f 2 -d ' ') &&
git filter-branch --msg-filter '
sed "/^git-svn-id:/ {s/[^@]*@/SVN-Revision: /;s/ *[a-f0-9-]*$//;}"
' $refs
and replace meta-data lines like the above with just:
SVN-Revision: 1234
The git filter-branch command leaves behind copies of references
to the original history in the local repository, so clean them up:
$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
If the repository contains proprietary information it must be removed before publication. One may not simply add a commit to remove the content because it will still be present in history. If there is a known secret string then run a command like:
$ git rev-list --reverse --topo-order --all | while read commit; do git diff-tree --no-commit-id --root -r $commit --diff-filter=AM --name-only -z | xargs -0 git --no-pager grep "OurSecret" $commit -- done
to grep for it throughout history. Then use the git filter-branch command to filter the content out of every version in history. There are multiple ways to do this:
-
Use the
--tree-filteroption to perform filesystem operations:$ refs=$(git show-ref --heads --tags | cut -f 2 -d ' ') && git filter-branch --tree-filter ' if test -f file-with-secret.txt; then sed -i -e "s/OurSecret/xxxxxxxxx/" file-with-secret.txt fi ' $refsThis is very flexible but can take a long time on large projects with long histories because it checks out every version to disk.
-
Use the
--index-filteroption to operate on the Git index:$ refs=$(git show-ref --heads --tags | cut -f 2 -d ' ') && git filter-branch -f --index-filter ' files=$(git ls-files -s -- files-with-secret*.txt) && if test -n "$files"; then echo "$files" | while read mode obj _ name; do out=$(git cat-file blob $obj | sed -e "s/OurSecret/xxxxxxxxx/" | git hash-object -t blob --stdin) && git update-index --cacheinfo $mode $out $name done fi ' $refsThis is more complex but may run faster than a tree filter.
The git filter-branch command leaves behind copies of references
to the original history in the local repository, so clean them up:
$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
Optionally use the following approach to publish old branches in the Git repository without an explicit branch name. This avoids obscuring active development branches in a list containing a large number of old branches that are no longer maintained.
WARNING: We recommend this step only for advanced users. This section makes use of advanced Git concepts and tools without explaining them in full detail. If you are not comfortable with them, skip this section and move on to Update Content below.
Git represents a branch as a refs/heads/<branch> reference to the
commit currently at the head of the branch. Each commit refers to its
immediate predecessors (parents) so the one reference to the branch
head holds its entire history. In order for the branch to appear in a
repository it must be reachable from a named branch head. The trunk
is not special in any way and is simply held as refs/heads/trunk.
At this point in the conversion the repository has a trunk and
perhaps a few branches, say release-0.1 and release-1.0. We
can visualize the history as:
...o---B-----A----o----o----o----o trunk
\ \
W--X release-0.1 o----o release-1.0
In this example commits W and X are useful as a record of an
old release-0.1 branch but we wish to avoid publishing
refs/heads/release-0.1 in the public repository. In order to keep
the old branch alive we can filter history to make it reachable from
trunk. Identify commit A as the oldest commit in trunk
that is newer than commit X, and B the parent of A
(A=after, B=before). Insert a merge commit M between B and
A to make X reachable from A and therefore trunk.
After filtering the history will be:
...o---B--M--A----o----o----o----o trunk
\ / \
W--X o----o release-1.0
Construct merge commit M with parents B and X and a commit
message mentioning release-0.1 to make it easy to find in
git log if anyone later wants to checkout the old branch.
One may achieve this by running bash code:
branch="release-0.1" &&
X=$(git rev-parse --verify $branch) &&
d="$(git log "$X" --pretty=%ad -n 1)" &&
A=$(git rev-list trunk --first-parent --since="$d" --reverse |head -1) &&
B=$(git rev-parse --verify "$A^1") &&
M=$(echo "Merge end of '$branch' branch
This commit was manufactured during conversion from SVN
to merge the end of the '$branch' branch." |
GIT_AUTHOR_NAME="Your Name" \
GIT_AUTHOR_EMAIL="[email protected]" \
GIT_AUTHOR_DATE="$d" \
GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME" \
GIT_COMMITTER_EMAIL="$GIT_AUTHOR_EMAIL" \
GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE" \
git commit-tree "$B^{tree}" -p $B -p $X
) &&
echo "# Graft merge commit $M as parent of $A
$A $M" >> info/grafts
to construct the merge commit and add a Git info/grafts entry to
tell git filter-branch where to insert it in history. Note that we
use B^{tree} as the tree object of M to tell Git that the
merge does not really make any changes. It is only for recording
history.
In practice repeat the above to graft in a merge commit for every old branch to be hidden. Run git filter-branch to resolve the grafts:
$ refs=$(git show-ref --heads --tags | cut -f 2 -d ' ') && git filter-branch -f $refs && rm info/grafts
The git filter-branch command leaves behind copies of references
to the original history in the local repository, so clean them up as above.
Finally, remove the old branch references.
Before proceeding with this step, provide the above conversion results for evaluation of the Conversion Output Checklist.
The Git repository is now almost ready for publication and new development. A few details may need to be addressed by adding new commits directly in Git:
- If the subversion repository used
svn:ignoredirectory properties, manually add a corresponding .gitignore file to each such directory. - If the subversion repository used
svn::eol-stylefile properties, one may need to add a .gitattributes file. - If the project build system assumes it is building in a Subversion checkout, fix it to be aware that it is now building in a Git work tree.
The bare repository used above has no work tree in which to make these changes, so clone it:
$ cd .. $ git clone example.git example $ cd example
After committing changes, push them back to the bare repository like any upstream:
$ git push origin master $ cd ../example.git
One may publish changes to the final public repository by pushing directly from
the bare repository containing the filtered history. Since it was created by
cloning from the conversion repository it already has an origin remote, so
add another remote for github:
$ git remote add github [email protected]:NCIP/example.git
Finally, use git push github ... to push each branch, e.g.:
$ git push github master
Here are some optional tips to aid in the above migration steps.
After running git svn fetch in the Fetch History step one may wish to check that there were no mistakes in the Map Directories to Branches section that brought in unexpected content. Run the bash command:
$ git rev-list --reverse --topo-order --all |
while read commit; do
echo "$commit $(git diff-tree --always --pretty=format:'%ai' $commit | head -1)" 1>&2
git diff-tree --no-commit-id --root -r $commit |
while read src_mode dst_mode src_obj dst_obj st file; do
if test "$src_obj" != "$dst_obj" -a "$dst_mode" != "000000" -a "$dst_mode" != "160000"; then
echo "$(git cat-file -s $dst_obj)"$'\t'"$dst_obj $file"
fi
done
done |sort -n > "blobs.txt"
Look in the generated blobs.txt to see a list of all file content (blobs) in the entire history, sorted by size. Each line will be of the form:
<size> TAB <sha1> SP <path>
Check that every <path> looks like it belongs inside the project content and is not part of Subversion branch structure.
Conversion time during git svn fetch can be reduced significantly
when one has access to a local copy of the Subversion repository
server-side representation (perhaps obtained as a zip from an admin).
With such access, configure git svn to fetch from the local repository:
$ git config svn-remote.svn.url file:///c:/path/to/local/svn/example
and to convert history as if it were taken from the original url:
$ git config svn-remote.svn.rewriteRoot https://ncisvn.nci.nih.gov/svn/example