Skip to content
This repository was archived by the owner on Jul 8, 2020. It is now read-only.

How to Migrate from Subversion

Brad King edited this page Jul 2, 2013 · 3 revisions

This page provides tips for migrating Subversion repositories to Git.

Conversion from Subversion to Git is typically done using the git svn command that comes with Git.

It is not necessary to use the separate svn2git tool.

Note

Command-line examples in this document assume a bash prompt and that the current working directory persists across all sections.

The git svn command provides an init sub-command to configure a Git repository for incremental bidirectional operation with a Subversion repository. However, for a one-time permanent conversion we suggest manual configuration.

Create a bare local Git repository to perform the initial conversion:

$ mkdir example-svn.git
$ cd example-svn.git
$ git --bare init

Configure git svn with the Subversion repository top level URL. For example:

$ git config svn-remote.svn.url https://ncisvn.nci.nih.gov/svn/example

prepares a Git repository to convert the example Subversion repository.

Alternatively, if one has a copy of the Subversion repository server files, see the below Tip: Fetch Locally.

Subversion records commit authorship by username but Git records it by name and email adderss e.g. Your Name <[email protected]>. In order to convert one must construct a map from username to name and email. Use svn log to list all user names recorded in the Subversion repository. For example, the command:

$ url=https://ncisvn.nci.nih.gov/svn/example &&
  svn log -q "$url" |grep '^r' |cut -f 2 -d '|' |sed 's/^ *//;s/ *$//' |sort|uniq
user1
user2
...

lists all user names that ever committed to the example repository. From this list create an authors file, say authors.txt, of the form:

$ cat authors.txt
user1 = Name One <[email protected]>
user2 = Name Two <[email protected]>
...

The authors file must contain a complete map or the tool may complain later during conversion. If there are any unknown entries choose something reasonable. For example, Subversion repositories converted from CVS by cvs2svn may contain a "(no author)" user. Map this to a name such as:

(no author) = no_author <no_author@no_author.cvs2svn>

to indicate that the commit was generated by the tool. If no reasonable authorship information is available for an unknown username one may map the name to itself:

unknown = unknown <unknown>

but this should be a last resort.

Configure git svn with the authors file to map commit authorship:

$ git config svn.authorsfile "authors.txt"

Subversion represents branches and tags as directories. One must identify the repository directories that directly hold the actual project content. These are typically the directories developers checkout to work.

Each such directory must be mapped to a Git branch or tag reference (refs/heads/<branch> or refs/tags/<tag>) using a command of the form:

$ git config --add svn-remote.svn.fetch $dir:$ref

where $dir is the Subversion repository directory (not including the top-level repository URL) and $ref is the Git reference. One may also map "container" directories whose immediate subdirectories are all branches or tags using commands of the form:

$ git config --add svn-remote.svn.branches $bdir/*:refs/heads/*
$ git config --add svn-remote.svn.tags     $tdir/*:refs/tags/*

where $bdir and $tdir are branch and tag container directories.

A "standard" layout under the repository top-level is:

  • trunk: Main Development Path
  • branches/*: Branches e.g. branches/release-1.0
  • tags/*: Tags e.g. tags/release-1.0.0

where * refers to all immediate subdirectories (no /). If the Subversion repository uses this layout, configure git svn to use it as follows:

$ git config --add svn-remote.svn.fetch trunk:refs/heads/trunk
$ git config --add svn-remote.svn.branches branches/*:refs/heads/*
$ git config --add svn-remote.svn.tags tags/*:refs/tags/*

However, Subversion does not enforce this layout so many repositories have different layouts.

Some projects modify the Standard Layout by adding a subdirectory named for the project:

  • trunk/example: Main Development Path
  • branches/*/example: Branches e.g. branches/release-1.0/example
  • tags/*/example: Tags e.g. tags/release-1.0.0/example

In this case configure git svn as:

$ git config --add svn-remote.svn.fetch trunk/example:refs/heads/trunk
$ git config --add svn-remote.svn.branches branches/*/example:refs/heads/*
$ git config --add svn-remote.svn.tags tags/*/example:refs/tags/*

See the "CONFIGURATION" section of the git svn documentation for a similar example.

Some projects follow no rigid layout at all. For example, consider a repository with directories:

  • docs: Documentation outside main history
  • trunk/example: Main Development Path
  • branches/release-0.9: Release 0.9 branch
  • branches/release-1.0: Release 1.0 branch
  • branches/people/user1/test: A test branch for user1
  • tags/release-0.9.0: Release 0.9.0 tag
  • tags/release-1.0.0: Release 1.0.0 tag
  • tags/people/user1/before-x: A reference tag for user1

In this example there is an unrelated docs history, the example project name subdirectory is used under trunk but not inside branches, branches/ is not strictly a branch container directory because its immediate people/ subdirectory does not directly contain the project content, and tags/ is not strictly a tag container for the same reason. One must configure git svn as:

$ git config --add svn-remote.svn.fetch docs:refs/heads/docs
$ git config --add svn-remote.svn.fetch trunk/example:refs/heads/trunk
$ git config --add svn-remote.svn.fetch branches/release-0.9:refs/heads/release-0.9
$ git config --add svn-remote.svn.fetch branches/release-1.0:refs/heads/release-1.0
$ git config --add svn-remote.svn.fetch branches/people/user1/test:refs/heads/people/user1/test
$ git config --add svn-remote.svn.fetch tags/release-0.9.0:refs/tags/release-0.9.0
$ git config --add svn-remote.svn.fetch tags/release-1.0.0:refs/tags/release-1.0.0
$ git config --add svn-remote.svn.fetch tags/people/user1/before-x:refs/tags/people/user1/before-x

to specify the entire directory to branch map explicitly. Note that we use only svn-remote.svn.fetch to specify individual branches and tags. We do not use not svn-remote.svn.branches or svn-remote.svn.tags because there are no branch or tag container directories.

Some projects have no branch structure at all and keep their content right at the top level of their repositories. For such repositories one must configure git svn as:

$ git config --add svn-remote.svn.fetch :refs/heads/trunk

to convert the entire repository as a single branch.

In any of the above layout configurations the Subversion repository may have a file or subdirectory inside the project content that one wishes to exclude from conversion. For example, a Subversion repository using the Standard Layout may have an artifacts directory under trunk, branches, and tags containing pre-built binaries that are not really part of the project history. These should be excluded because:

  • Large artifacts tend to be generated by tools, are not a source of original human-written content, and therefore do not belong in source code version control.
  • The Github File Size Limits enforcement will not allow large files to be pushed to repositories at all.

Any set of paths under the Subversion repository root directory may be excluded by configuring a Perl regular expression to match them. In this example we exclude the artifacts directories by configuring git svn with:

$ git config svn-remote.svn.ignore-paths '^(trunk|(branches|tags)/[^/]+)/artifacts'

Alternatively, one may use the approach below to Redact Proprietary Information.

Before proceeding with this step, submit the configuration constructed above for evaluation of the Conversion Input Checklist.

After all above configuration is complete one may run the command:

$ git svn fetch

to fetch the history from the Subversion repository and store it in the local Git repository. If the command terminates with an unknown username error, return to the Map Users to Authors section, update the authors.txt, and run git svn fetch again to pick up where it left off.

Optionally check converted content using the below Tip: List All Blobs.

After the above steps we have an example-svn.git repository containing the raw conversion result. However, additional filtering should be performed to cleanup and finalize the Git history. Clone the original Git repository:

$ cd ..
$ git clone --mirror example-svn.git example.git
$ cd example.git

Now perform any filter operations in the clone and leave the original repository untouched in case the filtering goes awry and needs to be restarted from scratch.

The raw git svn conversion includes meta-data in commit messages needed for incremental bidirectional operation between the Subversion repository and the Git repository. It has the form:

git-svn-id: https://ncisvn.nci.nih.gov/svn/example/trunk@1234 2465a239-a9df-dca0-63d6-4bfd70964974

where the URL specifies the repository directory whose content is represented in the commit as of revision @1234 and the UUID is that of the Subversion repository.

We do not need these meta-data in their raw form for a one-time conversion, but we purposely did not use the svn.noMetadata option that git svn offers. We do want to preserve the SVN revision number of each commit in its message. This is important because the revision numbers may be referenced by other commit messages, mailing list archives, issue trackers, etc. and anyone reading such references in the future must be able to find the corresponding Git commit.

Use the git filter-branch command to rewrite every commit message:

$ refs=$(git show-ref --heads --tags | cut -f 2 -d ' ') &&
  git filter-branch --msg-filter '
       sed "/^git-svn-id:/ {s/[^@]*@/SVN-Revision: /;s/ *[a-f0-9-]*$//;}"
  ' $refs

and replace meta-data lines like the above with just:

SVN-Revision: 1234

The git filter-branch command leaves behind copies of references to the original history in the local repository, so clean them up:

$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d

If the repository contains proprietary information it must be removed before publication. One may not simply add a commit to remove the content because it will still be present in history. If there is a known secret string then run a command like:

$ git rev-list --reverse --topo-order --all |
while read commit; do
  git diff-tree --no-commit-id --root -r $commit --diff-filter=AM --name-only -z |
  xargs -0 git --no-pager grep "OurSecret" $commit --
done

to grep for it throughout history. Then use the git filter-branch command to filter the content out of every version in history. There are multiple ways to do this:

  1. Use the --tree-filter option to perform filesystem operations:

    $ refs=$(git show-ref --heads --tags | cut -f 2 -d ' ') &&
    git filter-branch --tree-filter '
      if test -f file-with-secret.txt; then
        sed -i -e "s/OurSecret/xxxxxxxxx/" file-with-secret.txt
      fi
    ' $refs
    

    This is very flexible but can take a long time on large projects with long histories because it checks out every version to disk.

  2. Use the --index-filter option to operate on the Git index:

    $ refs=$(git show-ref --heads --tags | cut -f 2 -d ' ') &&
    git filter-branch -f --index-filter '
      files=$(git ls-files -s -- files-with-secret*.txt) &&
      if test -n "$files"; then
        echo "$files" |
        while read mode obj _ name; do
          out=$(git cat-file blob $obj |
                sed -e "s/OurSecret/xxxxxxxxx/" |
                git hash-object -t blob --stdin) &&
          git update-index --cacheinfo $mode $out $name
        done
      fi
    ' $refs
    

    This is more complex but may run faster than a tree filter.

The git filter-branch command leaves behind copies of references to the original history in the local repository, so clean them up:

$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d

Optionally use the following approach to publish old branches in the Git repository without an explicit branch name. This avoids obscuring active development branches in a list containing a large number of old branches that are no longer maintained.

WARNING: We recommend this step only for advanced users. This section makes use of advanced Git concepts and tools without explaining them in full detail. If you are not comfortable with them, skip this section and move on to Update Content below.

Git represents a branch as a refs/heads/<branch> reference to the commit currently at the head of the branch. Each commit refers to its immediate predecessors (parents) so the one reference to the branch head holds its entire history. In order for the branch to appear in a repository it must be reachable from a named branch head. The trunk is not special in any way and is simply held as refs/heads/trunk.

At this point in the conversion the repository has a trunk and perhaps a few branches, say release-0.1 and release-1.0. We can visualize the history as:

...o---B-----A----o----o----o----o  trunk
    \                   \
     W--X  release-0.1   o----o  release-1.0

In this example commits W and X are useful as a record of an old release-0.1 branch but we wish to avoid publishing refs/heads/release-0.1 in the public repository. In order to keep the old branch alive we can filter history to make it reachable from trunk. Identify commit A as the oldest commit in trunk that is newer than commit X, and B the parent of A (A=after, B=before). Insert a merge commit M between B and A to make X reachable from A and therefore trunk. After filtering the history will be:

...o---B--M--A----o----o----o----o  trunk
    \    /              \
     W--X                o----o  release-1.0

Construct merge commit M with parents B and X and a commit message mentioning release-0.1 to make it easy to find in git log if anyone later wants to checkout the old branch. One may achieve this by running bash code:

branch="release-0.1" &&
X=$(git rev-parse --verify $branch) &&
d="$(git log "$X" --pretty=%ad -n 1)" &&
A=$(git rev-list trunk --first-parent --since="$d" --reverse |head -1) &&
B=$(git rev-parse --verify "$A^1") &&
M=$(echo "Merge end of '$branch' branch

This commit was manufactured during conversion from SVN
to merge the end of the '$branch' branch." |
    GIT_AUTHOR_NAME="Your Name" \
    GIT_AUTHOR_EMAIL="[email protected]" \
    GIT_AUTHOR_DATE="$d" \
    GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME" \
    GIT_COMMITTER_EMAIL="$GIT_AUTHOR_EMAIL" \
    GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE" \
    git commit-tree "$B^{tree}" -p $B -p $X
) &&
echo "# Graft merge commit $M as parent of $A
$A $M" >> info/grafts

to construct the merge commit and add a Git info/grafts entry to tell git filter-branch where to insert it in history. Note that we use B^{tree} as the tree object of M to tell Git that the merge does not really make any changes. It is only for recording history.

In practice repeat the above to graft in a merge commit for every old branch to be hidden. Run git filter-branch to resolve the grafts:

$ refs=$(git show-ref --heads --tags | cut -f 2 -d ' ') &&
  git filter-branch -f $refs &&
  rm info/grafts

The git filter-branch command leaves behind copies of references to the original history in the local repository, so clean them up as above. Finally, remove the old branch references.

Before proceeding with this step, provide the above conversion results for evaluation of the Conversion Output Checklist.

The Git repository is now almost ready for publication and new development. A few details may need to be addressed by adding new commits directly in Git:

  • If the subversion repository used svn:ignore directory properties, manually add a corresponding .gitignore file to each such directory.
  • If the subversion repository used svn::eol-style file properties, one may need to add a .gitattributes file.
  • If the project build system assumes it is building in a Subversion checkout, fix it to be aware that it is now building in a Git work tree.

The bare repository used above has no work tree in which to make these changes, so clone it:

$ cd ..
$ git clone example.git example
$ cd example

After committing changes, push them back to the bare repository like any upstream:

$ git push origin master
$ cd ../example.git

One may publish changes to the final public repository by pushing directly from the bare repository containing the filtered history. Since it was created by cloning from the conversion repository it already has an origin remote, so add another remote for github:

$ git remote add github [email protected]:NCIP/example.git

Finally, use git push github ... to push each branch, e.g.:

$ git push github master

Here are some optional tips to aid in the above migration steps.

After running git svn fetch in the Fetch History step one may wish to check that there were no mistakes in the Map Directories to Branches section that brought in unexpected content. Run the bash command:

$ git rev-list --reverse --topo-order --all |
while read commit; do
  echo "$commit $(git diff-tree --always --pretty=format:'%ai' $commit | head -1)" 1>&2
  git diff-tree --no-commit-id --root -r $commit |
  while read src_mode dst_mode src_obj dst_obj st file; do
    if test "$src_obj" != "$dst_obj" -a "$dst_mode" != "000000" -a "$dst_mode" != "160000"; then
      echo "$(git cat-file -s $dst_obj)"$'\t'"$dst_obj $file"
    fi
  done
done |sort -n > "blobs.txt"

Look in the generated blobs.txt to see a list of all file content (blobs) in the entire history, sorted by size. Each line will be of the form:

<size> TAB <sha1> SP <path>

Check that every <path> looks like it belongs inside the project content and is not part of Subversion branch structure.

Conversion time during git svn fetch can be reduced significantly when one has access to a local copy of the Subversion repository server-side representation (perhaps obtained as a zip from an admin). With such access, configure git svn to fetch from the local repository:

$ git config svn-remote.svn.url file:///c:/path/to/local/svn/example

and to convert history as if it were taken from the original url:

$ git config svn-remote.svn.rewriteRoot https://ncisvn.nci.nih.gov/svn/example

Clone this wiki locally