'git merge multiple copies preserving history

I have a project which has multiple copies of some files in different places. For example:

src/location1/foobar.h
src/location1/foobar.cpp
src/location2/foobar.h
src/location2/foobar.cpp

I am extracting these into the own library. So I wish to end up with:

src/location3/foobar.h        combining multiple versions of foobar.h
src/location3/foobar.cpp      combining multiple versions of foobar.cpp

I've passed the first hurdle of removing all unwanted files using:

git filter-repo --path-glob \*foobar\*

Discovering in the process that filter-branch has recently been superceded by the superior filter-repo (worth repeating as filter-branch still appears in many top answers here).

I now want to combine the copies into one preserving all their histories. The two candidates for this are merge and merge-file.

merge-file requires the common ancestor of each file to be identified which is a pain as it was probably:

src/location3/foobar.h

which is somewhere unknown in the commit history. We have git merge-base to find the best common ancestor.

I'm not clear how to specify the file version for git merge-file I want to do:

git mv src/location1/foobar.h src/newlocation/foobar.h
git commit
git merge-file src/newlocation/foobar.h src/location3/foobar@<commitid> src/location2/foobar.h
...
git merge-file src/newlocation/foobar.h src/location3/foobar@<commitid> src/location3/foobar.h

This is quite painstaking and has to be repeated for each file. Another way is to create multiple temporary branches:

git checkout -b newlibbranch
git mv src/location1/foobar.h src/newlocation/foobar.h
git mv src/location1/foobar.cpp src/newlocation/foobar.cpp
git commit
git checkout oldversion
git checkout -b v2
git mv src/location2/foobar.h src/newlocation/foobar.h
git mv src/location2/foobar.cpp src/newlocation/foobar.cpp
git commit
git checkout newlibbranch
git merge --allow-unrelated-histories v2

This is also quite painstaking. Though it is possibly scriptable. There is also a practical problem as the merge is "rename/rename" conflict rather than a merge of the actual files. This seems to be solved by adding --allow-unrelated-histories

So my questions are:

Regarding the task:

  1. Is there a better way? perhaps a merge tool I am unaware of like I was unaware of filter-repo
  2. I am correct in thinking the multiple merge branches way is better than git merge-file?

Regarding merge-file:

  1. how do I specify a particular version of a file for git merge-file
  2. Is there a command or script which finds the common ancestor automatically. Something like:
      git merge-file-wrapper location1 location2   -->

      base = `git merge-base location1 location2`
      git merge-file location1 $base location2

Could it be that this does not exist because there are some hidden pitfalls?



Solution 1:[1]

I haven't found any automated tool to do this so there may be a gap in the ecosystem for one.

In my case I had multiple files to move some of which had more copies than others which adds some interesting complexity but is not uncommon when refactoring to remove duplication.

What I did in the end was:

  • write a script to create a new branch where each variant is moved to its new location.

  • My script first identifies the files to be moved.

  • Finds the file with the most copies and creates that many branches.

  • For each branch it tries to move one copy of each file to its new location

  • I then merged each branch manually.

    Most of these merges were trivial things such as changing a namespace for each sub-project.

The result is a single set of files which have all the changes I wanted and all the change history from each of them.

To make this a bit more concrete:

  • Step 1: use filter-repo to create a project containing just the files of interest

    (note this should be done on a fresh clone of the project)

     git filter-repo --path-glob \*ThingIWant1\* --path-glob \*AnotherThingIWant\* 
     git filter-repo --invert --path-glob \*ThingIDontWant\*
  • Step 2: create branches
    #!/bin/bash
    
    # find unique filenames
    MAXLOCS=0
    FILES=`find . -not -path '*/.*' -type f | grep -v makebranch | xargs -ifile basename file | sort -u`
    for FILE in $FILES; do
        echo FILE=$FILE
        # find number of locations for each filename
        NUMLOCS=`find . -not -path '*/.*' -name $FILE | wc -l`
        if [ $NUMLOCS -gt $MAXLOCS ]; then
        MAXLOCS=$NUMLOCS
        fi
    done
    echo "$MAXLOCS branches required"
    
    # for each branch
    #  move one location of each file to its final destination
    L=0
    while [ $L -lt $MAXLOCS ]; do
        git checkout develop
        git checkout -b ps$L
        for FILE in $FILES; do
        echo FILE=$FILE
        LOCS=( $(find . -not -path '*/.*' -name $FILE) )
        NUMLOCS=${#LOCS[@]}
        if [ $L -lt $NUMLOCS ]; then
            LOC=${LOCS[$L]}
            echo "mv $LOC"
            # Move source files to one place and test files to another
            # In my case we have src and test
            echo $LOC | grep -q /src/
            if [ $? ]; then
                mkdir -p FinalDestinationForSource
                git mv $LOC FinalDestinationForSource/$FILE
                if [ $? -ne 0 ];then
                   echo "BAD: git mv $LOC FinalDestinationForSource/$FILE"
                fi
            else
                mkdir -p FinalDestinationForTests
                git mv $LOC FinalDestinationForTests/$FILE
                if [ $? -ne 0 ];then
                   echo "BAD: git mv $LOC FinalDestinationForTests/$FILE"
                fi
            fi
        fi 
        done
        git add -u
        git status
        git commit -m "#Ticket: move Things to new location $L"
        ((L = L + 1))
    done
  • Step 3: merge each branch
    git checkout ps0
    git merge ps1 -X rename-threshold=5%
    # resolve manually... then
    git commit
    git merge ps1 -X rename-threshold=5%
    # resolve manually... then
    git commit

The rename-threshold helps convince git that the files share the same origin. Otherwise one version may simply replace the other without retaining the change history linking them. I think the result is equivalent to linking multiple commits using git commit-tree which would be another way to solve this problem.

You can verify the history using git blame to see where each line came from in each file and git log to see the actual commits.

Raymond Chen has a series of blogs on this which may be of interest. He approaches this task using commit-tree. I think that would work but I think its a little too low-level an approach for my case.

  • Step 4: merge your library into the project it belongs in

    This is included for completeness as you may be moving files to another project. See " How do you merge two Git repositories? " for more details

    cd targetProject
    git remote add sourceProject /path/to/sourceProject
    git fetch sourceProject
    git merge --allow-unrelated-histories sourceProject/ps0

I think this area is ripe for contributing a script to add a new merge facility to git.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bruce Adams