Oh Shit

2021-05-08

That was my exact words when I figured out my nice project turned sour. But behold! Sour can be a good thing. Thus let me explain…

First there was code

You know, nowadays source code is usually managed with some version control system. And usually that system is git, and on top of that the code is hosted at some service like Github or GitLab.

That’s all fine, but sometimes I would like to track what’s happening on some certain projects I find interesting, or just preserve the pearls for pi… oh I mean for future generation.

I have encountered few times issues with some 3rd party code:

  • Repository disappears like puff - the author removes it and there’s very little I can do
  • I forgot what I liked, and my notes are missing just the crucial piece of information
  • Oblivion just spreads it’s nasty arm

Whatever the reason is, my current flow is not enough!

Cheer for the savior

Well, I quickly made a hack to solve the problem. A simple Python script, which calls the “git” command from and does it’s magic. It’s like 1-2-3:

  1. Repository does not exist, so clone it
  2. Repository exits, just do “git pull”
  3. ????
  4. Profit!

And it worked like charm. And I used something like 15 minutes. Just set in to run in crontab once a day and get cloned git repos. So everything is fine. Or not.

Storage

The problem with this script is that it needs to be run somewhere. And the “somewhere” should fulfill these requirements:

  • Have enough storage space for multiple git repositories
  • Always on, so no need for manual steps or triggers, just cron to do it’s job
  • Be able to run the script

Well, the first two was easy. I selected my NAS box, which is always on, it’s running on Linux, and I have rooted it so have full control of it, and top of all this it has terabytes of storage.

Obstacle 1

It had Python 2.7, which was enough. But it didn’t have git. -Well it’s just ARM box, so I might find some pre-built binary. And I did find. But unfortunately Github requires bit more recent SSL version than the binaries provided. Thus clone just failed and I faced a dead end.

Well, I know I can hack my way around, so I tried to build a git binary. I got a ARM toolchain from Bootlin. Decided to select armv5-eabi and musl version to provide static binaries.

First thing I needed to do was build OpenSSL. That was easy as:

./Configure --prefix=$HOME/install/ssl linux-armv4
make CC=arm-buildroot-linux-musleabi-gcc RANLIB=arm-buildroot-linux-musleabi-ranlib LD=arm-buildroot-linux-musleabi-ld MAKEDEPPROG=arm-buildroot-linux-musleabi-gcc PROCESSOR=ARM -j

Next step was to compile required zlib, because git uses compression for it’s packfiles:

mkdir build
cd build
cmake .. -DBUILD_SHARED_LIBS=0 -DCMAKE_C_COMPILER=arm-buildroot-linux-musleabi-gcc -DCMAKE_CXX_COMPILER==arm-buildroot-linux-musleabi-g++ -DOPENSSL_ROOT_DIR=$HOME/install/ssl -DCMAKE_FIND_ROOT_PATH=$HOME/install/armv5-eabi--musl--stable-2020.08-1/arm-buildroot-linux-musleabi/

That needs some hacks to provide SSL inside the installed toolchain, but that was just a matter of a symbolic link. Finally I could compile git! And it was trickier:

LDFLAGS="-static" PKG_CONFIG="pkg-config --static" ./configure --host=arm-buildroot-linux-musleabi --with-zlib=$HOME/install/zlib --prefix=$HOME/install/git

Can’t remember any more if I needed to do something else, but might be I changed something in configure.ac

After all I got git binary, and it was working! But my joy was premature. Got next problem:

git: 'remote-https' is not a git command. See 'git --help'.

I tried to solve that but got really frustrated.

Obstacle 2

The next thing was to learn how git works. I checked shit. But it didn’t solve my issue: cloning and updating.

There were other options as well, but nothing seemed to work or be suitable. Thus I thought how hard would it be implementing “git clone” with Python?

Well, there’s documentation for the HTTP protocol. That took me a while to get working, biggest thing being support for the “smart” protocol, since Github has deprecated the “dump” one. Finally I got it working, got first references and could download my first packfile!

But the packfile format was not that trivial than I thought. Problem was that the size entry told the UNCOMPRESSED size. I would be happy to get compressed size. Why? Well let’s say that the file has header, which is easy to parse. Then it has data entry, which has type and size. Can easily parse those both. But then it’s followed the data as zlib compressed. But the size of the data is not same as the “size”, since it refers to the uncompressed one, and we’re handling compressed sizes! Since slob ignores the extra data, it should be ok just feed the raw contents. And then hope we can determine where the compressed data ends. Fortunately Python has “unused_data” in it’s zlib Decompress object. I was able to decompress many object, but then something went wrong. I believe there’s either bug in my implementation or then Python’s zlib doesn’t work like I expect it to work.

Few entries doesn’t warm that much, but I know I could fix it. But sometimes it’s just better to ignore they way you’re heading, since it would be very painful to continue. And that was just first steps toward my clone feature. Thus ditched that approach.

This is my free time

Thus I had failed to compile git, failed to implement my own clone. What is left?

Fortunately Github has it’s own library for git, called libgit2. I could write simple C app to do the cloning for me. Actually I was surprised how easy it was:

#include <stdio.h>
#include <git2.h>

int main(int argc, const char **argv)
{
    git_repository *repo = NULL;
    const char *url;
    const char *path;
    git_clone_options clone_opts = GIT_CLONE_OPTIONS_INIT;

    if (argc <= 2) {
        printf("Usage: %s url path\n", argv[0]);
        return 1;
    }
    url = argv[1];
    path = argv[2];

    git_libgit2_init();

    printf("Cloning '%s' to '%s'...\n", url, path);

    int error = git_clone(&repo, url, path, &clone_opts);
    if (error) {
        const git_error *e = git_error_last();
        printf("ERROR: %d/%d: %s\n", error, e->klass, e->message);
    }

    git_libgit2_shutdown();

    return error;
}

That’s about it. And it was working on my target, after I managed to provide recent ca-certificates for it.

Next step was doing “git pull”. That ended up being an adventure, which I haven’t finished yet. Unfortunately libgit2 doesn’t provide “git pull” as is. You need to perform multiple steps by yourself.

I ended up coding fetch, and merge, and few intermediate steps. My merge was not a good one, but more like faked. Without it I would just fetch changes and see them uncommitted! Now my code is quite big, and I hope to get it working and to produce minimal possible git mirroring environment.

I managed finally parse signature and commit message from FETCH_HEAD, and successfully create commit. However this worked only with one commit, if there were multiple commits, this again failed. Thus I think I would need to iterate all commits between HEAD and FETCH_HEAD and do commit for all of them separately.

What’s next?

For now I have been focusing to make it work, but once basics are in shape, I believe this is going to be quite useful for me. I can have list of repositories to clone and update automatically every day. This prevents me from losing any data.

This is just a recent thing I’m working with, so unfortunately no much more code to share. But I’ll post link here as fast as I manage to merge data. So far thanks for all the people writing awesome apps, libs, blogs and helpful information.

EDIT 2021-05-10 Gitclone Python version

I published my Gitclone Python script. Decided to run this on another server, where I have proper git command available, and same time investigate longer term solution.

Don’t know yet what that solution will be, but so far my options are:

  • libgit2 based solution
  • make git working on my NAS box
  • new NAS box where I can have newer git and python
  • some other host to serve all this