GitHub-backup: backs up everything GitHub knows about a repository or a user

joeyh · on Feb 8, 2016

There are a lot of different distributed issue trackers built on top of git. None widely known or used and all incompatable. If GitHub included issues and pull requests in a git repo (either in branch or in a separate git repo as they do for wikis), it would become an instant de-facto standard.

With such a standard, many new things would immediately spring up to use it. Bug reporting, and managing bug reports, and forwarding bug reports between related projects is a big pain point, largely due to the fractured mess of bug tracking systems. A thousand flowers would bloom.

The only reason GitHub has to not do this, as far as I can tell, is that keeping the issues locked in their silo makes it a little bit harder for competition to migrate repositories from GitHub. Although I hear GitLab migrates issues anyway, and the API makes this not hard (although perhaps needing lots of time due to rate limiting). And of course, they probably slapped a SQL database down on day 1 for issues without thinking too much about it, and so it would be some effort to move them into the git repo now.

Not holding my breath, which is why I wrote github-backup. Well, also because it recursively backs up forks, and I've lost changes in deleted forks enough times to want to back them up automatically.

sytse · on Feb 8, 2016

At GitLab we indeed have an importer for GitHub issues (so you can import repos, pull requests and wikis in one go). But issues in a git repo would be awesome. My thoughts on this are in https://gitlab.com/gitlab-org/gitlab-ce/issues/4084

jstepka · on Feb 8, 2016

What you're after is something like Fossil;

* http://www.fossil-scm.org/

... which is a distributed version control system that embeds issue and wiki data.

What would be a great next step for version control (similar to how we went from cvs to svn to git) would be to embed issue and PR data into the repository data structure.

I haven't worked on Git / Bitbucket in a while (at Docker now), so I haven't been tracking it very close... what would be ideal, maybe it's already there... would be to design and build on the popularity of Git with a module that stores the issue / PR data giving you true portability. Over time that module could be made mandatory giving you the flexibility you're after.

snuxoll · on Feb 8, 2016

Never heard of fossil until now, looks like an interesting project. I can see the reasoning for the lack of a rebase command but man things like git merge --squash make our commit history so much cleaner (but again, the argument raised is that the developers care about history as it happened, not as they wanted it to happen).

toomuchtodo · on Feb 8, 2016

Would you mind if I integrated this into Archive Team's ArchiveBot? It currently uses youtube-dl to grab audio and video content, and your tool would be very helpful for snapshotting Github.

joeyh · on Feb 8, 2016

Of course you're welcome to do that. https://joeyh.name/blog/entry/I_am_archiveteam/ ;)

toomuchtodo · on Feb 8, 2016

OH MY! :)

codemac · on Feb 8, 2016

Do you have any suggestions for what you think distributed issue trackers should look like?

My favorite was simple defects with the prophet database backend, but it was very "outside" of git and this gave people heartache.

Are there any that you have liked? Ikiwiki changed what I thought version control could be, but I never really used it for a project because I struggled with spam. I'd love to hear your thoughts on the more targeted issue tracker approaches.

WorldMaker · on Feb 8, 2016

I've experimented with a number of them over the years. I even put some effort into my own.

Of the off-the-shelf ones, I like the YAML-based ones the best for the artifacts they store in the repository. ditz [1] is the grand-daddy in that space.

However, given that the majority of an issue is formatted text, I think the best ones are actually based on a markup language of some sort with an easy to parse frontmatter. The one I built way back when was based on ReStructuredText and used it's nice, easy to RegEx, definition list format as frontmatter. These days I'd probably take the semi-standard "Jekyll" frontmatter approach of YAML+Markdown, and I'm somewhat surprised there still hasn't been a big tool yet (that I've seen) take that approach. (That said, you can admirably fake it with a custom Jekyll collection as-is, so maybe it's something we could build as an interesting template and/or plugin...)

[1] http://ditz.rubyforge.org/

SOLAR_FIELDS · on Feb 8, 2016

Your third point is absolutely the reason that they don't have this. This is a tool on Github that helps people migrate away from Github.

mintplant · on Feb 8, 2016

Thanks for this!

A couple years ago I was having trouble falling asleep at night. I found a series of music mixes that helped me drift off, and they were working well until one day the creator decided to scrub every trace of themselves from the internet and disappear. I generally believe that people should be able to delete what they put online, but ever since then I've maintained my own personal archive of things I wouldn't want to lose access to forever.

I use HTTrack for backing up static sites, youtube-dl for YouTube, SoundCloud, and the like, and now I'll be using this for repos on GitHub. Any more good archiving tools?

meowface · on Feb 8, 2016

>I generally believe that people should be able to delete what they put online

I might get some flak for this, but I think the opposite. I think when someone posts something online, it should remain online forever. It's a tragedy that information ever gets deleted. The fact that someone could publish something thousands or even millions of people enjoy and then take it away from them is a shame. Even if it's something only a handful of people enjoy.

Even in cases of blatant slander, I don't think censorship or deletion is ever justifiable. In cases of proven slander/libel, some sort of bright red notice should be placed above and below the content indicating a court found it to be false and libelous.

People do make mistakes, but I think it's better for information to always be "append-only". If I had some old embarrassing blog posts, I wouldn't delete them, but rather add a warning or disclaimer that I no longer hold those views and regret making that post.

I have a lot of embarrassing things about me on the Internet that I wouldn't want an employer to find, from when I was much younger, but which they can find through careful Googling. I just have to accept those things happened, and to try and make a case for why these were mistakes from when I was a teenager and not how I am today.

I totally support archives and web scraping, and am appalled by the EU's "right to be forgotten" ruling.

breakingcups · on Feb 8, 2016

There can be a lot of reasons to want to delete something you've posted online, a stalker for example. I think it's careless to dismiss all such reasons in one swoop.

meowface · on Feb 9, 2016

I agree that's a valid reason, but on principle, I think they still shouldn't have that option. They should address the stalking directly. Removing content isn't going to dissuade a stalker; it'll probably only make them more interested, honestly.

madawan · on Feb 8, 2016

You'll still need to find ways to deal with doxxing and child pornography. Also revenge porn.

meowface · on Feb 9, 2016

Yes, there are certainly some exceptions.

ivank · on Feb 8, 2016

> Any more good archiving tools?

For backing up websites, I released https://github.com/ludios/grab-site. Compared to HTTrack, it makes it easy to add ignores to skip unwanted URLs after a crawl has already started. It also saves to WARC instead of trying to fit the site to an on-disk directory structure, which is not always possible or useful (e.g. directory with > 100K files).

userbinator · on Feb 8, 2016

Not really a "tool" per se, but the vast collection of archive.org is also worth mentioning. There's plenty of great reading, listening, and watching material there, and it also crawls and archives websites too.

NotUsingLinux · on Feb 8, 2016

This shows (again) the fundamental issue with digital identity.

People try to save their data or 'status' like stars from one provider to another, while painfully experiencing that vendors only lockin their data, but it belongs to them.

It will be very interesting when there rises a plattform for digital trust which gians enough users. Could be comming from the programming community first.

Does anyone see things like GitTorrent as a soulution to the problem?

https://github.com/cjb/GitTorrent

slantedview · on Feb 8, 2016

So you backup things like stars. Can you restore?

blainesch · on Feb 7, 2016

In the "why" section you list 3 reasons.

> In case something happens to GitHub. More generally because keeping your data in the cloud and relying on the cloud to back it up is foolish

I disagree, cloud backups are more reliable.

> In case someone takes down a repository that you were interested in. If you run github-backup with your username, it will back up all the repositories you have watched and starred.

If the repo goes down it's already going to be a lot harder to use, but it seems easier to fork a repo.

> So you can keep working on your repository while on a plane, or on a remote beach or mountaintop. Just like Linus intended.

When have you not been able to work on a local repo locally?

Overall, I think I'm missing the point.

pyre · on Feb 7, 2016

> I disagree, cloud backups are more reliable.

You're taking this out of context. It does not download this data from Github, and then remove the data from Github. So the data is still "backed up" in the cloud, but is also now backed up locally. How is this not more reliable?

This also does not preclude the ability to push your backup to another cloud provider (e.g. S3, Dropbox, etc) to distribute your backups across the cloud.

> If the repo goes down it's already going to be a lot harder to use, but it seems easier to fork a repo.

A fork does not maintain things like comments and issues on the original repo, which github-backup does backup.

> When have you not been able to work on a local repo locally?

Github does not stick things like Github Comments or Github Issues directly into your repository. If you are (e.g.) relying on Github Issues to track things to do on your repo, then you're out of luck on a plane without access to that.

> Overall, I think I'm missing the point.

It seems like you've ignored the list of things that github-backup actually does, and commented on it from that point of view. He lists the things that github-backup does, but you have commented on none of them. For example, why did I have to explain to you that it backs up Github Comments and Github Issues on your repository? Were you under the impression that these things were already in your repo? Do you use Github without these things to the point were you didn't know that they existed, or can't see the use-case where someone would use them? I'm a little confused.

detaro · on Feb 7, 2016

Not the author, but possible answers:

> I disagree, cloud backups are more reliable.

Cloud backups, maybe. Data that's only in the cloud (e.g. on GitHub) is not more reliable than data that is in the cloud and backed-up by you to somewhere else (to a different cloud, if you want).

> If the repo goes down it's already going to be a lot harder to use, but it seems easier to fork a repo.

Which means you have to maintain an up-to-date fork, you loose the issues etc from the original repo, and if a repo is DMCAd your GitHub-fork is gone as well.

> When have you not been able to work on a local repo locally?

If you wanted to read the issues, or PRs by other people that you didn't manually copy.

I really like the project, it should make it easier to make sure I always have a copy of important stuff available. Just need to figure out what's the best way of telling it which repos I care enough about (I star to much stuff).

EDIT: take a look at the issues though, it has some annoying limitations

lucaspiller · on Feb 8, 2016

> In case something happens to GitHub. More generally because keeping your data in the cloud and relying on the cloud to back it up is foolish

Instead of the cloud, I read that as "the hands of a company that may get sold, shutdown, agendas may change, etc". I don't think that will happen with GitHub anytime soon, but 10 years ago I would have said the same about Sourceforge :-)

fapjacks · on Feb 8, 2016

That is usually how it goes, isn't it? People kill the golden goose because they start to think their platform is invincible and the laws of physics don't apply to them anymore. Then they become vulnerable to all kinds of problems.

realusername · on Feb 7, 2016

It could help for example in the case of a DCMA takedown, thanks to the backup, you won't lose all the issues this way.

sebastianavina · on Feb 8, 2016

in case you dont want to clone a repo... i guess...