Follow

gives an idea why paid so much for Github. They were after data: Tons of food for their AI, millions of contributors that now 'work' for MS for free.
You publish your code under GPLv3, even AGPLv3? So what? The AI learns from your code and uses it to generate code that is possibly proprietary. Does forbid this practice? (I don't think so)

That's the M$ way to break copyright law.

It's time for alternatives like @codeberg .

@t0k Is there anything that could stop MS from cloning random repos from the web and feed it's AI with that? Nope. That does not sound like a reasonable reason on why they bought GitHub.
Also, that's a bit like claiming that all code that I will write in my life is GPL if I learned to write code from looking at GPL code.

@Alexmitter 1) Private repos on other platforms cannot just be cloned. Owning GH makes a difference there. Also this way they get *everything* including meta data, social network etc.

Also: Try once to clone all public GH content. You're IP will get blocked quickly.

2) GPL allows you explicitly to study the code and learn from it.

The AI does some sort of statistical copy-pasting on a scale that is not comparable to the human brain.

@t0k 1. Why should MS have any interest in private repos?

And I can clone all public GH content, I just should not do that all at once.

2. Programming is not just for fun often nicknamed the copy paste from stackoverflow art. Humans copy paste code all the time.

@Alexmitter

1. It's called industrial #espionage.

But with #Copilot, #Microsoft will get access to codebases that are NOT under #GitHub.

The editor will send them the sources, one file after another.

@t0k

@Shamar 1. There is no need for Microsoft to use such a feature for large scale industrial espionage, they do have around a million easier ways to do such a thing . Do you guys not even try to think before you type? Seriously.

@Alexmitter

How many ways are... fully legal like this?

You literally send them your code! On purpose! I mean... ok it's fooling dumb boys, but it's legal! You can't complain after!

@Shamar With such a thing come similar terms and conditions to both parties as in the hosting of private repos for companies or any kind of servers as a service.
It is not like Microsoft would have any benefit from a large set of random code files, that is less useful then the snippets of Windows source code floating around on the web.

@Alexmitter

Yeah because #Microsoft is unable to automatically relates single files from a large project.

I mean... that would need NLP experts and a lot og expensive hardware! 🤣

Except in those languages that declare the module name at the beginning of each files, at least.

But sure, they "shall do no evil", right? 😇

@Shamar I am not saying that Microsoft is a good company. I am saying that your accusations on them abusing this service are idiotic at best and purely moronic at worst. It is pure over-engineering to accomplish something that would be illegal by law, completely pointless for Microsoft and a total waste of time and money to them.

Pulling weird accusations out of your arse against microsoft hurt everyone who wants to bring up actual arguments against them. Stop it.

@Alexmitter

Dude, you know nothing about #US hegemony in technology.

This would be in no way the worst they did or do.

@Shamar Again, they have easier, better, more cost effective ways to do industry espionage.

@Alexmitter Whether MS is actually *violating* copyright law is an interesting question. Maybe they just hacked it.

@Alexmitter Of course plain code was not the only reason for the acquisition but next to the millions of "produsers" and future lock-in effects a very valuable asset.

@t0k Its just one of many git frontends, you make it sound like its more of a deal then it really is.

@Alexmitter The reason why I don't like MS and what they do is not based on law. Its of ethical nature. There's a big disbalance between BigTech and small code writers. The latter - the "produsers" - produce the value for BigTech yet get not much more than fancy UIs and a little bit of disrespect in return.

(GPL was a way to object to this disbalance.)

@t0k This is not a discussion of if Microsoft acts ethical or not. This is about the copyright laws, GPL and your theory on the purchase of Github.

@t0k I think you're right that they wanted the data, but I also think they wanted the potential customer base to sell into.

I don't understand what you mean by breaking copyright law, though. Plenty of the code on Github isn't owned by the people uploading it, and therefore they cannot give Microsoft any more rights to it than they would have had if Microsoft had just downloaded it from Github without needing to buy Github.

@freakazoid Circumventing copyright law is probably a better wording.

There's two somewhat independent things that nevertheless play together:
1) Microsoft got access to a enormous code base. Even though many repos on GH are public, accessing and indexing ALL of it as a third party is difficult (for example GH blocks your IP).

@freakazoid
2) Circumventing copyright law: I believe that basically does some context sensitive statistical copy-pasting. In some sense this automatically rewrites existing code (protected copyright) into another piece of code.
Like a compiler translates code into binary. There's a automated transformation process. The question is: Is the output also protected by copyright? In case of compilers it is.
I speculate that in the view of MS this transformation strips away copyright.

@t0k That's interesting, given that in the past MS has had a fairly maximalist interpretation of GPL, where they would not allow their engineers to even LOOK at GPL code for fear of subjecting their other code to potential copyright claims. But perhaps they believe that training algorithms is the equivalent of "clean room" reverse engineering, where one team documents the thing, then leaves, and another team uses the documentation to implement a new thing. But I would find that surprising.

@t0k Google Books got shut down because it wasn't considered permissible to even make copyrighted works searchable without permission. But in that case copying wasn't allowed at all.

I suspect that there will be a big legal fight over just this issue when companies start using AIs to produce works that would be copyrightable if they were produced by a human. Is AI-generated music trained on pop songs a derivative work of every song?

@t0k If that's the case, wouldn't every piece produced by a composer who had ever listened to copyrighted music also constitute a derivative work? For music, the standard has been the amount of the song that sounds the same, and it's primarily focused on "sampling."

Maybe similar could be applied to software, but I would personally oppose expanding IP even further. I don't think we need IP at all.

@t0k while I strongly agree with your sentiment, I disagree with your conclusion: the fact is they could have done it just as easily (and cheaper!) by pulling code from CodeBerg, Gitlab, 0xacab, any other place where any code is publicly available.

The food for their AIs is right there for the taking, whether they own GitHub or not.

@marie_joseph

Not sure it's needed.

How could an algorithmic transformation of copyrighted material NOT be a derivative work?

It would be a giant loop-hole in #Copyright: you would just zip a mp3 and it would be public domain!

@t0k

@Shamar @t0k if one understands that ML algorithms are fundamentally just algorithms like any other, sure, but do the people writing, interpreting, and enforcing copyright laws understand this? Because it seems most people in general do not.

@marie_joseph

I think it's up to us to explain this.

A long time ago I started using "statistical programming" instead of AI, ML, DL and similar antropomorhic locutions.

I always explain that ANN are just peculiar virtual machines whose programs (the numerical matrices of weights and activation thresolds) are programmed statistically through data samples.

As far as I can see, people understand this quite well.

The calibration of an ANN (improperly named "training") is just a form of compilation: the readable data are turned into opaque binaries that only that specific topology can execute "correctly" (for whatever correctly can means in this context).

Calling it "statistical programming" makes it also clear programmers' responsibility and how fragile and bugged are the opaque programs.

@t0k

@t0k We'd need a court case but I don't see any sound argument where this isn't creating output that'd be under the gpl

@t0k

Just had a discussion about this around the water cooler at work (Teams chat), and the instrumentation that shares code back to the M-ship makes it a nonstarter here: security risk.

Nobody can stop anybody from learning from open source code and applying that knowledge to create closed code. Microsoft could have done this any time they wanted without buying Github until Github found out, at which time Github could have ended the experiment.

Buying Github guaranteed continued access.

@t0k Code produced by an AI trained on code is a derivative work, and the copyright still holds.

Their lawyers may think it's worth a gamble, since this hasn't been tested in court yet, but I think there's an excellent chance the findings would be for the owner of the original code.

That said, how would anyone find out?

@raboof Yes! But I haven't been able to think of a suitable way to make trap code—and it's quite likely the generated code would be closed-source anyhow, making discovery less likely.

@t0k was this a crosspost? that username is broken.

@snailerotica No, it was not a cross-post. Probably it was me messing up. I meant to mention @codeberg .

@t0k

I’m not a lawyer and I don’t have confidence over my legal understanding of the matter, so I’m just gonna leave these here as some extra argument and discussion on the topic:

juliareda.eu/2021/07/github-co

@Mehrad Thanks! I'm also not a lawyer, but I sense some false statements in the article. For instance the article claims that copyleft would not be necessary would copyright law not exist at all. At least it is imprecise at worst false: For example copyleft also enforces back-contributions to projects. Without copyright law this would not be possible in the same way. Maybe the intention was to say "without copyright law copyleft cannot exist". But then we're lost in contradictions.

@t0k
> For example copyleft also enforces back-contributions to projects

Unfortunately this is a myth only.

You are just required to deliver the code to whomever you deliver the software. Do are not required to back-contribute anything.
@Mehrad

@kirschwipfel @Mehrad Yeah true for GPL... Depends on the trigger condition though. For instance AGPL triggers on a user interaction already. I think technically would be possible to create a copyleft licence different from the GPL family which triggers earlier. But that slowly goes out of my area of expertise.

@t0k I don't think that the licence chosen by the developer has anything to do with that as I'm quite sure there's some clause in Github's Terms of Use allowing something like that.

@pixelcode I'm afraid of that too. Would be good to find out actually.
The question is: What is stronger in court? The abusive Terms of Use backed by $$ or peoples intellectual property rights?

@t0k When creating a GitHub account the user accepts those terms so Microsoft would most likely win 😐

@t0k I've heard this line of argument before but it does not make sense to me. Owning Github doesn't give MS special access to the public code there. They can scrape Codeberg too if they like. Also, citation needed for your implication that Copilot will violate license terms.

@pbx
1/2) Owning GH does give MS special access. Try to scrape all Github content yourself. You'll see it's not so easy. You get blocked before you even got a tiny fraction.
Probably the bandwidth of content pushed to Github is already problematic to keep up width.
Not even talking about all the metadata.

Scraping codeberg might work since it is so small. GH is some orders of magnitude bigger.

@pbx 2/2) I don't have a citation because I'm not citing anybody. Maybe I should rephrase to "That's the M$ way to circumvent copyright law." Since I'm not aware of any precedence case.
However, I must assume that the AI does some sort of statistical and context-aware copy-paste. It can by no way be compared to what a human brain would do.
Then when you process code with a compiler, the binary output is also protected by copyright. That can be compared to what copylot is doing.

@t0k GitHub doesn't have to care about any license you're using, when pushing code to github you're giving them a license, see https://docs.github.com/en/github/site-policy/github-terms-of-service#4-license-grant-to-us

@lanodan @t0k Do note that the license is to distribute, not to use, and believe you would need a license to use for training a ML model.

@ignaloidas @t0k Uuuh.

> If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
> You may grant further rights if you adopt a license. If you are uploading Content you did not create or own, you are responsible for ensuring that the Content you upload is licensed under terms that grant these permissions to other GitHub Users.
@ignaloidas @t0k Actually now that I'm even more carefully reading it than usual… are they throwing out the restrictions on licenses?

@lanodan @t0k If that is the case (which I belive is not) than they can use any AGPL'd code hosted there from what I read. Which is very obviously wrong.

@lanodan Thanks! That's a good point. So then, this is the actual reason to switch to alternatives like @codeberg .

@t0k additionally, don't depend on the law to protect your IP. take reasonable precautions and make sure you trust your supply chain and infrastructure. read the terms of service &c.

personally, i'm on a different side. i directly oppose copyright. i don't necessarily like what github is doing here, but i've also been careful not to give them copies of my code. i think assuming copyright cannot or will not be enforced gives a more careful perspective, it means you have to trust your host to share your interests.

https://xj-ix.luxe/chaotic-software/

@welt Yeah, there's many good alternatives actually. And with federated issue-tracker etc. this all gonna be awesome! 😄

@welt Yes I know 😉 . But as far as I know the social networking features like issue tracking and discussions are not federated. And that's in the end why I'm using something like Gitea at all. Otherwise I would just use plain git without any Web UI.

Honestly I never tried the mail as suggested in the article. But somehow I just cannot imagine how it can replace the structure that an issue tracker can give. Correct me if I'm wrong.

@t0k There's a webui for email that SourceHut has, as well as a ticketing system that's pretty much just an issue page.
Sign in to participate in the conversation
Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!