#Github #Copilot gives an idea why #Microsoft paid so much for Github. They were after data: Tons of food for their AI, millions of contributors that now 'work' for MS for free.
You publish your code under GPLv3, even AGPLv3? So what? The AI learns from your code and uses it to generate code that is possibly proprietary. Does #GPL forbid this practice? (I don't think so)
That's the M$ way to break copyright law.
It's time for alternatives like @codeberg .
@t0k Is there anything that could stop MS from cloning random repos from the web and feed it's AI with that? Nope. That does not sound like a reasonable reason on why they bought GitHub.
Also, that's a bit like claiming that all code that I will write in my life is GPL if I learned to write code from looking at GPL code.
@Alexmitter 1) Private repos on other platforms cannot just be cloned. Owning GH makes a difference there. Also this way they get *everything* including meta data, social network etc.
Also: Try once to clone all public GH content. You're IP will get blocked quickly.
2) GPL allows you explicitly to study the code and learn from it.
The AI does some sort of statistical copy-pasting on a scale that is not comparable to the human brain.
@t0k 1. Why should MS have any interest in private repos?
And I can clone all public GH content, I just should not do that all at once.
2. Programming is not just for fun often nicknamed the copy paste from stackoverflow art. Humans copy paste code all the time.
@Shamar 1. There is no need for Microsoft to use such a feature for large scale industrial espionage, they do have around a million easier ways to do such a thing . Do you guys not even try to think before you type? Seriously.
How many ways are... fully legal like this?
You literally send them your code! On purpose! I mean... ok it's fooling dumb boys, but it's legal! You can't complain after!
@Shamar With such a thing come similar terms and conditions to both parties as in the hosting of private repos for companies or any kind of servers as a service.
It is not like Microsoft would have any benefit from a large set of random code files, that is less useful then the snippets of Windows source code floating around on the web.
Yeah because #Microsoft is unable to automatically relates single files from a large project.
I mean... that would need NLP experts and a lot og expensive hardware! 🤣
Except in those languages that declare the module name at the beginning of each files, at least.
But sure, they "shall do no evil", right? 😇
@Shamar I am not saying that Microsoft is a good company. I am saying that your accusations on them abusing this service are idiotic at best and purely moronic at worst. It is pure over-engineering to accomplish something that would be illegal by law, completely pointless for Microsoft and a total waste of time and money to them.
Pulling weird accusations out of your arse against microsoft hurt everyone who wants to bring up actual arguments against them. Stop it.
@Alexmitter Whether MS is actually *violating* copyright law is an interesting question. Maybe they just hacked it.
@Alexmitter Of course plain code was not the only reason for the acquisition but next to the millions of "produsers" and future lock-in effects a very valuable asset.
@t0k Its just one of many git frontends, you make it sound like its more of a deal then it really is.
@Alexmitter The reason why I don't like MS and what they do is not based on law. Its of ethical nature. There's a big disbalance between BigTech and small code writers. The latter - the "produsers" - produce the value for BigTech yet get not much more than fancy UIs and a little bit of disrespect in return.
(GPL was a way to object to this disbalance.)
@t0k This is not a discussion of if Microsoft acts ethical or not. This is about the copyright laws, GPL and your theory on the purchase of Github.
@t0k I think you're right that they wanted the data, but I also think they wanted the potential customer base to sell into.
I don't understand what you mean by breaking copyright law, though. Plenty of the code on Github isn't owned by the people uploading it, and therefore they cannot give Microsoft any more rights to it than they would have had if Microsoft had just downloaded it from Github without needing to buy Github.
@freakazoid Circumventing copyright law is probably a better wording.
There's two somewhat independent things that nevertheless play together:
1) Microsoft got access to a enormous code base. Even though many repos on GH are public, accessing and indexing ALL of it as a third party is difficult (for example GH blocks your IP).
2) Circumventing copyright law: I believe that #Copilot basically does some context sensitive statistical copy-pasting. In some sense this automatically rewrites existing code (protected copyright) into another piece of code.
Like a compiler translates code into binary. There's a automated transformation process. The question is: Is the output also protected by copyright? In case of compilers it is.
I speculate that in the view of MS this transformation strips away copyright.
@t0k That's interesting, given that in the past MS has had a fairly maximalist interpretation of GPL, where they would not allow their engineers to even LOOK at GPL code for fear of subjecting their other code to potential copyright claims. But perhaps they believe that training algorithms is the equivalent of "clean room" reverse engineering, where one team documents the thing, then leaves, and another team uses the documentation to implement a new thing. But I would find that surprising.
@t0k Google Books got shut down because it wasn't considered permissible to even make copyrighted works searchable without permission. But in that case copying wasn't allowed at all.
I suspect that there will be a big legal fight over just this issue when companies start using AIs to produce works that would be copyrightable if they were produced by a human. Is AI-generated music trained on pop songs a derivative work of every song?
@t0k If that's the case, wouldn't every piece produced by a composer who had ever listened to copyrighted music also constitute a derivative work? For music, the standard has been the amount of the song that sounds the same, and it's primarily focused on "sampling."
Maybe similar could be applied to software, but I would personally oppose expanding IP even further. I don't think we need IP at all.
@t0k while I strongly agree with your sentiment, I disagree with your conclusion: the fact is they could have done it just as easily (and cheaper!) by pulling code from CodeBerg, Gitlab, 0xacab, any other place where any code is publicly available.
The food for their AIs is right there for the taking, whether they own GitHub or not.
I think it's up to us to explain this.
A long time ago I started using "statistical programming" instead of AI, ML, DL and similar antropomorhic locutions.
I always explain that ANN are just peculiar virtual machines whose programs (the numerical matrices of weights and activation thresolds) are programmed statistically through data samples.
As far as I can see, people understand this quite well.
The calibration of an ANN (improperly named "training") is just a form of compilation: the readable data are turned into opaque binaries that only that specific topology can execute "correctly" (for whatever correctly can means in this context).
Calling it "statistical programming" makes it also clear programmers' responsibility and how fragile and bugged are the opaque programs.
@t0k We'd need a court case but I don't see any sound argument where this isn't creating output that'd be under the gpl
Just had a discussion about this around the water cooler at work (Teams chat), and the instrumentation that shares code back to the M-ship makes it a nonstarter here: security risk.
Nobody can stop anybody from learning from open source code and applying that knowledge to create closed code. Microsoft could have done this any time they wanted without buying Github until Github found out, at which time Github could have ended the experiment.
Buying Github guaranteed continued access.
@t0k Code produced by an AI trained on code is a derivative work, and the copyright still holds.
Their lawyers may think it's worth a gamble, since this hasn't been tested in court yet, but I think there's an excellent chance the findings would be for the owner of the original code.
That said, how would anyone find out?
@raboof Yes! But I haven't been able to think of a suitable way to make trap code—and it's quite likely the generated code would be closed-source anyhow, making discovery less likely.
I’m not a lawyer and I don’t have confidence over my legal understanding of the matter, so I’m just gonna leave these here as some extra argument and discussion on the topic:
@Mehrad Thanks! I'm also not a lawyer, but I sense some false statements in the article. For instance the article claims that copyleft would not be necessary would copyright law not exist at all. At least it is imprecise at worst false: For example copyleft also enforces back-contributions to projects. Without copyright law this would not be possible in the same way. Maybe the intention was to say "without copyright law copyleft cannot exist". But then we're lost in contradictions.
@kirschwipfel @Mehrad Yeah true for GPL... Depends on the trigger condition though. For instance AGPL triggers on a user interaction already. I think technically would be possible to create a copyleft licence different from the GPL family which triggers earlier. But that slowly goes out of my area of expertise.
@pixelcode I'm afraid of that too. Would be good to find out actually.
@t0k When creating a GitHub account the user accepts those terms so Microsoft would most likely win 😐
@t0k I've heard this line of argument before but it does not make sense to me. Owning Github doesn't give MS special access to the public code there. They can scrape Codeberg too if they like. Also, citation needed for your implication that Copilot will violate license terms.
1/2) Owning GH does give MS special access. Try to scrape all Github content yourself. You'll see it's not so easy. You get blocked before you even got a tiny fraction.
Probably the bandwidth of content pushed to Github is already problematic to keep up width.
Not even talking about all the metadata.
Scraping codeberg might work since it is so small. GH is some orders of magnitude bigger.
@pbx 2/2) I don't have a citation because I'm not citing anybody. Maybe I should rephrase to "That's the M$ way to circumvent copyright law." Since I'm not aware of any precedence case.
However, I must assume that the AI does some sort of statistical and context-aware copy-paste. It can by no way be compared to what a human brain would do.
Then when you process code with a compiler, the binary output is also protected by copyright. That can be compared to what copylot is doing.
@welt Yeah, there's many good alternatives actually. And with federated issue-tracker etc. this all gonna be awesome! 😄
@welt Yes I know 😉 . But as far as I know the social networking features like issue tracking and discussions are not federated. And that's in the end why I'm using something like Gitea at all. Otherwise I would just use plain git without any Web UI.
Honestly I never tried the mail as suggested in the article. But somehow I just cannot imagine how it can replace the structure that an issue tracker can give. Correct me if I'm wrong.
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!