🚩 Report: Legal issue(s)

#116

by arborelia - opened Oct 6, 2024

Oct 6, 2024

StarCoder is trained on The Stack v2.0, which is built illegally on copies of copyrighted code with no license. It does not honor opt-out requests that have been made in the last year, which is most of them.

Per https://huggingface.co/spaces/bigcode/in-the-stack, my copyrighted repositories such as https://github.com/arborelia/advent2020 are still being distributed as part of The Stack v2.0.1. It tells me that the code has been removed from The Stack v2.1, but:

HuggingFace is still distributing v2.0.1
StarCoder2 is trained on v2.0.1, not v2.1, despite the clause in The Stack requiring users to update
Both v2.0.1 and v2.1 still contain copyrighted code with no license, as well as code with attribution clauses that it does not attribute correctly. Copyright is not an "opt-out" system in the first place.

It appears that, between The Stack v1 and v2, HuggingFace chose to ignore the copyright status of code, including code I have written that HuggingFace has no license to use and that I have explicitly asked HuggingFace not to use.

HuggingFace must cease distributing The Stack v2 and StarCoder2.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment