"PyPI is growing fast. If this dangerous expansion not stopped, our advanced machine learning models predict that in only 8 years the number of packages will outnumber human beings."
This is one of the funniest things I've read all week.
The WITNESS THIS INEVITABLE FUTURE button that slowly, then steadily started increasing the date on the graphs is executed very well. Truly hysterical, I laughed out loud.
Thank you, I’m glad you appreciate it. I spent way more time on it than I would like to admit. The original version actually did run a small model trained on the data in the browser to generate the predictions.
Security researchers wanting to investigate what kinds of credentials other people are looking for, for example.
Or as another example, someone who wants to research how often people copy paste example config files without replacing things like credentials given in the example.
So you found 57 live AWS keys[1]. That's out of 1631 according to your stats (If I didn't misread something). I wonder how many of the top two in the list are live :D
I know for myself and my team, I often use placeholder values for config files, but they would often be tokenized, so easy to tell they were placeholders. For example "%%USERNAME%%" or "%%PASSWORD%%".
> If this dangerous expansion not stopped, our advanced machine learning models predict that in only 8 years the number of packages will outnumber human beings.
Incredibly there seems to be no sign yet of an S-curve type saturation (which would be only normal at this point of the Python hype cycle).
Would be interesting to control by repeating the analysis with other languages to see if that exponential growth is Python specific or really the exponential growth of all open source
Python’s growth is not limited in the number of software developers, as is the case for Java or what have you. It’s used by many other professions, and it just keeps exploding.
I've been thinking about this today. Function equivalence is one thing, but detecting similar code fragements is another?
Like why not assign a character to each keyword: 'if' = 'i', 'while' = 'w', 'true' = 't', etc. Then reduce code to this, removing all whitespace comments and identifiers. so `if True: while True: pass` would become `itwtp`.
then the similarity is just an edit distance between two given strings?
Very interesting! Just a heads up that your charts on dark mode under Growth have yellow text on a white background which is near-impossible to read for me on hover
One potential readability improvement - if the numbers in the tables were all right aligned. More controversial opinion - if you standardized all of the sizes onto either TiB or GiB.
Anyway, very cool. I am shocked at how many header files are present.
Another idea - how many unique files are there between releases and how many unique files are there total? Take a sha hash of every file, every commit. Calculate how many shas are shared between releases vs novel (ie # files churned per release). Can then also calculate on the global uniqueness over time.
Of course, this means calculating billions of sha sums, so it could take forever, unless you had some cute trick to rip the value out of the git repos directly. Maybe you could even beat the odds and find that 1 in a quadrillion hash collision.
I tried on release day. Managed to break my entire Linux system. That was an annoying day.
And pip install some tensor flow nonsense has broken my projects many, many times. At this point I try really hard to ever avoid depending on anything in that ecosystem.
Is there something about Bazel's design choices that makes it so complicated? I felt like an idiot when I tried to build some old version of TensorFlow with some old non-default flag. I found even learning CMake easier.
It does, but that’s accounted for in the chart (tensorflow- and tf- prefixes). That will include other projects, but it doesn’t significantly contribute to the overall % as they are dwarfed by the official projects.
Although the same query can be run directly from Parquet files with clickhouse-local:
clickhouse-local --query "SELECT project_name, sum(lines), sum(size) FROM '*.parquet' WHERE skip_reason = '' AND project_name ILIKE '%ClickHouse%' GROUP BY project_name ORDER BY 2 DESC"
BTW RealPython is quite boring about wanting you to create an account, I'm sure they didn't use to be as bad but I can barely read that page in its entirety without getting forced to create an account.
Not quite: the aim is to keep the size of the repositories within a manageable bound. This is achieved by heuristically excluding some text files that won’t compress well based on their content and size before committing.
The “very long lines” exclusion is for text files that are very large but contain less than 5 or so lines. If I remember correctly there are some “py-armour” files that are basically just big dumps of base64 encoded python byte code on two or so lines.
These are likely unique files and so won’t compress well, which bloats the size of the repositories. If you’re interested you can use the SQL console on the datasets page to take a look at the specific files.
No, if you check above: "the aim is to keep the size of the repositories within a manageable bound. This is achieved by heuristically excluding some text files that won’t compress well based on their content and size before committing."
Github has a max size limit per repository. You can push a 150gb git repository to github: it won't stop you, but you'll get a message from support telling you remove it within 48 hours.
By "heuristically excluding some text files that won’t compress well based on their content and size before committing", we are able to keep the size bound of the repositories within those set by Github whilst keeping the majority of useful code.
We have excluded 26,610 .py files for being too large, whilst keeping 414,228,665 .py files. The average size for an excluded .py file is 15 MiB, vs ~10KiB for included files.
One small change that can make any graph easier to understand: Label the units on the axes. I think that the y axis is # of projects, but I'm not totally sure.
The website pulls it directly from the JSON “index” files in each mirror repo at build time, and there isn’t a metadata dataset with the versions yet. You can use the parquet indexes for this I guess, but it could be simpler.
The problem is that it’s pretty huge and would need to be an artefact. I’ll have a think.
This reminded me about https://github.com/cdnjs/cdnjs/ - every version of every popular JS library in one repository - one of the largest repositories on GitHub by size.
Some people don’t put their code on GitHub, since they object to GitHub’s ToS, especially those pertaining to analysis and use by Copilot and various other uses which Microsoft may see fit to put the code.
Will Microsoft see this as a free license to use all of PyPI?
The CoPilot team could pull the code from PyPi and use it to train their models, regardless of whether it's on GitHub. If you don't want AI trained on your code, then either don't publish it, or publish it somewhere that forbids (and preferably enforces) AI companies to index or train on it. But good luck with that... it's public code. Don't publish it if you don't want humans or machines to read it.
They could do that, just as much as someone could pull videos and songs from youtube and use that to train a model. Its public content, so if people wanted humans and machines to not access it then don't publish on a public platform.
One can argue about ToS and copyright, about different interpretations of fair use, derivative work, DRM protections, and so on. Usually people are not interested to discuss finer details of those things. Most people seems to want to perceive it as either being public or not public, in which case, Youtube is just as public as PyPi.
Because you're pushing the code to GitHub, so they need to enumerate their rights in terms of what they can do with it once you push it there. But if you publish your code to PyPi, the relevant ToS is the PyPi ToS, which has no such clause forbidding either PyPi or others from using the code how they'd like to (and as mentioned by other comments in this thread, the ToS actually explicitly grants others the right to republish the code).
Because most company's/people are (AFAIK) under the impression that training on public data will fall under "Fair Use" because it's substantially transformative, in the case that it isn't then you've already agreed to it on github.
It's a fallback clause, "fair use" is irrelevant if you've already given github permission to use it. By adding that clause you can no longer argue that it's not fair use to use the code you put on github after agreeing to their terms.
Which would mean nothing at all if any of the “training models on code doesn’t require permission in the first place” theories (Fair Use or otherwise) is true, and pretty much all current models collapse into illegality if at least one of those theories isn’t true.
You can’t use a license to bind people who don't need a license.
Yes, or there should be a ROBOTS.TXT file that describes how the code in a directory may be indexed or used by machines (e.g. malware scanning okay, no LLM training, etc.) But you're probably correct that such rights should just be covered by the license itself.
Your reciprocity suggestion could also work, since it would mean any LLM trained on even a single file of GPL code would be "poisoned" in the sense that any code it produced would also be GPL. This would make people wary of using the LLM for writing code, and thus would make the creators of it wary of training it on the code.
Humans are also not allowed to simply regurgitate GPL code verbatim, even if they do it by committing the code to memory. There's a reason clean room implementations are a thing, sometimes even the appearance of someone possibly remembering code verbatim is risky enough to warrant extra care. That said usually the risks are acceptable even without extra measures because you can hold the humans responsible.
Now just because you don't understand a language model and call its fitting procedure 'learning' doesn't mean that it is doing anything even remotely similar. And even if it does then it has no legal responsibilities so if you want to distribute it then you as the distributor need to assume responsibility for all the copyrights such an act could possibly infringe.
There are measures you can take to try to prevent the information from any one code base from entering the model verbatim, by severely restricting the model size or carefully obfuscating the data, but to my knowledge nobody has used any method that gives any proven measure of security.
If an AI model regurgitates GPL licensed code verbatim, that code is already protected by copyright and there is no need to update the GPL to cover it specifically.
This can be solved with automated tools that review your code for copyright violations.
But to be honest, this is a non-issue. Copilot (and co) rarely outputs copyrighted code verbatim and when it does, it's usually too small or too trivial to fall under copyright protection. I made a prediction last year that no one will get sued from using Copilot generated code and I believe it has held so far[0].
I understand language models quite well. I have published research on this on renowned conferences.
I also know how they learn. And I know how the biological brain learns. The differences are just technical. The underlying concept is really the same: Both learn by adjusting the neuronal strengths between neurons based on what they see.
Legal responsibilities is sth different. This is up to politics to decide.
My argument is purely on the fact that what humans do and what machines do is extremely similar, and the differences are just technical and don't really matter here. This is often misunderstood by people who think that machines don't really learn like humans.
Computers aren’t humans. There’s no reason licenses should treat them the same way when completely different considerations of fairness and economic impact apply.
You can very much be sued for working on a proprietary project and then trying to reproduce it somewhere else. That's why clean-room reimplementations have to make sure that no contributor has seen the original code.
I think it's an open question if that would actually work. I would guess that if the courts decided that worked, we'd see a GPL 4 with that sort of clause.
They are within their rights to make that choice, but when you publish a package to PyPI you agree to their terms which gives anyone the right to mirror, distribute and otherwise use the code you’ve published.
The rights you find in the PyPI terms, do they provide everything you need to comply with the Github terms? Ultimately it's tricky to understand what Github really means with their terms (they say User-generated contents a lot.)
Code which has a "can't be placed on github" license restriction is definitely not open source, regardless of what other terms the license purports to have.
> Some people don’t put their code on GitHub, since they object to GitHub’s ToS, especially those pertaining to analysis and use by Copilot
Copilot was trained on Github code under a “training models doesn’t require permission” theory before there was anything about it in the ToS, and basically every other large model has taken a similar approach to publicly-accessible data of all kinds.
> Will Microsoft see this as a free license to use all of PyPI
Microsoft doesn’t think they need a license for model training.
You have very serendipitous timing, I was needing to solve a similar task just the other day and had a hell of a time getting it worked out. Thanks for taking the time to post it! :) Straightforward, but seeing a "worked example" helps a ton.
I’ve recently worked on something similar that involved cloning multiple repositories and found gnu parallel to be ideal for that task. Parallel gives more control than xargs.
Yes, for me it is better because if you do it your way you have to keep your ssh connection open until all of the git clones have been done, which in this case takes several hours.
(Or you could also run your way in tmux or screen.)
With task-spooler, it puts all of the commands (in this case, the individual git clone commands for each of the repos) in a queue and it runs the commands independently of my ssh session, so I can quickly add a bunch of jobs like this to the queue and immediately disconnect my ssh session.
I'm all for archivism, but wouldn't this get taken down by GitHub? If what your website says is true, you're storing 300GB+ of code on GitHub. I've heard stories from people who've also tested the limits, and they've had emails from GitHub asking them to cease activity.
I was in contact with them, and they are apparently OK with having the repositories be split up. There are over 230 of them each under 1.3gb in total size.
I’m working on distributing this data without GitHub - git packfiles are a fantastic way of compressing this data, and you can serve those easily enough from a bucket.
highly recommend clicking the "witness the inevitable future" button
as a python oldie coming back into python, i've been surprised by dataclasses. are they basically "backwards compatible better classes"? any strong blogposts or readings that people can share about better typed python/OSS module publishing practices?
You reminded me of this article[0] where the author asks why not dataclasses by default? I am inclined to agree, dataclasses feel Pythonic in that they remove boilerplate with reasonable defaults (sorting, hashing, etc).
Yes, I really wish this was the reality. I would even be very happy with a new `data` keyword for this purpose. I'm teaching `dataclasses` in my CS1, and it's nice, but frustrating that A) students have to remember the import, and B) all my slides always have to be a bit longer thanks to the `@dataclass` line taking up space. This is a real problem for some of my longer examples...
welllll i'm not seriously proposing this but if you had a new extension, like .ty or .datapy or something, you could put whatever syntax you wanted in there and compile it down to .py...
Yes please keep stacking compilers on top of compilers. I loved dealing with Qt builds in CMake. Can't wait for Python's stable and consistent packaging tools to handle not-Python code that compiles down to Python code.
Dataclasses are orthogonal to typing (IMO), they just use types for their evocative syntax for fields.
Dataclasses are nice - they are a pared down version of the attrs library, so a simple way to create data-only or mainly-data records through classes. They are not intended to replace all classes.
Github has automation to detect and mitigate repos with AWS/GCC API keys. They probably don't (yet?) have equivalents for OpenAI or other similar services that aren't as popular.
Azure secrets works differently, and stores the keys outside of your project directory (traditionally). I'm sure someone could find a way to include them (and probably include multiple projects rather than a single project), but they would have to go out of their way to do this.
I can’t remember how I set this up right now, but I think it’s unique keys. And someone did manage to accidentally publish a master list of 500 individual keys at one point, which definitely boots the numbers.
This analysis will be re-done with some other data soon.
"If I upload Content covered by a royalty-free license included with such Content, giving the PSF the right to copy and redistribute such Content unmodified on PyPI as I have uploaded it, with no further action required by the PSF (an "Included License"), I represent and warrant that the uploaded Content meets all the requirements necessary for free redistribution by the PSF and any mirroring facility, public or private, under the Included License.
If I upload Content other than under an Included License, then I grant the PSF and all other users of the web site an irrevocable, worldwide, royalty-free, nonexclusive license to reproduce, distribute, transmit, display, perform, and publish the Content, including in digital form."
There's a popular package in a subfield (which I'm not going to name) which disallows redistribution in its license, but is uploaded by its author to PyPI. I wonder how that interacts with PyPI (given that it has a license, so would hit the first clause, but it doesn't allow redistribution, so would seem to be under the second clause, which contradicts the original license to users), and if the DMCA could be validly used to take down the GitHub repositories? It would also be interesting to see how many other packages have such issues.
It's not that I didn't think "it was easy enough to use PyPI do analytics and analysis": it is near impossible for the layman to use PyPI to do analytics and analysis in it's current form. The volume of data is very unweildly, there are numerous quirks with extracting packages and no tooling exists to help you in any way.
This means no analysis has been done on the contents of PyPI. In turn this means malicious packages are harder to detect (and for sure still present somewhere in there), it means people publish an absolutely crazy number of credentials to PyPI on a daily basis without ever knowing (+ no simple way to find concrete ways to improve this) and it means there is a lack of exploration on the impacts of language features/changes on the ecosystem.
To me the GitHub aspect isn't important or interesting. Would it make any difference if it was distributed from a series of git repositories hosted on S3? It's the git apsect that is interesting, because it lowers the barrier for anyone to access the corpus of already public, already mirrored and already automatically-scanned-by-bad-actors code that is on PyPI.
While this project is more "a number of things glued-together" than "a groundbreaking invention", I have to disagree with the triviality aspect. Most problems we deal with can be reduced to 'copying X from one place to another' (sorting?), and the devil is always in the details.
> I don't like this, it's this kind of stunt that makes me reluctant to publish my code in general.
Isn't this quite circular? People using code you publish publicly makes you reluctant to publish code publicly?
> Isn't this quite circular? People using code you publish publicly makes you reluctant to publish code publicly?
Not that I disagree with this project, but just to maybe help see it from a little different perspective...
When people publish their code, I think they typically expect it's going to be used like
import my_package
do_something_cool()
So it is a little weird when things like this come along and change that expectation.
It's kind of like, "I scanned millions of Facebook photos for soda cans to see if people prefer Pepsi or Coke!" People didn't post those photos be be part of a project, they just wanted to share some pictures with their friends.
It's not unusual to want to change certain behaviours of a project, e.g. by subclassing something within it. It's also worth at least having some idea of the code you're running before you run it, particularly if you don't know the developer, for many reasons but for e.g. [0].
I'm not really sold on the perspective that if you're a sophisticated enough developer to know+upload+publish on pypi that you wouldn't expect someone to read your code. In many ways that's kind of the point. Not to say such people don't exist, but they're probably a small minority.
According to the stats on the original link, there are over 25,000 identified secret ids/keys/tokens in the data. And it looks like that's just identifiable secrets, e.g. "Google API Keys" that I'm guessing are identifiable because they have a specific pattern, and may be missing other secrets that use less recognizable patterns.
I mean, sure, compared to the 478,876 Projects claimed on https://pypi.org/, that's a pretty small minority. On the other hand, I'd guess many Python packages don't use these particular services, or even need to connect to a remote service at all, so the area for this class of mistake should be smaller.
And mistakes do happen, but that's a pretty big thing to miss if you are knowingly publishing your code with the expectation other people will be reading it.
Thanks for that, I can definitely appreciate this perspective. I’d say it’s more akin to uploading photos to a shared public host like imgur rather than Facebook, but regardless I can see how someone’s expectations of who/what would use it might be different than mine.
Kinda like you say yourself, the service is probably the least interesting part.
It doesn't really matter whether its a public repository or if you use thing your friends shared only within their network. When it comes to what people expect and how they'll feel about breaking those expectations, the only difference is that a smaller network of generally like-minded people _may_ already be cool with it, or at least it's easier to ask.
I'm not even saying they're right to feel weird about it. Just that people are going to feel what they're going to feel, and doing something they didn't expect is a sure-fire way to get them to feel _something_.
Apart from being easy to analyze, why is Python so interesting?
It's a nice-to-play with language, very useful for researchers, but not practical for enterprise code. There are too many ways to do the same thing, not type safe, and I personally don't know many real python pros, the majority are just using python to play with.
Not all code is enterprise code. For the last 5 years of my professional career I use Python almost exclusively.
>There are too many ways to do the same thing, (...)
Fair, but that's a funny statement to make because Python from the start tried to have just one obvious solution for every problem. Maybe that's just what happens with languages over time.
>not type safe
Nitpick: Python is dynamically typed, but is actually quite type safe as used in practice (i.e. type errors are usually caught in runtime instead of silently doing the wrong thing). YMMV of course.
>and I personally don't know many real python pros, the majority are just using python to play with.
The beautiful thing about Python is that you don't have to be a pro to use it effectively :). And I think this may be a result of your professional bubble - for example I don't know any Java pros, but I have no doubts there are many.
This is one of the funniest things I've read all week.