I mirrored all the code from PyPI to GitHub and analysed it

4sak3n · on Sept 3, 2023

"PyPI is growing fast. If this dangerous expansion not stopped, our advanced machine learning models predict that in only 8 years the number of packages will outnumber human beings."

This is one of the funniest things I've read all week.

isoprophlex · on Sept 3, 2023

The WITNESS THIS INEVITABLE FUTURE button that slowly, then steadily started increasing the date on the graphs is executed very well. Truly hysterical, I laughed out loud.

orf · on Sept 3, 2023

Thank you, I’m glad you appreciate it. I spent way more time on it than I would like to admit. The original version actually did run a small model trained on the data in the browser to generate the predictions.

leroman · on Sept 3, 2023

Maybe its forward thinking of the ML folk to create big data to train the models that will replace them

krmboya · on Sept 2, 2023

One the issue of uploaded credentials, is there a chance that a portion contain dummy placeholder values in config files, or are they hardcoded?

codetrotter · on Sept 2, 2023

Some of them might even be canary credentials.

For example, here is one service that is able to generate various canary tokens: https://www.canarytokens.org/generate

nhggfu · on Sept 3, 2023

i had never heard of this "smart stuff as a service" before. thanks @codetrotter

jwilk · on Sept 3, 2023

Why would anyone upload canary credentials to PyPI?

codetrotter · on Sept 3, 2023

Security researchers wanting to investigate what kinds of credentials other people are looking for, for example.

Or as another example, someone who wants to research how often people copy paste example config files without replacing things like credentials given in the example.

Probably a bunch of other reasons too.

orf · on Sept 2, 2023

Lots of dummy ones in tests for sure. I’m going to follow up with some deeper analysis on this topic in a few weeks, but you’d be surprised.

chanux · on Sept 3, 2023

Thanks for the fantastic work!

So you found 57 live AWS keys[1]. That's out of 1631 according to your stats (If I didn't misread something). I wonder how many of the top two in the list are live :D

Google API Key 4,015 OpenAI API Key 3,531

[1] https://tomforb.es/i-scanned-every-package-on-pypi-and-found...

orf · on Sept 3, 2023

Oh no, that’s the first incarnation of this project. It scans the code without committing it.

The project that leverages GitHub secret scanning found a lot more.

progmetaldev · on Sept 3, 2023

I know for myself and my team, I often use placeholder values for config files, but they would often be tokenized, so easy to tell they were placeholders. For example "%%USERNAME%%" or "%%PASSWORD%%".

nologic01 · on Sept 3, 2023

> If this dangerous expansion not stopped, our advanced machine learning models predict that in only 8 years the number of packages will outnumber human beings.

Incredibly there seems to be no sign yet of an S-curve type saturation (which would be only normal at this point of the Python hype cycle).

Would be interesting to control by repeating the analysis with other languages to see if that exponential growth is Python specific or really the exponential growth of all open source

Maybe publish in JOSS?

pletnes · on Sept 3, 2023

Python’s growth is not limited in the number of software developers, as is the case for Java or what have you. It’s used by many other professions, and it just keeps exploding.

matt3210 · on Sept 3, 2023

I wonder how much duplication is there. I mean different code that does the same thing.

orf · on Sept 3, 2023

This would be a really really interesting project to do with this data.

You could maybe parse the files and run a similarity function of some kind on class and method bodies, ignoring identifier names.

ivanjermakov · on Sept 3, 2023

If only function equivalence was decidable[1].

[1]: Rice's theorem https://en.wikipedia.org/wiki/Rice%27s_theorem

orf · on Sept 3, 2023

I've been thinking about this today. Function equivalence is one thing, but detecting similar code fragements is another?

Like why not assign a character to each keyword: 'if' = 'i', 'while' = 'w', 'true' = 't', etc. Then reduce code to this, removing all whitespace comments and identifiers. so `if True: while True: pass` would become `itwtp`.

then the similarity is just an edit distance between two given strings?

ehPReth · on Sept 2, 2023

Very interesting! Just a heads up that your charts on dark mode under Growth have yellow text on a white background which is near-impossible to read for me on hover

orf · on Sept 2, 2023

Thanks, fixed!

fbdab103 · on Sept 3, 2023

One potential readability improvement - if the numbers in the tables were all right aligned. More controversial opinion - if you standardized all of the sizes onto either TiB or GiB.

Anyway, very cool. I am shocked at how many header files are present.

Another idea - how many unique files are there between releases and how many unique files are there total? Take a sha hash of every file, every commit. Calculate how many shas are shared between releases vs novel (ie # files churned per release). Can then also calculate on the global uniqueness over time. Of course, this means calculating billions of sha sums, so it could take forever, unless you had some cute trick to rip the value out of the git repos directly. Maybe you could even beat the odds and find that 1 in a quadrillion hash collision.

orf · on Sept 3, 2023

The datasets contain the file hashes, so this is definitely something you can do already.

https://play.clickhouse.com/play?user=play#U0VMRUNUIHByb2plY...

DrNosferatu · on Sept 2, 2023

What fraction of the total is TensorFlow related code?

orf · on Sept 2, 2023

There is a section on the page that tells you (near the bottom): 16% of all uncompressed data in PyPI is from the Tensorflow project.

geysersam · on Sept 2, 2023

What the? 16%? Of the Python ecosystem. Why is it so big?

orf · on Sept 2, 2023

You can go check out the page linked in the post to see? That’s kind of why it exists!

Spoiler: the answer is nightly builds, huge binaries and a lot of wheels per release.

geysersam · on Sept 2, 2023

I was asking more on the level of: why do they do nightly builds of huge binaries and wheels when similar frameworks are not?

glandium · on Sept 3, 2023

Have you tried building Tensorflow?

disgruntledphd2 · on Sept 3, 2023

I tried on release day. Managed to break my entire Linux system. That was an annoying day.

And pip install some tensor flow nonsense has broken my projects many, many times. At this point I try really hard to ever avoid depending on anything in that ecosystem.

plonk · on Sept 3, 2023

Is there something about Bazel's design choices that makes it so complicated? I felt like an idiot when I tried to build some old version of TensorFlow with some old non-default flag. I found even learning CMake easier.

DrNosferatu · on Sept 6, 2023

TF also stands for TensorFlow, right? So I would say a even larger fraction.

orf · on Sept 6, 2023

It does, but that’s accounted for in the chart (tensorflow- and tf- prefixes). That will include other projects, but it doesn’t significantly contribute to the overall % as they are dwarfed by the official projects.

progmetaldev · on Sept 3, 2023

Good eye!

zX41ZdbW · on Sept 3, 2023

I've uploaded the metadata to ClickHouse Playground for analysis.

Example: https://play.clickhouse.com/play?user=play#U0VMRUNUIHByb2plY...

This is a nice dataset for exploration, although small (only one billion records). Going to upload all lines of code...

zX41ZdbW · on Sept 3, 2023

Although the same query can be run directly from Parquet files with clickhouse-local:

    clickhouse-local --query "SELECT project_name, sum(lines), sum(size) FROM '*.parquet' WHERE skip_reason = '' AND project_name ILIKE '%ClickHouse%' GROUP BY project_name ORDER BY 2 DESC"

https://clickhouse.com/blog/extracting-converting-querying-l...

orf · on Sept 3, 2023

I’ve embedded it on this page here: https://py-code.org/datasets

Thanks so much for that!

orf · on Sept 3, 2023

Thanks, that’s really neat. I’ll see if I can embed it somehow onto the site.

I was looking into some browser duckdb WASM setup that reads the parquet files via http but I ran out of steam.

tlocke · on Sept 2, 2023

Interesting. In the language feature list, what is 'try star' ?

orf · on Sept 2, 2023

It's exception groups[1], with the syntax `except*`.

1. https://peps.python.org/pep-0654/

Einenlum · on Sept 3, 2023

I wondered what it was too. Maybe "exception groups" or "except star" would be clearer for the reader (thanks for your amazing work btw)

glaucon · on Sept 3, 2023

I hadn't heard of it either. In addition to the PEP this link may be useful https://realpython.com/python311-exception-groups/#exception... . The feature was new in 3.11.

BTW RealPython is quite boring about wanting you to create an account, I'm sure they didn't use to be as bad but I can barely read that page in its entirety without getting forced to create an account.

KirillPanov · on Sept 3, 2023

Sorry, can you explain this?

> Not all files can be committed to GitHub ... very long lines ... VCS directories

Wait so github will reject a push if one of the commits contains a file with very long lines in it?

orf · on Sept 3, 2023

Not quite: the aim is to keep the size of the repositories within a manageable bound. This is achieved by heuristically excluding some text files that won’t compress well based on their content and size before committing.

The “very long lines” exclusion is for text files that are very large but contain less than 5 or so lines. If I remember correctly there are some “py-armour” files that are basically just big dumps of base64 encoded python byte code on two or so lines.

These are likely unique files and so won’t compress well, which bloats the size of the repositories. If you’re interested you can use the SQL console on the datasets page to take a look at the specific files.

KirillPanov · on Sept 3, 2023

> The “very long lines” exclusion is for text files that are very large but contain less than 5 or so lines.

And github will block a push containing a commit which creates one of these files?

And git itself does not?

orf · on Sept 3, 2023

No, if you check above: "the aim is to keep the size of the repositories within a manageable bound. This is achieved by heuristically excluding some text files that won’t compress well based on their content and size before committing."

Github has a max size limit per repository. You can push a 150gb git repository to github: it won't stop you, but you'll get a message from support telling you remove it within 48 hours.

By "heuristically excluding some text files that won’t compress well based on their content and size before committing", we are able to keep the size bound of the repositories within those set by Github whilst keeping the majority of useful code.

We have excluded 26,610 .py files for being too large, whilst keeping 414,228,665 .py files. The average size for an excluded .py file is 15 MiB, vs ~10KiB for included files.

1. https://play.clickhouse.com/play?user=play#c2VsZWN0IHNraXBfc...

FridgeSeal · on Sept 3, 2023

Maybe it’s a protection for windows users? Windows has some arbitrary and reasonably small path limit IIRC.

Thorrez · on Sept 3, 2023

The path length limit only applies to files and directory names, it doesn't limit how long the lines of text can be in the content of a file.

jwilk · on Sept 3, 2023

Maybe they meant very long names?

bagels · on Sept 3, 2023

One small change that can make any graph easier to understand: Label the units on the axes. I think that the y axis is # of projects, but I'm not totally sure.

zomglings · on Sept 2, 2023

This is actually amazing, great work!

There is a search page where I can look up which repo a specific version of a specific PyPI package is in.

But I didn't find that index in the metadata files ("repositories_with_releases.json" seemed the most likely candidate).

Is it available as a flat file but I'm just too blind to see it?

orf · on Sept 2, 2023

The website pulls it directly from the JSON “index” files in each mirror repo at build time, and there isn’t a metadata dataset with the versions yet. You can use the parquet indexes for this I guess, but it could be simpler.

The problem is that it’s pretty huge and would need to be an artefact. I’ll have a think.

zX41ZdbW · on Sept 3, 2023

This reminded me about https://github.com/cdnjs/cdnjs/ - every version of every popular JS library in one repository - one of the largest repositories on GitHub by size.

mianos · on Sept 3, 2023

Is there some reason the largest space is used up by .so objects or is this just people going git add *?

kzrdude · on Sept 3, 2023

The dataset contains both the latest wheel (precompiled) and the latest source dist for each package, so .so files come from the non-python libs.

mayli · on Sept 3, 2023

precompiled, with 3rd party libs from manylinux.

Jarobq18 · on Sept 3, 2023

I suppose it's precompiled?

teddyh · on Sept 2, 2023

Some people don’t put their code on GitHub, since they object to GitHub’s ToS, especially those pertaining to analysis and use by Copilot and various other uses which Microsoft may see fit to put the code.

Will Microsoft see this as a free license to use all of PyPI?

chatmasta · on Sept 2, 2023

The CoPilot team could pull the code from PyPi and use it to train their models, regardless of whether it's on GitHub. If you don't want AI trained on your code, then either don't publish it, or publish it somewhere that forbids (and preferably enforces) AI companies to index or train on it. But good luck with that... it's public code. Don't publish it if you don't want humans or machines to read it.

belorn · on Sept 3, 2023

They could do that, just as much as someone could pull videos and songs from youtube and use that to train a model. Its public content, so if people wanted humans and machines to not access it then don't publish on a public platform.

One can argue about ToS and copyright, about different interpretations of fair use, derivative work, DRM protections, and so on. Usually people are not interested to discuss finer details of those things. Most people seems to want to perceive it as either being public or not public, in which case, Youtube is just as public as PyPi.

teddyh · on Sept 2, 2023

If Microsoft could legally do that, then why is that clause present in the GitHub ToS?

chatmasta · on Sept 2, 2023

Because you're pushing the code to GitHub, so they need to enumerate their rights in terms of what they can do with it once you push it there. But if you publish your code to PyPi, the relevant ToS is the PyPi ToS, which has no such clause forbidding either PyPi or others from using the code how they'd like to (and as mentioned by other comments in this thread, the ToS actually explicitly grants others the right to republish the code).

theaiquestion · on Sept 3, 2023

Because most company's/people are (AFAIK) under the impression that training on public data will fall under "Fair Use" because it's substantially transformative, in the case that it isn't then you've already agreed to it on github.

It's a fallback clause, "fair use" is irrelevant if you've already given github permission to use it. By adding that clause you can no longer argue that it's not fair use to use the code you put on github after agreeing to their terms.

schemescape · on Sept 2, 2023

Out of my own curiosity, which clause in the GitHub TOS are you referring to?

teddyh · on Sept 8, 2023

I don’t know, but it seems to be this one: <https://news.ycombinator.com/item?id=37425482>

labster · on Sept 2, 2023

The GPL should really be updated to say that any code produced from machine learning on GPL code is also GPL licensed.

dragonwriter · on Sept 3, 2023

Which would mean nothing at all if any of the “training models on code doesn’t require permission in the first place” theories (Fair Use or otherwise) is true, and pretty much all current models collapse into illegality if at least one of those theories isn’t true.

You can’t use a license to bind people who don't need a license.

chatmasta · on Sept 2, 2023

Yes, or there should be a ROBOTS.TXT file that describes how the code in a directory may be indexed or used by machines (e.g. malware scanning okay, no LLM training, etc.) But you're probably correct that such rights should just be covered by the license itself.

Your reciprocity suggestion could also work, since it would mean any LLM trained on even a single file of GPL code would be "poisoned" in the sense that any code it produced would also be GPL. This would make people wary of using the LLM for writing code, and thus would make the creators of it wary of training it on the code.

albertzeyer · on Sept 2, 2023

You also don't say that if a human has learned from GPL code, that all code that this human produces in the future is GPL licensed.

contravariant · on Sept 3, 2023

Humans are also not allowed to simply regurgitate GPL code verbatim, even if they do it by committing the code to memory. There's a reason clean room implementations are a thing, sometimes even the appearance of someone possibly remembering code verbatim is risky enough to warrant extra care. That said usually the risks are acceptable even without extra measures because you can hold the humans responsible.

Now just because you don't understand a language model and call its fitting procedure 'learning' doesn't mean that it is doing anything even remotely similar. And even if it does then it has no legal responsibilities so if you want to distribute it then you as the distributor need to assume responsibility for all the copyrights such an act could possibly infringe.

There are measures you can take to try to prevent the information from any one code base from entering the model verbatim, by severely restricting the model size or carefully obfuscating the data, but to my knowledge nobody has used any method that gives any proven measure of security.

olalonde · on Sept 3, 2023

If an AI model regurgitates GPL licensed code verbatim, that code is already protected by copyright and there is no need to update the GPL to cover it specifically.

CableNinja · on Sept 3, 2023

Except when the GPL header is missing from said code and the user has no idea it is protected, and/or the developer has no idea it was stolen

olalonde · on Sept 3, 2023

If the license is missing, it is a violation of the license and there is legal recourse. Just like if someone did the same thing manually.

CableNinja · on Sept 3, 2023

Yes, and thats my point.

AI wont include it, user wont know it should be GPL, developer wont know it was stolen.

This is a whole can of problematic worms

olalonde · on Sept 4, 2023

This can be solved with automated tools that review your code for copyright violations.

But to be honest, this is a non-issue. Copilot (and co) rarely outputs copyrighted code verbatim and when it does, it's usually too small or too trivial to fall under copyright protection. I made a prediction last year that no one will get sued from using Copilot generated code and I believe it has held so far[0].

[0] https://news.ycombinator.com/item?id=31849027

albertzeyer · on Sept 3, 2023

I understand language models quite well. I have published research on this on renowned conferences.

I also know how they learn. And I know how the biological brain learns. The differences are just technical. The underlying concept is really the same: Both learn by adjusting the neuronal strengths between neurons based on what they see.

Legal responsibilities is sth different. This is up to politics to decide.

My argument is purely on the fact that what humans do and what machines do is extremely similar, and the differences are just technical and don't really matter here. This is often misunderstood by people who think that machines don't really learn like humans.

rfw300 · on Sept 3, 2023

Computers aren’t humans. There’s no reason licenses should treat them the same way when completely different considerations of fairness and economic impact apply.

xigoi · on Sept 3, 2023

You can very much be sued for working on a proprietary project and then trying to reproduce it somewhere else. That's why clean-room reimplementations have to make sure that no contributor has seen the original code.

xigoi · on Sept 3, 2023

Even better, the model itself should be GPL licenced, since it's clearly a derivative work.

gaganyaan · on Sept 2, 2023

I think it's an open question if that would actually work. I would guess that if the courts decided that worked, we'd see a GPL 4 with that sort of clause.

justincormack · on Sept 3, 2023

Currently the US says that nothing generated by AI has copyright at all, it is all public ___domain.

jbaber · on Sept 2, 2023

This is an interesting idea.

orf · on Sept 2, 2023

They are within their rights to make that choice, but when you publish a package to PyPI you agree to their terms which gives anyone the right to mirror, distribute and otherwise use the code you’ve published.

kzrdude · on Sept 2, 2023

The rights you find in the PyPI terms, do they provide everything you need to comply with the Github terms? Ultimately it's tricky to understand what Github really means with their terms (they say User-generated contents a lot.)

uxp8u61q · on Sept 3, 2023

The terms of PyPI give anyone the right to mirror. Do they give anyone the right to mirror on github, is the question.

pjc50 · on Sept 3, 2023

Clearly yes. It doesn't say any restrictions.

Code which has a "can't be placed on github" license restriction is definitely not open source, regardless of what other terms the license purports to have.

xigoi · on Sept 3, 2023

GitHub does more to the code than just mirror it, so having the right to mirror is not enough.

If trying to prevent your software from being used to create proprietary software makes it not open-source, is the ONU GPL not an open-source license?

dragonwriter · on Sept 3, 2023

> Some people don’t put their code on GitHub, since they object to GitHub’s ToS, especially those pertaining to analysis and use by Copilot

Copilot was trained on Github code under a “training models doesn’t require permission” theory before there was anything about it in the ToS, and basically every other large model has taken a similar approach to publicly-accessible data of all kinds.

> Will Microsoft see this as a free license to use all of PyPI

Microsoft doesn’t think they need a license for model training.

codetrotter · on Sept 2, 2023

> https://py-code.org/download

This is a perfect job for task-spooler! :D

To mirror your pypi data, I sshed into my server and did this:

    mkdir -p ~/src/github.com/pypi-data
    cd !$
    wget https://raw.githubusercontent.com/pypi-data/data/main/links/repositories.txt
    xargs -L1 tsp git clone --bare < repositories.txt

And then I closed the ssh connection to my server, knowing that my server will proceed to mirror all of those repositories of yours one by one :D

mrwizrd · on Sept 2, 2023

You have very serendipitous timing, I was needing to solve a similar task just the other day and had a hell of a time getting it worked out. Thanks for taking the time to post it! :) Straightforward, but seeing a "worked example" helps a ton.

allarm · on Sept 3, 2023

I’ve recently worked on something similar that involved cloning multiple repositories and found gnu parallel to be ideal for that task. Parallel gives more control than xargs.

henrydark · on Sept 3, 2023

I didn't know about task spooler. Is it better than using xargs with a parallel pool?

    xargs -L1 -P20 git clone --bare < repositories.txt

codetrotter · on Sept 3, 2023

Yes, for me it is better because if you do it your way you have to keep your ssh connection open until all of the git clones have been done, which in this case takes several hours.

(Or you could also run your way in tmux or screen.)

With task-spooler, it puts all of the commands (in this case, the individual git clone commands for each of the repos) in a queue and it runs the commands independently of my ssh session, so I can quickly add a bunch of jobs like this to the queue and immediately disconnect my ssh session.

tentacleuno · on Sept 2, 2023

I'm all for archivism, but wouldn't this get taken down by GitHub? If what your website says is true, you're storing 300GB+ of code on GitHub. I've heard stories from people who've also tested the limits, and they've had emails from GitHub asking them to cease activity.

orf · on Sept 2, 2023

I was in contact with them, and they are apparently OK with having the repositories be split up. There are over 230 of them each under 1.3gb in total size.

I’m working on distributing this data without GitHub - git packfiles are a fantastic way of compressing this data, and you can serve those easily enough from a bucket.

swyx · on Sept 2, 2023

highly recommend clicking the "witness the inevitable future" button

as a python oldie coming back into python, i've been surprised by dataclasses. are they basically "backwards compatible better classes"? any strong blogposts or readings that people can share about better typed python/OSS module publishing practices?

fbdab103 · on Sept 2, 2023

You reminded me of this article[0] where the author asks why not dataclasses by default? I am inclined to agree, dataclasses feel Pythonic in that they remove boilerplate with reasonable defaults (sorting, hashing, etc).

[0] https://blog.glyph.im/2023/02/data-classification.html

acbart · on Sept 3, 2023

Yes, I really wish this was the reality. I would even be very happy with a new `data` keyword for this purpose. I'm teaching `dataclasses` in my CS1, and it's nice, but frustrating that A) students have to remember the import, and B) all my slides always have to be a bit longer thanks to the `@dataclass` line taking up space. This is a real problem for some of my longer examples...

swyx · on Sept 3, 2023

welllll i'm not seriously proposing this but if you had a new extension, like .ty or .datapy or something, you could put whatever syntax you wanted in there and compile it down to .py...

plonk · on Sept 3, 2023

Yes please keep stacking compilers on top of compilers. I loved dealing with Qt builds in CMake. Can't wait for Python's stable and consistent packaging tools to handle not-Python code that compiles down to Python code.

xigoi · on Sept 4, 2023

You may be interested in http://coconut-lang.org/

kzrdude · on Sept 2, 2023

Dataclasses are orthogonal to typing (IMO), they just use types for their evocative syntax for fields.

Dataclasses are nice - they are a pared down version of the attrs library, so a simple way to create data-only or mainly-data records through classes. They are not intended to replace all classes.

wiml · on Sept 2, 2023

Dataclasses are basically just better namedtuples. They do work nicely with type-annotations, but the two features are kind orthogonal.

dylanjcastillo · on Sept 2, 2023

It’s under Growth > Files for those struggling to find the button

albert_e · on Sept 3, 2023

How is Open AI API key the second biggest secret in repo already. Others like AWS and Google Cloud have been around and popular for decades!

Also why no Azure alongside AWS and Google?

amethyst · on Sept 3, 2023

Github has automation to detect and mitigate repos with AWS/GCC API keys. They probably don't (yet?) have equivalents for OpenAI or other similar services that aren't as popular.

judge2020 · on Sept 3, 2023

GH has the Secret Scanning Partner Program https://docs.github.com/en/code-security/secret-scanning/sec...

OpenAI is one of the supported platforms - A list of providers is at https://docs.github.com/en/code-security/secret-scanning/sec...

progmetaldev · on Sept 3, 2023

Azure secrets works differently, and stores the keys outside of your project directory (traditionally). I'm sure someone could find a way to include them (and probably include multiple projects rather than a single project), but they would have to go out of their way to do this.

orf · on Sept 3, 2023

I can’t remember how I set this up right now, but I think it’s unique keys. And someone did manage to accidentally publish a master list of 500 individual keys at one point, which definitely boots the numbers.

This analysis will be re-done with some other data soon.

gustavus · on Sept 2, 2023

[flagged]

galenmarchetti · on Sept 2, 2023

https://pypi.org/policy/terms-of-use/

Seems like PyPi explicitly allows this behavior:

"If I upload Content covered by a royalty-free license included with such Content, giving the PSF the right to copy and redistribute such Content unmodified on PyPI as I have uploaded it, with no further action required by the PSF (an "Included License"), I represent and warrant that the uploaded Content meets all the requirements necessary for free redistribution by the PSF and any mirroring facility, public or private, under the Included License.

If I upload Content other than under an Included License, then I grant the PSF and all other users of the web site an irrevocable, worldwide, royalty-free, nonexclusive license to reproduce, distribute, transmit, display, perform, and publish the Content, including in digital form."

aragilar · on Sept 3, 2023

There's a popular package in a subfield (which I'm not going to name) which disallows redistribution in its license, but is uploaded by its author to PyPI. I wonder how that interacts with PyPI (given that it has a license, so would hit the first clause, but it doesn't allow redistribution, so would seem to be under the second clause, which contradicts the original license to users), and if the DMCA could be validly used to take down the GitHub repositories? It would also be interesting to see how many other packages have such issues.

orf · on Sept 2, 2023

It's not that I didn't think "it was easy enough to use PyPI do analytics and analysis": it is near impossible for the layman to use PyPI to do analytics and analysis in it's current form. The volume of data is very unweildly, there are numerous quirks with extracting packages and no tooling exists to help you in any way.

This means no analysis has been done on the contents of PyPI. In turn this means malicious packages are harder to detect (and for sure still present somewhere in there), it means people publish an absolutely crazy number of credentials to PyPI on a daily basis without ever knowing (+ no simple way to find concrete ways to improve this) and it means there is a lack of exploration on the impacts of language features/changes on the ecosystem.

To me the GitHub aspect isn't important or interesting. Would it make any difference if it was distributed from a series of git repositories hosted on S3? It's the git apsect that is interesting, because it lowers the barrier for anyone to access the corpus of already public, already mirrored and already automatically-scanned-by-bad-actors code that is on PyPI.

While this project is more "a number of things glued-together" than "a groundbreaking invention", I have to disagree with the triviality aspect. Most problems we deal with can be reduced to 'copying X from one place to another' (sorting?), and the devil is always in the details.

> I don't like this, it's this kind of stunt that makes me reluctant to publish my code in general.

Isn't this quite circular? People using code you publish publicly makes you reluctant to publish code publicly?

lizard · on Sept 2, 2023

> Isn't this quite circular? People using code you publish publicly makes you reluctant to publish code publicly?

Not that I disagree with this project, but just to maybe help see it from a little different perspective...

When people publish their code, I think they typically expect it's going to be used like

    import my_package
    do_something_cool()

So it is a little weird when things like this come along and change that expectation.

It's kind of like, "I scanned millions of Facebook photos for soda cans to see if people prefer Pepsi or Coke!" People didn't post those photos be be part of a project, they just wanted to share some pictures with their friends.

cmcaleer · on Sept 2, 2023

It's not unusual to want to change certain behaviours of a project, e.g. by subclassing something within it. It's also worth at least having some idea of the code you're running before you run it, particularly if you don't know the developer, for many reasons but for e.g. [0].

I'm not really sold on the perspective that if you're a sophisticated enough developer to know+upload+publish on pypi that you wouldn't expect someone to read your code. In many ways that's kind of the point. Not to say such people don't exist, but they're probably a small minority.

[0]: https://cyble.com/blog/over-45-thousand-users-fell-victim-to...

lizard · on Sept 3, 2023

According to the stats on the original link, there are over 25,000 identified secret ids/keys/tokens in the data. And it looks like that's just identifiable secrets, e.g. "Google API Keys" that I'm guessing are identifiable because they have a specific pattern, and may be missing other secrets that use less recognizable patterns.

I mean, sure, compared to the 478,876 Projects claimed on https://pypi.org/, that's a pretty small minority. On the other hand, I'd guess many Python packages don't use these particular services, or even need to connect to a remote service at all, so the area for this class of mistake should be smaller.

And mistakes do happen, but that's a pretty big thing to miss if you are knowingly publishing your code with the expectation other people will be reading it.

orf · on Sept 2, 2023

Thanks for that, I can definitely appreciate this perspective. I’d say it’s more akin to uploading photos to a shared public host like imgur rather than Facebook, but regardless I can see how someone’s expectations of who/what would use it might be different than mine.

lizard · on Sept 3, 2023

Kinda like you say yourself, the service is probably the least interesting part.

It doesn't really matter whether its a public repository or if you use thing your friends shared only within their network. When it comes to what people expect and how they'll feel about breaking those expectations, the only difference is that a smaller network of generally like-minded people _may_ already be cool with it, or at least it's easier to ask.

I'm not even saying they're right to feel weird about it. Just that people are going to feel what they're going to feel, and doing something they didn't expect is a sure-fire way to get them to feel _something_.

hluska · on Sept 2, 2023

If you don’t like it, don’t publish your code to PyPI. It’s as simple as that. Their terms of use allow behaviour like this.

In the future, rather than shit on someone’s project, read.

Galanwe · on Sept 3, 2023

<sarcasm>That walrus operator was really worth it and has seen a huge adoption</sarcasm>

Jarobq18 · on Sept 3, 2023

Apart from being easy to analyze, why is Python so interesting? It's a nice-to-play with language, very useful for researchers, but not practical for enterprise code. There are too many ways to do the same thing, not type safe, and I personally don't know many real python pros, the majority are just using python to play with.

msm_ · on Sept 3, 2023

Not all code is enterprise code. For the last 5 years of my professional career I use Python almost exclusively.

>There are too many ways to do the same thing, (...)

Fair, but that's a funny statement to make because Python from the start tried to have just one obvious solution for every problem. Maybe that's just what happens with languages over time.

>not type safe

Nitpick: Python is dynamically typed, but is actually quite type safe as used in practice (i.e. type errors are usually caught in runtime instead of silently doing the wrong thing). YMMV of course.

>and I personally don't know many real python pros, the majority are just using python to play with.

The beautiful thing about Python is that you don't have to be a pro to use it effectively :). And I think this may be a result of your professional bubble - for example I don't know any Java pros, but I have no doubts there are many.

nine_k · on Sept 3, 2023

Real enterprises like YouTube, Instagram, Spotify, and even Reddit would likely politely disagree.