Hacker News new | past | comments | ask | show | jobs | submit login

Simple but probably wrong solution; why not ban obfuscation libraries, compressed and self-loading code within the PyPI ecosystem. Any package that even refers to illegible non-source techniques gets flagged and blocked? It seems the whole PyPI ecosystem is undisciplined and could be tightened up. Why can't we progress here?



You can pip install complex stand alone executables, such as nodejs, and it's used in the entire ecosystem.

In fact, most packages are now wheels, which are not sources: they are compressed, and may contain binaries for compiles extensions, something extremely popular (the scientific and AI stacks exist only because of this).

Some packages need to be compiled after the fact, something that setup.py will trigger, and some even embed a fallback compiler, like some cython based packages.

Also, remember there is very few people working on pypi, there is no moderation, anybody can publish anything, so you would need a bullet proof automated heuristic. That's either impractical, or too expensive.

If you want a secure package distribution platform, there are commercial ones, such as anaconda. You get what you pay for.


I guess we're finding out the other side of that blade, huh?


Self-loading code is a huge part of the value-add of python libraries. Many of the popular libraries (e.g. Numpy and friends) trigger a bewildering chain of events to compile from source if not installing from pre-built wheels. And if you do have wheels, you have opaque binary blobs. So pick your poison: compile-on-install with possible backdoor or prebuilt .so/.dylib/.pyc with possible backdoor.

The most obvious (but not necessarily easiest) approach is to phase out setup.py and move everything to the declarative pyproject.toml approach. This is not just better for metadata (setup scripts make it really hard to statically infer what deps a lib has), it also allows for better control over what installers/toolchains run on install.

Attackers still have quite a lot of latitude during the build phase, but at least libraries have the option to specify declaratively what permissions they need (and presumably the user has the option to forbid them).

Also eval/exec are terrible and I wish there were a mode to disable their usage, but I don't know if the python runtime has some deep dependency on it. Maybe there's a way to restrict it so that only low level frames can call the eval opcode.


Would it be possible that the wheels could be built in a more-trusted / hardened environment? Having a binary blob isn't as serious when it comes from a trusted source. Almost all Debian/etc linux distributions have this feature (binary-downloading package manager).

The hardening could mitigate on-compilation hacking.

Obviously, this leaves "compile in the backdoor and wait for the user to fall into it" but at least this isn't an issue of compiling on the user's computer and it isn't a issue of binary blobs. And possibly there's a greater chance of detection if actual source code has to be available to compile.


>Also eval/exec are terrible and I wish there were a mode to disable their usage,

You can use audit hooks in the sys module (as long as you load it first) to disable eval/exec/process spawning or even arbitrary imports or network requests.


I’ve been building Packj [1] to flag PyPI/NPM/Ruby packages that contain suspicious decode+exec and other “risky” APIs using static analysis. It also uses strace-based dynamic analysis to monitor install-time filesystem/network activities. We have detected a bunch of malware with the tool.

1. https://github.com/ossillate-inc/packj flags malicious/risky packages.


I don’t think this would work well for shipping packages that use proprietary libraries. But at the very least they could be flagged, yes.


The short answer is that this can’t be easily mitigated at the package index level, at least not without massive breaking chances to the Python packaging ecosystem: PyPI would have to ban all setup.py based source distributions.

Even then, that only pushes the problem down a layer: you’re still fundamentally installing third party code, which can do whatever it pleases. The problem then becomes one of static analysis, for which precision is the major limitation (in effect, just continuing the cat-and-mouse game.)


Why would you think that would change a thing? Also, obfuscation has legitimate uses by people making stuff they don't want easily reversed. This isn't a python specific problem.


Yeah, just get rid of anything that has a binary blob. Cool. And then when PyPI gets swapped out for whatever immediately replaces it because PyPI is useless, then at least PyPI will be secure.


For instance, F-Droid only permits software that can be verifiably compiled by them.

It essentially bans binary blobs yet it is very useful.


F-droid also has very different goals and lives in a much smaller and in some ways much saner ecosystem.


Yes, but pypi has 4 millions release to check, and the scientific and machine learning wheels are very hard to compile (scipy contains c, fortran and assembly code, and must be compiled to mac, linux and windows).

Providing a build env for that would make it prohibitively complicated and expensive, and basically would mirror github CI.

That's the reason continuum is making money: they sell a python package distribution channel that is checked and locked.


F-Droid is infamously known for taking weeks to build a new version of any app.


I'm given to understand that's more about having an offline signing process then the actual builds.


> PyPI is useless

Why will it be "useless". Explain your reasoning please.


Most of the libraries I use include compiled C/C++/Fortran/Rust code. Pandas, scipy, scikit-learn, … if I were limited to pure-python libraries, I would probably rather swap languages, or at least package manager, at great inconvenience.

That being said, I don’t think PyPI would be «useless» - this was the state a few years ago, and we had to compile all the libraries ourselves. I don’t want to go back.


None of those packages are downloading and running CRAP.EXE within the setup.py process, that's not how native extensions work. It should be possible to flag packages that are downloading things when setup.py runs, much less running exec within setup.py. a python package that really needs you to run a windows installer for its dependencies should have you be doing that separately.


Yes but the problem here is the obfuscation of the malware code loading. No need to trigger it in the setup.py process, as long as you have it in the lib, you can always put a call in a .pth somewhere and run your malware as soon as any python is executed.


it should be possible to test packages for that also. if you are testing setup.py to see that no network access or exec occurs, you could similarly run the python interpreter after install and ensure no network / exec() happens at that point either, assuming one has not imported the package. or just disallow unfamiliar .pth files from being installed altogether (outside of those generated by setuptools / etc. for normal execution).


Given that every attempt to sandbox python failed, and that every system exposed to the public have been pwn, I assume this is a cat and mouse game we can't win.

At best I suppose we could put in place checks to get low hanging fruits. But we are, after all, allowing a turing complete and highly dynamic language to execute.

> or just disallow unfamiliar .pth files from being installed altogether

That would kill the entire plugin ecosystem.

Now, the next thing could be to have a permission system, requesting access to the network, fs, .pth, etc. It would not be a bad idea, given that we are, after all, installing things that are as powerful as apps.

But it would be a gigantic effort, and users still would just accept without reading, like they do with apps.


Sure, I didn’t intend to claim that. It’s just a hassle for me to compile my own C code, which I’d have to do if binaries weren’t bundled. That’s why anaconda python took off on windows - it’s hard work to compile scipy on windows!


pypi delivers wheel files for pre-built binaries, and that's the only way one is supposed to distribute pre-built binary executables or shared libraries. the issue of "runs malicious code in setup.py" does not apply in that case because setup.py isn't invoked.


> great inconvenience

It's a convenience/security trade-off, I see.

The only solution I've ever seen to that requires investing trust in an "authority" which then becomes corrupt and censorial. One simply expands the dilemma to a triad; security/freedom/convenience.

If I am not mistaken the PyPI "Cheese Shop" is owned by the Python Software Foundation, a 501(c3) nonprofit organisation which constitutionally values Software Freedom highly. It seems natural that convenience would be sacrificed if security is of concern.


Such an authority in the Linux world used to be a distribution. Installing a binary blob provided by Debian build servers is based on decades of trust.

But there is a tradeoff between having things thoroughly vetted and tested, and moving fast.


Interesting point. So as dimensions we now have

  - security
  
  - freedom
    
  - convenience
  
  - speed/newness
Who can build me a UI with four sliders that selects the packages I can install? Bonus: when I move a slider it highlights all the potential packages that changed status with reasons why they are now included/excluded.


You're an HN reader, so you should be able to knock this out over a weekend /s


You're right the prototype GUI is a weekend of work. But you also know that's not where the work is :) Now some more intelligent comments are coming in we can talk about the analysis and tagging of thousands of packages, dealing with backward compatibility and what happens when naughty malware just hops to another level of trust.

But none of that is a call to give up. We just need to think seriously about the problem we face.


Windows S Mode has PyPI restricted to pure Python due to Device Guard. I'm happy to leave it on ($250 laptop). Indeed, Numpy has been a recurring blocker, maybe 3 times now. But with general peace of mind is the only way I've known Python/PyPI, so I'm pretty happy with it. I have a few RasPis that I can use as auxiliary devices as well, which I think is a pretty cool tradeoff, hardware sandbox--not gone there yet, beyond just configuring SSH/xRDP so I'm ready if the day comes.

But I've made a ton of web apps and tools anyway, including a little process launcher that plays the role of poor man's Docker.

It'd be nice if those popular systems had a pure Python capability anyway, similar analogy being software rendered 3D back in the day.


A simple warning when a library adds a binary blob should be enough. You don't need to ban them entirely.


They can just load the real payload as a second stage.


Are they to test this condition for each input to each program?


Even simpler solution: Require cryptographic signatures of the developers of projects along with hash-verified downloads via pip.

The problem is a failure to understand security.


A malicious author could embed malicious code in the package and still get the package signed. Hashing won't prevent this sort of thing on PyPi, it just addresses in transit and alternate supplier attacks.


Requiring anything from open source authors is a losing proposition. Items of interest just won't end up on pypi. Iirc this chain of events already happened on another distribution platform.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: