Hacker News new | past | comments | ask | show | jobs | submit login
Finding secrets by decompiling Python bytecode in public repositories (jse.li)
165 points by gilad on May 31, 2020 | hide | past | favorite | 51 comments



Public Service Announcement.

While for the other secrets, I’ve got nothing, there is never a reason to have AWS secret keys in your code or in application specific configuration files.

Every AWS SDK will automatically read your keys from your config file in your home directory locally. Just run

  aws configure 
When you run your code on EC2, Lambda or ECS, the same SDK’s will automatically get the keys associated with the attached role.


For others non AWS credentials, just use environment variables


They have to be set up somewhere, though.


It should be done out of band.

Repo should have a `README` that says the secrets are in the team 1password account, talk to <team member> to get a 1password account, and get added to the group that has access to the vault with the creds.

The repo should have a source script that will pull the credentials[1] and `export` them to your ENV. `direnv` can make that happen automatically[2], or you can run that script from your `.bashrc` or similar.

You can do something similar with your favorite secrets manager. I've used a similar approach before, with good results.

[1] https://support.1password.com/command-line-getting-started/

[2] https://direnv.net/


If you use AWS, “your favorite secrets manager” should be AWS SecretsManager, and you should use an AWS Role to limit access to the secrets.

Have your code fetch any secrets it needs from AWS Secrets manager, using the name of the secret (which does not have to be kept secret, so it can be in your source repo)

That way, you don’t have to put secrets in your environment, with the risk of leaking them.


For comparison, open source password managers are zero cost, 1Password is fixed cost (even more fixed if one buys the software, instead of the subscription), and in contrast https://aws.amazon.com/secrets-manager/pricing/ is $0.40 per secret per month, plus a tiny but not zero cost per API access.

I'm just pointing out that AWS Secrets Manager is not at automatic, no-brainer win


SSM Parameter Store (not Secrets Manager) is also “zero cost” and you can have a parameter of type “secret string”.

The other solutions don’t integrate with AWS IAM. Something has to grant access to the password vault. In the case of Secrets Manager/Parameter Store you just grant access to the role attached to your EC2 instance/ECS cluster/Lambda.


$0.40 per secret seems really high. I guess they aren't doing per-user pricing like other platforms, so maybe it would be cost effective for large teams?


This would be solved if python used an (OS-specific) cache directory for its .pyc files. I have always disliked .pyc files... here's a concrete reason!

Question: what does python do if it doesn't have write permission in the current working directory? Not write the cache?


It already (since 3.2, as the article points out) uses a cache directory. Why would an OS-specific one help?

You can also set PYTHONPYCACHEPREFIX if you want to use 'a mirror directory tree at this path, instead of in __pycache__ directories within the source tree' - https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPY...

The PEP mentions your case at https://www.python.org/dev/peps/pep-3147/#case-5-read-only-f... . But honestly, I don't really follow the answer. I think it means "ignore creation and write failures."


> Why would an OS-specific one help?

E.g., if you want to back up your home dir, and omit caches, since they can be regenerated. It's a lot easier if programs write their cache data to ~/.cache / $XDG_CACHE_HOME than if they intermix it / scatter it about.


But how does it help to have one directory for all Linux-based OSes (or one directory for RHEL, one for Ubuntu, one for Debian, etc.), and one for FreeBSD and one for OpenIndiana?

At least, that's what I interpret "OS-specific" to mean.


More along the line of the original comment you're replying to, if the cache was in, say, ~/.cache, then it won't get swept up in the repository's commits, since the cache data is no longer inside the repository's working directory. Then, it never gets uploaded to GitHub, and this security issue never happens.

I have seen a surprising number of people — some who are engineers by profession too, and ought to know better — just git add everything, and then commit it all without looking. One should review the diff one has staged to see if it is correct, but alas…


That's possible with 3.8's PYTHONPYCACHEPREFIX, yes?

Perhaps it's worthwhile for someone to blog about this more/promote this as a best practice? Though what's missing is the hook to connected it as appropriate for the given platform.

I see now that "OS-specific" was meant to be interpreted as "the OS-defined mechanism to find a cache directory", not "a cache directory which differs for each operating system".

I would not have been confused by the term "platform dependent", which is what Python's tmpdir documentation uses, as in: "The default directory is chosen from a platform-dependent list" at https://docs.python.org/3/library/tempfile.html?highlight=tm... .


Windows exists, too. Probably would be in %APPDATA% or %TMP%


I suspect his thought was to use something like "%UserProfile%/pycache/" on windows and "$HOME/.pycache/" on Linux.


That env var is new in 3.8. I've been looking forward to it to stop docker writing root owned cache files back to a bind mounted file system. I might set it to $HOME/.cache/python globally as well.


> what does python do if it doesn't have write permission in the current working directory? Not write the cache?

In the case where the interpreter can't find any place it has write access to to write the cache, it will not write it, yes. That means it will have to re-parse the source file into bytecode (and fail to write the bytecode to cache) every time it is loaded.


It makes more sense to use a second stream or named stream.


That sounds like a recipe for even more people accidentally storing and distributing that bytecode without wanting to do so, because it won't be immediately visible that it's there, many tools don't show or highlight the existence of those streams.

I'd argue that it's more appropriate to follow the 'explicit is better than implicit' from the 'guidelines' of https://www.python.org/dev/peps/pep-0020/ .


Highly recommend "export PYTHONDONTWRITEBYTECODE=1" in your bashrc and just forget about it. Pyc files are still an important optimization on modern machines in some circumstances (especially with huge oft-restarted apps), but the autogeneration behaviour has always been a pain in the ass

The bulk of your pycs are generated during package install. What tends to remain in the usual case is a handful of files representing app code or similar.


Why do you highly recommend this? You can set git (or other VCS) settings to avoid committing .pyc files. They're not large files; I don't see the downside.


Difficulty with git basics is the least of your worries. For example pycs are fundamentally racy, it is quite possible to have a .py newer than a .pyc depending on how unlucky you were with an in-progress deployment, or some tool that never updated the .py timestamp. Python continues to execute the pyc even though the code changed, since the minimal benefit of the pyc would be rendered significantly inert should Python use any kind of strong check (instead of just comparing second-granularity stat() output) to ensure the cached bytecode matches the source. In this way, without your permission, Python silently plays undesirable code execution roulette with your computer every time it starts, for as long as you have the feature enabled.

I have lost count of the number of times I've seen someone lose an hour due to it. I can also count many instances of QA environments becoming inexplicably bricked by it. The correct fix for this requires opening the .py and hashing its content, at least doubling the amount of IO required to start a program. They were a great feature when parsing small files was noticeably slow, but this hasn't been true for almost 20 years.

It's therefore worth turning the question around: why do you think pyc files are useful?


Python 3.7 added support for hash-based cache files as an alternative to time-based.

https://docs.python.org/3/reference/import.html#pyc-invalida...

Verifying a hash is a bit slower than checking the timestamp but far faster than parsing and byte compiling the source file, so I don't think this option is "significantly inert".


Managed to miss this, thanks. I'd be interested to hunt out the BPO ticket at some stage to see if they benchmarked on NFS or spinning rust


Huh. I hadn't bothered to read the PEP, which is https://www.python.org/dev/peps/pep-0552/ on "Deterministic pycs."

> The current Python pyc format is the marshaled code object of the module prefixed by a magic number [7], the source timestamp, and the source file size. The presence of a source timestamp means that a pyc is not a deterministic function of the input file’s contents—it also depends on volatile metadata, the mtime of the source. Thus, pycs are a barrier to proper reproducibility.

That is, they were made for a quite different use case than you or I were talking about.

I looked at the PEP to see if it gave timing numbers. No luck - would be a good blog post if I were still blogging. It does say:

> The hash-based pyc format can impose the cost of reading and hashing every source file, which is more expensive than simply checking timestamps. Thus, for now, we expect it to be used mainly by distributors and power use cases.


In the last 12 years of writing python, I have only hit had issues with .pyc files a handful of times, and always with python < 2.7. Anaecdotally this experience is shared with everyone I have worked with.

If you’re seeing this regularly, it suggests there may be something unique or uncommon in your set-up. You may wish to isolate and change whatever that is.


Now that you mentioned it, I just realized I never have problems related to .pyc files anymore ever since I switched to python 3 a few years ago. I remember I used to have problem with deleting database migration files because python would load the .pyc files of deleted migration scripts unless I also delete the .pyc files (which I often forgot).


The Django development server's asynchronous auto reloader is neither unique nor uncommon.


When doing dev, there a be other problems with pyc files. On of them is deleting the original py file, but still being able to import the module because Python will import any pyc file it if exist.

So if performances allow you to do so, disabling them when you dev is not a bad idea, as long as you keep them enabled when you run CI.

IMO, when you dev, you should set PYTHONHASHSEED (random is predictible), PYTHONDEVMODE (verbose warnings + tooling + sys.flags.dev_mode=True) and PYTHONDONTWRITEBYTECODE (remove the need for cleaning them).

In CI and prod, you should make sure those are NOT set, and use PYTHONOPTIMIZE=2 (remove assert, __debug__ lines and docstring) if you trust your dependancies to be well written (which I would check by running tests in a pre-push hook).


Just reading the script from TFA, it attempts to find secrets.pyc and decompile it, but doesn't even check if secrets.py is also in the repo. A glance at search results (I just used GitHub's web interface, didn't bother to run the code) tells me when secrets.pyc is committed, secrets.py comes with it at least the vast majority of time.

I guess the author did find cases where secrets.pyc is committed but secrets.py is not? It's hard to fathom how that could have happened (especially inside "organization" settings). Sounds like the result of absolute rookies in both Python and git following a tutorial with a step "add secrets.py to .gitignore" but unfortunately takes ignoring __pycache__ and ﹡.pyc for granted, which is too much to ask for some people.

> it is very easy for an experienced programmer to accidentally commit their secrets

No, it doesn't take an experienced programmer to put __pycache__ and ﹡.pyc to global ignore, or use a gitignore boilerplate at project creation, or notice random unwanted files during code review.


Seems to me quite easy to forget ignoring __pycache__ and *pyc, have the secrets pushed to the repo, then never get to remove them from history.


Fix your process then. Use a global ignore file. Add a language-specific gitignore boilerplate first thing you create a new project. Scan for files that don't belong in code review (do I even need to suggest this).

> never get to remove them from history.

Scrubbing specific files from git history isn't hard.

s/__pycache__ and ﹡pyc/secrets.py/g and people will also commit it in. PEBCAK.


Of course there are trivial solutions to this issue.

Nonetheless, this is a common mistake, whether you believe it or not. And if it is common, then it will be exploited.


The premise of my original post is that ignoring secrets.py but not secrets.pyc is probably not very common. TFA claims "thousands of GitHub repositories contain secrets hidden inside their bytecode", which is probably true, but at least the vast majority of those have secrets.py in plain sight as well, no decompiling necessary; and TFA doesn't actually demonstrate any effort to filter those out.


I think I am an experienced developer (not Python) and this would never cross my mind.


It would never cross your mind not to commit .pyc files to source control? They're not even source. Committing .pyc files is to Python what committing .o files is to C.


> They're not even source.

To be clear, they’re not even text. You don’t need to know Python at all to realize something’s not right when you’re committing unknown binaries to source control.


When you review the list of changes (including added files), you notice the files that shouldn't be there; so you would see them before committing. And even if you forget, git also lists new files after calling "git commit"


I actually insist at work that most repositories don’t have a .gitignore; just setup a reasonable global one and you’re done. OSS or repos with a large number of contributors are generally exceptions here.


That sounds like poor advice with unclear rationale behind it. There are definitely project specific ignores that you’d want to set up, and if you’re working with projects with multiple different languages, “one gitignore to rule them all” fast becomes a mess.


Sorry, to clarify, if there's good reason to have project specific ignores, that's fine. But most projects have similar/overlapping ignores


Sure but reasonable for me is different from reasonable for you. In addition, I might miss files that are repo specific. Insisting repos don't have a .gitignore is terrible advice. It doesn't cost much to maintain it if at all.


I think you are misunderstanding. The secret does not need to be hardcoded in the python file. If it's read in from an environment variable or some other external source, it will also be in the pyc


Of course not, that would mean env vars are hard-coded into byte code at compile time, which would be completely crazy. A pyc file is just a parsed series of op codes that the interpreter could dispatch directly, so that it doesn't have to parse source files every single time.

It's very easy to verify:

secrets.py:

  import os
  SECRET = os.getenv('SECRET')
Then

  $ python -m compileall secrets.py
  $ uncompyle6 __pycache__/secrets.cpython-38.pyc
  # uncompyle6 version 3.7.0
  # Python bytecode 3.8 (3413)
  # Decompiled from: Python 3.8.2 (default, Mar 10 2020, 12:58:02)
  # [Clang 11.0.0 (clang-1100.0.33.17)]
  # Embedded file name: secrets.py
  # Compiled at: ...
  # Size of source mod 2**32: 40 bytes
  import os
  SECRET = os.getenv('SECRET')
  # okay decompiling __pycache__/secrets.cpython-38.pyc


That’s totally incorrect. .pyc files just contain a representation of the _code_ and not any values that don’t exist in the code.

So a snippet like “os.environ[‘my_super_secret’]” won’t contain anything else than the bytecode to fetch that environment variable.


Radare2 also supports Python bytecode of diffent versions [1].

[1] https://github.com/radareorg/radare2/tree/master/libr/asm/ar...


Another possible place to look for secrets is in public docker images. Bots are scanning github repos for secrets all the time, but what about dockerhub (and other docker images repositories)? I accidentally leaved a secret on my public docker image once and that's made me quite paranoid about it now.


I scanned everything pushed to dockerhub for a few weeks but didn’t find too much interesting stuff showing up


I read "Finding secrets by decompiling Python bytecode in public restrooms" by accident. It never occured to me that anyone would do THAT in there.


there's bytecode sharpied on the walls




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: