Hacker News new | past | comments | ask | show | jobs | submit login
The Python Package Index is now a GitHub secret scanning integrator (github.blog)
372 points by rbanffy on March 24, 2021 | hide | past | favorite | 114 comments



These secret scanning integrations have been very helpful. We had a client ask to take a project open source recently that had started a few years ago as closed source. We of course checked over the current version of the code and have had linters in place to look for secrets for a while but not in the very early days of the project. In that one codebase we had:

- AWS IAM token for S3 upload access to a throwaway dev bucket. The bucket had already been deleted but still... Got an email about it informing me the IAM token had been revoked by AWS within 5 minutes

- A Slack webhook notification URL/secret. Committed as a example on a working branch and then git rm'ed but still active. Got an email about it and token revoked by Slack automatically within 5 minutes.

- A Mapbox API token. This one was funny. The token was indeed in there and functional but was in the docs/sample code for a dependency. Still, we got an email within the hour about it and were able to investigate.

Edit: In this case we intentionally kept the commit history. A safer alternative (and one we normally practice) is to start a fresh repo for the open source variant.


When I helped to take Zulip open-source in 2015, I wrote a simple script that scrubbed secrets from the commit history using git fast-export and git fast-import. We replaced all our secrets with xxxxxxx placeholders, replaced internal customer references with dummy names, deleted and renamed certain files, and even did some code replacements that caused certain commit diffs to become empty so those commits could be removed from the history.

https://github.com/zulip/zulip/blob/3.3/tools/zanitizer

https://github.com/zulip/zulip/blob/3.3/tools/zanitizer_conf...

The script was really fast (all ~10000 commits in a few minutes), which allowed us to iterate quickly on its configuration as we audited using gitk and other tools for remaining items to scrub.

Doing this work allowed us to release with an essentially complete history going back to the first commit in 2012, which has been a really valuable resource for understanding why various Zulip subsystems were written the way they were.

Nowadays there are other tools for scrubbing history that might be more polished, like BFG: https://rtyley.github.io/bfg-repo-cleaner/


Nice tooling. I've used bfg when we knew what patterns to look for. This project didn't generally access private data, had a reasonably well behaved team for most of its life (the pre-linter & code-review commits were my own damn fault). Since it was low risk, I just did a few manual `git log -S ...` and moved on. I was still very happy to have github catch my throwaway credentials and remind me in the most obvious way that these things go in `ENV` and not IN code even in examples!


> A safer alternative (and one we normally practice) is to start a fresh repo for the open source variant.

Note that it's also possible to go back and rewrite history (e.g. if you know what the tokens are and where/when they were committed), to preserve Git history while cleaning out tokens. It can be mildly slow or complicated, but there are tools to automate it, such as BFG Repo Cleaner[0] which is relatively easy to use (once you learn it).

There are other awesome rewriting tools, like git filter-repo[1], but that operates solely on the structure of the repository (i.e. it can manipulate basically anything except file contents). Great for removing unwanted files or directories extremely fast, but not good for removing tokens (unless you want to remove the entire file the token was in).

    [0] https://rtyley.github.io/bfg-repo-cleaner/
    [1] https://github.com/newren/git-filter-repo


Learning so many options from this thread. I've used these tools when I knew what to look for but thats been the tricky bit.

psanford also mentioned truffleHog and others, lstamour mentioned https://github.com/cloud-gov/caulking which is built on gitleaks which looks good. caulking's customized list of patterns for gitleaks is here https://github.com/cloud-gov/caulking/blob/master/local.toml Looks like it would have found the keys in my example case no problem.


I've found that git filter-repo actually can modify file contents, using --replace-text and a file containing replacements.


An overlooked vector is old commits. It’s often times better to squash all commits before taking a project open source, which is a real shame for obvious reasons.

Commit histories can spill a lot of secrets that are easy to overlook.


There are tools available to help look for this sort of thing (for both you and any potential attackers). TruffleHog[1] is the first one that comes to mind for me.

I also like shhgit[2] for looking for secrets in repositories. (I don't think shhgit will look back in the git history for you though).

[1]: https://github.com/dxa4481/truffleHog

[2]: https://github.com/eth0izzle/shhgit


Another idea is to use a git commit hook, such as https://github.com/cloud-gov/caulking


Thanks! I knew they existed but hadn't investigated for one that would look over past history. Will try out truffleHog.


I've found the entropy detection in trufflehog to be pretty noisy. When I've run it I generally disable that.


Absolutely this!

Same problem here with inner source, that goes open source.

I feel sorry for all our internal committers, however I know of "secrets", that went into the commit history. We are still considering our option, but tend to opt for deleting our commit history entirely and build a wall of fame for the former committers.


My current fear is versioning back up systems. KeePass files may now have secure master keys, but maybe the version saved 18 mos ago did not.

1. Get an old copy 2. run dictionary attack 3. prosper


FYI pypi tokens look like pypi-9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9

The integration means that GitHub knows to recognize this format, and calls some API of pypi.org when it finds one so PyPI can revoke it.

As always, please allow me to lament that we don't have a standard for this, such as secret-token:pypi.org/9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9, which would let any system know that this string is a secret and that pypi.org should be notified (for example via POST pypi.org/.well-know/compromised-secret). See also https://news.ycombinator.com/item?id=25978185


Hey there! I designed and implemented PyPI's tokens (although not the secret scanning integration).

They're actually just macaroons[1] internally, which means that they could easily be upgraded at some point to include a reporting URL like you mention.

Just as a tidbit: they were originally prefixed with "pypi:" rather than "pypi-", but that colon caused problems for a few packaging utilities. Any sort of in-band signaling like that is unlikely to gain widespread adoption for exactly that reason :-)

[1]: https://en.wikipedia.org/wiki/Macaroons_(computer_science)


Interesting. I can get the "pypi.org" ___domain from the base64-encoded part, however I don't see anything about revocation in the paper.

Your reporting endpoint seems protected by a secret key that GitHub holds. Any reason PyPI can't accept anonymous submission of compromised tokens? If I find a PyPI token on my own server, can I not post it to https://pypi.org/_/github/disclose-token without getting a key from you first?


> I don't see anything about revocation in the paper.

I don't believe it's something standardized or considered by the original whitepaper. Macaroons have the ability to contain arbitrary data, however, so it wouldn't be difficult to add revocation information to them.

> If I find a PyPI token on my own server, can I not post it to https://pypi.org/_/github/disclose-token without getting a key from you first?

I wasn't part of the design, but my first thought goes to preventing the endpoint's use as an oracle: after a compromise, a malicious agent might find it useful to have an unlimited endpoint to test their stolen credentials against. Restricting use to a limited set of trusted entities avoids that.


I don't think allowing revocation of a token by any bearer of the token is much of a security issue. Consider a real world example, if one finds a credit card someone dropped on the street it can be reported as lost and revoked by the issuer even though the reporter is not the owner.

As for the endpoint being an oracle, the endpoint doesn't really need to respond to the reporting client other than the revocation request has been received.


> I don't think allowing revocation of a token by any bearer of the token is much of a security issue. Consider a real world example, if one finds a credit card someone dropped on the street it can be reported as lost and revoked by the issuer even though the reporter is not the owner.

Whether or not it's a security issue depends on how the token is being used. Allowing potentially arbitrary parties to revoke tokens right before, say, a critical security release feels like a potential issue to me. Then again, I suppose they could do that by proxy by just publishing it on GitHub and letting the secret scanner do the work.

Long story short: I'm idly speculating. For all I know, they did it because allowing arbitrary parties to report leaked secrets would result in unacceptably high FP rates. I wasn't privy to the decision.


> Allowing potentially arbitrary parties to revoke tokens right before, say, a critical security release feels like a potential issue to me

If the third-party has the token, they can make releases *adding* critical security issues.


Don't every other endpoints work as the oracle you describe? Are you worried about rate-limiting specifically?

Also, the endpoint sends a 204 with no information about the validity of tokens, making it not much of an oracle. I think the payload is processed in the background too, preventing timing attacks.


> Don't every other endpoints work as the oracle you describe? Are you worried about rate-limiting specifically?

Rate-limiting was just the easy example. Other endpoints are subject to additional constraints: tokens don't directly carry their user information (IIRC), so someone with a collection of stolen tokens may not know which projects they can control. Similarly, tokens are scoped, so "create a new project" isn't an ability that an arbitrary token can necessarily do to gain more information about its rightful owner.

Like I said, I don't know too much about the actual design decisions for that endpoint! That was an educated guess, based on what I might have done.


According to the documentation (https://docs.github.com/en/developers/overview/secret-scanni...), secret issuers specify a regex that can detect secrets they've issued. "Be as precise as possible, because this will reduce the number of false positives" - that's the guideline from GitHub. Github runs the regex on every commit that is uploaded and informs the secret provider when a match occurs.


I see that they document the alerting endpoint there. The only piece missing is building the URL from the token format. I hope we get there someday, and everyone can deploy this without having to replicate GitHub's registry of token formats.

This page also mentions that they "strongly recommend you implement signature validation in your secret alert service", but I'm not sure why. Isn't the fact that they send valid tokens proof that they have really found a leak?


So, you can submit an overly generous (or specifically crafted) regex to get notified of tokens that someone else issued if you know their format?


I wonder if false-positives often result in GitHub sending secrets to the wrong service.


I wonder if any of those services have a combination of bad regexes and bad validation and could be SQL injected by committing a malicious faux-token to GitHub.


One cool data format standard I only recently learned about is multihash[1] - a self-describing hash format: the first byte represents the hashing algorithm, the second byte represents the length of the hash, and the subsequent [length] bytes is the actual hash.

Something similar for tokens would be really useful.

[1] https://multiformats.io/multihash/


Until someone stores a secret without the prefix - because it's always the same, right?


As long as the API wrappers don't mess this up, this has no reason to happen.


The headline sounds insidious (How dare PyPI and GitHub secretly scan me! I'm glad someone has revealed this dastardly collusion!) but it turns out they're actually doing something great.


Naming things is the hardest thing to do in computer science.


That, and cache invalidation.


That, and off-by-one errors


There are actually only two hard problems in computer science:

0) Cache invalidation

1) Naming things

5) Asynchronous callbacks

2) Off-by-one errors

3) Scope creep

6) Bounds checking


7) Project estimation


-1) Keeping secrets


Luckily, building better garbage collectors is easy: ref pointers to each cons


Ha ha! I get the reference:

http://people.cs.uchicago.edu/~wiseman/humor/ai-koans.html

Moon instructs a student

One day a student came to Moon and said: “I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons.”

Moon patiently told the student the following story:

“One day a student came to Moon and said: ‘I understand how to make a better garbage collector...

[Ed. note: Pure reference-count garbage collectors have problems with circular structures that point to themselves.]


old heads and new alike ~~grok~~ vibe


Naming things is the hardest thing to do in computer science.


4294967295) Integer underflows


NaN) Javascript


7) February 29th.


7) Timezones

FTFY


7.0000001) leap seconds


9000) communicating


I thought it was the second hardest. At least that's what I remember, since I last checked.


We see what you did there.


Yes brother I agree!


@dang, in re: this comment, any hopes of editing the title to say "secret-scanning" with a hyphen? Might add some clarity.


I was seriously impressed when a few days ago I accidentally pushed my secret Discord bot token to Github and literally one second later I received a Discord message and an email letting me know that I leaked my token and that they deactivated it.


The API keys I’ve used (admittedly not many) all seem to be long random text strings - how does GitHub detect them? By then being used (ie in api code) or do they actually have a known format?


GitHub documents the process over at https://docs.github.com/en/developers/overview/secret-scanni.... You specify a regex, and you check if the secret is valid on your end.


There must be an astounding number of false positives for common patterns like N-length string of base64 chars. Could someone upload a malicious file with millions of matching strings and watch Github DDoS a company's verification endpoint?


I imagine the scanning would be rate-limited on per-repo basis.


Probably also a max false positive rate; this isn't a guarantee, just a service, so if it detects X false positives it could just exclude the repo entirely as problematic.


Yeah, that would be reasonable.


"Now you have 2 problems."


This is a difficult problem indeed, but thankfully it is just as difficult for the malicious actors as it is for the "good guys". Since various bad guys have presumably been scanning public repos for years already, Github and PyPa adding this feature is leveling the playing field, even if it is not a 100% accurate search algorithm.


Not sure how these particular scanners do it, but during security assessments you sometimes use tools that will find all strings in an application package with high entropy.

Usually its junk, but occasionally you do get lucky and find tokens.


PyPI API keys have a known format, they start with "pypi-".


Wow today I learned this acronym. PyPI -> python package index, after using python for over a decade. Thanks!


Just don't confuse it with PyPy, which is entirely different...


And don't pronounce PyPI as pie-pie. It's pie-P-I.


Ah. The fat detective.


it was much easier when it was just called the cheese shop


That's why you pronounce it "Cheese Shop"


Pronounced pie-pee-aye, and not pee-pee, pie-pee or any of the other ways I heard it pronounced at work :)


Right, I used to pronounce it as pie-pie. Might continue to do so but at least I know what it stands for :D


I call it pie-pie because that makes the most sense and sounds the least weird.


Based on my workplace, I'm pretty sure it's "pee-pee". Just like 'Qt' is "cue-tee".

There's no winning these battles..


> Just like 'Qt' is "cue-tee".

How else would you want to pronounce it?


According to them, its just "cute"


Can someone explain what exactly this means?


> From today, GitHub will scan every commit to a public repository for exposed PyPI API tokens. We will forward any tokens we find to PyPI, who will automatically disable them and notify their owners.


If you commit your AWS secrets/tokens, or similar, inside a python script it will now be discovered by github automatically.

They have integrations with a bunch of services to recognize the tokens, and disable them. This means malicious users can't copy/paste them, spin up servers and leave you with a big bill. (Ideally, of course it could still happen, but the aim is to prevent that kind of thing.)


Though this has been true for a while, it's not what this announcement is about. This is specifically announcing automated scanning and reporting of PyPI keys, which if exposed, could allow a bad actor to distribute compromised Python packages via PyPi (e.g. pip)


And this is a potentially huge security issue. Think about all the systems software that relies on Python packages.


It should reduce the possibility of pypi packages being taken over as the result of its owner being careless with theirs pypi credentials.

I think it’s good because the risk of a package being taken over is low, but very damaging if it occurs in a widely used package.


If you accidentally commit your PyPI private token to git and push it to GitHub, PyPI will detect this and disable the token within seconds (because there are absolutely bots who will try to find it and abuse it).


I presume it means that if someone accidentally pushes up a token to a public github repo then it can't be used to hijack all the PyPi packages corresponding to that token to become malicious


Is there some best practice on creating a format for secret keys? If I create an API with secret keys, should I make them something like z77dj3kl-secret-pk-[secret-stuff]?

Is there an argument (security by obscurity?) that that makes it easier to spot it and abuse it?

Or would it be better to encode it in the secret bits somehow, add 16 control bits that have known values?


FWIW There's a new RFC for specifying a URI scheme: https://tools.ietf.org/html/rfc8959


It would be nice instead if the git command prevented you from committing a file with a token in it.


Haven't seen it mentioned here and it's not specifically mentioned in their docs where they explain about secret scanning: github also does secret scanning on all public gists. Seeing as how every gist is just a repository under the hood it makes sense.

I've seen devs share a snippet of code with an AWS Access Key Id/Secret in it using gists and we immediately got a notice from Amazon about that key being compromised.


This makes me wonder if Github should do basic code sanity checks on every repo. Things like checking for division by zero, infinite-loops, etc. They'd have to be very conservative checks as to not trigger false positives. But if there is benefit in secret scanning for all public repos there must be benefit in detecting other types of programmer mistakes.


They acquired LGTM (https://github.com/marketplace/lgtm) not too long ago, so expect this to happen.


Do any APIs standardise on a simple secret key pattern that can be easily identified as a secret? For example, all secrets have a "secret-" prefix? Or is this idea unworkable?

I usually try and prefix e.g. fields in config files with "secret" to make it obvious they shouldn't be committed.


There was a discussion a while ago about IETF RFC 8959 which proposes a secret-token URI that might be of interest: https://news.ycombinator.com/item?id=25978185


They got a decent list of partnered companies which you can find over here:

https://docs.github.com/en/code-security/secret-security/abo...

Glad they got our back.


In case anyone is interested, it looks like this is the implementation on the PyPI side: https://github.com/pypa/warehouse/pull/8563


    > Fixes #6051
    > See #7124 reverted in #8555 due to #8554 which is addressed in #8562 (pfew...)
    > Should not be merged before #8562: EDIT: 
    > 
    > Re-revert of the code. The bug that caused revert was splitted into #8562
Software development in a nutshell, everyone.


This is great hopefully we will get GitHub packages support for python soon. https://github.com/features/packages


It's on their public roadmap: https://github.com/github/roadmap/issues/94

Unfortunately it's marked as "Future," so it's still a ways out.


Are the regexes behind the GitHub secret scanning open source? It would be great to check my code that isn't on GitHub.


As a non-Python person:

Is it an easy mistake to make, for someone to inadvertently commit and push a "secret PyPI token"?


I can certainly imagine putting a token into a deploy script in the same directory as a python package's repo. From there, it's a typo away from getting added and committed to the repo. So, it's better to keep those tokens elsewhere.


Isn't it totally verboten to put secret tokens / passwords into scripts? Regardless of language?

When I write, say, bash scripts which do work using ssh, I don't specify a password: The user running the script will provide their own manually, or use ssh-copy-id, or edit the authorized_keys file on the target machine if they want to save themselves some typing. That is - authentication is decoupled from my script's actual work. Why is that not how things work with PyPI?


> Isn't it totally verboten to put secret tokens / passwords into scripts?

No, it's not “totally verboten” (forbidden by whom?), and people do it all the time. Mostly, perhaps, for stuff they aren't planning to share, but plans change.

It doesn't help that lots of example code embeds placeholders for secrets directly (with notes to replace with your actual credentials), so lots of stuff gets embedded in the course of copy-and-paste coding.


It is. But even if it is strongly discouraged, some people will commit it anyway. Look at any beginner's repository, there is a high chance it contains files compiled from the source of the repo (executable, .pyc, ...), the developer's IDE config (.vscode, ...), __MACOSX, ...


> Isn't it totally verboten to put secret tokens / passwords into scripts?

It's only a rule because people have made the mistake enough to learn the lesson...


If you are trying to publish your package for other people to download through the `pip` package manager, then yeah.

Most python devs will probably never publish to PyPi, but this can save some headaches for those who do, especially for the first time.


Secrets in general leak into source code all the time, nothing specific about PyPI.


I think not. The standard tools read the token from ~/.pypirc (or the console if absent). Inadvertent commits of the token probably only happens if you have a custom script with a hardcoded token.


> to help keep their customers safe

The elimination of a distinction between “safety” and “security” is unhealthy imo, as it leads to a failure to distinguish between unintentional harm caused by nature, and intentional harm caused by other people.

E.g. “safety first” is only intelligible if it doesn’t also prevent you from trusting anyone (which is what would be implied by “security first” as a general priority).


I think you misunderstand what “X first” normally means, it means “X is most heavily weighted” not “the smallest condideration in ___domain X outweighs all other considerations in all other domains”.


Do you lock your doors?


Sometimes. But I can’t say that I have a “security first” mindset, which seems analogous to “trust no one”.


Great news!

IMHO, Github should make it mandatory for integrated services to provide this feature.


would love to see tighter integration with some GitHub Secret/ Action publishing


Not sure if this is what you're asking for, but the PyPA does maintain a GitHub Action for publishing to PyPI as well: https://github.com/pypa/gh-action-pypi-publish


good one


This is some epic-level brand building in action. Pretty soon, people just entering our industry will mistakenly believe that GitHub's ownership (Microsoft) wants open source to exist and thrive.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: