The Python Package Index is now a GitHub secret scanning integrator

amichal · on March 24, 2021

These secret scanning integrations have been very helpful. We had a client ask to take a project open source recently that had started a few years ago as closed source. We of course checked over the current version of the code and have had linters in place to look for secrets for a while but not in the very early days of the project. In that one codebase we had:

- AWS IAM token for S3 upload access to a throwaway dev bucket. The bucket had already been deleted but still... Got an email about it informing me the IAM token had been revoked by AWS within 5 minutes

- A Slack webhook notification URL/secret. Committed as a example on a working branch and then git rm'ed but still active. Got an email about it and token revoked by Slack automatically within 5 minutes.

- A Mapbox API token. This one was funny. The token was indeed in there and functional but was in the docs/sample code for a dependency. Still, we got an email within the hour about it and were able to investigate.

Edit: In this case we intentionally kept the commit history. A safer alternative (and one we normally practice) is to start a fresh repo for the open source variant.

anderskaseorg · on March 24, 2021

When I helped to take Zulip open-source in 2015, I wrote a simple script that scrubbed secrets from the commit history using git fast-export and git fast-import. We replaced all our secrets with xxxxxxx placeholders, replaced internal customer references with dummy names, deleted and renamed certain files, and even did some code replacements that caused certain commit diffs to become empty so those commits could be removed from the history.

https://github.com/zulip/zulip/blob/3.3/tools/zanitizer

https://github.com/zulip/zulip/blob/3.3/tools/zanitizer_conf...

The script was really fast (all ~10000 commits in a few minutes), which allowed us to iterate quickly on its configuration as we audited using gitk and other tools for remaining items to scrub.

Doing this work allowed us to release with an essentially complete history going back to the first commit in 2012, which has been a really valuable resource for understanding why various Zulip subsystems were written the way they were.

Nowadays there are other tools for scrubbing history that might be more polished, like BFG: https://rtyley.github.io/bfg-repo-cleaner/

amichal · on March 24, 2021

Nice tooling. I've used bfg when we knew what patterns to look for. This project didn't generally access private data, had a reasonably well behaved team for most of its life (the pre-linter & code-review commits were my own damn fault). Since it was low risk, I just did a few manual `git log -S ...` and moved on. I was still very happy to have github catch my throwaway credentials and remind me in the most obvious way that these things go in `ENV` and not IN code even in examples!

danudey · on March 24, 2021

> A safer alternative (and one we normally practice) is to start a fresh repo for the open source variant.

Note that it's also possible to go back and rewrite history (e.g. if you know what the tokens are and where/when they were committed), to preserve Git history while cleaning out tokens. It can be mildly slow or complicated, but there are tools to automate it, such as BFG Repo Cleaner[0] which is relatively easy to use (once you learn it).

There are other awesome rewriting tools, like git filter-repo[1], but that operates solely on the structure of the repository (i.e. it can manipulate basically anything except file contents). Great for removing unwanted files or directories extremely fast, but not good for removing tokens (unless you want to remove the entire file the token was in).

    [0] https://rtyley.github.io/bfg-repo-cleaner/
    [1] https://github.com/newren/git-filter-repo

amichal · on March 24, 2021

Learning so many options from this thread. I've used these tools when I knew what to look for but thats been the tricky bit.

psanford also mentioned truffleHog and others, lstamour mentioned https://github.com/cloud-gov/caulking which is built on gitleaks which looks good. caulking's customized list of patterns for gitleaks is here https://github.com/cloud-gov/caulking/blob/master/local.toml Looks like it would have found the keys in my example case no problem.

nyanpasu64 · on March 25, 2021

I've found that git filter-repo actually can modify file contents, using --replace-text and a file containing replacements.

ed25519FUUU · on March 24, 2021

An overlooked vector is old commits. It’s often times better to squash all commits before taking a project open source, which is a real shame for obvious reasons.

Commit histories can spill a lot of secrets that are easy to overlook.

psanford · on March 24, 2021

There are tools available to help look for this sort of thing (for both you and any potential attackers). TruffleHog[1] is the first one that comes to mind for me.

I also like shhgit[2] for looking for secrets in repositories. (I don't think shhgit will look back in the git history for you though).

[1]: https://github.com/dxa4481/truffleHog

[2]: https://github.com/eth0izzle/shhgit

lstamour · on March 24, 2021

Another idea is to use a git commit hook, such as https://github.com/cloud-gov/caulking

amichal · on March 24, 2021

Thanks! I knew they existed but hadn't investigated for one that would look over past history. Will try out truffleHog.

psanford · on March 24, 2021

I've found the entropy detection in trufflehog to be pretty noisy. When I've run it I generally disable that.

_the_inflator · on March 24, 2021

Absolutely this!

Same problem here with inner source, that goes open source.

I feel sorry for all our internal committers, however I know of "secrets", that went into the commit history. We are still considering our option, but tend to opt for deleting our commit history entirely and build a wall of fame for the former committers.

jgalt212 · on March 24, 2021

My current fear is versioning back up systems. KeePass files may now have secure master keys, but maybe the version saved 18 mos ago did not.

1. Get an old copy 2. run dictionary attack 3. prosper

remram · on March 24, 2021

FYI pypi tokens look like pypi-9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9

The integration means that GitHub knows to recognize this format, and calls some API of pypi.org when it finds one so PyPI can revoke it.

As always, please allow me to lament that we don't have a standard for this, such as secret-token:pypi.org/9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9, which would let any system know that this string is a secret and that pypi.org should be notified (for example via POST pypi.org/.well-know/compromised-secret). See also https://news.ycombinator.com/item?id=25978185

woodruffw · on March 24, 2021

Hey there! I designed and implemented PyPI's tokens (although not the secret scanning integration).

They're actually just macaroons[1] internally, which means that they could easily be upgraded at some point to include a reporting URL like you mention.

Just as a tidbit: they were originally prefixed with "pypi:" rather than "pypi-", but that colon caused problems for a few packaging utilities. Any sort of in-band signaling like that is unlikely to gain widespread adoption for exactly that reason :-)

[1]: https://en.wikipedia.org/wiki/Macaroons_(computer_science)

remram · on March 24, 2021

Interesting. I can get the "pypi.org" ___domain from the base64-encoded part, however I don't see anything about revocation in the paper.

Your reporting endpoint seems protected by a secret key that GitHub holds. Any reason PyPI can't accept anonymous submission of compromised tokens? If I find a PyPI token on my own server, can I not post it to https://pypi.org/_/github/disclose-token without getting a key from you first?

woodruffw · on March 24, 2021

> I don't see anything about revocation in the paper.

I don't believe it's something standardized or considered by the original whitepaper. Macaroons have the ability to contain arbitrary data, however, so it wouldn't be difficult to add revocation information to them.

> If I find a PyPI token on my own server, can I not post it to https://pypi.org/_/github/disclose-token without getting a key from you first?

I wasn't part of the design, but my first thought goes to preventing the endpoint's use as an oracle: after a compromise, a malicious agent might find it useful to have an unlimited endpoint to test their stolen credentials against. Restricting use to a limited set of trusted entities avoids that.

devman0 · on March 25, 2021

I don't think allowing revocation of a token by any bearer of the token is much of a security issue. Consider a real world example, if one finds a credit card someone dropped on the street it can be reported as lost and revoked by the issuer even though the reporter is not the owner.

As for the endpoint being an oracle, the endpoint doesn't really need to respond to the reporting client other than the revocation request has been received.

woodruffw · on March 25, 2021

> I don't think allowing revocation of a token by any bearer of the token is much of a security issue. Consider a real world example, if one finds a credit card someone dropped on the street it can be reported as lost and revoked by the issuer even though the reporter is not the owner.

Whether or not it's a security issue depends on how the token is being used. Allowing potentially arbitrary parties to revoke tokens right before, say, a critical security release feels like a potential issue to me. Then again, I suppose they could do that by proxy by just publishing it on GitHub and letting the secret scanner do the work.

Long story short: I'm idly speculating. For all I know, they did it because allowing arbitrary parties to report leaked secrets would result in unacceptably high FP rates. I wasn't privy to the decision.

remram · on March 25, 2021

> Allowing potentially arbitrary parties to revoke tokens right before, say, a critical security release feels like a potential issue to me

If the third-party has the token, they can make releases *adding* critical security issues.

remram · on March 25, 2021

Don't every other endpoints work as the oracle you describe? Are you worried about rate-limiting specifically?

Also, the endpoint sends a 204 with no information about the validity of tokens, making it not much of an oracle. I think the payload is processed in the background too, preventing timing attacks.

woodruffw · on March 25, 2021

> Don't every other endpoints work as the oracle you describe? Are you worried about rate-limiting specifically?

Rate-limiting was just the easy example. Other endpoints are subject to additional constraints: tokens don't directly carry their user information (IIRC), so someone with a collection of stolen tokens may not know which projects they can control. Similarly, tokens are scoped, so "create a new project" isn't an ability that an arbitrary token can necessarily do to gain more information about its rightful owner.

Like I said, I don't know too much about the actual design decisions for that endpoint! That was an educated guess, based on what I might have done.

nindalf · on March 24, 2021

According to the documentation (https://docs.github.com/en/developers/overview/secret-scanni...), secret issuers specify a regex that can detect secrets they've issued. "Be as precise as possible, because this will reduce the number of false positives" - that's the guideline from GitHub. Github runs the regex on every commit that is uploaded and informs the secret provider when a match occurs.

remram · on March 24, 2021

I see that they document the alerting endpoint there. The only piece missing is building the URL from the token format. I hope we get there someday, and everyone can deploy this without having to replicate GitHub's registry of token formats.

This page also mentions that they "strongly recommend you implement signature validation in your secret alert service", but I'm not sure why. Isn't the fact that they send valid tokens proof that they have really found a leak?

dragonwriter · on March 25, 2021

So, you can submit an overly generous (or specifically crafted) regex to get notified of tokens that someone else issued if you know their format?

kevincox · on March 24, 2021

I wonder if false-positives often result in GitHub sending secrets to the wrong service.

danudey · on March 24, 2021

I wonder if any of those services have a combination of bad regexes and bad validation and could be SQL injected by committing a malicious faux-token to GitHub.

l0b0 · on March 24, 2021

One cool data format standard I only recently learned about is multihash[1] - a self-describing hash format: the first byte represents the hashing algorithm, the second byte represents the length of the hash, and the subsequent [length] bytes is the actual hash.

Something similar for tokens would be really useful.

[1] https://multiformats.io/multihash/

rlpb · on March 25, 2021

Until someone stores a secret without the prefix - because it's always the same, right?

remram · on March 25, 2021

As long as the API wrappers don't mess this up, this has no reason to happen.

cpcallen · on March 24, 2021

The headline sounds insidious (How dare PyPI and GitHub secretly scan me! I'm glad someone has revealed this dastardly collusion!) but it turns out they're actually doing something great.

zitterbewegung · on March 24, 2021

Naming things is the hardest thing to do in computer science.

cbm-vic-20 · on March 24, 2021

That, and cache invalidation.

teraku · on March 24, 2021

That, and off-by-one errors

airstrike · on March 24, 2021

There are actually only two hard problems in computer science:

0) Cache invalidation

1) Naming things

5) Asynchronous callbacks

2) Off-by-one errors

3) Scope creep

6) Bounds checking

macksd · on March 24, 2021

7) Project estimation

DonHopkins · on March 24, 2021

-1) Keeping secrets

weeboid · on March 24, 2021

Luckily, building better garbage collectors is easy: ref pointers to each cons

DonHopkins · on March 26, 2021

Ha ha! I get the reference:

http://people.cs.uchicago.edu/~wiseman/humor/ai-koans.html

Moon instructs a student

One day a student came to Moon and said: “I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons.”

Moon patiently told the student the following story:

“One day a student came to Moon and said: ‘I understand how to make a better garbage collector...

[Ed. note: Pure reference-count garbage collectors have problems with circular structures that point to themselves.]

weeboid · on April 3, 2021

old heads and new alike ~~grok~~ vibe

wizzwizz4 · on March 24, 2021

Naming things is the hardest thing to do in computer science.

jsheard · on March 24, 2021

4294967295) Integer underflows

_joel · on March 24, 2021

NaN) Javascript

mbreese · on March 24, 2021

7) February 29th.

moviuro · on March 24, 2021

7) Timezones

FTFY

Sebb767 · on March 24, 2021

7.0000001) leap seconds

gogopuppygogo · on March 24, 2021

9000) communicating

doubleunplussed · on March 24, 2021

I thought it was the second hardest. At least that's what I remember, since I last checked.

selcuka · on March 25, 2021

We see what you did there.

brian_herman · on March 24, 2021

Yes brother I agree!

bryant · on March 24, 2021

@dang, in re: this comment, any hopes of editing the title to say "secret-scanning" with a hyphen? Might add some clarity.

dthul · on March 24, 2021

I was seriously impressed when a few days ago I accidentally pushed my secret Discord bot token to Github and literally one second later I received a Discord message and an email letting me know that I leaked my token and that they deactivated it.

bombcar · on March 24, 2021

The API keys I’ve used (admittedly not many) all seem to be long random text strings - how does GitHub detect them? By then being used (ie in api code) or do they actually have a known format?

Deathmax · on March 24, 2021

GitHub documents the process over at https://docs.github.com/en/developers/overview/secret-scanni.... You specify a regex, and you check if the secret is valid on your end.

monkeybutton · on March 24, 2021

There must be an astounding number of false positives for common patterns like N-length string of base64 chars. Could someone upload a malicious file with millions of matching strings and watch Github DDoS a company's verification endpoint?

neurostimulant · on March 24, 2021

I imagine the scanning would be rate-limited on per-repo basis.

lostcolony · on March 24, 2021

Probably also a max false positive rate; this isn't a guarantee, just a service, so if it detects X false positives it could just exclude the repo entirely as problematic.

monkeybutton · on March 24, 2021

Yeah, that would be reasonable.

michaelcampbell · on March 24, 2021

"Now you have 2 problems."

MattConfluence · on March 24, 2021

This is a difficult problem indeed, but thankfully it is just as difficult for the malicious actors as it is for the "good guys". Since various bad guys have presumably been scanning public repos for years already, Github and PyPa adding this feature is leveling the playing field, even if it is not a 100% accurate search algorithm.

boarnoah · on March 24, 2021

Not sure how these particular scanners do it, but during security assessments you sometimes use tools that will find all strings in an application package with high entropy.

Usually its junk, but occasionally you do get lucky and find tokens.

di · on March 24, 2021

PyPI API keys have a known format, they start with "pypi-".

loloquwowndueo · on March 24, 2021

Wow today I learned this acronym. PyPI -> python package index, after using python for over a decade. Thanks!

mschulkind · on March 24, 2021

Just don't confuse it with PyPy, which is entirely different...

daviddavis · on March 24, 2021

And don't pronounce PyPI as pie-pie. It's pie-P-I.

lostcolony · on March 24, 2021

Ah. The fat detective.

dec0dedab0de · on March 24, 2021

it was much easier when it was just called the cheese shop

fredley · on March 24, 2021

That's why you pronounce it "Cheese Shop"

kspacewalk2 · on March 24, 2021

Pronounced pie-pee-aye, and not pee-pee, pie-pee or any of the other ways I heard it pronounced at work :)

loloquwowndueo · on March 24, 2021

Right, I used to pronounce it as pie-pie. Might continue to do so but at least I know what it stands for :D

danudey · on March 24, 2021

I call it pie-pie because that makes the most sense and sounds the least weird.

verall · on March 24, 2021

Based on my workplace, I'm pretty sure it's "pee-pee". Just like 'Qt' is "cue-tee".

There's no winning these battles..

porker · on March 24, 2021

> Just like 'Qt' is "cue-tee".

How else would you want to pronounce it?

conradludgate · on March 24, 2021

According to them, its just "cute"

natemcintosh · on March 24, 2021

Can someone explain what exactly this means?

eecc · on March 24, 2021

> From today, GitHub will scan every commit to a public repository for exposed PyPI API tokens. We will forward any tokens we find to PyPI, who will automatically disable them and notify their owners.

stevekemp · on March 24, 2021

If you commit your AWS secrets/tokens, or similar, inside a python script it will now be discovered by github automatically.

They have integrations with a bunch of services to recognize the tokens, and disable them. This means malicious users can't copy/paste them, spin up servers and leave you with a big bill. (Ideally, of course it could still happen, but the aim is to prevent that kind of thing.)

JosephRedfern · on March 24, 2021

Though this has been true for a while, it's not what this announcement is about. This is specifically announcing automated scanning and reporting of PyPI keys, which if exposed, could allow a bad actor to distribute compromised Python packages via PyPi (e.g. pip)

russfink · on March 24, 2021

And this is a potentially huge security issue. Think about all the systems software that relies on Python packages.

prepend · on March 24, 2021

It should reduce the possibility of pypi packages being taken over as the result of its owner being careless with theirs pypi credentials.

I think it’s good because the risk of a package being taken over is low, but very damaging if it occurs in a widely used package.

geofft · on March 24, 2021

If you accidentally commit your PyPI private token to git and push it to GitHub, PyPI will detect this and disable the token within seconds (because there are absolutely bots who will try to find it and abuse it).

nautilus12 · on March 24, 2021

I presume it means that if someone accidentally pushes up a token to a public github repo then it can't be used to hijack all the PyPi packages corresponding to that token to become malicious

z77dj3kl · on March 24, 2021

Is there some best practice on creating a format for secret keys? If I create an API with secret keys, should I make them something like z77dj3kl-secret-pk-[secret-stuff]?

Is there an argument (security by obscurity?) that that makes it easier to spot it and abuse it?

Or would it be better to encode it in the secret bits somehow, add 16 control bits that have known values?

theoretick · on March 24, 2021

FWIW There's a new RFC for specifying a URI scheme: https://tools.ietf.org/html/rfc8959

meeka · on March 24, 2021

It would be nice instead if the git command prevented you from committing a file with a token in it.

crescentfresh · on March 25, 2021

Haven't seen it mentioned here and it's not specifically mentioned in their docs where they explain about secret scanning: github also does secret scanning on all public gists. Seeing as how every gist is just a repository under the hood it makes sense.

I've seen devs share a snippet of code with an AWS Access Key Id/Secret in it using gists and we immediately got a notice from Amazon about that key being compromised.

soheil · on March 24, 2021

This makes me wonder if Github should do basic code sanity checks on every repo. Things like checking for division by zero, infinite-loops, etc. They'd have to be very conservative checks as to not trigger false positives. But if there is benefit in secret scanning for all public repos there must be benefit in detecting other types of programmer mistakes.

leblancfg · on March 24, 2021

They acquired LGTM (https://github.com/marketplace/lgtm) not too long ago, so expect this to happen.

seanwilson · on March 24, 2021

Do any APIs standardise on a simple secret key pattern that can be easily identified as a secret? For example, all secrets have a "secret-" prefix? Or is this idea unworkable?

I usually try and prefix e.g. fields in config files with "secret" to make it obvious they shouldn't be committed.

csnover · on March 24, 2021

There was a discussion a while ago about IETF RFC 8959 which proposes a secret-token URI that might be of interest: https://news.ycombinator.com/item?id=25978185

molticrystal · on March 24, 2021

They got a decent list of partnered companies which you can find over here:

https://docs.github.com/en/code-security/secret-security/abo...

Glad they got our back.

simonw · on March 24, 2021

In case anyone is interested, it looks like this is the implementation on the PyPI side: https://github.com/pypa/warehouse/pull/8563

danudey · on March 24, 2021

    > Fixes #6051
    > See #7124 reverted in #8555 due to #8554 which is addressed in #8562 (pfew...)
    > Should not be merged before #8562: EDIT: 
    > 
    > Re-revert of the code. The bug that caused revert was splitted into #8562

Software development in a nutshell, everyone.

brian_herman · on March 24, 2021

This is great hopefully we will get GitHub packages support for python soon. https://github.com/features/packages

luhn · on March 24, 2021

It's on their public roadmap: https://github.com/github/roadmap/issues/94

Unfortunately it's marked as "Future," so it's still a ways out.

pabs3 · on March 25, 2021

Are the regexes behind the GitHub secret scanning open source? It would be great to check my code that isn't on GitHub.

einpoklum · on March 24, 2021

As a non-Python person:

Is it an easy mistake to make, for someone to inadvertently commit and push a "secret PyPI token"?

klyrs · on March 24, 2021

I can certainly imagine putting a token into a deploy script in the same directory as a python package's repo. From there, it's a typo away from getting added and committed to the repo. So, it's better to keep those tokens elsewhere.

einpoklum · on March 24, 2021

Isn't it totally verboten to put secret tokens / passwords into scripts? Regardless of language?

When I write, say, bash scripts which do work using ssh, I don't specify a password: The user running the script will provide their own manually, or use ssh-copy-id, or edit the authorized_keys file on the target machine if they want to save themselves some typing. That is - authentication is decoupled from my script's actual work. Why is that not how things work with PyPI?

dragonwriter · on March 25, 2021

> Isn't it totally verboten to put secret tokens / passwords into scripts?

No, it's not “totally verboten” (forbidden by whom?), and people do it all the time. Mostly, perhaps, for stuff they aren't planning to share, but plans change.

It doesn't help that lots of example code embeds placeholders for secrets directly (with notes to replace with your actual credentials), so lots of stuff gets embedded in the course of copy-and-paste coding.

progval · on March 24, 2021

It is. But even if it is strongly discouraged, some people will commit it anyway. Look at any beginner's repository, there is a high chance it contains files compiled from the source of the repo (executable, .pyc, ...), the developer's IDE config (.vscode, ...), __MACOSX, ...

klyrs · on March 24, 2021

> Isn't it totally verboten to put secret tokens / passwords into scripts?

It's only a rule because people have made the mistake enough to learn the lesson...

hannasanarion · on March 24, 2021

If you are trying to publish your package for other people to download through the `pip` package manager, then yeah.

Most python devs will probably never publish to PyPi, but this can save some headaches for those who do, especially for the first time.

macintux · on March 24, 2021

Secrets in general leak into source code all the time, nothing specific about PyPI.

progval · on March 24, 2021

I think not. The standard tools read the token from ~/.pypirc (or the console if absent). Inadvertent commits of the token probably only happens if you have a custom script with a hardcoded token.

leot · on March 24, 2021

> to help keep their customers safe

The elimination of a distinction between “safety” and “security” is unhealthy imo, as it leads to a failure to distinguish between unintentional harm caused by nature, and intentional harm caused by other people.

E.g. “safety first” is only intelligible if it doesn’t also prevent you from trusting anyone (which is what would be implied by “security first” as a general priority).

dragonwriter · on March 25, 2021

I think you misunderstand what “X first” normally means, it means “X is most heavily weighted” not “the smallest condideration in ___domain X outweighs all other considerations in all other domains”.

hannasanarion · on March 24, 2021

Do you lock your doors?

leot · on March 24, 2021

Sometimes. But I can’t say that I have a “security first” mindset, which seems analogous to “trust no one”.

linkdd · on March 24, 2021

Great news!

IMHO, Github should make it mandatory for integrated services to provide this feature.

RocketSyntax · on March 24, 2021

would love to see tighter integration with some GitHub Secret/ Action publishing

di · on March 24, 2021

Not sure if this is what you're asking for, but the PyPA does maintain a GitHub Action for publishing to PyPI as well: https://github.com/pypa/gh-action-pypi-publish

melson · on March 24, 2021

good one

sneak · on March 24, 2021

This is some epic-level brand building in action. Pretty soon, people just entering our industry will mistakenly believe that GitHub's ownership (Microsoft) wants open source to exist and thrive.