FYI pypi tokens look like pypi-9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9 The...

woodruffw · on March 24, 2021

Hey there! I designed and implemented PyPI's tokens (although not the secret scanning integration).

They're actually just macaroons[1] internally, which means that they could easily be upgraded at some point to include a reporting URL like you mention.

Just as a tidbit: they were originally prefixed with "pypi:" rather than "pypi-", but that colon caused problems for a few packaging utilities. Any sort of in-band signaling like that is unlikely to gain widespread adoption for exactly that reason :-)

[1]: https://en.wikipedia.org/wiki/Macaroons_(computer_science)

remram · on March 24, 2021

Interesting. I can get the "pypi.org" ___domain from the base64-encoded part, however I don't see anything about revocation in the paper.

Your reporting endpoint seems protected by a secret key that GitHub holds. Any reason PyPI can't accept anonymous submission of compromised tokens? If I find a PyPI token on my own server, can I not post it to https://pypi.org/_/github/disclose-token without getting a key from you first?

woodruffw · on March 24, 2021

> I don't see anything about revocation in the paper.

I don't believe it's something standardized or considered by the original whitepaper. Macaroons have the ability to contain arbitrary data, however, so it wouldn't be difficult to add revocation information to them.

> If I find a PyPI token on my own server, can I not post it to https://pypi.org/_/github/disclose-token without getting a key from you first?

I wasn't part of the design, but my first thought goes to preventing the endpoint's use as an oracle: after a compromise, a malicious agent might find it useful to have an unlimited endpoint to test their stolen credentials against. Restricting use to a limited set of trusted entities avoids that.

devman0 · on March 25, 2021

I don't think allowing revocation of a token by any bearer of the token is much of a security issue. Consider a real world example, if one finds a credit card someone dropped on the street it can be reported as lost and revoked by the issuer even though the reporter is not the owner.

As for the endpoint being an oracle, the endpoint doesn't really need to respond to the reporting client other than the revocation request has been received.

woodruffw · on March 25, 2021

> I don't think allowing revocation of a token by any bearer of the token is much of a security issue. Consider a real world example, if one finds a credit card someone dropped on the street it can be reported as lost and revoked by the issuer even though the reporter is not the owner.

Whether or not it's a security issue depends on how the token is being used. Allowing potentially arbitrary parties to revoke tokens right before, say, a critical security release feels like a potential issue to me. Then again, I suppose they could do that by proxy by just publishing it on GitHub and letting the secret scanner do the work.

Long story short: I'm idly speculating. For all I know, they did it because allowing arbitrary parties to report leaked secrets would result in unacceptably high FP rates. I wasn't privy to the decision.

remram · on March 25, 2021

> Allowing potentially arbitrary parties to revoke tokens right before, say, a critical security release feels like a potential issue to me

If the third-party has the token, they can make releases *adding* critical security issues.

remram · on March 25, 2021

Don't every other endpoints work as the oracle you describe? Are you worried about rate-limiting specifically?

Also, the endpoint sends a 204 with no information about the validity of tokens, making it not much of an oracle. I think the payload is processed in the background too, preventing timing attacks.

woodruffw · on March 25, 2021

> Don't every other endpoints work as the oracle you describe? Are you worried about rate-limiting specifically?

Rate-limiting was just the easy example. Other endpoints are subject to additional constraints: tokens don't directly carry their user information (IIRC), so someone with a collection of stolen tokens may not know which projects they can control. Similarly, tokens are scoped, so "create a new project" isn't an ability that an arbitrary token can necessarily do to gain more information about its rightful owner.

Like I said, I don't know too much about the actual design decisions for that endpoint! That was an educated guess, based on what I might have done.

nindalf · on March 24, 2021

According to the documentation (https://docs.github.com/en/developers/overview/secret-scanni...), secret issuers specify a regex that can detect secrets they've issued. "Be as precise as possible, because this will reduce the number of false positives" - that's the guideline from GitHub. Github runs the regex on every commit that is uploaded and informs the secret provider when a match occurs.

remram · on March 24, 2021

I see that they document the alerting endpoint there. The only piece missing is building the URL from the token format. I hope we get there someday, and everyone can deploy this without having to replicate GitHub's registry of token formats.

This page also mentions that they "strongly recommend you implement signature validation in your secret alert service", but I'm not sure why. Isn't the fact that they send valid tokens proof that they have really found a leak?

dragonwriter · on March 25, 2021

So, you can submit an overly generous (or specifically crafted) regex to get notified of tokens that someone else issued if you know their format?

kevincox · on March 24, 2021

I wonder if false-positives often result in GitHub sending secrets to the wrong service.

danudey · on March 24, 2021

I wonder if any of those services have a combination of bad regexes and bad validation and could be SQL injected by committing a malicious faux-token to GitHub.

l0b0 · on March 24, 2021

One cool data format standard I only recently learned about is multihash[1] - a self-describing hash format: the first byte represents the hashing algorithm, the second byte represents the length of the hash, and the subsequent [length] bytes is the actual hash.

Something similar for tokens would be really useful.

[1] https://multiformats.io/multihash/

rlpb · on March 25, 2021

Until someone stores a secret without the prefix - because it's always the same, right?

remram · on March 25, 2021

As long as the API wrappers don't mess this up, this has no reason to happen.