I do wonder if ProductHunt uses any CAPTCHA solution. In spite of the flack that...

Terr_ · 2024-10-01T21:14:20 1727817260

Lately I've been pondering how one might create a "probably a human"/skin-in-the-game system. For example, imagine visiting an "attestor" site where you can make a one-time donation of $5 to a charity of your choice, and in exchange it gives you some proof-you-spent-money tokens. Those tokens can be spent (burned) by some collaborating site (e.g. HN) to mark your account there as likely a human, or at least a bot whose owner will feel pain if it is banned.

This would be far more privacy-preserving that dozens of national-ID lookup systems, and despite the appearance of "money for speech" it could actually be _cheaper_ than whatever mix of time and bus-fare and paperwork in a "free" system.

____________

I imagine the big problems would be things like:

* How to handle fraudulent payments, e.g. someone buying tokens with a stolen credit card. Easiest fix would be some long waiting-period before the token becomes usable.

* How to protect against a fraudulent attestor site that just takes your money, or one whose tokens are value-less.

* How to protect against a fraudulent destination site that secretly harvests your proof-token for its own use, as opposed to testing/burning it properly. Possible social fix: Put in a fake token, if the site "accepts" then you know it's misbehaving.

* Handling decentralization, where multiple donation sites may be issuing their own tokens and multiple account-sites that may only want to support/trust a subset of those tokens.

brian_cunnie · 2024-10-02T00:38:25 1727829505

> you can make a one-time donation of $5 to a charity of your choice ...

The Alcoholics Anonymous San Francisco website had to implement CAPTCHAs on their website because scammers were making one-time donations to make sure their stolen credit cards were still valid. Every morning we had to invalidate a dozen obviously-fake donations.

Raidion · 2024-10-02T14:49:43 1727880583

Every SaaS platform with a reasonably cheap offering deals with these. I work for a recognizable SaaS and there are checks that flag both the accounts and reports the credit cards that are used after a fairly low threshold of "add payment method attempts". High levels of fraud usage hurt your reputation with payment processors and that's bad for business.

It doesn't stop the truly determined ones I'm sure, but it does mean that it adds complexity. You don't need to be impossible to test cards on, you just need to be harder to use than someone else (like a lower resource charity). We've even debated "fake accepting" some payment methods after we're confident it's someone trying to find working credit card numbers to add some false positives into the mix.

LorenPechtel · 2024-10-02T01:40:33 1727833233

Yup. Charitable donations are a way to spend money without it pointing to you and thus a common test for a stolen card.

Terr_ · 2024-10-02T01:27:15 1727832435

Definitely an issue. I don't really like the idea of long-term Patreon-eseque relationship between the individual user and the attestor/issuer site, but it could be done. The charitable giving is more of a means-to-and-end than a goal, functioning as a kind of "observed spending" which is harder to fake than, say, buying something from yourself on ebay.

If tokens had to mature for X days before being used that could deter laundering pretty handily, but stopping "tests" of cards would require hiding payment errors from the user for a certain period... which would not be a great experience.

ackbar03 · 2024-10-02T10:38:06 1727865486

what happens if you don't invalidate them?

homero · 2024-10-02T14:58:09 1727881089

You'll get a chargeback when the owner sees it

afiori · 2024-10-05T16:56:51 1728147411

And if you get too many chargebacks your account gets closed

schnitzelstoat · 2024-10-02T13:00:14 1727874014

It's an unauthorised payment so I guess at that point the police get involved.

esperent · 2024-10-02T03:52:38 1727841158

It seems to me that ideas like this are unworkable due to income inequality.

$5 isn't much for a wealthy westerner. It's a reasonable amount for an unemployed westerner. It's 12% of their weekly budget for someone earning median wage ($160/month) in Vietnam. But if you put in place regional pricing, it'll be cheap enough that spammers will just operate out of low income countries and buy thousands of cheap accounts.

Terr_ · 2024-10-02T04:47:46 1727844466

> It seems to me that ideas like this are unworkable due to income inequality.

There's no reason you can't have an attestation entity that's based on volunteer hours, provided you can convince sites-in-general that your proof-units are good.

The core theme isn't about cash, but that:that:

1. There are kinds of activity someone can do which demonstrates some kind of distinct actual expenditure of time or effort (not self-dealing.)

2. A trusted agent could attest to that activity.

3. Requiring (proof of) activity gives you an decent way to ward off the majority of bots/spam in a relatively simple way that doesn't become a complex privacy nightmare.

It's a similar outcome to sending CPU-bound challenges to a client, except without the deliberate-waste and without a strong bias towards people who can afford their own fast computer.

Woeps · 2024-10-02T08:05:15 1727856315

The issue is that it another system that puts the "blame" and/or work people instead of dealing with the root cause. So in that case nothing changes.

Because I wonder how are people going to do volunteering hours, get it recognized trough red tape/bureaucracy if they're already struggling to survive.

nkrisc · 2024-10-02T08:34:52 1727858092

Is still completely asymmetric in that the poorer you are, the harder you have to work to now simply access the same resources online as everyone else.

And the poor get poorer.

tim333 · 2024-10-02T09:01:07 1727859667

Well not in places like Vietnam (7% gdp growth/year) or a lot of the less developed world.

squigz · 2024-10-02T05:09:55 1727845795

As a poor disabled citizen who also cares about privacy and freedom, I haven't heard a single idea for attestation that doesn't scare the shit out of me. But then, I'm a poor, disabled citizen, so my opinion doesn't hold much weight.

Terr_ · 2024-10-02T19:19:55 1727896795

Given that being summarily blocked from participation by paranoid site-operators is already a existing scary problem, what kind of fix would you suggest that adds the least additional scariness?

Personally, I am particularly concerned with avoiding the scariness of a government agency that inherently knows all the websites all people are using.

squigz · 2024-10-02T19:32:10 1727897530

I would at least hope we'd consider whether more 'fixes' are actually worth it. It seems to me that 40 years of 'fixes' haven't done much to combat spam and the like, and has instead just made it harder for people like me to access and browse the Internet.

Terr_ · 2024-10-02T22:27:13 1727908033

I'm trying to discern a positive proposal from that, and the best I can extract seems to be "stop fighting and just let the bad people operate unopposed", which doesn't seem workable to me.

Security/anti-spam is probably not biggest accessibility factor in the last 40 years of change anyway: It's easier to make an alternate CAPTCHA route than to convince management a phone-app is unnecessary, or to correctly annotate everything with aria/alt-text properties in all the languages.

squigz · 2024-10-02T22:45:52 1727909152

> I'm trying to discern a positive proposal from that, and the best I can extract seems to be "stop fighting and just let the bad people operate unopposed."

That's not a very generous reading, I think. I am suggesting that the "bad people" seem to be doing fine, so at a certain point we might want to ask ourselves how far we take this "fight" in terms of sacrificing accessibility and privacy (to only name 2 concerns) to stop some percentage of bad actors.

As someone who has been hurt by these efforts over the past 20+ years, and who has yet to hear a proposal for next steps that doesn't greatly worry me, I'm not going to be in favor of propositions just because "well we have to do something"

> It may also be blaming the wrong factors and growing pains. It's easier to make an alternate CAPTCHA route than to convince management to not rely on a phone app or to correctly annotate everything with aria-properties in all the languages.

We've had 20 years to make CAPTCHAs more accessible, yet they've gotten worse. Not to mention their efficacy being in question, hence the discussion about next steps (i.e., attestation)

mandibles · 2024-10-01T22:15:22 1727820922

Have you checked out the L402[0] protocol?

It's basically using the HTTP 402: Payment Required status code and serving up a Lightning Network payment invoice.

Edit to add: it basically solves all of the caveat issues you identified.

[0]: https://l402.org/

akoboldfrying · 2024-10-02T00:07:51 1727827671

>Possible social fix: Put in a fake token, if the site "accepts" then you know it's misbehaving.

IIUC the tokens would need to be cheaply verifiable by anyone as authentically issued, so a fake token would never be accepted (or if it somehow was, it would only tell you that the acceptor is fantastically lazy/incompetent).

I think that that verifiability, plus a guarantee that tokens will not be spent twice, plus visibility of all transactions, suffice: Then anyone can check the public ledger x minutes after they spent their and verify that the acceptor sent it straight to the burn address after receiving it. IOW, blockchain suffices. OTOH, it would be nice not to have to need the public ledger.

Terr_ · 2024-10-02T00:45:38 1727829938

I think blockchain is (as usual) the wrong tool for the job here, since it would dramatically increase code/bugs/complexity/onboarding-cost while also introducing new privacy risks. After all, we're already trusting that a given attestor/issuer isn't just handing out tokens willy-nilly.

By comparison, here's a simpler "single HTTP call" approach, where a site like HN makes a POST to the issuer's API, which would semantically be like: "Hey, here is a potential token T and a big random confirmation number C. If T is valid, burn it and record C as the cause. Otherwise change nothing. Finally tell me whether-or-not that same T was burned in the last 7 days along with the same C that I gave."

The benefits of this approach are:

1. The issuer just has to maintain a list of surviving tokens and a smaller short-lived list of recent (T,C) burning activity, and use easy standard DB transactions to stop conflicts or double-spending.

2. All the social-media site has to do is create a random number C for burning a given T, and temporarily remember the pair until it gets a yes-or-no answer.

3. A malicious social-media site cannot separate testing the token from spending it on a legitimate site, which deters a business model of harvest-and-resale. However it could spend it immediately for its own purposes, which is worth further discussion.

4. The idempotent API call resists connection hiccups and works for really basic retry logic, avoiding "wasted" tokens.

5. The issuer doesn't know how or where a given token is being used, beyond what it can infer from the POST request source IP. It certainly doesn't know which social-media account it just verified, unless the two sites collude or the social-media site is stupid and doesn't use random C values.

akoboldfrying · 2024-10-02T08:45:01 1727858701

This is going in the right direction, but you identified the acceptor double-spend problem.

What about if, instead of the spender handing the token directly to the acceptor, the spender instead first makes an HTTP "I want to spend token 456" request to the issuer, which replies with a "receipt" that the spender then sends to the acceptor, which in turn sends a "If the token associated with this receipt is not yet burnt, burn it, record C next to it and report OK, otherwise if it was already recently burnt using C also report OK (for idempotence), otherwise (if it was already burnt with some other C') report FAIL" request to the issuer. The receipt not being valid as a spendable token cuts out the double-spend issue, at the cost of one extra HTTP request for the spender.

Terr_ · 2024-10-02T23:48:05 1727912885

I think that's the right direction but it seems incomplete: The new indirect-thingamabob can still be retargeted to a different site. (Ex: I sign up to AcmeWidgetForum which falsely claims it needs to confirm I'm a real person, and AcmeWidgetForum secretly sends the data onwards to verify an unrelated spam-account on Slashdot.)

[Edit: This has a flaw, but I already typed it out and I think it makes an incremental advancement.] How about:

1. User earns Token (no change from before)

2. User visits the Site and begins the "offer proof" process, the Site generates and records two random numbers/UUIDs for the process. The first is the previously-discussed Confirmation Code, which is used for idempotency and is not shared with the User. The second is a Site Handshake code which the user must copy down.

3. User goes to Attestor site and plugs in two pieces of information, the Token and the Site Handshake code. This returns a Burning code (valid for X hours) which the user carries back to the Site.

4. User passes the Burn Trigger to the Site, and it calls the previously-discussed API with both the Confirmation Code and the Site Handshake. If the Site Handshake does not match what's on file for that Burn Trigger, the attempt immediately fails with a security error.

____

No, wait, that doesn't really work. Although it protects against EvilForum later leveraging the data into a spam account on Slashdot, it fails when EvilForum has pre-emptively started a spam account on Slashdot and is reusing Slashdot's chosen Site Handshake as its own.

akoboldfrying · 2024-10-03T00:51:51 1727916711

>AcmeWidgetForum secretly sends the data onwards

It can't do this, because the only "data" it has from the spender is a receipt. A receipt is by design not a spendable token itself; this is trivial to make evident to any party (e.g., tokens are all 100 characters, receipts are all 50).

Terr_ · 2024-10-03T01:26:28 1727918788

> It can't do this, because the only "data" it has from the spender is a receipt.

It can because nothing in that artifact binds it to the one and only one site that the user expects. The only thing keeping it from being used elsewhere is if everybody keeps it secret, and the malicious not-really-spending site simply won't obey that rule.

In scenario form:

1. User goes to Attestor, inputs a Token for an output of a Burn Trigger. (I object to "receipt" because that suggests a finalized transaction, and nothing has really happened yet.)

2. Users submits that Burn Trigger to malicious AcmeWidgetForum, which (fraudulently) reports a successful burning and puts a "Verified" badge on the account.

3. In the background, AcmeWidgetForum acts like a different User and submits the Burn Trigger to InnocentSite, which sees no issue and burns it to create a new "verified" account.

Even if the User can somehow audit "which site actually claimed responsibility for burning my Token" and sees that "InnocentSite" shows up instead, most won't check, and even knowing that AcmeWidgetForum was evil won't do much to stop the site from harvesting more unwitting Users.

akoboldfrying · 2024-10-03T07:14:06 1727939646

Ah, you're right. The receipt is "spendable" by the acceptor, since it contains nothing identifying the original spender.

Terr_ · 2024-10-03T07:44:14 1727941454

What If: The Site chooses and exposes a public key (a simple one, like SSH, unrelated to TLS/DNS/certs) which the User carries over to create the Burn Trigger.

The Attestor generates a random secret associated with each Burn Trigger, and encrypt it with the supplied public key to create a non-secret Challenge. (Which is carried back by the User or else can be looked up by another API call.)

To burn/verify the Token, the Site would need to use its private key to reverse the process, turning the Challenge back into the secret. It would they supply the secret to the burn/verify API call. The earlier Confirmation Code would no longer be needed.

Thus AcmeWidgetForum would be the only site capable of using that Burn Trigger. (Unless they granted that ability to another site by sharing the same keypair, or stole a victim-site's keypair.)

... I know this is reinventing wheels, but I'm gonna choose to believe that there's some minor merit to it.

doctorpangloss · 2024-10-01T21:46:21 1727819181

> Lately I've been pondering how one might create a "probably a human"/skin-in-the-game system.

This has the same energy as the "we need benchmarks for LLMs" startups. Like sure it's obvious and you can imagine really complex cathedrals about it. But nobody wants that. They "just" want Apple and Google to provide access to the same APIs their apps and backends use, associating authentic phone activity with user accounts. You already get most of the way there by supporting iCloud login, which should illuminate to you what you are really asking for is to play outside of Apple's ecosystem, a totally different ask.

tim333 · 2024-10-01T21:58:04 1727819884

There is the much slagged off but maybe effective Worldcoin.

LoganDark · 2024-10-02T04:46:21 1727844381

Nothing like this or Worldcoin ever will be useful in any capacity for qualifying non-fraudsters, because fraudsters will have an infinite supply from people they tricked while non-fraudsters will only have what they've personally been given. So it'll basically do the opposite of what you want.

tim333 · 2024-10-02T08:54:12 1727859252

Worldcoin IDs are one per person and I've not heard of people tricked to give them away. Some have been bought for cash but there's a limit to how many of those will be available. In practice they are no good for verifying humans on blogs and the like though because only about 0.0003% of humans have one. But maybe something a bit like that that's easier to get?

LoganDark · 2024-10-03T05:15:18 1727932518

> I've not heard of people tricked to give them away.

People can be tricked to give anything away.

> In practice they are no good for verifying humans on blogs and the like though because only about 0.0003% of humans have one.

Even if every human had one it'd still be useless.

a2128 · 2024-10-02T10:19:47 1727864387

There is a whole industry of CAPTCHA solving services that mostly use humans in places where labor is cheap. Prices per reCAPTCHA vary somewhere between $0.001 to $0.002 on one of the popular ones. It doesn't require much sophistication to use it. For around $50/year you can spam a website with 100 comments per day, assuming it requires a CAPTCHA to be solved per comment. This pricetag may leave the average script kiddie out of the game, but if your spam is earning you money somehow then this becomes easily profitable. I don't believe these services are "edge cases"

imiric · 2024-10-02T11:06:51 1727867211

I'm aware of that, but CAPTCHAs are still a hurdle for most low-effort operations. I'm not so certain that ones using mechanical turks are not edge cases, since they would typically target the largest and most popular/profitable sites, and wouldn't bother with smaller websites.

Besides, CAPTCHAs shouldn't be the only protection against spam. There should still be content moderation tools, whether they're automated or not, to protect when/if CAPTCHAs don't work. Larger websites should know this and have the resources to mitigate it.

So saying that CAPTCHAs aren't worth it because they're not 100% accurate or effective is the wrong way of looking at this. They're just the first line of defense.

a2128 · 2024-10-02T21:38:58 1727905138

It's not about having to be 100% effective, it just has to be worth it considering the trade-off of introducing a hurdle for every legitimate user using your website.

I would probably value my time, spent solving an annoying reCAPTCHA tapping on slowly fading pictures of what an American would consider a school bus before being asked to try again, more than a fraction of a cent. Of course reCAPTCHA probably considers me an edge case using Firefox with tracking protection and not being signed into Google, but it's just rude to require users to deal with this on a common basis. A local government website here requires me to solve a reCAPTCHA every time to view or refresh a timetable even though it's already locked behind an identity verification step involving logging in through my bank.

It would be smart to put some sort of CAPTCHA or other verification step to a website when signing up with just an email, because otherwise the cost for someone to automate making a million accounts would be $0.00. But it should at least be properly implemented, I've run into websites that use the invisible reCAPTCHA v3 and when my Firefox browser inevitably fails the check, it doesn't even give me a challenge of any sort, just an error message and I can't sign up or even sign in to my previously made account. A literal hurdle I can't get past as a legitimate user. If I were a spammer though apparently it would only cost less than a quarter of a cent to get past it.

imiric · 2024-10-02T22:32:40 1727908360

Bad CAPTCHA implementations are not a reason to dismiss CAPTCHAs as a whole. These are all solvable technical problems. Yes, they will likely never be 100% accurate, but plenty can be done to improve the user experience and avoid the situations you're describing. There are alternative products on the market today that do a much better job at this than reCAPTCHA.

throwaway48476 · 2024-10-02T07:04:38 1727852678

The problem is that website owners want to have their cake and eat it too. They want to make data public but not so public that it can be copied. It's the same problem as DRM which doesn't work. It's an inherent contradiction.

Web devs also bloat the hell out of sites with MB of Javascript and overcomplicated design. It would be far cheaper to just have a static site and use CDN.

imiric · 2024-10-02T08:31:13 1727857873

CAPTCHAs can protect public resources as well. But the main problem here is about preventing generated spam content, not scraping. This can be mitigated by placing CAPTCHAs only on pages with signup/login and comment forms.

throwaway48476 · 2024-10-02T08:35:40 1727858140

>This can be mitigated by placing CAPTCHAs only on signup/login and comment forms.

If only...

Algorithmic turing test is a more interesting problem. https://xkcd.com/810/

Traditional captcha solving ability has already been surpassed by the bots which is now why there's so many new and creative different CAPTCHAs. Until someone trains a model to solve them that is.

imiric · 2024-10-02T11:13:53 1727867633

It's a never-ending arms race. CAPTCHA services are improving just as the attackers are. Just because they're not 100% accurate or effective doesn't mean they're worthless. They're just the first line of defense.

pixl97 · 2024-10-02T11:48:28 1727869708

Just don't forget the scale doesn't stop at zero. They can be less than worthless and cost you actual humans visiting your site.

kamray23 · 2024-10-02T09:15:56 1727860556

That'd be a nice way of looking at it, if serving content was cheap. It is not. I want to put my CV online, but I'm not willing to shill out tens of thousands every year to have it scraped for gigabytes per day. Doesn't happen, you say? Didn't before, definitely. Now there's so many scrapers building data sets that I've certainly had to block entire ranges of IPs due to repeated wasting of money.

It's like the classic "little lambda thing" that someone posts on HN and finds a $2 million invoice in their inbox a couple weeks later. Except instead of going viral your achievements get mulched by AI.

probably_wrong · 2024-10-02T10:48:50 1727866130

I have a personal website (including my CV in PDF), blog, and self-hosted email. A story I posted once made the HN frontpage and my e-mail is in my profile, meaning my content is read by more bots than humans.

My monthly hosting costs are ca. $10 a month. Therefore I'm really curious: if hosting your CV requires "tens of thousands every year", what does your setup looks like?

afiori · 2024-10-05T16:48:39 1728146919

I imagine it is something about bandwidth costs

throwaway48476 · 2024-10-02T09:41:01 1727862061

Gigabytes? How big is your CV?

>lambda thing

I never understood why anyone thought this was ever a good idea. QoS is a natural rate limit for non critical resources.

nkrisc · 2024-10-02T08:50:35 1727859035

There's a nearly fool-proof solution: manually verify every submission.

You can use automated systems as a first line of defense against spam, and then hire people to manually verify every submission that makes it through. You can even use that as opportunity to ensure a certain quality of submission, even if it was submitted by a person.

Any legitimate submissions that get caught in the initial spam filter can use a manual appeal process (perhaps emailing and pleading their case which will go into a queue to be manually reviewed).

Sure, it's not necessary easy and submissions may take some time to appear on the site, but there would be essentially zero spam and low-quality content.

mschuster91 · 2024-10-02T10:50:19 1727866219

> You can use automated systems as a first line of defense against spam, and then hire people to manually verify every submission that makes it through. You can even use that as opportunity to ensure a certain quality of submission, even if it was submitted by a person.

The problem is, once you do manual upfront moderation, you lose a lot of the legal protections that UGC-hosting sites enjoy - manual approval means you are accepting the liability for anything that is published.

nkrisc · 2024-10-02T11:50:47 1727869847

That's a good point, and perhaps a flaw in safe-harbor laws for UGC. It's a bit all or nothing, isn't it?

bitwizeshift · 2024-10-02T10:14:36 1727864076

The article never talked about bot-generated products, only bot generated comments and upvotes. How does manual review address this exactly?

qup · 2024-10-02T10:07:23 1727863643

The bots are commenting and voting, not submitting products.

class3shock · 2024-10-01T22:50:35 1727823035

As someone that already often runs into them due to vpn use being flagged please no more. Think about how much human time has been wasted on these things.

imiric · 2024-10-01T23:15:57 1727824557

The detection rules ultimately depend on each site. If you're using a VPN known to be the source of bot activity, you're likely being grouped with that traffic. Ideally the detection can be sophisticated enough to distinguish human users from bots based on other signals besides just their source IP, but sometimes this is not the case.

All these usability issues are solvable. They're not a reason to believe that the problem of distinguishing bots from humans can't be approached in a better way.

fryry · 2024-10-02T03:58:18 1727841498

I was researching CAPTCHA solving services yesterday that run on cheap 3rd world labour. I couldn't imagine a worst job than solving them all day.

throwaway48476 · 2024-10-02T07:07:34 1727852854

It's an inherently repetitive task which makes it easy to automate with ML without requiring anything like AGI.

m463 · 2024-10-01T22:45:37 1727822737

I wonder if this is like the recent article about people not buying from locked display cabinets:

https://news.ycombinator.com/item?id=41630482

how many humans does captcha send away?

imiric · 2024-10-01T23:07:17 1727824037

You're thinking of the traditional CAPTCHA that requires an active effort from the user. Those do present a barrier to entry where some users give up.

But there's a new breed of them that work behind the scenes and are transparent to the user. It's likely that by the time the user has finished interacting with the form, or with whatever is being protected, that the CAPTCHA has already determined whether the user is a bot or not. They only block the action if they have reasons to suspect the user is a bot, in which case they can show a more traditional puzzle. How effective this is depends on the implementation, but this approach has received good feedback from users and companies alike.

LorenPechtel · 2024-10-02T01:42:49 1727833369

And all too often said systems call humans bots.

Especially if you load a page in another tab while remaining on the page you were on.

imiric · 2024-10-02T04:39:24 1727843964

I left a reply here[1] that also applies to your comment.

[1]: https://news.ycombinator.com/item?id=41717214

animal531 · 2024-10-02T10:06:22 1727863582

Bots can apparently now beat 100% of road sign captchas, so unless if you can cycle them around its not going to do much.

capitainenemo · 2024-10-02T02:03:55 1727834635

Annoyingly these captchas that apparently safeguard user privacy make websites completely unusable when using Firefox fingerprinting protection.

imiric · 2024-10-02T04:36:35 1727843795

I'm not claiming that these systems are perfect. Those edge cases should be resolved.

But if we want the internet to remain usable, our best chance is to fight back and improve our bot detection methods, while also improving all the other shortcomings people have associated with CAPTCHAs. Both are solvable technical problems.

The alternatives of annoying CAPTCHAs that don't work well, or no protection at all, are far worse in comparison.

capitainenemo · 2024-10-04T02:11:00 1728007860

Well, so far it's being solved by fingerprinting everyone uniquely, and punishing people who use anti-fingerprinting with essentially unusable websites. So, the captcha is essentially window dressing.

imiric · 2024-10-04T03:37:36 1728013056

I get that argument, as someone who uses those privacy-preserving methods. I've dealt with annoying CAPTCHAs for many years. The problem is that a CAPTCHA by definition is unable to do its job unless it can gather as much information as possible about the user. There are obvious privacy concerns here, but companies that operate under regulations like the GDPR are generally more conscious about this.

So what should be the correct behavior if the CAPTCHA can't gather enough information? Should it default to assuming the user is a bot or a human?

I think this decision should depend on each site, depending on how strict they want the behavior to be. So it's a configuration setting, rather than a CAPTCHA problem.

In a broader sense, think about the implications of not using a CAPTCHA. The internet is overrun with bots; they comprise an estimated 36% of global traffic[1]. Cases like ProductHunt are not unique, and we see similar bot statistics everywhere else. These numbers will only increase as AI gets more accessible, making the current web practically unusable for humans.

If you see a better alternative to CAPTCHAs I'd be happy to know about it, but to me it's clear that the path forward is for websites to detect who is or isn't a bot, and restrict access accordingly. So working on improving these tools, in both detection accuracy and UX, should be our main priority for mitigating this problem.

[1]: https://investors.fastly.com/news/news-details/2024/New-Fast...

capitainenemo · 2024-10-04T15:55:12 1728057312

So, I have a few objections here. First off, CAPTCHAs are not "by definition" about fingerprinting users. They are "by definition" a turing test for distinguishing humans from bots. It just turns out that is hard to do, so CAPTCHAs pivoted to fingerprinting instead. Secondly, sites often are unaware or not given the choice. Businesses are sold the idea that they are being protected against bots, when in fact they are turning away real users. Many I contacted were unaware this was happening. In fact, the servers in between are not even integrated in a way to support a reasonable fallback. For example, on some sites (FedEx, Kickstarter) the "captcha" is returned by a JSON API that is completely unable to handle it or present it to the user. Thirdly, the fingerprinting is broadly applied with NO exceptions. You would think a simple heuristic would be "the user has used this IP for the past 5 years to authenticate to this website, with the same browser UA - we can probably let them through" but, no, they kick it over to a third party automated system, one that can completely break authentication, to fingerprint their users, on pages with personal information at that. They often don't offer any other options either, like additional auth challenges.

So, yeah, people are being told "well, we have to fingerprint users, we have no choice" and the ironic thing is the battle is being lost anyway, and real damage is being done to in the false positives, esp if the site is tech savvy.

But whatever. I'm aware I won't convince you, I'm aware I'm in the minority, most people are accept the status quo, or are unaware of the abuses, but it's being implemented poorly, it isn't working, it's harming real people and the internet as a whole, and it is not an adequate fix.

imiric · 2024-10-05T10:07:43 1728122863

Hey, thanks for taking the time to write such a thoughtful reply. I'm always open to counterarguments to what I'm saying, and happy to discuss them in a civil manner. I think such discussions are healthy, even without the expectation that we're going to convince one another.

I think our main disagreement is about what constitutes a "fingerprint", and whether CAPTCHAs can work without it.

Let's start from basic principles...

The "Turing test" in the CAPTCHA acronym is merely a vague historical descriptor of what these tools actually do. For one, the arbitrer in the original Turing test was a human. In contrast, the "Completely Automated" part means that the arbitrer in CAPTCHAs has to be a machine.

Secondly, the original Turing test involved a natural language conversation. This would be highly impractical in the context of web applications, and would also go against the "Completely Automated" part.

Furthermore, humans can be easily fooled by machines in such a test nowadays, as the original Turing test has been decidedly broken with recent AI advancements.

So taking all of this into account, since machines don't have reasoning capabilities (yet) to make the bot-or-not distinction in the same way that a human would, we have to instead provide them with inputs that they can actually process. This inevitably means that the more information we can gather about the user, the higher the accuracy of their predictions will be.

This is why I say that CAPTCHAs have to involve fingerprints _by definition_. They wouldn't be able to do their job otherwise.

Can we agree on this so far?

Now let's define what a fingerprint actually is. It's just a collection of data points about the user. In your example, the IP address and user agent are a couple of data points. The question is: are just these two alone enough information for a CAPTCHA to accurately do its job? The IP address can be shared by many users, and can be dynamic. The user agent can be easily spoofed, and is not reliable. So I think we can agree that the answer to that question is "no".

This means that we need much more information for a CAPTCHA to work. This is where device information, advanced heuristics and behavioral signals come into play. Is the user interacting with the page? How human-like are their interactions? Are there patterns in this activity that we've seen before? What device are they using (or claim to be using)? Can we detect a browser automation tool being used? All of these, and many more, data points go into making an accurate bot-or-not decision. We can't rely on any single data point in isolation, but all of them in combination gives us a better picture.

Now, this inevitably becomes a very accurate "fingerprint" of the user. Advertisers would love to get ahold of this data, and use it for tracking and targeting purposes. The difference is in how it is used. A privacy-conscious CAPTCHA implementation that follows regulations like the GDPR would treat this data as a liability rather than an asset. The data wouldn't be shared with anyone, and would be purged after it's not needed.

The other point I'd like to emphasize is that the internet is becoming more difficult and dangerous to use by humans. We're being overrun with bots. As I linked in my previous reply, an estimated 36% of all global traffic comes from bots. This is an insane statistic, which will only grow as AI becomes more accessible.

So all of this is to say that we need automated ways to tell humans and computers apart to make the internet safer and actually usable by humans, and CAPTCHAs are so far the best system we have for it. They're far from being perfect, and I doubt we'll ever reach that point. Can we do a better job at it? Absolutely. But the alternative of not using them is much, much worse. If you can think of a better way of solving these problems without CAPTCHAs, I'm all ears.

The examples you mention are logistical and systemic problems in organizations. Businesses need to be more aware of these issues, and how to best address them. They're not indicators of problems with CAPTCHAs themselves, but with how they're used and configured in organizations.

Sorry for the wall of text, but I hope I clarified some of my thoughts on this, and that we can find a middle ground somewhere. :) Cheers!

Another point I forgot to mention: it's certainly possible to not gather all these signals. We can present an actual puzzle to the user, confirm whether they solve it correctly, and use signals only from the interaction with the puzzle itself. There are two problems with this: it's incredibly annoying and disruptive to actual humans. Nobody wants to solve puzzles to access some content. This is also far from being a "Completely Automated" test... And the other problem is that machines have become increasingly good at solving these puzzles themselves. The standard image macro puzzle has been broken for many years. Object and audio recognition is now broken as well. You see some CAPTCHA implementations coming up with more creative puzzles, but these will all inevitably be broken as well. So puzzles are just not a user friendly or reliable way of doing bot detection.

capitainenemo · 2024-10-06T19:39:05 1728243545

I'm going to leave it at "agree to disagree".. But here's my wall of text anyway.

Until something more substantive is done to control who can fingerprint (let's assume this is even a reasonable solution), users are forced to deactivate fingerprinting, and Firefox can NOT roll it out by default (your captchas are the main blocker) - or even expose it as a user option in config and advertise it with caveats that you might get more challenges - right now you don't just get more challenges, you get a broken internet.

And, 36% of the internet bot activity is pretty meaningless. I personally have no problem if 90% of the internet is bot activity. We have an enormous amount of bot traffic on our websites - I would say the majority - and I don't block any of it that respects our terms - a ton of it is being obviously used to train LLMs or improve search engines - more power to them. And honestly there's probably an opportunity for monetisation here. Some of it is security scans. Whatever. That is not a problem. Non-human users of the internet will inevitably arise as integration does, and I've written many a bot myself. Abuse is the problem. There are ways to tackle abuse that aren't fingerprinting. Smarter heuristics (which are obviously not being used by the "captcha" companies or I would not be getting blocked on routine use of sites like FedEx or Drupal or my bank after following a link from that bank or service), hash cash, smarter actual turing tests that verify not "human-like" spoofable profiles, but actual human-like competence... without fingerprinting. What we have right now is laziness and the fact that fingerprinting is profitable so there is actually an incentive to discourage it by all parties involved. It'll never be perfect but what we have now is far far far from that.

I will say, BTW, that bots are not that hard to block. On a website I maintain we went from 1000+ bot accounts a month to 0 in many years, simply by adding an extra hand-rolled element to a generic captcha. The generic captchas are what bots bother to break in most cases. (that would probably not apply to massive services, but those also have the capacity to keep creating new custom ones, and be a moving target - probably would just require one programmer full-time really)

And yes, businesses need to implement it these "captcha" solutions better, but the people offering the solutions are not offering them with transparency as to the issues or clean integration with APIs. It's just get the contract, drop in front of all traffic, move on.

And, for god's sake, implement the captcha sanely. Don't require third party javascript, cookies, etc. Have the companies proxy it through their website so standard security and privacy measures don't block by default which happens almost all the time. In fact in many cases even the feedback when blocked, is also blocked facepalm. Don't block by default on a "suspicious" (i.e. generic) fingerprint as what happens quite often now. Actually SHOW a captcha so the user has a fighting chance and knows what is going on.

creer · 2024-10-04T02:53:56 1728010436

> These are all [CAPTCHA] issues that can be improved.

No. This is not a new issue. The problems have been there for many years. You can't claim "working on it" - which is not even what you are claiming.

By now, recognize that if the users themselves are fighting this crap or avoiding the sites and companies that use them, it's entirely deserved. By setting CAPTCHAs, you attack your users. (Witnessed in 2024, an insurance claims form which demands that a CAPTCHA be solved but shows no CAPTCHA. This crap is now so common it can now be used to delay insurance claims!)

imiric · 2024-10-04T03:52:26 1728013946

> You can't claim "working on it" - which is not even what you are claiming.

I can, actually. :) I'm part of the team at https://friendlycaptcha.com/ and we agree that most CAPTCHAs suck. But we also believe that these issues can be improved, if not outright solved—at least the UX aspects.

I was doing my best to avoid bringing up my employment, since these are my own opinions and I didn't want to promote anything, but I might as well mention that there are people working on this. There are similar solutions from Cloudflare, DataDome, and others.

If you're having an annoying CAPTCHA experience in 2024, that's mostly due to the particular website choosing to use an annoying CAPTCHA implementation, or not configuring it properly. As I've said numerous times in this thread, distinguishing bots from humans will never be 100% accurate, but the alternative of not doing that is far worse. So we'll have to live with this if we want the internet to remain usable, and our efforts should be directed towards making it as painless as possible for actual humans.