I've been a member @ HN for quite a while, but I usually just tend to focus in on security and privacy topics. One of my good friends is visiting, and we wanted to work on something challenging together. Both of us find privacy policies overly confusing and annoying, so we decided to tackle the problem. We built a tool that that crawls for privacy policies and uses guided machine learning to analyze them. We would love any feedback you have.
The site is pretty slow. I noticed that entering my site a second time didn't produce a result any faster than the first time. You might consider adding some caching.
We deliberated on the categories to include here and went with what we thought made the most sense and was the most clear in policies. A lot of policies don't make the distinction you describe =/
You can go so much farther with this. How about letting me paste in any TOS and have it analyzed for important bits or things out of the ordinary? I'd love having my own personal robot lawyer to read over all the stuff I sign or agree to!
Thanks, and yes indeed, that's what we are hoping to get towards. We started out with a few other categories & classifiers (i.e. does the site claim intellectual property over what you post ) but wanted to start with "does this site sell my data." We can train it on TOSs and categorize there.
Do you have a background in this (re: JD)? This is a very complicated task. In fact it's such a tricky task that I think you might be better off just building a database of terms. A quick Googling didn't turn up anything so maybe this is something else you could offer. Have people submit what sites that want you to take a look at, take info from people... Fairly different than what you're doing but seems more likely to give people information that's useful. I wouldn't want to rely on a machine-generated evaluation that wasn't 100% right 100% of the time!
Would you trust such a lawyer? No offense to the developer, I think he id a terrific job, but trusting a machine to parse a contract for something fishy seems a little too... Well, risky.
Of course it'd be great when you have a positive match.
For example, http://www.privacyparrot.com/privacy states that they never "share any information about you" but then has an offhand mention about the site using Google Analytics.
I consider Google Analytics a great harm to my privacy and brushing it off with "We give you a cookie so Google analytics works, but it's nothing personal." gives me a dishonest and careless expression.
I love this idea, but here's how I think it could be improved:
1. Consider porting the entire service to a browser extension for Chrome or Firefox, and making the homepage more of an information/FAQ center.
2. Demo video. A good demo video explaining why "John Doe" should worry about his personal information being sold would be more convincing - this is how you get your service to less savvy internet users who aren't primarily concerned with privacy.
3. Find a way around inconsistencies. It would be better to report if a website actually sells/uses your personal information rather than returning a simple search result with TOS findings. A website can tweak or flat out lie. You should try to account for this.
4. Are you planning to commercialize this in any way? How do you plan to fund it, if at all?
1. Seems we got a lot of feedback and requests for an extension. Sounds like a good idea to me too!
2. Yes a demo video explaining why this is important would be helpful to folks
3. Not sure how to address this. The tool reads and classifies terms of service. Any suggestions on how we can do what you suggest
4. Potentially, want to explore a bit more and make sure it works well first. We have ideas for premium and related features/services that would key off of this well if it becomes popular and trusted.
As for 3: A simple feature to add would be to add a form for users to submit samples (links, comments).
BTW: Cool tool with a lot of potential! See also http://www.javacoolsoftware.com/eulalyzer.html (They analyze EULAs, not Privacy Policies though. Also I wasn't too impressed when I tested but a nifty idea and I only tested it a few years ago so it might have been improved a lot in the meantime.)
Hm...well, there are a few ways. What you could do is use an automated crawler with a Y|N function to tell you if it's being tracked. For example, when a user submits a website, your engine would send an automated message to the server and detect how much, if any of the information were stored, and then track it to see how it's used. If you've got the tech to do this, it would eliminate a huge amount of false positives and false negatives that the privacy policy's rhetoric generates. But that's just my suggestion.
Just curious, if it could highlight the offending phrases it used to figure out the difference between selling, not selling, and bankruptcy selling. This way when we put in our revisions we can better help it learn.
Also if you aren't planning on making this a commercially viable product, could you release source code? Things like this make the world better and safer, (not to mention easier and funner.) All in all though it was rather interesting. (Still trying out websites and i see myself doing this until the end of the day at work.)
Of the two sites of mine that I checked, one came up as "Danger! Warning! They're going to sell your information in case of a Bankruptcy!!!"
Why?
Reading one of the submitter's comments below, it seems to lump "sold the entire company, therefore the user database went with it" into the same category as "we're running out of money, so let's sell everybody's email addresses to spammers."
They're not in any way related. I'd suggest splitting out those two categories, as I suspect it will drop that "bankruptcy email fire sale" category down to somewhere near 0%.
That is a good point. Perhaps the way we've worded it is giving an incorrect impression.
It's tough, most policies look something like this
In the event that XXXXXXX is involved in a bankruptcy, merger, acquisition, reorganization or sale of assets, your information may be sold or transferred as part of that transaction
From what I gather, the example above implies that the company considers your data an asset and it "may be sold" as part of the transaction or bankruptcy.
Help me come up with a better way to describe this situation succinctly! =)
This may be OT, but I'd like to know more about organizational practices (and corresponding contractual language) that mitigate this risk.
My vague impression is that unless one takes deliberate steps to remove such information from the available... "asset pool", when bankruptcy strikes, all bets are off (in the U.S., at least).
Any effective limits after that point seem more often to be PR-based (bad PR decreasing, negating, or even outweighing the value of the information) than due to legal stricture. Or else a matter of getting a "one-off" restriction from a court proceeding.
This is just my impression from the news. I would welcome any clarification.
How about requiring the word "bankruptcy" to appear somewhere in the privacy policy before you put the word "bankruptcy" in big red letters next to the name of somebody else's website?
Err on the side of caution when you've got somebody else's reputation in your hands.
We've reworded the message to mention acquisitions as well for the time being. We're working on better ways to display the information to convey a sense of caution and encourage further reading rather than the feeling of danger associated with the direct sale of information.
Why does it automatically add www in front of what I type in the URL box? If I type in news.ycombinator.com, it says "www.news.ycombinator.com does not exist".
If you made a Google Chrome or Firefox extension I'd install (both of) them. In fact, porting this service to an extension seems more logical and streamlined if it could push notifications in real time as you entered a website, the way Firefox can warn you if a website has a false certificate or is a registered scam/phishing website.
In a galaxy far away there was once conceived the idea of a machine readable privacy policy ... checking the interwebs reveals that http://www.w3.org/P3P/ was updated for the last time in 2007.
On another note: www dot freeprivacypolicy dot com [1] seems to generate the kind op privacy policies the site featured in this post sets out to parse. There is humor in that.
[1] don't want to feed page rank as privacyparrot says: "Your information may be sold during a bankruptcy"
Cool, wonder if we can get some training data for "out of the ordinary" and screen for companies that are big frauds and then short them (like http://www.muddywatersresearch.com/ does)
Also, I suggest when you suggest adding “.com”, you strip spaces from the search as well. For instance, I searched for “Less Wrong” and found nothing, and you suggested “Less Wrong.com”. That doesn’t exist, but “LessWrong.com” does.
How can someone trust you to parse the policies correctly? What if someone sues you for incorrectly interpreting a policy which they then use to make a decision.
A lawyer could probably advise them quite quickly as to a TOS that would limit their liability from such an action by specifically informing users of the site that the information should not be relied upon. I'm not a lawyer either, but it doesn't look like it would be a big deal to get a good TOS to prevent these issues.
If the site's privacy policy explicitly states they will never sell your information, it may be possible to sue if they do so. There is some precedent with a toysmart.com case back in 2000, though it was settled out of court.
The main point of checking for the bankruptcy clause is to raise awareness of the risks involved so people can make an informed decision before trying a new site.
I am just learning some of these machine learning tools and am rapt, so forgive me for asking, but would you be able to explain a little about what you are doing?
How are you generating features? Stanford parser? Are you using logistic regression or something more advanced?
I love the idea. I am interested in applying some of these concepts myself. Do you have any ideas that you are not able to pursue yourself, that I might take a crack at?
Works just like spam filtering: We're using a naive Bayesian classifier with training data. Built and tested a custom extractor that makes the most sense for the legalese of privacy policies.
Ideas: email me at michaelaiello (at) michaelaiello.com
Would it be possibly to also identify sites that share your information?
A great example:
facebook.com
Does not sell your private information.
But they obviously do share information and while this is apparent to most users, how many sites practice the same and users are not aware of it?
Also, any plans to capture change in privacy policies over time? Often times, site owners do not proactively notify users when their policies or legalese has changed.
We do plan to note changes in the policies. Considered a feature that lets you subscribe to the policies you are interested in and get notices if they change, or if in the news, it is mentioned that the site loses your data. What do you think?
I'd love a site that stored diffs off all legal docs for large organizations. I was trying to find old TOS of amazon.com the other day, and couldn't find them anywhere.
It does not appear to like subdomains. I've been trying to get it to visit http://news.ycombinator.com and see what it thinks. But I keep getting back, We were unable to connect to http://www.news.ycombinator.com. If it exists, please try again later.
This is very cool. Maybe you could make a scriptlet bookmark that pops something on the page you are viewing. Here's one that will redirect you to the privacy parrot page : javascript:___location.href="http://www.privacyparrot.com/privacy-policy-for-+___location.ho...;
I tried the policizer with some copy and pasted policies but it frequently told me "CAN SELL" just because the text did not include any specifics regarding selling and bankruptcy