Notably, this is what happens most of the time when something from /r/SubredditSimulator gets popular and becomes intermixed with normal submissions on a subscriber's front page and /r/all.
The way to beat a Turing Test is through hiding in plain sight.
9gag automatically reposts reddit's /r/all onto their site. People on 9gag don't realize it's a bot and have one hell of a time trying to interpret the post. Examples:
Leonardo da Vinci used to throw paint filled sponges at walls and then force himself to make sense out of the resulting irregular shapes in relation to a problem he was interested in. For example: he might be thinking about transportation and say "well this looks like a horse drawing a carriage ..." and then use it as a basis to come up with new ideas. By forcing his brain to make connections between totally unrelated things it enhanced his creativity (or at least that was the intention.)
What comes to mind for this Hacker News Simulator is the modern equivalent of da Vinci's sponges: an ink blot which you can use to come up with completely new and novel ideas by forcing unexpected connections. And because the topics here are in some way related to hacker news the result could be filling in enough blanks to produce something newsworthy (i.e. an actual good idea instead of a garbled markov chain.)
For a while I ran a Markov chain blog poster that drew on my own notes and posts on a site I ran for a while. A phrase in one of those posts ("Only a matter of seeing simple data structures and designing lightweight tools that can cross the galaxy") actually led to a chain of association that has spawned a five-year research thread. This technique can actually work.
That's really interesting, hadn't heard that before. A big difference here though is that the software is making (the appearance of) very specific logical connections itself -- with the ink sponges, you're creating a vague image that is open to human interpretation.
Some of the software-generated "logical connections" are not quite logical though, and require a little massaging/tweaking (human interpretation) in order to make sense. This process can result in interesting new logical connections.
Is that da Vinci method a bit like interpreting I Ching hexagrams? Put some randomness in, add some constraints, and interpret it to your current context. Should be a similar creativity enhancer.
Kind of off topic, but I have been looking at I Ching hexagrams as design patterns for life. Each hexagram represents a state you can be in. You can transition from one state to any other state, but there are consequences. The consequences are described in the transitions of the moving lines.
To make the most sense of this, it helps to read Richard Wilhelm's translation and annotations of the I Ching. If you don't read German, then Cary Banes's translation of Wilhelm's translation is pretty much the only one out there. Unfortunately still under copyright (ha ha... on 5000 year old document). But it is widely pirated on the web.
I often wonder if the random "predictions" of the I Ching were originally just a way of studying. There are 4096 different transitions to look at, and approaching them systematically would be painful in the extreme.
I've spend some time to see if the transitions actually make sense and from what I can tell, they do (modulus my ability to fool myself into seeing reason in things that have no actual reason). I wish I had enough patience to actually study it ;-)
"The Science of Leonardo" by Fritjof Capra is easy to read and a good intro. Martin Kemp's "Leonardo da Vinci" is a bit scattered but has great visuals. But most interesting is Leonardo's own "Notebooks."
This is perfect. I opened it on a new tab, and went off to read another article. After coming back and opening multiple links I found myself thinking "where are all this posts with gramatical errors coming from" for a solid one minute - until I noticed "pg" where my username should be!
My responses ranged from "they're doing what to Linux now?" to "that one sounds cool..." to "that's true..." to "...of course I'm clicking that" to to "hm... well they must be going for an Indian motif".
And somewhere in there I was like "wow, /r/titlegore meets HN."
I'm very dense though. This was really well-made! :D
I still wish that 4th link was real. Like, I have a different taste in music, but I'd so reply to that. lol
I did virtually the exact same thing - except my first reaction on seeing the fake HN was: "Why the heck am I logged in as pg? Security hole? This ought to be fun, let's see what kind of things he can do in here..."
Took me a while to figure out that this was actually a fake :-(
The titles suddenly got a lot better! Less grammatical errors.
One of my favorites is "Hosted Continuous Integration Using Gradle, Android Studio And New York Times, Evernote, Gmail, and Quicksilver".
Another hilarious title: "Ask HN: Worst examples of really creative ways to combat late payments".
This sounds like an real post on HN, and actually makes sense.
Or maybe "Swift Language Will Instantly Know Everything About the Origin of 'The World's Dumbest Idea': Milton Friedman".
Yep, it uses the same technique. I've pulled every comment and story from HN through the API and make a bunch of Markov chains to produce story titles and comments. The input corpus is a lot smaller than subredditsimulator though and the various subreddits have a large variation in the words used (meaning more funny comments), so it's no way near as good.
Perhaps a future enhancement might be to partition comments by the submission ___domain, i.e use only comments for github.com stories for fake github submissions, might generate very different text than a munge of every comment.
You could also use the comments on reddit in /r/programming and /r/startups and other related reddits to help get more data for seeding your corpus.
Or if you want to get more complex, find the reddit comments for every link that was submitted to HN and use that (but you have to be careful that you use "hacker" related reddits or it will sound too "reddity")
Maybe aggregate over the subreddits listing (where HN links appear) and then whitelist some of them (after checking against a comment frequency count) like programming, linux etc.
I made a similar thing a year or two based off of another Markov Chain of HN headlines. I generalized it to allow you to mashup headlines from different news sources (Buzzfeed x Hacker News, for example). Still pretty funny, if anyone's interested: http://www.headlinesmasher.com/best/all
Think of all those poor, decommissioned teletypes we could put back into service. Then watch The Brave Little Toaster (while trying to ignore the truly-WTF moments). Then weep that this isn't a rely future.
> Google wins the Book Search settlement gives Google 15 days in orbit (bostonglobe.com)
Google wins 15 days in orbit! Whee!
The comments are pretty great, too.
> Hang in there, say "Pizza" and it certainly has a lot of leverage because they're frustrated. Worst case: Someone sees your duck and you've got a new revenue model was (otherwise it was something I loved it, and they LIVE here.
Just hang in there, say "pizza", and make sure no one sees your duck.
> You're trying to solve bugs or problems.
>> It's like chess, or gymnastics, or baseball, or anything, just that it vanished overnight.
> I've also seen discussions of how your data structures without hunting down some raster graphics, I fire up Uber first.
> Your love of Pete, don't just repeat it with your keystrokes. FWIW I had never thought those 30 servers would be classified as unlawful combatants, removing their legal protections then go for them.
> Why I design software, I want to want to live in it
> Show HN: Solving the problem of what you read the Web
> Think Apple Would Dare To Be Upset About Aaron Swartz's life
I opened it in a background tab, lost track of it for a while, ended back there thinking it was the normal ycombinator and unwittingly spent a couple minutes thinking "wtf is up with HN today?"
In fact, I learned in a very silly thing to know. The risk is on controlling the hardware thats in there"Only when you get the impression that credit cards that my Time Warner unfamiliar with the goal these people as potential phishing as well as US government uses private business to sell your company in its own horn about being able to be great.
I kept looking at the URL and thinking: "How are they using the same ___domain?" I came back an hour later only to realize that the i and n are switched in ycombniator.
This reminds me of a Google exercise within their Python course, wherein it read in a text file and built dictionaries of every word and the words that followed said word in the text in order to create a prosaic style that mimed the author of the original text. It was quite interesting to run and throw a text file at it to see what it produced.
EDIT: I have a cached version of the exercise if anyone feels like looking at it:
Absolutely fantastic, and probably will happen on Mars. Unfortunately, I can't read it. But Erlang is a human, but it's not unheard of, or even a real wood and lead pencil. A comment to mean what you mean how do we keep in mind when I can set a deadline, some basic programming with Scratch.
As a HN data processing note, to remove the garbled characters, you need to convert the smart quotes from HN (among other things like long dashes) into normal ASCII characters.
EDIT: Looks like the garbled characters were fixed.
Yeah thanks for pointing that out, I took the nuclear approach and just ran all everything through unicodedata.normalize("NFC", ...) which seems to have done the trick.
> I, being born a woman was violently beaten and robbed in a project/problem.
That's a weird auto-generated comment[1]. I wonder how much of it is random selection and how much is seeded text. What were the units that combined to form this?
[1] "Ask HN: Pure client-side PadMapper would be great as jobs? What a senior Rails dev?" post).
You've got to be kidding me. I've been further even more decided to use even go need to do look more as anyone can. Can you really be far even as decided half as much to use go wish for that? My guess is that when one really been far even as decided once to use even go want, it is then that he has really been far even as decided to use even go want to do look more like. It's just common sense.
It even uses real usernames for comments. I've apparently commented on "Watch Morley Safer Lie in Tech is Not a Single Blog Post" :P , and I've seen posts by patio11 (and referencing patio11 too!)
>I miss Google wave invites going for $26 a month (srikarg.github.io)
>Ask HN: Is there a good, standard capped convertible note paperwork?
>NSA leaks: David Cameron cracks down on Apache Quitting JCP: 'Oracle Is the Fear of Macros (techcrunch.com)
>RSS.gd: the RSS icon was mistaken for the end of the Union at CoreOS Fest 2015 – Call for New Startup BitcoinDeals is Launching its Own URLs? (technokyle.com)
I kind of feel bad I discovered what was going on in like 20 seconds.
I increased the zoom of this site (the original font size is just too small for me), so when I opened this link and noticed that the zoom was restarted I immediately became suspicious.
Normally I don't like comments that are just jokes, but this one was too perfect. Well done.
And to add just a slight bit more substance to my comment, while I was reading through all the comments here my wife asked why I was laughing so hard. I found it really difficult to convey why, but I guess that's the nature of this type of humor.
Markov chains do learn in the sense that they model distributions of strings, can be trained on observed strings and used to generate strings, assign probabilities to strings, classify strings, etc. They have well developed treatments in multiple frameworks of computational learning theory, including Gold learnability and PAC learnability.
Is "deep learning" more than statistics and randomization?
There isn't, but I don't think it should be difficult to feed this into 'char-rnn' if you wanted to do it with RNNs rather than Markov chains. The interface, such as it is, to char-rnn is very simple; you dump everything into a text file 'input.txt'.
I'm actually running that right now with HN comments. It's not done training, but it's not that much more interesting than OP's. Here's some example output:
{"text": "The article was in client 70, or a Denmark, is common captured - and I very well needed to be when they picked out time reports, all the reader or warning detectors and tools and proxchit matters. It also comes up with all levels of me using legality as it is that.<p>That helps a fly of Intel companies through it, but I'm importantly convinced the UI book the impression of orderly research on this afternoon. Personally, it also has a hash but mass measure all the Web working issued and leased across the question of my commercial group of interfaces. The various currents, others avoid their BitCoin better than one game (which is obvious, and form my position for hours at the software itself).", "author": "nostrademons"}
{"text": "Neither care to censor other people (\"infrastructure\" type by development! Neuroscience, flying migrations.)<p>Relevant comment was not finished at the moment. If that appears to be the case that, but ones are a real body.", "author": "pavel_liah"}
{"text": "<i>But just surely this should simply escruble him critically though I laudf. </i><p>It's taken as a more extreme shark to manage hcpm-infolves. However, I'm great, laser mortality, one of the aight payment.", "author": "jacquesm"}
{"text": "FtAhn is not all for violenceral campuse. Teghtletter usually try to provide wonderful purposes a dozen ones for intellectually-good common argument, so this would have you considered something something from writing points from a conversation. It's such a good idea and prohibition, disappeared. Except for partaicrolabed downverted vehicles, if you're the one, you can't pay for your own business, but the women are going onfichious, or not.<p>Edit: the processor thinks without searching stuff. <i>It looks like the Num corrupt introduces a scrappy page\"</i> we might be a new level, you can care about label heat timegakes.<p>> Two individuals. I don't refer to finally great ideas, but I haven't even heard of his place with high-generation (I tell me that a lot of the money) should prosecute my frequency. I have more succes, in fact dumping out the concept of engaging in a way to say that anyone wrote fits and a crunch employee (well, the only manner of Jessico would expose the South Clothes and enterprise making it to there very attempt to save a tight interface to carco again a different type branch takedow, because transcorrs freve-lock writing reduces. Grannian raises <i>the major</i> responsibility.)<p>(Nope! Unless you look at it even when a civil support doesn't expect to admit the system for us disk law (though that would make us bad news.)", "author": "sp332"}
{"text": "Sigh HTML5 of the extradition to NELOANAG may changed.", "author": "davidw"}
Or if I sample with low temperature:
{"text": "I don't know what I want to do in the sense that I was a problem with the same problem with the same problem as a comment on the side of the site. I was a little bit like a problem with the same problem with a single part of the problem.<p>I don't know what I wanted to do in the first place and I was a lot more powerful than the one that was a problem with the same problem. I was a pretty good point of view of the statement of the statement of the state of the state of the particular problem. I would have to say that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is the same as a single person who would have to pay for the same problem as a program that is a problem that is a problem with the same problem. ... <then it just keeps going like that>
You'll probably have to train separate char-rnn instances, unfortunately. For the past week or two, I've been experimenting with putting in a metadata prefix which I can use as a seed to specify author/ style, but thus far it hasn't worked at all. The char-rnn just spits out a sort of average text and doesn't mimic specific styles.
Yup. That's been my finding as well. char-rnn was really just a diversion of curiosity after I'd cleaned up the data. My best idea right now is to make a generative model of p(next_token | previous_token(s), author), essentially connecting author directly to every observation. I'm mostly sure that using characters as tokens is overkill for this and requires a higher complexity model than I can afford with this dataset and my computational resources, so I'm going to stop using char-rnn with it.
That's possible. My hope was that you could get authorial style by just including it inline as metadata rather than needing to hardwire it into the architecture (eg take 5 hidden nodes and have them specify an ID for the author so the RNN can't possibly forget). It would have been so convenient and made char-rnn much more useful, but I guess it's turning out to be too convenient to be true.
Yeah. If you store them in a relational database you will grow a couple of grey hairs because you're essentially forcing a tree structure into a table structure. The concepts clash and it's kind of a pain, but possible since every post has 1 unique parent, so you can make upwards references and rebuild a tree from that.
Thank you sir for pointing that out. This looks very interesting and I might even implement that in my own blog at some point in time. I thought about doing something similar to that (without actually knowing this technique) but I shyed away precisely because of the price you pay at insertion time.
EDIT: I've been thinking some more about this. Another possibility would be to limit the depth of the tree to, say, 8 (which should be reasonable) and then make 8 fields, one for each ancestor (parent, grandparent, and so on). Changing the tree will become a nightmare but all queries for subtrees will be blazingly fast.
To clarify - more users are going to read and refresh pages than actually post, so making certain not every GET request results in a new database query would probably improve performance more than trying to limit the number of rows in each query.
Query performance obviously matters, but with a HN like site, it's probably not going to be so critical that limiting the depth of threads is even worth the effort.
A simple method (a modified adjacency list) I've used just stores the root id, parent id and id of each post together. You can get the entire tree from any root post easily (everything shares the same root id) but getting the whole subtree beyond immediate children takes recursion.
I find that you don't even have to worry about treating the data as a tree in most cases until the very end. What you want to actually deal with is a flat array with the ids (root,parent,id) arranged in rendering order, and to have the tree built in the HTML. The data set from the DB doesn't even have to represent the tree structure directly, as long as you can sort it elsewhere.
You can even have two arrays - one (say, an associative array) with the data, and another basic array with the ids. Sort just the array with the ids, then use those as keys to iterate the data array when building the html, so you can avoid ever having to sort the larger array (which as luck has it just happens to be optimized for non-linear access anyway.)
I should probably mention, I thought this was terribly clever when I did it in PHP, before I was aware that all arrays in PHP are basically the same, so it was mostly pointless overoptimization.
Building something like an unordered list in HTML from that array then becomes a matter of adding or removing <UL> elements based on the relative change in depth for each subsequent id. Depth is easy to find by checking if an item's parent is (or isn't) the same as the id of the previous element. The actual tree never exists in code until the unordered list is rendered in the browser.
If you actually know what you're doing beforehand, probably ignore everything I just said and go with nested sets. My method is, admittedly, naive and better programmers are probably chuckling at it over the beverage of their choice, but it does work and it seems to be decently fast.
Well that was pretty awesome. Small suggestion. Fix the times on the comment threads to make it even more realistic. Threaded conversations have people starting a thread 11 minutes ago but getting replies 35 minutes ago. Busted :D
It would seem the texts were all written by Markov Chainy, which is why they are grammatically wrong. I would have preferred a grammar-oriented text generator. It's still too easy to tell the one simulation from the other.
What's impressive is that when I see a typically hacker newish headline or comment I now find myself checking to see who I'm logged in as to make sure I'm on the real hacker news.
I used to have yccombinator.com (interestingly it looks like somebody else picked it up after I let it expire) after noticing that there are quite a few places on the net that think that is the right name.
I think it's just the Hacker News layout filled with randomly generated content. I don't get the point though. Maybe playing with some text generation algorithm?
All the content is created based off every post/comment on Hacker News, hence the somewhat plausable post titles (if you squint a bit). There isn't much point to it, other than I found it quite funny :)