The paper assumes that popular repositories are popular on their own merit. But as you can see from my chart, there are clear spikes, one when the repository was first released, one a few days afterward, and one in February for literally no reason. The cause of those spikes were Hacker News, /r/programming, and /r/webdev respectively.
There's also a huge selection bias problem with only looking at Top Repositories. While understandable from a scalability standpoint, it's also possible that repositories are popular because of good marketing.
Correct. A lot of disaffected programmers think "marketing" is a dirty word and they will have no hand in it. It's a type of magical thinking that forgets the movie Field of Dreams ("if you build it, they will come") is about a literal miracle. There's really no such thing in the real world.
"Advertising" open source projects doesn't typically involve taking out ads on Facebook, but you still need to perform the marketing task. Get out to events and stand up as a speaker--event organizers are always desperate for content. Write blog posts. Make videos demonstrating how to use your tech.
If you don't care about growing your project beyond yourself, sure, feel free to skip these things. But if you're really serious about growing your project, it should take about half your total time on the project.
Speaking at events really doesn't seem that effective to me at gaining exposure. One post on reddit would net 25-30x the number of stars that a conference/large meetup gig would.
It's more useful for reaching out and making stronger real life contacts with people who might be interested in using your software, but the net isn't cast nearly as wide.
On multiple occasions for various project over the years, I've received several thousand views to my sites from a Show HN or a Reddit post. In comparison, doing a talk is "only" 25 or so views. If all you care about is views, clearly posting online is better.
But I care about people actually using my software. And giving a talk usually ends with 3 or 4 people actually trying my project. Or if I'm running a class, it's 30 or 40. In the last two years, I'm pretty sure there is one person from all the Show HNs and Reddit posts that is using my project. From even the few number of local talks I've given, I have a small cohort of people not only using my software, but giving me feedback on it as well. We see each other once a month, at least. It's one of the greatest feelings ever to talk to a person, face to face, who is excited about your project and hasn't run away even after using it.
Also, I can't guarantee that a post I make will get to HN's front page where it will get so many views. But at this point in time, I get asked to do a talk about once a month. If I really put the effort in, I could probably be doing talks once a week.
And all those thousands of HN viewers never once offered to sponsor the continued development of my project. But my progress on giving in-person talks got me in front of just the right people and now I can work on my project full-time.
In a similar situation, my wife has a couple of sci-fi novels that she self-publishes online. The vast, vast majority of her sales have been in-person at book fairs. People at book fairs are ready to buy. They are there because they want to spend money. It's an easy sale. Online, you have to intrude on people and fight against the million other people self-publishing.
Yes, you cast a much wider net online. But it's a net with big holes and for much, much smaller fish. Follow the numbers, the important ones, not "views" or "stars" or other such things that aren't "cold, hard, cash". The cash says in-person marketing is significantly easier to execute than online.
> The paper assumes that popular repositories are popular on their own merit.
I read the paper and don't see this assumption. The authors seem well aware of marketing, calling the fastest growth cluster "viral", and conducting a tiny survey of developers finding ... that the developers posted to HN and social media.
Added: I agree only looking at most popular repos is a big problem. The authors (inadequately in my view) acknowledge this in treats to validity.
> The authors seem well aware of marketing, calling the
> fastest growth cluster "viral"
Viral is a description of rate of growth. It makes no statement about the reason for the growth, so much of the viral growth may have been due to people searching GitHub for keywords that would describe a solution to the problem they're having. I know that's what I do.
Exactly. I have a reasonably popular github project (Haraka, an SMTP server) and frankly the stars mean nothing.
What really means something for an open source project is two things:
- for ego only: the large scale users (eg one user I spoke to a couple of weeks ago is receiving over 200 million emails a day with it, with a single server doing over 150k concurrent connections - that's why I built Haraka and it's a good feeling knowing it can do that)
- for contributions: I don't want to be the sole developer - I have other stuff to do. Getting contributions from the community is a far better metric than stars. But more stars do tend to equal more contributions.
By "literally no reason," I mean that nothing done in the repository itself caused the exposure, which is exactly the problem with the results presented. (See commit log, there were no commits proximate prior to the Feb 9 date of the threads: https://github.com/minimaxir/big-list-of-naughty-strings/com...)
I skimmed through this, but their conclusions are far too simplistic to be of any use. A company repo is more popular than a personal one? Are they implying forming a company around an unpopular repo will make it popular? It's likely that if there's a company behind it then it's being done as paid work, that it has specifications needed to interop with something else, with management behind it driving quality, perhaps to a deadline. Second, saying more contributors equates to greater success is a tautology. Or are they suggesting one can simply give strangers commit access and that alone will determine success? More likely, success and popularity attract contributors. And repos tend to get more stars after a release, so if one releases every hour they'll get the most stars, right? Fix a typo, it's version 79.0. Document a method, it's version 80.0. Or maybe it's announcements around a release that serve as marketing?
Lots of correlations in their conclusion, but no causation. The paper could be shortened to "Have more visible activity in your repo".
> Second, saying more contributors equates to greater success is a tautology. Or are they suggesting one can simply give strangers commit access and that alone will determine success? More likely, success and popularity attract contributors.
They don't say this. They note that the measure of popularity they're using (stargazer count) is weakly correlated with commits and contributors.
> And repos tend to get more stars after a release, so if one releases every hour they'll get the most stars, right?
While there is a mention of Hacker News, I was surprised to see that the authors apparently hadn't attempted to correlate popularity with Reddit links. Anything that I've had become remotely popular is because it got a few upvotes on HN or Reddit...and anecdotally, I've seen Reddit/HN launch niche libraries into the 1000+ star group, even if the library is relatively niche (hell, I'll star things that look cool and had interesting discussion, even if it's likely I'll never clone/fork the repo)
Meanwhile, libraries that were ubiquitous by the time Github became popular have relatively few stars. The most prominent example in my mind is ruby/rake, which has just 627 stars: https://github.com/ruby/rake
ruby/rake is in maintenance mode, so there really isn't a lot of reason for people to notice it or even star it. You can see how active it is, in the screenshots below:
And as the last screenshot shows, in the last year, they only changed 104 files with 227 commits from 23 contributors. And if you look at the churn graph, there wasn't a lot of churn between 2011 to 2015.
Now compare this, to something like GitLab (18,000 stars), which is my goto repo for showing high rates of activity. This is there churn for the last 7 days, which doesn't count added/deleted files and changes by merge commits. And the reason for not counting added/deleted files, is I was told they are doing a lot of restructuring.
In the last 7 days, they changed 505 files with 390 commits, from 35 different contributors. In one week, they doubled the activity of ruby/rake's one year activity. What would be interesting to know is, of the 35 contributors, how many are GitLab employees.
I would also love to analyze GitHub's and Atlassian's Bitbucket development repos. I can't imagine they are iterating at the pace that GitLab is and I have yet to find another open source project that is going at their rate.
I also found both the methodologies and the conclusions to be very simplistic. They conclude that repository age isn't a factor and then give apple/swift as an example. I don't use swift, but my understanding is that it's been an ultra popular language for a long time, but only recently got open sourced. I don't think they account for project existence before GitHub appearance which may be a significant factor for a good amount of top projects...
We also reported the existence of a strong correlation between stars and forks, a week correlation between stars and commits, and a week correlation between stars and contributors (RQ #2)
Ok. Yes. this is a lame thing to point out, but, it just strikes me as weird: they spelled "weak" wrong.
I happen to have a scraper handy for tracking stars over time, and profiling the users who made those stars (https://github.com/minimaxir/get-profile-data-of-repo-starga...). Here is a chart of the daily number of Star events on the big-list-of-naughty-strings repository as of today: http://i.imgur.com/NzzjuKK.png
The paper assumes that popular repositories are popular on their own merit. But as you can see from my chart, there are clear spikes, one when the repository was first released, one a few days afterward, and one in February for literally no reason. The cause of those spikes were Hacker News, /r/programming, and /r/webdev respectively.
Here's another one of my repositories currently at 458 stars (https://github.com/minimaxir/facebook-page-post-scraper) which only got exposure after a Show HN last Thursday: http://i.imgur.com/8qpTrrb.png
There's also a huge selection bias problem with only looking at Top Repositories. While understandable from a scalability standpoint, it's also possible that repositories are popular because of good marketing.