Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Tuesday, February 17, 2015

GG über alles

Reading Gerd Gigerenzer's "Mindless Statistics" (here).

In a café. Back in Berlin. A better blogger would include quotations from this splendid article, but we at pp are somewhat threadbare after months of showing the flag.

Thursday, August 2, 2012

ich wünschte ich wüßte...

Cathy and Cosma both feel that knowing specific programming languages is not essential. To quote Cathy, "you shouldn’t obsess over something small like whether they already know SQL." To put it politely, I reject this statement. To apply to a data science job without learning the five key SQL statements is a fool's errand. Simply put, I'd never hire such a person. To come to an interview and draw a blank trying to explain "left join" is a sign of (a) not smart enough or (b) not wanting the job enough or (c) not having recently done any data processing, or some combination of the above. If the job candidate is a fresh college grad, I'd be sympathetic. If he/she has been in the industry, you won't be called back. (One not-disclosed detail in the Cosma-Cathy dialogue is what level of hire they are talking about.)

Why do I insist that all (experienced) hires demonstrate a minimum competence in programming skills? It's not because I think smart people can't pick up SQL. The data science job is so much more than coding -- you need to learn the data structure, what the data mean, the business, the people, the processes, the systems, etc. You really don't want to spend your first few months sitting at your desk learning new programming languages.
Both Cathy and Cosma also agree that basic statistical concepts are easily taught or acquired. Many studies have disproven this point, starting with the Kahneman-Tversky work. ..

Terrific post by Kaiser Fung (of Junk Charts and Numbers Rule Your World) - not least for thrill of discovery that Cosma Shalizi is, er, aggressively discussing...

Friday, March 23, 2012

Within any education category, richer people vote more Republican. In contrast, the pattern of education and voting is nonlinear. High school graduates are more Republican than non-HS grads, but after that, the groups with more education tend to vote more Democratic. At the very highest education level tabulated in the survey, voters with post-graduate degrees lean toward the Democrats. Except for the rich post-graduates; they are split 50-50 between the parties.
What does this say about America’s elites? If you define elites as high-income non-Hispanic whites, the elites vote strongly Republican. If you define elites as college-educated high-income whites, they vote moderately Republican.
There is no plausible way based on these data in which elites can be considered a Democratic voting bloc. To create a group of strongly Democratic-leaning elite whites using these graphs, you would need to consider only postgraduates (no simple college grads included, even if they have achieved social and financial success), and you have to go down to the below-$75,000 level of family income, which hardly seems like the American elites to me.

Andrew Gelman, Statistical Modeling... http://andrewgelman.com/2012/03/voting-patterns-of-americas-whites-from-the-masses-to-the-elites/

Saturday, February 11, 2012

infra dig

I recently read a discussion between Nathan Englander and Jonathan Safran Foer in the Guardian (they were talking about their new edition of the Haggadah, translated by NE and edited by JSF). JSF said he did not read reviews. A fortiori I'm guessing he does not check out GoodReads and then dig down to the stats on ratings. If one is going to indulge in this undignified practice I can't help feeling the best plan is to keep quiet about it. But! We have data! (You know my weakness for data.)

Lightning Rods was rejected by 17 editors when Bill Clegg sent it out; it was rejected by another 4 or 5 years earlier; this looks like a unanimous rating of <= 2 stars.

Here is a dear little bar chart from GoodReads:



This isn't quite what I would expect the distribution to look like if people either loved it or hated it, which (editorial consensus notwithstanding) seemed to be the response among people who read it pre-publication, but there's certainly much more variation than among readers of The Last Samurai:


I contemplate the fact, though, that many of the people who HATED the book are of my mother's generation - and my mother HATES COMPUTERS.  She tried e-mail, grudgingly, for years; six years into the trial she had not gone online once to check out a website.  So she would certainly not sign up for GoodReads; if the sort of person likely to hate the book is also the sort of person unlikely to sign up for GoodReads, this would naturally affect the distribution.

How much easier life would have been, I can't help thinking, anyway, if the distribution among editors had matched that of readers on GoodReads. Or rather -- it's so complicated with editors. Bill said 16 out of 17 editors thought the book was funny and well written but they could not see publishing it, which maybe means they anticipated most readers giving it a rating <= 2 stars. Would an anticipated distribution like that of GoodReads have tipped the balance? (How much easier life would have been had the distribution of editors anticipating a distribution like that on GoodReads matched the distribution on GoodReads . . .) But regrets are fruitless. On with the show.

Tuesday, December 27, 2011

The psychologist Gerd Gigerenzer has shown that if conditional probabilities are reinterpreted as frequencies, people have no problem in interpreting their meaning (see the discussion "Risk School" in Nature 461,29, October 2009). Gigerenzer has been promoting the idea that trigonometry be dropped from the high school math sequence (no one uses it except surveyors, physicists, and engineers) and probability theory be added. This sounds like a great idea to me.

Herbert Gingis reviewing Daniel Kahneman's Thinking Fast and Slow over at our very dear friends at Amazon (HT, as too often, MR) [We at pp are huge fans of GG, not that it helps: we feel that if our very dear friends in the biz had but read GG, and then immersed themselves in the oeuvre of ET, we could have been a contender.] [This is not necessarily the most insightful quote from HG wrt DK, but we at pp are, as we say, huge fans of GG.]

Stop press!!!!!! New Yorkers take note! 


On Saturday January 21 at 2.00pm Edward Tufte will conduct an open forum answering questions about analytical design, art, the creative process, and public service. Free event, ET Modern.
On Monday January 23, 2012, Edward Tufte will give his one-day course, "Presenting Data and Information," at ET Modern. The Monday course filled up quickly and is now closed, so we've now added another course day: Sunday, January 22, 2012. See below for course information and registration.

Saturday, November 26, 2011

disgusted in topeka

pp has not had much to say about statistics lately.  So. Data. Cussedness thereof.

Lighting Rods took a long time to get published.  It was very different from The Last Samurai, so different that 50% (at a guess) of readers who loved TLS hated the book.  This is not encouraging to a publisher, whichever half of the 50% he happens to side with.

You'd never guess it now that the book has been published.  Reviews have been, for the most part, extremely enthusiastic.  (Sloth prevails over shameless self-promotion; I could throw in lots of links, but sloth, as I say, prevails.)  This does not really give an accurate picture of responses to the book.

My publicist, Tom Roberge, was swamped by requests for review copies.  Everyone who asked for an ARC did not write a review. Some loved the book. Others HATED it. The ones who hated it hated it so much they couldn't bring themselves to waste time writing a review.

The result being that, if you go by reviews, you'd be likely to see this as a book with a 3.8 GPA. A, A, A, A+, A+, A++, A-, B+, B+ . . . Because the people who HATED the book, the people who would give the book a C, C-, D+, or downright F -- hated it so much they couldn't write a review.




Wednesday, August 10, 2011

Andrew Gelman on the difference between information visualization and statistical graphics:

When I discuss the failings of Wordle (or of Nightingale’s spiral, or Kosara’s swirl, or this graph), it is not to put them down, but rather to highlight the gap between (a) what these visualizations do (draw attention to a data pattern and engage the viewer both visually and intellectually) and (b) my goal in statistical graphics (to display data patterns, both expected and unexpected). The differences between (a) and (b) are my subject, and a great way to highlight them is to consider examples that are effective as infovis but not as statistical graphics. I would have no problem with Kosara etc. doing the opposite with my favorite statistical graphics: demonstrating that despite their savvy graphical arrangements of comparisons, my graphs don’t always communicate what I’d like them to.

I’m very open to the idea that graphics experts could help me communicate in ways that I didn’t think of, just as I’d hope that graphics experts would accept that even the coolest images and dynamic graphics could be reimagined if the goal is data exploration.

To get back to our exchange with Kosara, I stand firm in my belief that the swirly plot is not such a good way to display time series data–there are more effective ways of understanding periodicity, and no I don’t think this has anything to do with dynamic vs. static graphics or problems with R. As I noted elsewhere, I think the very feature that makes many infographics appear beautiful is that they reveal the expected in an unexpected way, whereas statistical graphics are more about revealing the unexpected (or, as I would put it, checking the fit to data of models which may be explicitly or implicitly formulated. But I don’t want to debate that here. I’ll quarantine a discussion of the display of periodic data to another blog post.

The whole thing here.

Tuesday, November 9, 2010

under the hood

“You say autism, or Down syndrome, and people know somebody,” said Ms. Dopp, who stays home with Jackson and his three siblings. “When you try to explain 7q to people and they barely know what a chromosome is, it’s hard.”

via MR, here

Thursday, November 4, 2010

black swans

Andrew Gelman revisits his review of Taleb's The Black Swan.

And then there are parts of the review that make me really uncomfortable. As noted in the above quote, I was using the much-derided "picking pennies in front of a steamroller" investment strategy myself--and I knew it! Here's some more, again from 2007:

I'm only a statistician from 9 to 5

I try (and mostly succeed, I think) to have some unity in my professional life, developing theory that is relevant to my applied work. I have to admit, however, that after hours I'm like every other citizen. I trust my doctor and dentist completely, and I'll invest my money wherever the conventional wisdom tells me to (just like the people whom Taleb disparages on page 290 of his book).

Not long after, there was a stock market crash and I lost half my money. OK, maybe it was only 40%. Still, what was I thinking--I read Taleb's book and still didn't get the point!

Actually, there was a day in 2007 or 2008 when I had the plan to shift my money to a safer place. I recall going on the computer to access my investment account but I couldn't remember the password, was too busy to call and get it, and then forgot about it. A few weeks later the market crashed.

If only I'd followed through that day. Oooohhh, I'd be so smug right now. I'd be going around saying, yeah, I'm a statistician, I read Taleb's book and I thought it through, blah blah blah. All in all, it was probably better for me to just lose the money and maintain a healthy humility about my investment expertise.

Andrew was kind enough to have me to dinner (along with Jenny Davidson) while I was in New York; Andrew is probably one of the few who are more charismatic in person than in avatar (possibly because backed up by the exceptionally charismatic Caroline, Jakey and Zach). This in itself would be sufficient justification for blogging (at a purely personal level); the thing that is of real significance, though, is the fact that AG was able to write a review with self-determined word-count -- and then revisit it in light of events. Show me the paper publication that lets reviewers write a review of the review years later, at a word count dictated by developments in the world rather than by paper constraints-- I don't think so.

Tuesday, August 17, 2010

Statisticians hate small numbers (samples); now there is another reason to hate small numbers. In one word, scams.

The FTC has shut down a scam in which the crooks have sneaked through 1.35 million fraudulent credit-card charges, each valued at $0.25 to $9 -- after letting it run for four years. What's shocking is that less than 5% of the victims (78,724) noticed and reported the charges. So, instead of stealing $1 million from one person, steal $1 from a million.

Kaiser Fung on Small Numbers and Scams
Granger causality is a standard statistical technique for determining whether one time series is useful in forecasting another. It is important to bear in mind that the term causality is used in a statistical sense, and not in a philosophical one of structural causation. More precisely a variable A is said to Granger cause B if knowing the time paths of B and A together improve the forecast of B based on its own time path, thus providing a measure of incremental predictability. In our case the time series of interest are market measures of returns, implied volatility, and realized volatility, or variable B. . . . Simply put, Granger's test asks the question: Can past values of trader positions be used to predict either market returns or volatility?
A report on Granger causality, discussed by Andrew Gelman at Statistical Modeling...

AG:
I have nothing to say on the particulars, as I have no particular expertise in this area. But in general, I'd prefer if researchers in this sort of problem were to try to estimate the effects of interest (for example, the amount of additional information present in some forecast) rather than setting up a series of hypothesis tests. The trouble with tests is that when they reject, it often tells us nothing more than that the sample size is large. And when they fail to reject, if often tells us nothing more than that the sample size is small. In neither case is the test anything like a direct response to the substantive question of interest.

Thursday, August 12, 2010

ESPN's Bill Simmons (aka The Sports Guy) recently suggested that the primary cause of dwindling interest in Red Sox games by fans is that baseball games these days are too long. "It's not that fun to spend 30-45 minutes driving to a game, paying for parking, parking, waiting in line to get in, finding your seat ... and then, spend the next three-plus hours watching people play baseball", he says.

Erm, I always thought the reason I thought baseball games were too long was that I was not interested in baseball. Had not considered the possibility that a 3-hour game might put off people who actually liked the game.
Revolutions (New about R &c) offers a plot in ggplot2 to determine, anyway, whether the data support the claim that games are getting longer.

Monday, July 19, 2010

After being awarded a Rosenwald Fellowship, established by the clothing magnate Julius Rosenwald to aid black scholars, he attended the Institute for Advanced Study at Princeton but left after a year when, because of his race, he was not issued the customary invitation to become an honorary faculty member. At Berkeley, where the statistician Jerzy Neyman wanted to hire him in the mathematics department, racial objections also blocked his appointment.

Instead, Mr. Blackwell sent out applications to 104 black colleges on the assumption that no other schools would hire him. After working for a year at the Office of Price Administration, he taught briefly at Southern University in Baton Rouge, La., and Clark College in Atlanta before joining the mathematics department at Howard University in Washington in 1944.

Obituary of David Blackwell, NYT, ht Andrew Gelman

Thursday, July 15, 2010

I think some of the confusion that has arisen from Ed Tufte's work is that people read his book and then want to go make cool graphs of their own. But cool like Amis, not cool like Orwell. We each have our own styles, and I'm not trying to tell you what to do, just to help you look at your own writing and graphics so you can think harder about what you want your style to be.

Andrew Gelman at Statistical Modeling...

Saturday, June 26, 2010

Friday, May 28, 2010

Expressing the problem using a population distribution rather than a probability distribution has an additional advantage: it forces us to be explicit about the data-generating process.

Consider the disease-test example. The key assumption is that everybody (or, equivalently, a random sample of people) are tested. Or, to put it another way, we're assuming that the 10% base rate applies to the population of people who get tested. If, for example, you get tested only if you think it's likely you have the disease, then the above simplified model won't work.

This condition is a bit hidden in the probability model, but it jumps out (at least, to me) in the "population distribution" formulation. The key phrases above: "Of the 10 with the disease . . . Of the 90 without the disease . . . " We're explicitly assuming that all 100 people will get tested.


Andrew Gelman
on assumptions underlying calculations of conditional probability.

Tuesday, May 18, 2010

damn lies and good graphs

maybe they're right that crappy chartjunk graphs are better than crappy non-chartjunk graphs. But I don't think it's appropriate to generalize to the claim that chartjunk graphs are better than good graphs.

Is chartjunk undeservedly maligned? Possibly, said a recent post on Infosthetics. Andrew Gelman offers the critique I wish I'd written (instead of, shame, shame, frivolously linking without comment). The rest here.

Sunday, May 16, 2010

Think about it. One absolutely cannot tell, by watching, the difference between a .300 hitter and a .275 hitter. The difference is one hit every two weeks. It might be that a reporter, seeing every game that the team plays, could sense that difference over the course of the year if no records were kept, but I doubt it. Certainly the average fan, seeing perhaps a tenth of the team's games, could never gauge two performances that accurately -- in fact if you see both 15 games a year, there is a 40% chance that the .275 hitter will have more hits than the .300 hitter in the games that you see. The difference between a good hitter and an average hitter is simply not visible -- it is a matter of record.

But the hitter is the center of attention. We notice what he does, bend over the scorecard with his name in mind. If he hits a smash down the third base line and the third baseman makes a diving stop and throws the runner out, then we notice and applaud the third baseman. But until the smash is hit, who is watching the third baseman? If he anticipates, if he adjusts for the hitter and moves over just two steps, then the same smash is a routine backhand stop -- and nobody applauds.

...

So if we can't tell who the good fielders are accurately from the record books, and we can't tell accurately from watching, how can we tell?


Bill James, 1977 Baseball Abstract, quoted in Michael Lewis, Moneyball

Thursday, May 13, 2010

Gordon Lish in the House of Bayes

Andrew Gelman gives a statistician's take on the Lish method:

In any case, in the grand tradition of reviewing the review, I have some thoughts inspired by DeWitt, who quotes from this interview:

LRS: I was studying writing at college and then this professor showed up, a disciple of Gordon Lish, and we operated according to the Lish method. You start reading your work and then as soon as you hit a false note she made you stop.

Lipsyte: Yeah, Lish would say, "That's bullshit!"

If they did this for statistics articles, I think they'd rarely get past the abstract, most of the time. The methods are so poorly motivated. You're doing a so-called "exact test" because . . . why? And that "uniformly most powerful test" is a good idea because . . . why again? Because "power" is good? And that "Bayes factor"? Etc.

The rest here.

Saturday, May 8, 2010

ppv v fp

When asked about the accuracy of a mammogram, doctors cite the "false positive rate". Ignore the false positive rate, what patients really need to know is the "positive predictive value" (PPV), that is, the chance of having breast cancer given that one has a positive mammogram.

The PPV for mammography is 9 percent. Nine percent! You heard right. For every 100 patients who test positive, only 9 have breast cancer, the other 91 do not. This may sound like a travesty but it's easily explained: breast cancer is a rare disease, afflicting only 0.8% of women so almost all women who take mammograms do not have cancer, and even a highly accurate test when applied to large numbers of cancer-free women will generate a lot of false alarms.


Junk Charts on Steven Strogatz's piece for the NYT on Bayesian reasoning