Showing posts with label Rafe Donahue. Show all posts
Showing posts with label Rafe Donahue. Show all posts

Thursday, February 12, 2009

spring is just around the corner

.. and we know what that means: Continuing Education courses through
ASA!

Make your plans now to attend "Fundamental Statistics Concepts in
Presenting Data:
Principles for Constructing Better Graphics" in warm and cheerful
Alexandria, VA on Friday April 17.

More information available at
http://www.amstat.org/education/learnstat/fscpd_pcbg.cfm


(Lifted from an e-mail from Rafe Donahue. Long-time readers of the blog may remember the day I meant to go to the gym, got a 102-page handout from Rafe and sat devouring the document for the next two hours. This is the course for which the unputdownable handout was written. Post here. Link to handout here.)

The ASA is doing its best to exclude credit-crunched riff-raff from the course:

*Cost:
$475 for ASA Members
$375 for Students
$615 for Nonmembers
*Registration fee includes course material and lunch on both days


but, well, if you can get sponsorship from big pharma or your local drug dealer it looks like a good day out.

(On a separate but not unrelated subject, I've just been reading Malcolm Gladwell's Outliers. Readers charmed by the logical incoherence and slapdash anecdotal style of The Tipping Point and Blink will not be disappointed by the new book. (Yes, yes, I know, a reader who failed to be charmed by Mr Gladwell's previous two books had no business buying the third; the title seemed to promise more in the way of statistical substance.) Anyway, the source of grievance is not really the existence or shortcomings of this particular book, but the non-existence of the brilliant book the Man from Tennessee could have written if given the nod. Outliers is on sale at Gatwick, "Fundamental Statistics Concepts in Presenting Data: Principles for Constructing Better Graphics" is available on PDF at Rafe's website and to anyone with $615 burning a hole in their pocket who happens to be in the DC area on April 17. But.

Look, the question ostensibly being addressed by Mr Gladwell is not
"How can I make lots of money selling intellectually underpowered blather to intellectually underpowered readers?" The question being addressed is "What are the secrets of success?" Mr Gladwell's view is that talent is being squandered; many more people could achieve excellence than actually do so. And one of the "secrets" is that success comes to be people who work hard, who persevere with difficult subjects, who come from a cultural context where hard work is valued. Another is that cultures where the language of mathematics is simple, requiring little cognitive processing to learn and deploy, achieve strikingly better results in mathematics. But in that case surely Outliers itself was an opportunity to push the mass of readers toward a level of excellence not on offer in the educational system, a level their culture had persuaded them was confined to those with exceptional mathematical gifts. Edward Tufte has argued that if information design is used well it can support analysis in a way that a general audience can follow; instead of cluttering up the book with the textual equivalent of chartjunk, Gladwell could have shown readers that they had the capacity to deal with presentation and analysis of complex material. That is, he could have done what RD does in his hand-out. Well, we are the hollow men, we are the stuffed men, leaning together, headpiece filled with straw. Alas!)

Friday, July 18, 2008

...of the return

Woke up yesterday at 4.30am. Decided to go to the gym. I got a new bike the other day off Craigslist, which I've been keeping inside to prevent theft. Took it downstairs, set off for the gym, got to the gym, realised I had left the bike lock behind. Always a danger when a bike is not being actively protected from theft by its lock between times of active use.

Back to the apartment. It was a glorious day. Why not go for a bike ride? But first, why not clear the kitchen counter so it would be clear when I got back? And why not make the bed and clear the bedroom floor of boxes so it would be clear when I got back? (It's only 5.30am, after all.)

And while we're at it, why not check e-mails?

Big mistake.

In my Inbox was an e-mail from Rafe Donahue with a link to the PDF of a 102-page handout, Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics.

Needless to say, instead of going for a bike ride first and reading Principles for Constructing Better Graphics later, I'm unable to refrain from downloading and opening the document.

Big mistake.

Before I know it I'm up to page 48 and laughing out loud. I'm deep in an account of work done for a pharmaceutical company which wanted to know how many on the sales force were actually reading various monthly reports sent out to them, with a view to seeing whether the reports were improving sales. I read:



The May data month data appear to have been released on the Tuesday after
the 4th of July holiday break. With the 4th landing on a Sunday, the following
Monday was an off day. The May data month data are pushed to 50% cumula-
tive utilization in about a week as well, with no individual day have more than
the typical 20% usage.

We might now glance down to the cumulative row and note the dramatic spikes
when new data were released and the within-week declines and the dismal week-
ends, although Sunday does seem to trump Saturday.

We might also glance to the top and notice the curious case of the March data
month data still being used by at least someone after the April data month data
had been released, and after the May data month data, and so on. Someone was
still using the March data month data in early October, after five updated ver-
sions of this report had been issued! Why?

Looking at the May data month, the eagle-eye might notice some curiosities in
the May data month utilization before the May data were released. Note the
very, very small tick on approximately 19 May, and again on (approximately)
14, 23, 24, and 30 April: the May data month data had been viewed nearly two
months before they were released!?! Furthermore, now note with some horror
that all of the cumulative lines are red and show some utilization prior to the
reports actually being issued!

This glaring error must certainly point to faulty programming on the author’s
part, right? On the contrary, an investigation into the utilization database, sort-
ing and then displaying the raw data based on date of report usage, revealed that
hidden within the hundreds of thousands of raw data records were some reports
that were accessed prior to the Pilgrims arriving at Plymouth Rock in 1620!
How could such a thing be so?

[As I think I've said, I've been trying to get the LRB to let me write about hysterical realism and information design. James Wood is agin the type of novelist who wants to show how the world works instead of the inner life - but it seems to me that Principles for Constructing Better Graphics not only shows us something about how the world works and how to find out about it, but in the process shows us the inner life of (surprise!) a statistician who is interested in the graphical presentation of data. We read on, agog:]

The answer is that I lied when I told you that the database held a record of the
date and time each report was accessed. I told you that because that is what I
had been told. The truth is that the database held a record of the date and time
that was present on the field agent’s laptop each time the report was accessed.
[It's hard not to love this. ]

Some of the field agents’ laptops, therefore, held incorrect date and time values.
Viewing all the atomic-level data reveals this anomaly. Simply accounting the
number of impulses in a certain time interval does nothing to reveal this issue
with data.

But I know that it is not possible for those usages to take place before the reports
were released, so why not just delete all the records that have dates prior to the
release date, since they must be wrong? True, deleting all the records prior to
the release would eliminate that problem, but such a solution is akin to turning
up the car radio to mask the sounds of engine trouble. The relevant analysis
question that has been exposed by actually looking at the data is why are these
records pre-dated?

We see variation in the data; we need to understand these sources of variation.
We have already exposed day-of-the-week, weekday/weekend, and data-month
variation. What is the source of the early dates? Are the field agents resetting
the calendars in their computers? Why would they do that? Is there a rogue
group of field agents who are trying to game the system? Is there some reward
they are perceiving for altering the calendar? Is there a time-zone issue going
on? Are the reports released early very early in the morning on the east coast,
when it is still the day before on the west coast? And, most vitally, how do we
know that the data that appear to be correct, actually are correct???




I was, naturally, unable to tear myself away. By the time I'd finished reading the weather had changed, it was cloudy and dull. Bad, bad, very bad.

Is it too late to change careers and be a statistician? Say it ain't so, Rafe, say it ain't so.

The full report here.

Sunday, November 18, 2007

One we made earlier

I got an e-mail from Rafe Donahue a couple of months ago with two plots that I couldn't get to display properly, and so time passed, time passed, but I am now having another shot at it, including also Rafe's comments:

...logistic regression allows one to predict a categorical outcome as a function of other variables. The output is typically a cumulative logit (log-odds), which is linear, but those are hard to understand. So I drew the probabilities and their confidence intervals.

So, you can see that as your TAC score goes up, you are less likely to be in the No Disease group, and if you have a TAC score of about 10,000 you have about a 50-50 chance of being in the Rutherford 1,2,3 club (disease) or the Rutherford 4,5,6 club (very bad disease).



I did the analysis in SAS and drew the plot in R.

Also is a plot that shows how to predict TAC score from disease group. This one shows the means, confidence intervals for the means, and prediction intervals for the individual scores.





Meanwhile Rafe has sent another e-mail with his thoughts on the study of biostatistics, it's not all cool plots was the gist, I know, I know, I know . . .

Sunday, September 9, 2007

and one and two and

Got this e-mail from Rafe Donahue, a biostatistician at the University of Vanderbilt:

Ok, so one day I was looking at heart rate data. You go to the doctor and the technician takes your pulse. You sit still for 15 seconds and they count. Then they multiply by 4.

Or they can count for 30 and mutliply by two.

Or they can just count for 60 and then there is no difficult math involved.


Heck, you could count for 2 minutes and
*divide* by two.

When I was a little kid, I convinced myself, before I knew anything about statistics and probability, that you could count for 1 second and multiply by 60 and then get a pulse of 0 or 60 or 120 and then if you did a weighted average of a bunch of single seconds, it would all work out! A child prodigy I was; then I grew up and look what happened.


So I am looking at some heart rate data and I decide to draw the histogram and look: there are little spikes at 48 and 52 and 56 and 60 and 64 ... and smaller spikes at 50 and 54 and 58 and 62 ... and very few readings at odd numbers. So of course different places where the pulses are taken use different counting schema!

In fact, if you draw the histogram for the individual sites, you can see which ones did what! Goodness, who ever thought one would need to standardize _taking a pulse!_

Then someone I know sends me the attached picture. These are diastolic blood pressure readings from a clinical trial. These are the baseline values. There are something like 6000 readings total; it is a big trial. The guy who sent the plot added the smoothed density estimate.


At the end of the trial, the dbp values will be examined, probably by doing some t tests. And the assumptions will be that the data come from a Normal, or Gaussian, distribution. Ha!


So, what will be the impact of that digit preference? I'm not sure, but I know that if the rounding is not symmetric relative to the original distribution, there will be bias. In fact, we will probably be able to show that one can make a treatment difference arbitrarily big or small by choosing a suitable rounding scheme.


Go figure. So much for Normal data.


Here's the graph:

I ask Rafe if I can post this on pp and he says Sure. He comments that it is real world clinical data, but it's better not to name the pharmaceutical company ("although there is no doubt that they all look like this"), adding:

We need to make sure that the point is that the data are funky; no one is _trying_ to use them to be deceitful. But when you actually look at the data, sometimes things look different from what you might expect. And the downstream implications are pretty much unknown.

Oh, and in other news, you can read some news about the lottery in TN. They switched from a physical machine with numbered balls to a computer system. Naturally someone screwed up the programming and no one noticed. Then someone noticed something was seemingly goofy but they didn't know what to really check; they didn't know how to do the probability computations and, to make it worse, they didn't know that they didn't know how to do the probability computations. Here is the link where they asked me about the probabilities:
http://www.tennessean.com/apps/pbcs.dll/article?AID=2007709020373

Sunday, July 29, 2007

Bivariate Baseball Plot

Rafe Donahue, a biostatistician at the University of Vanderbilt, has sent me a link to an interactive website that uses the statistical graphic program R to produce a bivariate baseball plot. Devised in collaboration with Tatsuki Koyama, Jeffrey Horner and Cole Beck (as Rafe as pointed out in the comments), it works like this:

The user selects the team and year in which s/he is interested

then goes on to select from: Day of the Week, Opponent Team



Opponent League, Day/Night, Starting Pitcher



(I know readers have seen drop-down menus before, but they are not usually this much fun), Opponent Starting Pitcher, Home/Away, Pitcher with Decision, Opponent Pitcher with Decision, Month, or First/Second Half .

R then produces a bivariate plot displaying the results:



As you'll have noticed from the menus, you can then print out your graphic as a PDF.

The Baseball Scoreplot blog explains how to read a baseball bivariate score plot, discusses known issues and analyses the graphic Rafe generated for the Astros, with Roger Clemens as starting pitcher

The Astros’ opponents’ marginal distribution (on the left) shows how teams fare against teams that beat them: their average rpg is just over 3.5 rpg compared with nearly 4.5 rpg for the Astros. Where the Astros were held to 1 run 27 times, their opponents were held to 1 or fewer on 42 occasions. Note that Clemens started 2 games that were shutouts and started 11 games where the opponents were held to fewer than 2 runs. He also started a game where the opponents scored 9 runs.

The joint distributions reveals details of Clemens’ abysmal run support. The bottom-left corner of the distribution shows five games which Clemens started in which the Astros lost 1-0, a pitcher’s nightmare. So, of the 11 games that Clemens started and the opponents were held to one run, 5 of those games failed to produce a single Houston run. In fact, Clemens was the only Astros pitcher to start a game in which the team lost 1-0.
(Graphic available on blog.)

We never see this kind of thing in fiction.