paperpools: Excel

Showing posts with label Excel. Show all posts

Wednesday, May 21, 2008

plots

Kevin Connolly has very kindly sent me a new improved plot of the gender mnemonics data using Excel 2007

which really is awfully nice.

Monday, May 19, 2008

Best excuse ever

My dear friend Rafe Donahue has sent many helpful suggestions with regard to ggplot2. He comments:

Don't think of the ± of 0.2 to each of the data at the repeat
locations as modifying the data.  You are not modifying the data, you
are providing a different plotting algorithm for those points that share
a ___location with other points.

a line of argument which it is hard not to love. I'm not saying it's not valid, no... and yet I see myself, down the years, explaining innocently that I was not actually modifying the data as such, I was just providing a different plotting algorithm etc. etc.

Meanwhile Hadley Wickham very kindly sent the correct line of code to change the y axis. I type this in, and by the simple procedure of providing a different plotting algorithm for those points that share a ___location with other points produce

which really is terribly nice. Further information on ggplot2 is available here.

Readers who have not spent much time with Excel charts may be inclined to accept uncritically the complaints of PP; a wealth of information on what can be achieved is available at Peltier Technical Services, here.

On the subject of providing a different plotting algorithm for data whose points overlap, Rafe reminds me that

As I said before, we did this in the
baseball plot data y putting them in little boxes.  Of course, you need
to decide on the size of the box, etc, but that is the price to pay.

For those who missed the great bivariate baseball score plot the first time round, this enables the user to select a team and one or more aspects of its game history and generate a bivariate plot, for example

You can create your own plots here.

Sunday, May 18, 2008

losing the plot

In the previous post I mentioned the results of a couple of tests on memorisation of articles in a language class. A reader pointed out that the plots given separately could be combined in a bivariate plot. Too true. This was weighing on my mind even as I posted the simple pair of plots in the last post.

The results, I remind you, were:

Test 1: 2 6 6 7 8 8 9 9 13 14 14 15 15 16 16 17 18 19
Test 2: 18 20 20 20 20 18 20 19 19 18 17 20 20 19 20 20 19 19

Since the data are currently in Excel, I run them through the chart wizard to generate a scatter plot and come up with:

which is not at ALL what I want. Why does the y axis start at 16.5? Why is it broken down in increments of .5, when the number of correct answers was always an integer? I want a chart that shows the area that's blank because NOBODY got a second score below 17.

The Excel Chart Wizard does not offer the option of customising the y axis. Since I always expect the worst of Excel, I assume there is nothing to be done. In my hour of shame, I come up with the dodgy solution of, ahem, adding a dummy set of results at 0,0. This produces

which is an improvement, but I am, needless to say, deeply mortified by the false result at the lower lefthand corner. I suddenly think: But what if I double-click on the y axis! Sure enough this brings up a dialogue box which lets me set the y axis to my own specifications, and...

Ha!

There's just one slight problem. I know this tiny database, which means I know that 9 people got 20 on the second test, whereas the line at 20 shows only 7 results. The chart has fallen victim to overplotting; the two people who got 6 on Test 1 and 20 on Test 2 have been collapsed into a single dot, as have the two who got 15 followed by 20. Excel has come through once, but I can't believe there's a way to jitter the plot points. I retreat cravenly to inserting bullet points by hand in the basic grid:

This is clunky, no doubt about it, but since I've been doing it all by hand it's easy to see the two people at 6 who got 20 and the two at 15 who got 20:

I then realise that I can achieve a similar result in the charts feature by tampering, yet again, with the data: if I replace the pairs (6,20; 6,20) with (5.8,20; 6.1,20) and use the same dodge on 15 I come up with

Good. Good. (I mention all this because Excel is what most readers are likely to have in the home; it's easy to assume that feeding data into a chart will generate a chart that displays all the data.)

At this point, needless to say, I do not feel happy about a chart that depends on fudging the data. I now do what I should have done in the first place, which is to take it all into R. How much better it would all look, I think, if I used Hadley Wickham's ggplot2 package!

So I put the data into R. Vanilla R produces a plot which throws up a y axis that starts at 17 and moves by increments of .5 to 20, which means it is necessary to rifle through much PDF documentation (which is, of course, why I did not take this very simple task to R in the first place). ylim produces the right axis but doesn't look very nice, so I load ggplot2 and get this

which is very pretty but has yet another y axis starting at 17 and going up to 20 in increments of .5. There passed a weary time, each tongue was parched and glazed each eye, in other words ylim does not do the business in ggplot2, some other method of tinkering is called for, I spend much time rifling through the documentation of ggplot2 both in PDF and at geom_point
(I knew this would happen) trying to work out what to do. Wickham's work is inspired not only by Tufte but by Lee Wilkinson's Grammar of Design, which means that the documentation discusses the rationale underlying the package, which is, of course, both interesting and admirable but unhelpful if you just want to know how to do in ggplot2 what ylim does in vanilla R. Finger in the page. geom_point does make it easy to jitter, so I try that out and get

which is actually not what I want at all, because I only want to jitter the four points where there is overlapping. I think there is a way to fix this (I think it is possible to select horizontal jitter), but how late it is, how late.

At this point, naturally, I begin to wonder whether it is not somewhat infra dig to put all this low-level milling about on display; how much better just to relegate it all to the drafts folder! Wait till I have worked through ggplot2 properly and at some later date post a series of handsome plots, drawing on a more interesting range of data sets, with an air of effortless ease. Yes.

(I revert to my paltry little Excel chart. Wouldn't it be better to have gridlines that divided the area in four? Would it be better if the numbers on the x axis were closer to the points, i.e. at the top of the plot?

Well, maybe. It's clear that about half the participants got under half the answers right on Test 1, and everyone got better than 75% right on Test 2, so that's quite nice. And it does look somewhat like a Smeg refrigerator into the bargain.)

One problem with writing novels is that you often find that there is some software somewhere that looks as though it might do some specific thing that you need for some particular chapter, which may well never be needed again. So you find yourself simultaneously at the embarrassing amateur stage of, who knows, maybe 10 or 15 different programs. So what you would really love to have is the literary equivalent of a director of photography - a technical advisor whose job it is to answer questions like 'How do I fix the axes in ggplot2?' But this is really at odds with the whole Weltanschauung of the publishing industry. But enough, enough.

I then think, but maybe it would be nice to see the two sets of data in a line plot. I am somewhat demoralised by my adventures with ggplot2, so I run them through Excel and get

which is, of course, hideous.

But also enlightening.

Participants got 3 minutes to learn the genders of 20 words. Pre-technique, half remembered fewer than half. Post-technique, half remembered 100%; all remembered 80% or better. ONE PERSON, who started with a score of 19, failed to raise the score. In a word unknown to the immortal bard, blimey.

I don't know how well they would have performed if they had been tested again after half an hour, or 5 hours, or 5 days; this is, one would have thought, an obvious question, but it was one that was not answered in the class.

Meanwhile, behind the scenes... I draft an e-mail to Hadley Wickham, pleading for help. I then realise that my dear dear friend Rafe Donahue, despite his exasperation with the sort of person who is seduced by pretty plots, is still my dear dear dear dear friend. I send an e-mail to my dear friend...

And meanwhile, what to my wondering eyes should appear, but a newsletter from Linotype celebrating the birthday of Adrian Frutiger. I mooch around the Linotype website, checking out the Akira Says column: the most recent essay is on Frutiger, but there are also essays on dashes (hyphens, en-dashes, em-dashes), small caps...

And I am OUTRAGED

because the thing is, when you see a book into print, an 'expert' will be given a month or so to go through the text to introduce 'correct' dashes, capitalisation and so on, which the author can then spend up to 6 months trying to remove

but the thing is, let's be sane. Fine-tuning the dashes and caps is never going to achieve significant improvement in the reader's grasp and retention of the text. When I say 'significant' I'm not poaching on statistical preserves, I'm talking about the kind of improvement displayed in a pair of tests on memorisation of gender. Text A gets 50% of readers, we'll say, getting things wrong 50% of the time; improved Text B gets 100% of readers getting things right 80% of the time or better. You're not going to see that, because, um, text has an extremely limited capacity to convey information in the first place. Whereas, of course, if you start with two sets of numbers

Test 1: 2 6 6 7 8 8 9 9 13 14 14 15 15 16 16 17 18 19
Test 2: 18 20 20 20 20 18 20 19 19 18 17 20 20 19 20 20 19 19

and convert them to some kind of graphic display (as above), you can dramatically improve your chances of conveying a pattern of change. And if the graphic display has the allure of a Smeg, it will dramatically improve the chances that the sort of reader who has hitherto loathed graphs will suddenly be downloading R, braving PDF documentation, collecting data on self, friends and relations for the sheer entertainment of turning it all into graphs.

The point being, if publishers hired statisticians instead of copy-editors and designers, so that the author spent a few months going over the text with a statistical expert instead of the sort of person who knows his en-dashes, it would still be a lot of work, but it would be worth it.

Meanwhile it's a dark, gloomy day.

Sunday, May 11, 2008

pin factories and such

Via Marginal Revolution, came to a post by Stephen Dubner of Freakonomics on specialisation -- linked, on the one hand, to our failure to see a connection between our patterns of behaviour and (say) global warming, and, on the other hand, to the system which enables one to fritter away hours, days, weeks, months at the keyboard snacking on food produced by the sweat of someone else's brow.

Dubner quotes the Babylonian Talmud:

Ben Zoma once saw a crowd on one of the steps of the Temple Mount. He said, Blessed is He that discerneth secrets, and blessed is He who has created all these to serve me. [For] he used to say: What labours Adam had to carry out before he obtained bread to eat! He ploughed, he sowed, he reaped, he bound [the sheaves], he threshed and winnowed and selected the ears, he ground [them], and sifted [the flour], he kneaded and baked, and then at last he ate; whereas I get up, and find all these things done for me.

And how many labours Adam had to carry out before he obtained a garment to wear! He had to shear, wash [the wool], comb it, spin it, and weave it, and then at last he obtained a garment to wear; whereas I get up and find all these things done for me. All kinds of craftsmen come early to the door of my house, and I rise in the morning and find all these before me.

I come across this while trying to sort out logistics for bringing things out of storage in the UK and transporting them to Berlin; have been putting this off for 4 years, because the piano needs a Luton van with a tail lift. A Luton van is the smallest available size of box van, a size just manageable for someone who has never driven anything larger than a car; I have driven one before (this was how I got the piano from London to the North in the first place), but that did not involve taking it on a ferry and driving a vehicle from drive-on-the-left Britain across drive-on-the-right Europe and back again.

The things in London went into storage when I went to NY in 2003; my editor had said I could work directly with the designer on my poker book, but he wanted to use his own designer. So I went to NY, and we negotiated a contract, and I could not get my publishers to provide the designer. The things in Leeds went into storage in 2000; my second agent, Andrew Wylie, had made tough noises when we met, claiming that the agency would bring ruthless efficiency to bear on -- this is the kind of thing that makes that book by Graham Greene look so good. Don't tell me about the past, tell me about the future.

It's a glorious day.

I had been thinking a while back that I would like to do an intensive course on driving an HGV, which can be done in Britain in a month for about £800, thinking how exceptionally helpful it would have been if I had been able to learn this at school. Thinking how much better off just about everyone would be if they were taught how to drive large vehicles with manual transmission at school. Asking myself: who expects to get through a lifetime without transporting stuff?

I've been spending a lot of time on Powerpoint, which the sort of job I might apply for tends to require. It makes driving an HGV look good, but the kind of job that calls for Powerpoint pays better. Have also been spending a lot of time on Excel pivot tables. These are moderately entertaining, especially when combined with MicroCharts, but I then get distracted and start playing around with Hadley Wickham's ggplot2. Had faintly hoped Powerpoint would look less deeply silly if I got to grips with it but it doesn't. Spent a long time working with hangul in Adobe Illustrator. Any wp program will let you select text and format it (Mellel will let you do more than most), but you can't select one element of a syllable in hangul, which the program perceives as a single unit: you must go into Illustrator, create a text box, type hangul, select with the Select tool, convert to outlines, select the element using the Direct Select tool, create a new layer, move the element to the new layer, and hey presto! format it. Question not the need.

This is what the hangul looked like when I was cutting and pasting and colouring in by hand back in 2005, pre-Illustrator:

There is one slight problem, which is that the font directly above (Seoul) is on my old Mac but not the new one and allegedly cannot be installed there. In a separate but related incident InDesign crashes when I try to open it in Leopard, but can't be installed on the old Mac (still on Tiger) because it does not have enough RAM. It will be obvious to the meanest intelligence that the author of this blog is precisely the sort of person who should leave design to the sweat of someone else's brow.

It's a glorious day.

You can type Chinese, Japanese and Korean in vanilla Illustrator, but it can't handle Greek, Russian, Arabic or Hebrew: you need the Eastern European version for the first two, the Middle Eastern version for the second. Same for InDesign. Poor head.

I read an agent's blog a while back (never a good idea); he thought query letters should not pose a problem for writers, because the query letter should simply express the writer's passion for the book. This is always hard to deal with. Quite a lot of working on a book feels pointless and stupid. Why am I sitting inside on a glorious day working on the deeply silly Powerpoint, mildly amusing pivot tables, delectable ggplot2 or possibly useful Illustrator? It's a glorious day. And even if (as I think) all these socially embedded means of presenting information show something interesting, and even if something even more interesting were to emerge if I dug deeper, how exactly is all this supposed to work in a book? This is the question I should be thinking about, no doubt, but instead I am wondering how long it will take to get Leopard on speaking terms with InDesign and whether it is worth adding more RAM to the old Mac.

So time goes by, and perhaps, in the end, there is a book for which hundreds of practical problems have been resolved - a feature that took hundreds of hours and thousands of dollars to fix, maybe, turns up on two or three pages. And other features that took hundreds of hours to get right turn out to be wrong for the book, but there was no way of knowing that ahead of time. The author is about as likely to be enthusiastic as a marathon runner who has just collapsed at the finish line. And the author is likelier, unfortunately, to like dealing with technicians - people with some kind of expertise, people obsessed with details, people who can reduce the amount of time it takes to get something right, freeing up time to go outside on glorious days.

Too much shadowboxing. Mr Ilya is seeing the world.

Wednesday, October 3, 2007

the bug in Excel

In Meknes, just checking emails to see if Justin Smith has set up an appointment with my dentist to fix my broken tooth on my return to Berlin (he has), I suddenly wonder what Joel Spolsky is up to (for this, I hear you say incredulously, you needed to go to Morocco?), and I find that Joel has a post on the bug in Excel! And why, if you multiply 77.1 x 850, you get 100,000 instead of 65,535! (It turns out that .1 is the binary equivalent of a repeating decimal.) All here.

I spent a few days in Chefchouen, in the Rif mountains. Many as yet inchoate thoughts on languages that fall outside formal educational systems for one reason or another - there are three or four Berber languages, of which only one has been written down, and there are no mechanisms in place to help travellers pick up enough of the basics to go to places where not much else is spoken.