html for ebooks

I’m starting to get really frustrated with Calibre, so I thought I’d look up how to format an ebook in HTML. This looks dead easy. And Mobipocket has free HTML-Mobipocket conversion tools; it looks like you can do epub conversion just by editing the HTML file.

Has anyone tried digging into this stuff? Honestly, I feel like an idiot for messing around with Calibre to convert from Word and LaTeX when it seems like it should be easy to create a clean HTML version of a book and go from there. I guess the Mobipocket conversion is probably the most important for most people, so if that tool can’t even do good work with clean HTML, maybe Calibre is still the best free tool for the job. I suppose I should try this out in my copious free time. (It also makes me think that maybe I should do my writing in HTML rather than LaTeX. I bet converting HTML to Word is enormously easier than LaTeX to Word. Plus no compiling. Hmm.)

when a number doesn’t equal itself

I spent a while on Wednesday getting screwed by floating point errors in R. Since it took me a bit of sifting through search results on “r number doesn’t equal itself” to find the problem, this post is basically a quick shot at raising the visibility of the solution.

misuse of the word “love”

I was coding up a Python script to do some data analysis, and I accomplished with some not-all-that-clever list comprehensions what would otherwise have taken a few lines of for loop to accomplish. And I was pleased, and quoth unto myself, “Gosh, I love list comprehensions.”

Then I thought, “Wait, am I just using list comprehensions just to accomplish in Python what I could accomplish in Matlab with clever indexing tricks?” And I looked upon my code, and it was so.

This is not necessarily a win for Matlab — list comprehensions presumably have more general uses than indexing tricks, and the only reason I have ever bothered to use them is that I’m used to being able to grab sections of lists with one-liners even when the way I’m subdividing those lists is a little complicated. And given that I can do what I want with one-liners in either setting, Python’s overall neatness and superior text-processing utilities give it the win. In this context, which is perhaps retrospectively obvious, since what I’m doing is mathematically light and text-processing heavy. But I spent so long using Matlab as default general-purpose programming language that these things still strike me from time to time.

programming, science, and epistemology

If you’re going to do good science, release the computer code too

I don’t have any opinion about whether this is or isn’t an indictment of climate change research, but the general point here is absolutely right. All the work in my dissertation depends on a pretty substantial volume of bespoke Matlab code and probably an equally substantial volume of code written by others (AFNI, MVPA toolbox). AFNI, which is produced by a very conscientious and dedicated group of programmers at the NIH, recently discovered a bug that severely inflates the results of certain statistical comparisons. I reported a fantastic correlation in a talk my first year of grad school that was due to a sign error in the analysis code. A large part of my confidence in the validity of my dissertation work is due to the fact that my advisor and I independently coded the analyses in different languages — he used Java, I used Matlab. We worked this way for reasons more to do with stubbornness than conscience, but we both feel pretty good about it now. My volume of scientific programming hasn’t dwindled in my postdoc; I write PsychoPy code to run experiments and Python, R, and Matlab code to analyze them. Psychologists who aren’t comfortable with programming will use E-Prime, Excel, and SPSS. But Microsoft, at least, won’t save them. The problem is well stated, and it extends well beyond climate change.

What to do about it is not as clear as the article might suggest. Making data and analysis code public is a clear step forward, but that’s a sizable step away from making it easy for people to verify the claims. To check my code, you’ve got to have AFNI and the MVPA toolbox installed — that’s a big investment! That’s assuming you have access to Matlab, which is a proprietary language. And a lot of my more resource-intensive analyses were done in parallel on a computing cluster with several dozen nodes; those aren’t so easy to check on your PC. Still, a clear step forward.

Once you’ve logged the data and programs, though, how many papers are going to get the same kind of attention as the high-profile results Dr. Ince is talking about? Again, no question that it’s better to check just high-profile results than nothing. But we should be realistic about the fact that most of these programs will not be checked. In cognitive neuroscience, at least, I don’t think there are enough reviewers competent to check them, to say nothing of how time-consuming it would be. (This is more of a problem for bespoke code; packages like AFNI, the MVPA toolbox, PsychoPy, and so on could presumably be dealt with by a specialized accreditation unit of some sort — but who has the incentive or the resources to create such a unit?)

To me, it’s not so clear what happens once a dispute arises over the correctness of the code. Maybe it’s always clear, or always devolves to mathematical arguments; that would be the best case. But is there always a way to establish ground truth? Who is responsible for making the final arbitration? Maybe communities tend to self-organize around these issues and produce a reliable consensus; Dr. Ince implies that was the case for the proof of the four-color theorem. I’m just not sure.

It’s also worth thinking about the new biases an open-code policy would encourage. As one commenter noted, “I do… sympathize with scientists who don’t want to release computer code to bullies who just want to pick it apart to find tiny errors and then blow them out of all proportion, claiming that they undermine the whole body of science behind climate change.” I think the scientific community is apt to separate gold from dross in these situations if mobilized in sufficient numbers, but I’m more worried about it in the context of smaller disputes. How do you set things up so an obstructionist dissenter can’t effectively filibuster someone’s publication by picking at the code? Imagine this from the perspective of an assistant professor who’s choosing between spending all his time rebutting attacks on his code and being forced to retract a paper. These sorts of problems are probably solvable, but open code alone won’t solve them.

Also, is an open-code policy especially fair to quantitative researchers? It doesn’t have to be — scientists should exalt truth over competitive advantage — but my wife, for example, is a developmental biologist who uses biochemical tools that seem to operate on a basis of absolute voodoo. It’s known why they’re supposed to work, but sometimes they behave strangely and no one knows why. No one’s going to do any kind of bug-hunting on her beautiful stain or PCR, but my fMRI results are up for grabs? If I have to spend more time convincing people that my imaging results are real than my friend down the hall has to spend convincing people that his single-unit neurophysiology experiments are real, the world may find itself short on imagers. (Although this would be bad for me, I’ll concede that this may be a good thing in general — there’s a strong case for saying we only want really good people in research that depends on extremely elaborate calculations. Again, it’s not that I think quantitative researchers have any right to shy away from scrutiny of their code — but these new openness requirements, if they ever actually become requirements, are going to have consequences, and the fact that openness seems like a good idea makes the potential consequences that much more worth thinking about.)

The comments section is intermittently interesting — it’s mostly climate change-specific, but there’s some more general-purpose insight there. I was struck by one serious error that no one seems to have caught:

“… if random computer code errors are affecting the reliability of conclusions about global warming, then they will be equally likely to be underestimating as overestimating the effects. So this line of argument does not really help the sceptics case.”

Wrong! The errors that underestimate the effects don’t get published. They’re probably also more apt to get debugged — if your code is producing results you don’t like, you’re going to make absolutely sure it’s right, but you’re much less likely to scrutinize code that seems to be working (from your perspective) fine.

resolutions to a couple of psychopy issues

After several hours failing to get PEBL working, I’ve been coding up some scripts for a computerized testing battery in PsychoPy. I really like it overall; it covers PyEPL‘s major deficiency of not being able to create arbitrary stimuli online, and the learning curve is much gentler, at least for someone who doesn’t know much Python in the first place. It doesn’t enforce certain good habits that PyEPL does (e.g. segregating config files, obsessively logging everything), and I can’t figure out how to change the awful font choices in the editor or (just as good) get it to run from plain Python rather than within the PsychoPy app — but overall I’m well pleased.

I’ve run into two real problems in the last week or so of semi-intensive PsychoPy programming that I wanted to share with the Internet (and, possibly, my future self). Neither is a problem in the sense of “PsychoPy fails at x”; they’re problems in the sense of “This is an approach that might seem reasonable, but it won’t work.”

  1. Vertices of ShapeStim shapes. This is actually documented, but I didn’t notice it for a while, so here it is again: Vertices are offsets from the pos argument to the ShapeStim object. If you give the damn thing vertices with absolute screen positions, the resulting shape will be shaped right, but the size will be all messed up. You actually might only notice this if you need to put ShapeStims and RadialStims or PatchStims on the same screen — but you’ll notice then, mark my words.

    Related: I’ve confirmed the documentation’s admission that filling breaks for shapes that have concavities. Creating those shapes out of their convex components, however, works just fine. Also, it looks like giving a filled shape a depth argument of anything other than the default causes the lines, but not the fill, to take that depth — as far as I can tell, the fill just disappears in these situations.

  2. The proper way to wait for keypresses. Coming from a very old version of Psychtoolbox, or perhaps just from the Land of Bizarre Programming Practices, I’m accustomed to waiting a certain amount of time t (but no longer) for a keypress by doing something like this:

    while trialTime < t:
    .... update trialTime
    .... check for keypresses
    .... if there are any, record RTs, process them, and break

    If you do this in PsychoPy, the natural idiom to use for the “check for keypresses” step is event.GetKeys. This approach will miss a lot of keypresses. It’s not a subtle issue, luckily; your program will just fail to respond to something like four out of five keypresses. However, there’s actually no need to use this structure. event.WaitKeys has a maxWait argument that serves just as well to enforce the time limit (something it took me a while to realize), and you can record RT just by resetting a clock before the WaitKeys call and recording time from that clock after it. I don’t understand why the for loop over GetKeys has this issue, but I haven’t detected any drawbacks to just using WaitKeys instead.