Via Charlie Stross: Fantasy author Jim Hines has posted data from a survey of professionally published novelists; Steven Saus has posted additional analyses. Data of note [with code abbreviations in brackets] include (a) number of short fiction sales [SFS], (b) number of rejections [R], and (c) years of writing [YW] prior to selling a first novel, as well as (d) age at publishing a first novel [A]. I admit grouchiness as impetus for the following analyses — I’m grateful for Jim’s work in collecting the data, but his scatterplot of year of publication of first novel vs. SFS is basically unreadable because of the one mutant^H^H^H^H^H^Hluminary who sold 400 short stories before selling a novel. According to scientific custom, the “narrative” below is idealized, i.e. mostly non-chronological but perhaps conceptually coherent.

I first excised two data points with incoherent values — one author claimed a zero value on YW and one a zero value on A. My next step was to plot histograms and descriptives of a few pedestrian data transformations (PDF link from Seth Roberts). I looked at square root, log_{2}, and reciprocal transformations of the four variables enumerated above. (Log_{2} and reciprocal weren’t applicable to SFS and R, which had many zero values.)

In all cases, the square root transformation seemed to bring the data closest to normality — it reliably reduced skewness and usually reduced kurtosis. So I used square root-transformed data in all subsequent analyses. The next step was to eliminate outliers like the one that made Jim’s original year-SFS plot so hard to read. I excised six more rows, each of which had a value at least three standard deviations away from the mean of at least one variable. Four had unusual values on SFS and two on R. That left me with 238 values to work with.

At this point, I did a bunch of plotting and backed away from parametric tests (specifically Pearson correlation). Probably because the sales data are so unalterably skewed, a nonparametric test (Spearman correlation) was still more sensitive than the Pearson correlation on transformed data. In what follows, I really should be plotting the ranks, but it seemed more grungy than it was worth to figure out how to do that properly in R after midnight, so I’m still plotting square root data. However, the one significant rank-order correlation had a similar magnitude when I did the Pearson correlation on the square root data, where it wouldn’t have on the raw data, so the square root data isn’t such a bad visualization.

When I first looked at the year-SFS scatterplot, there seemed to be a strong positive relationship between year of publication and number of short fiction sales, even though the correlation coefficient was very small. I knew the mode of SFS was zero, so I figured there were probably a lot of overlapping points sitting on top of one another. So I figured I’d size each point such that its area was proportional to the amount of data residing there. Here’s my version of Jim’s plot:

You can see that the correlation is zero, although the number of writers making a lot of short fiction sales before their first novel does seem to be rising by the year.

Next, I wanted to see whether prior sales were related to how often an author’s first novel was rejected (maybe more experienced writers are less liable to suffer rejection) or how old they were when their first novel was published (maybe writing more short fiction delays — or accelerates? — selling your first novel):

There’s enough data that the incredibly low correlation coefficients from this incredibly skewed data are actually not so far from significant. (Perhaps most saliently, there are a few isolated cases of people selling several dozen stories and still getting their novel rejected several dozen times.) What about years of writing? You could imagine that the more short stories you sell, the less time you have to spend writing before you publish your novel (or, alternatively, the better a writer you are, the more short stories you sell AND the less time you have to spend writing before you publish your novel).

Although this correlation coefficient is highly significant and the rho value is much bigger, it’s still a pretty small effect — SFS and YW share a little less than 5% of their variance, whereas SFS and R or A share a little more than 3%. What’s notable is the direction. Since the data is just correlational, it’s not clear whether writing more stories causes you to spend longer in your writing career before you publish a novel, or whether taking a long time to sell a novel just gives you more opportunity to publish stories before you do — but in any case, it looks like writing short stories doesn’t accelerate the publication of your first novel. If anything, it slows it down.

Anyway. Not much here — mostly an incrementally more sophisticated confirmation of Jim’s observation of null results. I didn’t feel like dealing with the categorical data earlier in the day, and it’s late, so I won’t now. But it was a nice opportunity to play around with a dataset that’s different from the kind I usually run into, and to teach myself a few new things in R.