Downloads and citations are count data with a variance that significantly exceeds its mean and a large number of zeros. Their observations can only have non-negative integer values and they do not have an explicit upper limit. If the dependent variable is a count variable, the typical econometric regression tool, the linear regression model, is not appropriate since it is sensitive to both the large number of zeros and the extreme values that are not uncommon in count data. Assumptions of normality for count data are difficult to justify unless the data sample is sufficiently large. A more suitable model would be based on the Poisson distribution, since it specifically models the number of events that occur over a specific time period, but it works under the assumption that the mean of the count variable is equal to its variance. But for downloads and citations, the variance significantly exceeds the mean. The mean for downloads is 110 while the variance is 60,085.25 A better fit is thus a negative binomial regression, which is a generalization of the Poisson distribution that includes a parameter to control for over-dispersion, which leads to confidence intervals that are more precise than those from a Poisson regression model. It is also appropriate to use in situations where the underlying count process is not independent26 (Winkelmann, 2008). Problems with the negative binomial include its low applicability to data with large numbers of zero observations (Mihaylova et al., 2011). Besides using the negative binomial model, a second option would be the two-part model, which is able to account for excess zeros in count data (Winkelmann, 2008). The first part of the model estimates the probability of the variable being counted (i.e. downloaded or cited), while the second part estimates the mean number of counts conditional on the count being positive. Logit or probit models are typically used for the first part, while ordinary least squares, log-linear, or generalized least squares models are applied for the second. Two-part models appear to outperform other methods when there are large numbers of zeros in the count data. The results from both models are presented in the report. V. Results Our most parsimonious specification shows that costlier reports for middle income countries are downloaded more. Similar to Wagstaff (2012b), we find that more expensive policy reports tend to have more downloads. In fact, increasing the budget of a report from around $180,000 (the mean) to around $316,000 (an increase by half a standard deviation) increases the number of downloads on average by 23 (which is the combined effect of the two part model) or by 30 conditional on the report being downloaded (the result of the two part model regression). We also include dummies of the year of disclosure. As expected, reports that have been disclosed for a longer period of time are more likely to be downloaded (Table B). Regional reports on larger and richer countries tend to be downloaded more. Anecdotal evidence suggests that some countries with large populations and middle income country status receive higher downloads. One would also expect that richer countries are likely to have better internet availability,27 25 2 The same holds for citations which µ= 0.96 and σ = 20.36. 26 The existence of contagion or state dependence—that is, the occurrence of an event makes further occurrences likely—would cause over-dispersion. In the case of downloads, one person’s download is unlikely to be observable by another person (no data of this kind is provided in D&R), making this possibility unlikely. On the other hand, a citation by one article is observable by others, and this positive contagion effect could drive the citation count of policy reports. 27 Though the number of internet users is not significantly linked with increased downloads. 16

Which World Bank Reports Are Widely Read? - Page 22 Which World Bank Reports Are Widely Read? Page 21 Page 23