Friday, December 9, 2016

R Vinaigrettes

Knuth's call for a literate programming style has spawned a new genre of statistical exposition, the R vignette, and thereby raised the dreary task of documenting computer code to the level of a minor art form, like finger painting or tap dancing.  These vignettes are intended to reveal something of the authors contribution to the greater glory of data analysis, usually in the form of an R package.

This development has been enormously successful, and yet there is a general unease within the research community, a feeling that many of the almost 10,000 packages currently on CRAN have not received adequate  vetting, or vignetting.  In this spirit I would like to propose a new genre, the
R vinaigrette.  These would be brief communications that expose some feature, or bug, in the collective enterprise of statistical software.  As the name suggests there should be something piquant about a vinaigrette, some lemon juice to balance the oils, or mustard, or vinegar.  I would only insist that, like the vignette, the vinaigrette must be reproducible.  Ideally, they should also satisfy the Kolmogorov dictum  that every single discovery should fit in a four-page Doklady note, since "the human brain is not capable of creating anything more complicated at one time." 

An example is now available at http://www.econ.uiuc.edu/~roger/research/ebayes/Bdecon.pdf

Monday, November 28, 2016

Optimal Transport on the London Tube

I've been reading Alfred Galichon's terrific new monograph on optimal transportation, and was
inspired over the fall break to look into his example in Section 8.4 on routes for the Paris metro.
Data for the London Underground was more easily accessible, so I made a toy tube router function
for R that takes an origin and destination and computes an "optimal" path by minimizing the
cumulative distance between stops.  An example path is illustrated in the figure below with the
lines color coded, unfortunately my current data sources don't account for links that have multiple
lines, so the routes typically overstate the number of line changes.  (It would be nice to penalize
line changes with a fixed cost, but this would have extended the project beyond the fall break.)
Data and code is available here:http://www.econ.uiuc.edu/~roger/research/OT/tube.tar.gz
It is all very simple, just a linear program, but it makes you think about how one might scale it
up to the scheme used by Google Maps.



Wednesday, November 9, 2016

Election 11/9/16

Over night I've become an orthodox Freudian:  the election pitted the American super-ego against its id, and the id won, decisively.  Neither economics, nor statistics, help to explain this cataclysm.

Friday, September 16, 2016

Dawn of the δ-method





Several years ago my colleague Steve Portnoy wrote a letter to the editor of the American Statistician
in response to an article that they had published called Who invented the δ-method?  The article
claimed priority for Robert Dorfman on the basis of an article appearing in 1938 called "A Note
on the δ-method for Finding Variance Formulae" published in the Biometric Bulletin.  Portnoy
pointed out that Joe Doob had written about the δ-method in a 1935 Annals paper titled,  "On the
limiting distribution of certain statistics" referring to it as the "well-known δ-method" and citing
prior work by T.L. Kelley and Sewell Wright, and noting rather modestly that his Theorem 1
"shows an interpretation which can be given to the results obtained by this method."  It seems
plausible that Doob's is the first formal justification for the method, and it is puzzling to put it
euphemistically  that Dorfman made no mention of Doob's article.  Perhaps this oversight can be
forgiven as a juvenile mistake since the Dorfman paper was written shortly after he finished his
undergraduate studies at Columbia, while working at the Worcester State Hospital, pictured above.
This august institution was reputed to be the first asylum for the insane in New England, and also happened to be the publisher of the Biometric Bulletin.  Dorfman later went on to earn a Phd at Berkeley, and taught at Harvard where hecoauthored an influential book about linear programming with Paul Samuelson and Robert Solow.

Thursday, September 15, 2016

Bag of Little Bootstrap for QR

In the never ending quest to speed up inference for large quantile regression problems,
I have started to look into the Kleiner, et al Bag of Little Bootstraps proposal.  After a
serious confusion on my part was corrected with the help of Xiaofeng Shao,  I've come
to the following code fragment added to summary.rq in my quantreg package:


 else if (se == "BLB"){ # Bag of Little Bootstraps
        n <- length(y)
        b <- ceiling(n^gamma)
        S <- n %/% b
        U <- matrix(sample(1:n, b * S), b, S)
        Z <- matrix(0, NCOL(x), S)
        for(i in 1:S){
            u <- U[,i]
           B <- matrix(0, NCOL(x), R)
           for(j in 1:R){
w <- c(rmultinom(1, n, rep(1/b, b)))
B[,j] <- rq.wfit(x[u,], y[u], tau, weights = w, method = "fnb")$coef
           }
            Z[,i] <- sqrt(diag(cov(t(B))))
        }
        serr <- apply(Z, 1, mean)
    }

In the eventual implementation I managed to embed the inner loop into fortran
which helps to speed things up a bit, although of course it would be eventually
helpful to allow this to be distributed across cluster nodes.
I should also mention that the implicit assumption here that BLB works for
moments, appears to be (presently) beyond the scope of the current theory,  

Friday, March 25, 2016

Round of 16: What's Sweet about It?

The Statistics Department is running a March Madness contest, and I couldn't resist entering.
It is organized a little differently than the usual bracket picking, which made it more fun to
prepare an entry.  You are given a budget of 100 units,  and you must pick a subset of teams
as many as you want subject to the budget constraint:  Teams seeded 1 cost 25, 2 cost 19, ...
16 seeds cost 1.  I simulated 10,000 brackets, and recorded the survival probabilities as in
the earlier survival plot on this blog, and then computed the expected number of wins for
each team, normalized by their cost, ordered the teams and produced the following list of
teams.  The winner is the entry whose teams accumulate the largest number of wins.

                 EWins Seeds Cost      Bang CumCost CumEWins
Gonzaga         1.1619    11    4 0.2904750       4   1.1619
Pittsburgh      0.9041    10    4 0.2260250       8   2.0660
Cincinnati      1.0757     9    5 0.2151400      13   3.1417
Iowa            1.6385     7    8 0.2048125      21   4.7802
Syracuse        0.8112    10    4 0.2028000      25   5.5914
VA Commonwealth 0.7855    10    4 0.1963750      29   6.3769
West Virginia   2.4625     3   13 0.1894231      42   8.8394
Duke            2.2013     4   12 0.1834417      54  11.0407
Purdue          1.9888     5   11 0.1808000      65  13.0295
Connecticut     0.9040     9    5 0.1808000      70  13.9335
Butler          0.8602     9    5 0.1720400      75  14.7937
Indiana         1.8908     5   11 0.1718909      86  16.6845

Texas A&M       1.8507     3   13 0.1423615      99  18.5352

Monday, March 14, 2016

Bracketology 2016

Again, it is time to fill the brackets.  This year, again, I'm planning to join the Kaggle gaggle, but
I thought I would report here what this years survival probabilities look like.  The situation is quite
different than last year, when Kentucky was the overwhelming favorite.  This year there is a quite
close race with three teams:  Kansas, MSU, and UNC all above 0.10 probability of winning it all,
with 0.16, 0.12, 0.11 respectively.  This is based on my 1000 simulations of the tournament with
my standard QR model, just like last year when Kentucky was at 0.40.  Here is a Tufte type
spark lines graphic with this years survival functions.

This year we have also posted a new bracket generator at

http://www.econ.uiuc.edu:8080/QBracketology/

that can be used to generate a random bracket according to this year's fitted model.  Thanks to
Ignacio Sarmiento Barbieri for help with the R shiny implementation of this.  Further details
about the methods underlying all of this are available here:

http://www.econ.uiuc.edu/~roger/research/bracketology/MM.html

Sunday, January 3, 2016

Edgeworth and the Newsvendor

Several months ago I received a new paper by Mukherjee, Brown and Rusmevichientong on
empirical Bayes methods for the "newsvendor problem."  It is a matter of considerable personal
embarrassment that I hadn't ever heard about the newsvendor problem.  In its simplest form we
have a newspaper vendor (remember them?) who must decide how many papers of various sorts
he should stock.  He sacrifices profits if he stocks too many or too few, the loss is linear in both
directions, but asymmetric.  He knows the distribution, F, of the (random) demand for each
paper, and of course his solution is to stock the a/(a+b) quantile of F, where a is the marginal
cost of stocking too few papers, and b is the marginal cost of stock too many.  I learned this
bit of basic decision theory wisdom from Tom Ferguson's wonderful textbook in a course from
Bruce Hill eons ago, and I was aware that it appears in Raiffa and Schlaifer's text as well, but
I was surprised to learn that there was an extensive operations research literature going back
at least to the seminal paper of Arrow, Harris and Marschak in 1951.  Why I was surprised,
since this is a rather fundamental problem in inventory policy, is another question, but I won't
delve into that. *  Instead, I decided to look into the earlier history of this idea and began to see
quite a lot of references to Edgeworth's (1888) paper "The mathematical theory of banking."
More embarrassment ensued when I realized that I'd never read this paper.  Reading it revealed
a very clear formulation of the newsvendor problem applied to banking, unfortunately Edgeworth's
solution seemed a little murky.  I've tried to untangle all this in a couple of pages below, but as
you will see, it resists a fully satisfactory untangling.

Edgeworth.pdf

* I would like to stipulate that the simple static model underlying the newsvendor problem
is only a small part of the accomplishment of the AHM paper that introduced dynamic sS
rules as well.