Thursday, March 20, 2014

Mea Copula

The usual craziness last Sunday drove me to consider updating the bracketology methods that Gib
Bassett and I described here:
http://www.tandfonline.com/doi/abs/10.1198/jbes.2009.07093#.Uys1PVxTq6Q

Since the Intel (Kaggle.com) contest had reasonable looking data structures, it was reasonably
straightforward to modify my old R software to do 2013-14 version of what we had done
earlier for 2003-4.  The Kaggle competition wanted submissions that made probability estimates
for all 2278 possible match ups for this years tournament.  Our approach was to estimate a
pairwise comparison model for team scores for the entire pre-tournament season  for a relatively
fine grid of tau values -- by (what else?) quantile regression.  The resulting QR models have a
design matrix that is 10724 by 704, but are very sparse so on a grid 1:199/200 the estimation
can all be done in about a minute on my desktop machine.  Then comes the fun of simulating
tournament brackets, and estimating probabilities.  For any pairing, ij,  we have an estimated
quantile function for team i's score and another estimated quantile function for team j's score.
Yesterday afternoon as the deadline for the Intel contest loomed closer and closer, I lost focus
and decided to compute probability estimates based on what one might call the comonotonic
copula -- that is under the assumption that if in our hypothetical game team i achieves quantile
tau in its scoring performance, then team j will also achieve quantile tau performance in this
game.  Thus, if team i's quantile function lies above the quantile function for team j,  the
predicted probability will be 1, that is we will assert that team i will always beat team j.
But clearly team i might have a bad day, when team j has a good one, so another limiting
option would be to consider an independent copula:  each team gets an independent draw
from a uniform that then gets plugged into their quantile function.  Various compromises then
suggest themselves.  In the JBES paper we used a Frank copula model that implied a Kendall
correlation of about 0.27, so mild positive correlation between performance of the two teams.
In contrast to the initial probability estimates with the comonotonic model that produced roughly
half of the 2278 phats at 0 or 1, the Frank copula model produced a much more uniform distribution
of them, as illustrated by the histogram below.  Unfortunately, this didn't occur to me until after
the deadline passed for the Intel submissions.


Sunday, February 9, 2014

Empirical Bayes Bakeoff

The ISI has a new journal called Stat,  I'm not sure this is a good marketing appellation in the modern world of search engines, but it has an interesting premise.  It promises quick turnaround, prompt electronic publication and a sanctioned blog forum in which readers can comment on published articles.  It isn't intended for long expository pieces, rather it is meant for pithier, perhaps more provocative pieces that wouldn't fit well in the current journals.  This is a niche that Biometrika's Miscellanea section used to fill.  I decided to give this a whirl recently, submitted a 5 page paper comparing some empirical Bayes methods in several simulation settings of the Johnstone and Silverman "needles in haystacks" type.  In a couple of weeks I received a favorable, and very well informed referee report* and the paper was accepted.  It appeared after a bit of copy-editing a week or two later.  Whether it attracts any Statblog attention remains to be seen,  but I certainly hope so.  It is freely available at:http://onlinelibrary.wiley.com/doi/10.1002/sta4.v3.1/issuetoc  for a while at least.


* In the realm of referee reports, "favorable" and "well informed" are synonyms, so this sentence was a bit redundant.

Saturday, February 8, 2014

Winter Warming?

With all the complaining about the winter weather this year,  I became curious about trends in winter
weather in my home town.  So here is a plot of the last 100 years, or so, of minimum January temperatures in Grand Forks, North Dakota.  The three black lines are regression quartile fits
and the red line is the mean regression fit.  The mean and median fits both indicate that there is roughly a 0.1  degree F per year warming trend over the entire period -- so 6 degrees F since I was in first grade.   This year the minimum was -30F, so quite bit below the median trend line.



Tuesday, January 14, 2014

Kondratiev Meets Pynchon

The ineffable Elif Batuman's twitter feed led me to this PLOS paper on long cycles in literary
and economic misery, confirming the iron law of Kondratiev that there are long cycles in the
reappearance of Kondratiev's long cycles.  Each reappearance is cause to celebrate Slutsky's
devastating critique of such nonsense in the paper:  The Summation of Random Causes as the
Source of Cyclic Processes, Econometrica, 1937,  originally written and published in 1927
when Slutsky was employed in Kondratiev's "Conjuncture Institute"  in Moscow.

Sunday, January 12, 2014

Momentary Lapses

One of my hobbies (in the sense of xkcd) is poking fun at method of moments methods.  For quite a
while my first lecture of the graduate econometric theory course here has had several cautionary tales
designed to make students wary of moment based statistical procedures.  Yeah,  this blog was bound
to get mired down in matters statistical eventually, it is surprising that it took this long.

One of these cautionary tales was based on an American Statistician paper by Bruce Lindsay
and Prasanta Basak called: Moments Determine the Tail of a Distribution (But Not Much Else)
Rarely has there been a more accurately titled paper in the statistical literature.  The main
result of their paper was to provide bounds on how different two distribution functions, say F and G
can be, given that they have the same first n moments.  They provide a very nice extension of an
old result of Akhiezer for the general case, and give an example showing that for the Gaussian
distribution, G, there is an F with the same first 60 moments, for which |F(0) - G(0)| > 0.1116.
This seems rather astonishing, even after matching the first 60 (!) moments we can not expect
to be within 1/10 at the median of the distribution.

Whenever I thought about this, I wondered what would these F look like?  Could we actually
calculate a particular example to illustrate this sort of phenomena.  Such problems are usually
called Method of Moment Spaces problems, and it occurred to me a little while ago that it
would be easy to compute such animals by standard linear programming methods.  My first
attempt at this employed Mosek, my favorite general linear programming engine, and when
I tried to find an F to match the first 20 Gaussian moments, I discovered that the problem
was numerically pretty nasty:  the moments grow quite large and the resulting linear programming
problem becomes quite ill conditioned.  This led me to wonder whether the solution that Mosek
produced claiming an F that differed by 0.1846 at zero was really reliable.  At this point I decided
that it might be fun to try to solve the same problem in Mathematica, since it was capable of
doing so in extended precision.  So with some help and encouragement from my colleague,
Steve Portnoy,  I wrote the following simple program:

GaussianMomentMatching[n_, m_] := Module[{x,A,c,b,B,p},
   fm[u_,v_] := v^u;
   gm[j_] := If[OddQ[j-1],(j-1)!!,0];
   x = 8 Table[j, {j, -m, m}]/m;
   A = Outer[fm, Range[0, n], x] /. Indeterminate -> 1;
   b = Table[gm[j], {j,n}];
   c=(1 - Sign[x])/2;
   c[[m+1]] = 0;
   B = Table[0, {2}, {n+1}];
   B[[1]] = Prepend[b,1];
   B = Transpose[B];
   p = LinearProgramming[c,A,B,Method->"Simplex"];
   0.5 - N[c.p]
   ]

I am restricting attention to F's that have point masses on a grid from -8 to 8 with 2m + 1 
possible values, and matching first n moments of the Gaussian.  For matching n = 20
moments with m = 250,  this produces essentially identical results as Mosek with a gap of 0.1846 at
zero.  And now the rational solution in the p's matches precisely.  For n = 30, I get
a gap of 0.155.  The Mathematica code is a bit slow, so I haven't tried to go beyond
30,  but I'm still astonished by the fact that the gaps are so large.  Clearly Lindsay is
right when he says that moment matching for distributions with unbounded support
doesn't tell us much about the center of the distribution.  Of the  501 possible points
of support for these solutions only n+1 points get positive mass, this is a feature of
the linear programming formulation.  An entertaining aspect of the Mathematica solution
is that these p's with positive mass are (wildly) rational, e.g.

    4016749983193866933980157909278598671875 
______________________________________________.  So what do these matching solutions

1398521361440385903587657725093918246728368128

look like?  I've plotted the n = 20 solution in the figure below.  Note that I'm really plotting the cube
root of the probability masses in an effort to make the small tail masses visible.  Not surprisingly,
the solution looks quite symmetric since the odd moments must all match zero, but a little mass
out in the tails can make the moments mimic the Gaussian moments exactly.

Another question on might ask about all of this might be:  how different can the quantiles of
F and G be?  Obviously, the medians agree exactly, but as is apparent from the figure above
there are other quantiles that can be different by as much as 0.5.