Thursday, March 20, 2014

Mea Copula

The usual craziness last Sunday drove me to consider updating the bracketology methods that Gib
Bassett and I described here:
http://www.tandfonline.com/doi/abs/10.1198/jbes.2009.07093#.Uys1PVxTq6Q

Since the Intel (Kaggle.com) contest had reasonable looking data structures, it was reasonably
straightforward to modify my old R software to do 2013-14 version of what we had done
earlier for 2003-4.  The Kaggle competition wanted submissions that made probability estimates
for all 2278 possible match ups for this years tournament.  Our approach was to estimate a
pairwise comparison model for team scores for the entire pre-tournament season  for a relatively
fine grid of tau values -- by (what else?) quantile regression.  The resulting QR models have a
design matrix that is 10724 by 704, but are very sparse so on a grid 1:199/200 the estimation
can all be done in about a minute on my desktop machine.  Then comes the fun of simulating
tournament brackets, and estimating probabilities.  For any pairing, ij,  we have an estimated
quantile function for team i's score and another estimated quantile function for team j's score.
Yesterday afternoon as the deadline for the Intel contest loomed closer and closer, I lost focus
and decided to compute probability estimates based on what one might call the comonotonic
copula -- that is under the assumption that if in our hypothetical game team i achieves quantile
tau in its scoring performance, then team j will also achieve quantile tau performance in this
game.  Thus, if team i's quantile function lies above the quantile function for team j,  the
predicted probability will be 1, that is we will assert that team i will always beat team j.
But clearly team i might have a bad day, when team j has a good one, so another limiting
option would be to consider an independent copula:  each team gets an independent draw
from a uniform that then gets plugged into their quantile function.  Various compromises then
suggest themselves.  In the JBES paper we used a Frank copula model that implied a Kendall
correlation of about 0.27, so mild positive correlation between performance of the two teams.
In contrast to the initial probability estimates with the comonotonic model that produced roughly
half of the 2278 phats at 0 or 1, the Frank copula model produced a much more uniform distribution
of them, as illustrated by the histogram below.  Unfortunately, this didn't occur to me until after
the deadline passed for the Intel submissions.