HurdleDMR¶

HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multiple Regression (HDMR), as described in:

Kelly, Bryan, Asaf Manela, and Alan Moreira (2018). Text Selection. Working paper.

It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).

This tutorial explains how to use this package.

Setup¶

Install Julia¶

First, install Julia itself. The easiest way to do that is from the download site https://julialang.org/downloads/. An alternative is to install JuliaPro from https://juliacomputing.com

Once installed, open julia in a terminal (or in Atom) and add the following packages:

Pkg.clone("https://github.com/AsafManela/Lasso.jl")
Pkg.clone("https://github.com/AsafManela/HurdleDMR.jl")

Add parallel workers and make package available to workers

addprocs(Sys.CPU_CORES-2)
import HurdleDMR; @everywhere using HurdleDMR

Example Data¶

Setup your data into an n-by-p covars matrix, and a (sparse) n-by-d counts matrix. Here we generate some random data.

using CSV, GLM, DataFrames, Distributions
n = 100
p = 3
d = 4

srand(13)
m = 1+rand(Poisson(5),n)
covars = rand(n,p)
ηfn(vi) = exp.([0 + i*sum(vi) for i=1:d])
q = [ηfn(covars[i,:]) for i=1:n]
scale!.(q,ones(n)./sum.(q))
counts = convert(SparseMatrixCSC{Float64,Int},hcat(broadcast((qi,mi)->rand(Multinomial(mi, qi)),q,m)...)')
covarsdf = DataFrame(covars,[:vy, :v1, :v2])

Distributed Multinomial Regression (DMR)¶

The Distributed Multinomial Regression (DMR) model of Taddy (2015) is a highly scalable approximation to the Multinomial using distributed (independent, parallel) Poisson regressions, one for each of the d categories (columns) of a large counts matrix, on the covars.

To fit a DMR:

m = dmr(covars, counts)

INFO: fitting 100 observations on 4 categories, 3 covariates 
INFO: distributed poisson run on local cluster with 18 nodes

HurdleDMR.DMRCoefs([-0.277378 -1.10612 -1.18168 -0.746987; -3.94502 -1.7695 -0.387265 0.355981; -2.83456 -1.82878 0.0 0.212385; -2.72819 -0.983448 -0.346102 0.247442], true, 100, 4, 3)

or with a dataframe and formula

mf = @model(c ~ vy + v1 + v2)
m = fit(DMR, mf, covarsdf, counts)

INFO: fitting 100 observations on 4 categories, 3 covariates 
INFO: distributed poisson run on local cluster with 18 nodes

HurdleDMR.DMRCoefs([-0.277378 -1.10612 -1.18168 -0.746987; -3.94502 -1.7695 -0.387265 0.355981; -2.83456 -1.82878 0.0 0.212385; -2.72819 -0.983448 -0.346102 0.247442], true, 100, 4, 3)

in either case we can get the coefficients matrix for each variable + intercept as usual with

coef(m)

4×4 SharedArray{Float64,2}:
 -0.277378  -1.10612   -1.18168   -0.746987
 -3.94502   -1.7695    -0.387265   0.355981
 -2.83456   -1.82878    0.0        0.212385
 -2.72819   -0.983448  -0.346102   0.247442

By default we only return the AICc maximizing coefficients. To also get back the entire regulatrization paths, run

paths = fit(DMRPaths, mf, covarsdf, counts)

INFO: fitting 100 observations on 4 categories, 3 covariates 
INFO: distributed poisson run on remote cluster with 18 nodes

HurdleDMR.DMRPaths(Nullable{Lasso.GammaLassoPath}[Poisson GammaLassoPath (56) solutions for 4 predictors in 330 iterations):
                λ   pct_dev ncoefs
 [1]     0.108369       0.0      0
 [2]     0.098742 0.0275025      1
 [3]      0.08997 0.0509363      1
 [4]    0.0819773 0.0793996      2
 [5]    0.0746947  0.112825      3
 [6]     0.068059  0.148402      3
 [7]    0.0620129  0.179008      3
 [8]    0.0565038    0.2054      3
 [9]    0.0514842  0.228196      3
[10]    0.0469105  0.247908      3
[11]    0.0427431  0.264961      3
[12]    0.0389459  0.279718      3
[13]     0.035486  0.292484      3
[14]    0.0323336  0.303522      3
[15]    0.0294611   0.31306      3
[16]    0.0268439  0.321293      3
[17]    0.0244591  0.328392      3
[18]    0.0222863  0.334505      3
[19]    0.0203064  0.339762      3
[20]    0.0185024  0.344276      3
[21]    0.0168587  0.348147      3
[22]    0.0153611  0.351461      3
[23]    0.0139964  0.354294      3
[24]     0.012753  0.356711      3
[25]    0.0116201  0.358771      3
[26]    0.0105878  0.360524      3
[27]   0.00964719  0.362013      3
[28]   0.00879016  0.363277      3
[29]   0.00800927  0.364347      3
[30]   0.00729775  0.365253      3
[31]   0.00664944  0.366018      3
[32]   0.00605872  0.366664      3
[33]   0.00552048  0.367208      3
[34]   0.00503005  0.367667      3
[35]    0.0045832  0.368052      3
[36]   0.00417604  0.368376      3
[37]   0.00380505  0.368648      3
[38]   0.00346702  0.368876      3
[39]   0.00315902  0.369068      3
[40]   0.00287838  0.369228      3
[41]   0.00262267  0.369362      3
[42]   0.00238968  0.369474      3
[43]   0.00217739  0.369568      3
[44]   0.00198396  0.369646      3
[45]   0.00180771  0.369711      3
[46]   0.00164712  0.369766      3
[47]   0.00150079  0.369811      3
[48]   0.00136746  0.369849      3
[49]   0.00124598  0.369881      3
[50]   0.00113529  0.369907      3
[51]   0.00103444  0.369929      3
[52]   0.00094254  0.369948      3
[53]  0.000858808  0.369963      3
[54]  0.000782513  0.369975      3
[55]  0.000712997  0.369986      3
[56]  0.000649656  0.369995      3
, Poisson GammaLassoPath (48) solutions for 4 predictors in 243 iterations):
               λ   pct_dev ncoefs
 [1]    0.130638       0.0      0
 [2]    0.119032 0.0262986      2
 [3]    0.108458 0.0497641      2
 [4]   0.0988228 0.0695382      2
 [5]   0.0900437 0.0882711      3
 [6]   0.0820444  0.105895      3
 [7]   0.0747558  0.120782      3
 [8]   0.0681147  0.133357      3
 [9]   0.0620636  0.143978      3
[10]     0.05655  0.152945      3
[11]   0.0515263  0.160513      3
[12]   0.0469488  0.166898      3
[13]    0.042778  0.172277      3
[14]   0.0389778  0.176809      3
[15]   0.0355151  0.180623      3
[16]     0.03236  0.183831      3
[17]   0.0294852  0.186527      3
[18]   0.0268659  0.188791      3
[19]   0.0244792  0.190691      3
[20]   0.0223045  0.192285      3
[21]    0.020323  0.193621      3
[22]   0.0185176  0.194739      3
[23]   0.0168725  0.195676      3
[24]   0.0153736  0.196459      3
[25]   0.0140079  0.197114      3
[26]   0.0127635  0.197661      3
[27]   0.0116296  0.198118      3
[28]   0.0105964    0.1985      3
[29]  0.00965509  0.198818      3
[30]  0.00879736  0.199084      3
[31]  0.00801582  0.199305      3
[32]  0.00730372   0.19949      3
[33]  0.00665488  0.199643      3
[34]  0.00606368  0.199771      3
[35]    0.005525  0.199878      3
[36]  0.00503417  0.199967      3
[37]  0.00458695  0.200041      3
[38]  0.00417946  0.200102      3
[39]  0.00380817  0.200154      3
[40]  0.00346986  0.200196      3
[41]  0.00316161  0.200231      3
[42]  0.00288074  0.200261      3
[43]  0.00262482  0.200285      3
[44]  0.00239164  0.200306      3
[45]  0.00217917  0.200323      3
[46]  0.00198558  0.200337      3
[47]  0.00180919  0.200348      3
[48]  0.00164846  0.200358      3
, Poisson GammaLassoPath (44) solutions for 4 predictors in 216 iterations):
               λ   pct_dev ncoefs
 [1]     0.23463       0.0      0
 [2]    0.213786 0.0129647      2
 [3]    0.194794 0.0239732      2
 [4]    0.177489 0.0331495      2
 [5]    0.161721 0.0407988      2
 [6]    0.147354 0.0471745      2
 [7]    0.134264 0.0524878      2
 [8]    0.122336 0.0569149      2
 [9]    0.111468 0.0606031      2
[10]    0.101566 0.0636751      2
[11]   0.0925429 0.0662334      2
[12]   0.0843216 0.0683634      2
[13]   0.0768307 0.0701365      2
[14]   0.0700053 0.0720438      3
[15]   0.0637862 0.0738808      3
[16]   0.0581196 0.0754093      3
[17]   0.0529564 0.0766811      3
[18]   0.0482519 0.0777391      3
[19]   0.0439654  0.078619      3
[20]   0.0400596 0.0793508      3
[21]   0.0365008 0.0799593      3
[22]   0.0332582 0.0804653      3
[23]   0.0303036 0.0808859      3
[24]   0.0276115 0.0812355      3
[25]   0.0251586 0.0815261      3
[26]   0.0229236 0.0817676      3
[27]   0.0208871 0.0819683      3
[28]   0.0190316  0.082135      3
[29]   0.0173408 0.0822736      3
[30]   0.0158003 0.0823887      3
[31]   0.0143967 0.0824843      3
[32]   0.0131177 0.0825638      3
[33]   0.0119524 0.0826298      3
[34]   0.0108906 0.0826846      3
[35]  0.00992307 0.0827301      3
[36]  0.00904153 0.0827679      3
[37]  0.00823831 0.0827994      3
[38]  0.00750644 0.0828254      3
[39]  0.00683959 0.0828471      3
[40]  0.00623198 0.0828651      3
[41]  0.00567835   0.08288      3
[42]   0.0051739 0.0828924      3
[43]  0.00471426 0.0829028      3
[44]  0.00429546 0.0829113      3
, Poisson GammaLassoPath (49) solutions for 4 predictors in 232 iterations):
               λ   pct_dev ncoefs
 [1]    0.471659       0.0      0
 [2]    0.429758  0.023192      1
 [3]     0.39158 0.0526047      2
 [4]    0.356793 0.0795194      2
 [5]    0.325096  0.101828      2
 [6]    0.296216  0.120322      2
 [7]    0.269901  0.140525      3
 [8]    0.245923  0.159758      3
 [9]    0.224076  0.175704      3
[10]     0.20417  0.188926      3
[11]    0.186032  0.199892      3
[12]    0.169505  0.208987      3
[13]    0.154447  0.216532      3
[14]    0.140726  0.222791      3
[15]    0.128225  0.227984      3
[16]    0.116834  0.232293      3
[17]    0.106454  0.235868      3
[18]   0.0969973  0.238835      3
[19]   0.0883803  0.241297      3
[20]   0.0805288  0.243341      3
[21]   0.0733749  0.245037      3
[22]   0.0668565  0.246444      3
[23]   0.0609171  0.247613      3
[24]   0.0555054  0.248583      3
[25]   0.0505745  0.249387      3
[26]   0.0460816  0.250056      3
[27]   0.0419878   0.25061      3
[28]   0.0382577   0.25107      3
[29]    0.034859  0.251453      3
[30]   0.0317622   0.25177      3
[31]   0.0289406  0.252033      3
[32]   0.0263696  0.252252      3
[33]    0.024027  0.252433      3
[34]   0.0218925  0.252584      3
[35]   0.0199476  0.252709      3
[36]   0.0181755  0.252813      3
[37]   0.0165609  0.252899      3
[38]   0.0150896   0.25297      3
[39]   0.0137491   0.25303      3
[40]   0.0125277  0.253079      3
[41]   0.0114148   0.25312      3
[42]   0.0104007  0.253154      3
[43]  0.00947673  0.253182      3
[44]  0.00863484  0.253206      3
[45]  0.00786775  0.253225      3
[46]   0.0071688  0.253241      3
[47]  0.00653194  0.253255      3
[48]  0.00595166  0.253266      3
[49]  0.00542293  0.253275      3
], true, 100, 4, 3)

we can now select, for example the coefficients that minimize CV mse (takes a while)

coef(paths; select=:CVmin)

4×4 Array{Float64,2}:
 -1.04233  -1.27898   -1.1979    -0.733303
 -3.08667  -1.61234   -0.368288   0.346255
 -2.03017  -1.65467    0.0        0.202824
 -1.92495  -0.857511  -0.329399   0.240294

Hurdle Distributed Multiple Regression (HDMR)¶

For highly sparse counts, as is often the case with text that is selected for various reasons, the Hurdle Distributed Multiple Regression (HDMR) model of Kelly, Manela, and Moreira (2018), may be superior to the DMR. It approximates a higher dispersion Multinomial using distributed (independent, parallel) Hurdle regressions, one for each of the d categories (columns) of a large counts matrix, on the covars. It allows a potentially different sets of covariates to explain category inclusion ($h=1{c>0}$), and repetition ($c>0$).

Both the model for zeroes and for positive counts are regularized by default, using GammaLassoPath, picking the AICc optimal segment of the regularization path.

HDMR can be fitted:

m = hdmr(covars, counts; inpos=1:2, inzero=1:3)

INFO: fitting 100 observations on 4 categories 
2 covariates for positive and 3 for zero counts
INFO: distributed hurdle run on local cluster with 18 nodes

HurdleDMR.HDMRCoefs([-2.97869 -3.39104 -1.37637 -0.690629; 0.0 0.0 -0.724477 0.426105; 0.0 0.0 0.0 0.244869], [0.629668 -0.00316568 -0.521496 2.98134; -4.68583 -2.4084 0.0 0.0; -3.37301 -2.30744 0.0 0.0; -3.37706 -1.50431 0.0 0.0], true, 100, 4, 1:2, 1:3)

or with a dataframe and formula

mf = @model(h ~ vy + v1 + v2, c ~ vy + v1)
m = fit(HDMR, mf, covarsdf, counts)

INFO: fitting 100 observations on 4 categories 
2 covariates for positive and 3 for zero counts
INFO: distributed hurdle run on local cluster with 18 nodes

HurdleDMR.HDMRCoefs([-2.97869 -3.39104 -1.37637 -0.690629; 0.0 0.0 -0.724477 0.426105; 0.0 0.0 0.0 0.244869], [0.629668 -0.00316568 -0.521496 2.98134; -4.68583 -2.4084 0.0 0.0; -3.37301 -2.30744 0.0 0.0; -3.37706 -1.50431 0.0 0.0], true, 100, 4, [1, 2], [1, 2, 3])

where the h ~ equation is the model for zeros (hurdle crossing) and c ~ is the model for positive counts

in either case we can get the coefficients matrix for each variable + intercept as usual with

coefspos, coefszero = coef(m)

([-2.97869 -3.39104 -1.37637 -0.690629; 0.0 0.0 -0.724477 0.426105; 0.0 0.0 0.0 0.244869], [0.629668 -0.00316568 -0.521496 2.98134; -4.68583 -2.4084 0.0 0.0; -3.37301 -2.30744 0.0 0.0; -3.37706 -1.50431 0.0 0.0])

By default we only return the AICc maximizing coefficients. To also get back the entire regulatrization paths, run

paths = fit(HDMRPaths, mf, covarsdf, counts)

coef(paths; select=:all)

INFO: fitting 100 observations on 4 categories 
2 covariates for positive and 3 for zero counts
INFO: distributed hurdle run on remote cluster with 18 nodes

([-2.97869 0.0 0.0; -2.93456 -0.15307 0.0; … ; 0.0 0.0 0.0; 0.0 0.0 0.0]

[-3.39104 0.0 0.0; -3.29061 0.0 -0.251553; … ; -1.04786 -2.52399 -5.36357; -1.04459 -2.52915 -5.37187]

[-1.68781 0.0 0.0; -1.65155 -0.0792342 0.0; … ; 0.0 0.0 0.0; 0.0 0.0 0.0]

[-0.352618 0.0 0.0; -0.370878 0.037995 0.0; … ; 0.0 0.0 0.0; 0.0 0.0 0.0], [-3.87231 0.0 0.0 0.0; -3.72971 -0.30193 0.0 0.0; … ; 0.62351 -4.679 -3.36716 -3.37106; 0.629668 -4.68583 -3.37301 -3.37706]

[-2.86015 0.0 0.0 0.0; -2.74269 -0.19503 -0.0481219 0.0; … ; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0]

[-0.521496 0.0 0.0 0.0; -0.448339 0.0 0.0 -0.13552; … ; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0]

[2.98134 0.0 0.0 0.0; 2.86604 0.0 0.248068 0.0; … ; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0])

Sufficient reduction projection¶

A sufficient reduction projection summarizes the counts, much like a sufficient statistic, and is useful for reducing the d dimensional counts in a potentially much lower dimension matrix z.

To get a sufficient reduction projection in direction of vy for the above example

z = srproj(m,counts,1,1)

100×3 Array{Float64,2}:
  0.311047    0.0       10.0
  0.311047    0.0       10.0
  0.426105    0.0        3.0
 -0.260452   -0.802801   9.0
  0.163324   -1.56194    6.0
  0.261736    0.0        7.0
  0.304361   -1.2042     7.0
 -0.0670014   0.0        7.0
  0.426105    0.0        8.0
  0.17042     0.0        9.0
  0.298263    0.0        9.0
  0.426105    0.0        4.0
  0.13846     0.0        4.0
  ⋮                         
 -0.119349   -1.56194    5.0
  0.426105    0.0        4.0
 -0.149186    0.0        4.0
  0.0923065  -0.802801   6.0
  0.110768   -0.802801   5.0
  0.13846     0.0        4.0
  0.11231     0.0       11.0
  0.0364953  -0.802801   7.0
 -0.0586262  -0.802801   8.0
  0.426105    0.0        2.0
  0.0923065  -1.77356    6.0
  0.261736    0.0        7.0

Here, the first column is the SR projection from the model for positive counts, the second is the the SR projection from the model for hurdle crossing (zeros), and the third is the total count for each observation.

Counts Inverse Regression (CIR)¶

Counts inverse regression allows us to predict a covariate with the counts and other covariates. Here we use hdmr for the backward regression and another model for the forward regression. This can be accomplished with a single command, by fitting a CIR{HDMR,FM} where the forward model is FM <: RegressionModel.

cir = fit(CIR{HDMR,LinearModel},mf,covarsdf,counts,:vy; nocounts=true)

INFO: fitting 100 observations on 4 categories 
2 covariates for positive and 3 for zero counts
INFO: distributed hurdle run on local cluster with 18 nodes

HurdleDMR.CIR{HurdleDMR.HDMR,GLM.LinearModel}(1, [1, 2], HurdleDMR.HDMRCoefs([-2.97869 -3.39104 -1.37637 -0.690629; 0.0 0.0 -0.724477 0.426105; 0.0 0.0 0.0 0.244869], [0.629668 -0.00316568 -0.521496 2.98134; -4.68583 -2.4084 0.0 0.0; -3.37301 -2.30744 0.0 0.0; -3.37706 -1.50431 0.0 0.0], true, 100, 4, [1, 2], [1, 2, 3]), GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
       Estimate Std.Error   t value Pr(>|t|)
x1     0.596995  0.108965   5.47876    <1e-6
x2    -0.165407 0.0953801  -1.73418   0.0862
x3    -0.059985 0.0933614 -0.642503   0.5221
x4     0.283205  0.126589   2.23721   0.0276
x5     0.160959 0.0471665   3.41257   0.0010
x6   0.00293183 0.0116717  0.251192   0.8022

, GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
       Estimate Std.Error   t value Pr(>|t|)
x1     0.456718 0.0713401   6.40198    <1e-8
x2   -0.0414989 0.0975448 -0.425434   0.6715
x3    0.0921372 0.0909831   1.01269   0.3137

)

where the nocounts=true means we also fit a benchmark model without counts.

we can get the forward and backward model coefficients with

coefbwd(cir)

([-2.97869 -3.39104 -1.37637 -0.690629; 0.0 0.0 -0.724477 0.426105; 0.0 0.0 0.0 0.244869], [0.629668 -0.00316568 -0.521496 2.98134; -4.68583 -2.4084 0.0 0.0; -3.37301 -2.30744 0.0 0.0; -3.37706 -1.50431 0.0 0.0])

coeffwd(cir)

6-element Array{Float64,1}:
  0.596995  
 -0.165407  
 -0.059985  
  0.283205  
  0.160959  
  0.00293183

The fitted model can be used to predict vy with new data

yhat = predict(cir, covarsdf[1:10,:], counts[1:10,:])

10-element Array{Union{Float64, Missings.Missing},1}:
 0.545235
 0.532596
 0.567779
 0.353053
 0.329678
 0.524191
 0.465097
 0.444719
 0.561167
 0.532235

We can also predict only with the other covariates, which in this case is just a linear regression

yhat_nocounts = predict(cir, covarsdf[1:10,:], counts[1:10,:]; nocounts=true)

10-element Array{Union{Float64, Missings.Missing},1}:
 0.457317
 0.517999
 0.480777
 0.472548
 0.472676
 0.485196
 0.479915
 0.441864
 0.459232
 0.423214

Kelly, Manela, and Moreira (2018) show that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.

Please reference the paper for additional details and example applications.

	vy	v1	v2
1	0.693073	0.877116	0.401554
2	0.938163	0.737491	0.997271
3	0.755878	0.743268	0.595892
4	0.191058	0.296443	0.30533
5	0.00753542	0.360474	0.335553
6	0.410974	0.773871	0.657641
7	0.279942	0.154284	0.321258
8	0.208454	0.849653	0.22147
9	0.639872	0.926706	0.444675
10	0.269132	0.83785	0.0137366
11	0.704959	0.120137	0.401541
12	0.820248	0.379542	0.704862
13	0.752849	0.745383	0.907775
14	0.634401	0.383528	0.276991
15	0.370604	0.595542	0.0999965
16	0.400454	0.596132	0.00424357
17	0.331037	0.777271	0.963936
18	0.109256	0.45404	0.873842
19	0.0384627	0.358023	0.369017
20	0.655115	0.984853	0.284056
21	0.961148	0.425115	0.836061
22	0.313693	0.631212	0.33691
23	0.468663	0.203475	0.0895971
24	0.130995	0.416474	0.25323
25	0.510355	0.418123	0.542134
26	0.763879	0.501635	0.257008
27	0.139297	0.337539	0.143543
28	0.315356	0.838806	0.0502037
29	0.359166	0.992776	0.882517
30	0.511267	0.983282	0.976795
⋮	⋮	⋮	⋮