HurdleDMR from R

HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multinomial Regression (HDMR), as described in:

Bryan Kelly, Asaf Manela & Alan Moreira (2021). Text Selection, Journal of Business & Economic Statistics (ungated preprint).

It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).

This tutorial explains how to use this package from R via the JuliaCall package that is available on CRAN.

Setup

Install Julia

First, install Julia itself. The easiest way to do that is to get the latest stable release from the official download page. An alternative is to install JuliaPro.

Once installed, open julia in a terminal, press ] to activate package manager and add the following packages:

pkg> add RCall HurdleDMR GLM Lasso

The JuliaCall package for R

Now, back to R

Load the JuliaCall library and setup julia

jl allows us to evaluate julia code.

Example Data

The data should either be an n-by-p covars matrix or a DataFrame containing the covariates, and a (sparse) n-by-d counts matrix.

For illustration we'll analyse the State of the Union text that is roughly annual and relate it to stock market returns.

The sotu.jl script compiles stock market execss returns and the State of the Union Address texts into a matching DataFrame covarsdf and a sparse document-term matrix counts.

Add parallel workers and make HurdleDMR package available to workers

First we need to convert the R sparseMatrix to julia. We do this in pieces because if we materialize the entire dense matrix representation it may require more memory than we have.

Distributed Multinomial Regression (DMR)

The Distributed Multinomial Regression (DMR) model of Taddy (2015) is a highly scalable approximation to the Multinomial using distributed (independent, parallel) Poisson regressions, one for each of the d categories (columns) of a large counts matrix, on the covarsdf.

To fit a DMR:

or with a dataframe and formula, by first converting the pandas dataframe to julia

We can get the coefficients matrix for each variable + intercept as usual with

By default we only return the AICc maximizing coefficients. To also get back the entire regulatrization paths, run

We can now select, for example the coefficients that minimize 10-fold CV mse (takes a little longer)

Hurdle Distributed Multinomial Regression (HDMR)

For highly sparse counts, as is often the case with text that is selected for various reasons, the Hurdle Distributed Multinomial Regression (HDMR) model of Kelly, Manela, and Moreira (2021), may be superior to the DMR. It approximates a higher dispersion Multinomial using distributed (independent, parallel) Hurdle regressions, one for each of the d categories (columns) of a large counts matrix, on the covars. It allows a potentially different sets of covariates to explain category inclusion ($h=1{c>0}$), and repetition ($c>0$) using the optional inpos and inzero keyword arguments.

Both the model for zeroes and for positive counts are regularized by default, using GammaLassoPath, picking the AICc optimal segment of the regularization path.

HDMR can be fitted:

We can get the coefficients matrix for each variable + intercept as usual though now there is a set of coefficients for the model for repetitions and for inclusions

By default we only return the AICc maximizing coefficients. To get the coefficients that minimize say the BIC criterion, run

Sufficient reduction projection

A sufficient reduction projection summarizes the counts, much like a sufficient statistic, and is useful for reducing the d dimensional counts in a potentially much lower dimension matrix z.

To get a sufficient reduction projection in direction of Rem for the above example

Here, the first column is the SR projection from the model for positive counts, the second is the the SR projection from the model for hurdle crossing (zeros), and the third is the total count for each observation.

Counts Inverse Regression (CIR)

Counts inverse regression allows us to predict a covariate with the counts and other covariates. Here we use hdmr for the backward regression and another model for the forward regression. This can be accomplished with a single command, by fitting a CIR{HDMR,FM} where the forward model is FM <: RegressionModel.

where the nocounts=True means we also fit a benchmark model without counts. The last few coefficients are due to text data. zpos is the SR projection summarizing the information in repeated use of terms. zzero is the SR projection summarizing the information in term inclusion. m is the total number of excess counts. is the total number of nonzero counts.

We can get the forward and backward model coefficients with

The fitted model can be used to predict vy with new data

We can also predict only with the other covariates, which in this case is just a linear regression

Kelly, Manela, and Moreira (2021) show that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.

Please reference the paper for additional details and example applications.