HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multinomial Regression (HDMR), as described in:
Bryan Kelly, Asaf Manela & Alan Moreira (2021). Text Selection, Journal of Business & Economic Statistics (ungated preprint).
It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).
This tutorial explains how to use this package.
First, install Julia itself. The easiest way to do that is to get the latest stable release from the official download page. An alternative is to install JuliaPro.
Once installed, open julia, press ]
to activate package manager and add the following packages:
pkg> add HurdleDMR
To run this tutorial which is in a jupyter notebook, you may need to add the IJulia package too.
Here, this first cell installs all the packages used by this notebook in a temporary environment to avoid interference from previously installed packages:
import Pkg
mytmpenv = mktempdir()
Pkg.activate(mytmpenv)
Pkg.add(["HurdleDMR",
"Lasso",
"CSV",
"DataDeps",
"DataFrames",
"FamaFrenchData",
"TextAnalysis",
"SparseArrays"])
Activating new environment at `/tmp/jl_FPZAVB/Project.toml` Updating registry at `~/.julia/registries/General` Updating registry at `~/.julia/registries/ManelaLabRegistry`
Updating git-repo `git@gitlab.com:ManelaLab/ManelaLabRegistry.git`
Resolving package versions... Updating `/tmp/jl_FPZAVB/Project.toml` [336ed68f] + CSV v0.8.2 [124859b0] + DataDeps v0.7.6 [a93c6f00] + DataFrames v0.22.2 [bd2a388e] + FamaFrenchData v0.3.0 [f9e53bcf] + HurdleDMR v1.3.2 [b4fcebef] + Lasso v0.6.0 [a2db99b7] + TextAnalysis v0.7.2 [2f01184e] + SparseArrays Updating `/tmp/jl_FPZAVB/Manifest.toml` [621f4979] + AbstractFFTs v1.0.0 [79e6a3ab] + Adapt v2.4.0 [56f22d72] + Artifacts v1.3.0 [b99e7846] + BinaryProvider v0.5.10 [336ed68f] + CSV v0.8.2 [324d7699] + CategoricalArrays v0.9.0 [34da2185] + Compat v3.25.0 [e66e0078] + CompilerSupportLibraries_jll v0.3.4+0 [a8cc5b0e] + Crayons v4.0.4 [717857b8] + DSP v0.6.10 [9a962f9c] + DataAPI v1.4.0 [124859b0] + DataDeps v0.7.6 [a93c6f00] + DataFrames v0.22.2 [864edb3b] + DataStructures v0.18.8 [e2d170a0] + DataValueInterfaces v1.0.0 [31c24e10] + Distributions v0.24.10 [e2ba6199] + ExprTools v0.1.3 [8f5d6c58] + EzXML v1.1.0 [7a1cc6ca] + FFTW v1.3.1 [f5851436] + FFTW_jll v3.3.9+7 [bd2a388e] + FamaFrenchData v0.3.0 [1a297f60] + FillArrays v0.10.2 [59287772] + Formatting v0.4.2 [38e38edf] + GLM v1.3.11 [7693890a] + HTML_Entities v1.0.0 [cd3eb016] + HTTP v0.9.2 [f9e53bcf] + HurdleDMR v1.3.2 [83e8ac13] + IniFile v0.5.0 [1d5cc7b8] + IntelOpenMP_jll v2018.0.3+0 [d8418881] + Intervals v1.5.0 [41ab1584] + InvertedIndices v1.0.0 [c8e1da08] + IterTools v1.3.0 [82899510] + IteratorInterfaceExtensions v1.0.0 [692b3bcd] + JLLWrappers v1.2.0 [682c06a0] + JSON v0.21.1 [984bce1d] + LambertW v0.4.5 [8ef0a80b] + Languages v0.4.3 [b4fcebef] + Lasso v0.6.0 [94ce4f54] + Libiconv_jll v1.16.0+7 [e6f89c97] + LoggingExtras v0.4.2 [856f044c] + MKL_jll v2020.2.254+0 [f0e99cf1] + MLBase v0.8.0 [739be429] + MbedTLS v1.0.3 [c8ffd9c3] + MbedTLS_jll v2.16.8+1 [e1d29d7a] + Missings v0.4.4 [78c3b35d] + Mocking v0.7.1 [6fe1bfb0] + OffsetArrays v1.5.0 [efe28fd5] + OpenSpecFun_jll v0.5.3+4 [bac558e1] + OrderedCollections v1.3.2 [90014a1f] + PDMats v0.10.1 [69de0a69] + Parsers v1.0.15 [f27b6e38] + Polynomials v1.2.0 [2dfb63ee] + PooledArrays v0.5.3 [08abe8d2] + PrettyTables v0.10.1 [1fd47b50] + QuadGK v2.4.1 [3cdcf5f2] + RecipesBase v1.1.1 [189a3867] + Reexport v0.2.0 [79098fc4] + Rmath v0.6.1 [f50d1b31] + Rmath_jll v0.2.2+1 [91c51154] + SentinelArrays v1.2.16 [1277b4bf] + ShiftedArrays v1.0.0 [fb8f903a] + Snowball v0.1.0 [88f46535] + Snowball_jll v2.0.0+0 [a2af1166] + SortingAlgorithms v0.3.1 [276daf66] + SpecialFunctions v0.10.3 [2913bbd2] + StatsBase v0.33.2 [4c63d2b9] + StatsFuns v0.9.6 [3eaba693] + StatsModels v0.6.18 [9700d1a9] + StrTables v1.0.1 [856f2bd8] + StructTypes v1.2.1 [3783bdb8] + TableTraits v1.0.0 [bd369af6] + Tables v1.2.2 [a2db99b7] + TextAnalysis v0.7.2 [f269a46b] + TimeZones v1.5.3 [5c2747f8] + URIs v1.2.0 [796a5d58] + WordTokenizers v0.5.6 [02c8fc9c] + XML2_jll v2.9.10+3 [a5390f91] + ZipFile v0.9.3 [83775a58] + Zlib_jll v1.2.11+18 [3f19e933] + p7zip_jll v16.2.0+3 [2a0f44e3] + Base64 [ade2ca70] + Dates [8bb1440f] + DelimitedFiles [8ba89e20] + Distributed [9fa8497b] + Future [b77e0a4c] + InteractiveUtils [76f85450] + LibGit2 [8f399da3] + Libdl [37e2e46d] + LinearAlgebra [56ddb016] + Logging [d6f4376e] + Markdown [a63ad114] + Mmap [44cfe95a] + Pkg [de0858da] + Printf [3fa0cd96] + REPL [9a3f8284] + Random [ea8e919c] + SHA [9e88b42a] + Serialization [1a1011a3] + SharedArrays [6462fe0b] + Sockets [2f01184e] + SparseArrays [10745b16] + Statistics [4607b0f0] + SuiteSparse [8dfed614] + Test [cf7118a7] + UUIDs [4ec0a83e] + Unicode
Add parallel workers and make package available to workers
using Distributed
addprocs(4)
@everywhere (import Pkg; Pkg.activate($mytmpenv)) # makes sure parallel workers use the same environment
import HurdleDMR; @everywhere using HurdleDMR
Activating environment at `/tmp/jl_FPZAVB/Project.toml`
From worker 4: Activating environment at `/tmp/jl_FPZAVB/Project.toml` From worker 5: Activating environment at `/tmp/jl_FPZAVB/Project.toml` From worker 3: Activating environment at `/tmp/jl_FPZAVB/Project.toml` From worker 2: Activating environment at `/tmp/jl_FPZAVB/Project.toml`
The data should either be an n-by-p covars matrix or a DataFrame containing the covariates, and a (sparse) n-by-d counts matrix.
For illustration we'll analyse the State of the Union text that is roughly annual and relate it to stock market returns.
The sotu.jl
script compiles stock market execss returns and the State of the Union Address texts into a matching DataFrame covarsdf
and a sparse document-term matrix counts
.
include("sotu.jl")
┌ Info: built SOTU corpus with 92 speeches └ @ Main /home/user/.julia/dev/HurdleDMR/docs/src/tutorials/sotu.jl:122 ┌ Info: built Document Term Matrix with bigrams occuring at least 20 times in the corpus └ @ Main /home/user/.julia/dev/HurdleDMR/docs/src/tutorials/sotu.jl:127 ┌ Info: SOTU + Rem loaded and saved to /home/user/.julia/datadeps/sotu └ @ Main /home/user/.julia/dev/HurdleDMR/docs/src/tutorials/sotu.jl:181
(92×3 DataFrame Row │ Date Rem President │ Int64 Float64 String ─────┼─────────────────────────────────────── 1 │ 1927 29.47 Calvin Coolidge 2 │ 1928 35.39 Calvin Coolidge 3 │ 1929 -19.54 Herbert Hoover 4 │ 1930 -31.23 Herbert Hoover 5 │ 1931 -45.11 Herbert Hoover 6 │ 1932 -9.39 Herbert Hoover 7 │ 1934 3.02 Franklin D. Roosevelt 8 │ 1935 44.96 Franklin D. Roosevelt 9 │ 1936 32.07 Franklin D. Roosevelt 10 │ 1937 -34.96 Franklin D. Roosevelt 11 │ 1938 28.48 Franklin D. Roosevelt ⋮ │ ⋮ ⋮ ⋮ 83 │ 2010 17.37 Barack Obama 84 │ 2011 0.44 Barack Obama 85 │ 2012 16.27 Barack Obama 86 │ 2013 35.2 Barack Obama 87 │ 2014 11.71 Barack Obama 88 │ 2015 0.08 Barack Obama 89 │ 2016 13.3 Barack Obama 90 │ 2017 21.51 Donald J. Trump 91 │ 2018 -6.93 Donald J. Trump 92 │ 2019 28.28 Donald J. Trump 71 rows omitted, [1 , 1] = 1 [18, 1] = 2 [30, 1] = 1 [34, 1] = 1 [42, 1] = 1 [47, 1] = 1 [50, 1] = 1 [54, 1] = 16 [55, 1] = 1 [74, 1] = 1 [77, 1] = 1 [82, 1] = 1 ⋮ [49, 163] = 2 [51, 163] = 2 [53, 163] = 2 [54, 163] = 3 [57, 163] = 1 [64, 163] = 1 [69, 163] = 2 [72, 163] = 1 [77, 163] = 1 [85, 163] = 2 [88, 163] = 1 [89, 163] = 2 [92, 163] = 2, ["administration congress", "al qaeda", "american families", "american family", "american leadership", "american life", "american people", "american workers", "armed forces", "army navy" … "war ii", "weapons mass", "welfare reform", "welfare system", "western europe", "white house", "world economy", "world peace", "world trade", "world war"])
The Distributed Multinomial Regression (DMR) model of Taddy (2015) is a highly scalable
approximation to the Multinomial using distributed (independent, parallel)
Poisson regressions, one for each of the d categories (columns) of a large counts
matrix,
on the covarsdf
.
size(counts)
(92, 163)
covars = Matrix(covarsdf[:, ["Rem"]])
size(covars)
(92, 1)
To fit a DMR:
m = dmr(covars, counts)
┌ Info: fitting 92 observations on 163 categories, 1 covariates └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/dmr.jl:302 ┌ Info: distributed poisson run on local cluster with 4 nodes └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/dmr.jl:314
DMRCoefs{Array{Float64,2},MinAICc}([-5.433977770164587 -5.169034226440705 … -5.37899826953734 -4.430372688887225; -0.03460506144624849 -0.017470711240303454 … 0.007352225999870346 0.004405577373050878], true, 92, 163, 1, MinAICc(2))
or with a dataframe and formula:
mf = @model(c ~ Rem)
m = fit(DMR, mf, covarsdf, counts)
┌ Info: fitting 92 observations on 163 categories, 1 covariates └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/dmr.jl:302 ┌ Info: distributed poisson run on local cluster with 4 nodes └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/dmr.jl:314
TableCountsRegressionModel{DMRCoefs{Array{Float64,2},MinAICc},DataFrame,SparseMatrixCSC{Int64,Int64}} 1-part model: [c ~ Rem]
in either case we can get the coefficients matrix for each variable + intercept as usual with
coef(m)
2×163 Array{Float64,2}: -5.43398 -5.16903 -5.48346 -5.87732 … -5.379 -4.43037 -0.0346051 -0.0174707 0.0 0.0118823 0.00735223 0.00440558
By default we only return the AICc maximizing coefficients. To also get back the entire regulatrization paths, run
paths = fit(DMRPaths, mf, covarsdf, counts)
┌ Info: fitting 92 observations on 163 categories, 1 covariates └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/dmr.jl:362 ┌ Info: distributed poisson run on remote cluster with 4 nodes └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/dmr.jl:384
TableCountsRegressionModel{DMRPaths,DataFrame,SparseMatrixCSC{Int64,Int64}} 1-part model: [c ~ Rem]
we can now select, for example the coefficients that minimize 10-fold CV mse (may take a while)
coef(paths, MinCVKfold{MinCVmse}(10))
2×163 Array{Float64,2}: -5.44837 -5.23215 -5.48346 -5.87732 … -5.31916 -4.39683 -5.11703e-14 0.0 0.0 0.0118823 7.62288e-10 7.83015e-10
For highly sparse counts, as is often the case with text that is selected for
various reasons, the Hurdle Distributed Multinomial Regression (HDMR) model of
Kelly, Manela, and Moreira (2021), may be superior to the DMR. It approximates
a higher dispersion Multinomial using distributed (independent, parallel)
Hurdle regressions, one for each of the d categories (columns) of a large counts
matrix,
on the covars
. It allows a potentially different sets of covariates to explain
category inclusion ($h=1{c>0}$), and repetition ($c>0$) using the optional inpos
and inzero
keyword arguments.
Both the model for zeroes and for positive counts are regularized by default,
using GammaLassoPath
, picking the AICc optimal segment of the regularization
path.
HDMR can be fitted:
m = hdmr(covars, counts)
┌ Info: fitting 92 observations on 163 categories │ 1 covariates for positive and 1 for zero counts └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/hdmr.jl:498 ┌ Info: distributed InclusionRepetition run on local cluster with 4 nodes └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/hdmr.jl:511
HDMRCoefs{InclusionRepetition,Array{Float64,2},MinAICc,UnitRange{Int64}}([-4.10095407845034 -2.463114021079464 … -4.9712845243783885 -3.9926360951822044; -0.02713335949136956 -0.03288064802519407 … 0.016349965746303912 0.025092391911020286], [-0.5918418471275423 -0.9311260670603407 … 0.15429960586669478 1.1944539960204115; 0.0 0.0032977922250339766 … 0.005415281988535716 -0.0033636080585655497], true, 92, 163, 1:1, 1:1, MinAICc(2))
or with a dataframe and formula
mf = @model(h ~ Rem + President, c ~ Rem + President)
m = fit(HDMR, mf, covarsdf, counts)
┌ Info: fitting 92 observations on 163 categories │ 16 covariates for positive and 16 for zero counts └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/hdmr.jl:498 ┌ Info: distributed InclusionRepetition run on local cluster with 4 nodes └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/hdmr.jl:511
TableCountsRegressionModel{HDMRCoefs{InclusionRepetition,Array{Float64,2},MinAICc,Array{Int64,1}},DataFrame,SparseMatrixCSC{Int64,Int64}} 2-part model: [h ~ Rem + President, c ~ Rem + President]
where the h ~ equation is the model for zeros (hurdle crossing) and c ~ is the model for positive counts
in either case we can get the coefficients matrix for each variable + intercept as usual with
coefspos, coefszero = coef(m)
([-7.505779153456341 -3.282723357305929 … 0.0 -3.8422483146788986; 0.0 0.0 … 0.0 0.007227060912341828; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [-0.5918418471274728 2.2627069688748356 … 0.0 1.182793857457826; 0.0 0.0 … 0.0 0.0; … ; 0.0 -5.790836977953466 … 0.0 -0.5793429745060755; 0.0 -6.302814646916909 … 0.0 -0.6656455871872498])
By default we only return the AICc maximizing coefficients. To also get back the entire regulatrization paths, run
paths = fit(HDMRPaths, mf, covarsdf, counts)
coef(paths, AllSeg())
┌ Info: fitting 92 observations on 163 categories │ 16 covariates for positive and 16 for zero counts └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/hdmr.jl:310 ┌ Info: distributed InclusionRepetition run on remote cluster with 4 nodes └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/hdmr.jl:341
([-3.813582932615021 0.0 … 0.0 0.0; -3.853810880797987 0.0 … 0.0 0.0; … ; -7.459296254466469 0.0 … 0.0 0.0; -7.505779153456341 0.0 … 0.0 0.0] [-2.636306292658036 0.0 … 0.0 0.0; -2.66359923548162 0.0 … 0.0 0.0; … ; -3.330246398413737 -0.010344788429923752 … 0.0 0.0; -3.330502415345261 -0.010368918223518313 … 0.0 0.0] [-5.305434834316585 0.0 … 0.0 0.0; -5.30494210534174 0.0 … 0.0 0.0; … ; -4.55911691859083 -0.030603425600868865 … -7.501561722796674 -7.945381925771689; -4.5589970403005 -0.030603999923886308 … -7.594708346805521 -8.038529181375855] ... [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0] [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0] [-3.8717217086546176 0.0 … 0.0 0.0; -3.856406187723185 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [-0.5918418471274729 0.0 … 0.0 0.0; -0.5800184432449021 0.0 … 0.0 -0.12726570066456563; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0] [-0.901336871518703 0.0 … 0.0 0.0; -0.9261859067147747 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0] [0.01871385751876236 0.0 … 0.0 0.0; 0.03268597629500669 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0] ... [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0] [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0] [1.166655306957609 0.0 … 0.0 0.0; 1.1458708457174278 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0])
A sufficient reduction projection summarizes the counts, much like a sufficient
statistic, and is useful for reducing the d dimensional counts in a potentially
much lower dimension matrix z
.
To get a sufficient reduction projection in direction of vy for the above example
z = srproj(m,counts,1,1)
92×4 Array{Float64,2}: 0.00411637 0.000362534 30.0 22.0 0.00318 -0.000217747 28.0 33.0 0.00667597 -0.000352564 44.0 34.0 0.00613306 -0.000705127 10.0 17.0 -0.0244185 -0.000544871 17.0 22.0 -0.00692207 -0.000359282 13.0 20.0 0.000594303 0.0 5.0 14.0 0.00586085 0.000569696 6.0 14.0 0.00420992 0.000419776 8.0 19.0 0.0 -8.88432e-5 4.0 17.0 -0.00500263 -7.94913e-5 21.0 19.0 -0.00431083 0.000379797 14.0 21.0 9.90504e-5 0.000705806 15.0 25.0 ⋮ -0.00235882 -0.00205056 27.0 35.0 -0.00262085 0.000219573 34.0 38.0 -0.0103014 0.000108327 61.0 39.0 -0.00615453 0.000338392 31.0 42.0 -0.00872007 8.85187e-5 36.0 47.0 0.00123649 0.000137973 50.0 47.0 0.000880195 6.03795e-5 39.0 48.0 -0.00127769 -0.000349448 32.0 43.0 -0.00193883 -0.00068754 23.0 43.0 -0.000533283 -0.000279268 21.0 36.0 0.000371439 -0.000265056 12.0 27.0 0.000657006 2.7179e-5 11.0 32.0
Here, the first column is the SR projection from the model for positive counts, the second is the the SR projection from the model for hurdle crossing (zeros), the third is the total count for each count in excess of 1 and the last column is the total number of nonzero counts.
Counts inverse regression allows us to predict a covariate with the counts and other covariates.
Here we use hdmr for the backward regression and another model for the forward regression to predict the excess market return Rem
.
This can be accomplished with a single command, by fitting a CIR{HDMR,FM}
where the forward model is some FM <: RegressionModel
supported by the GLM
package.
This time we'll fit a slightly more interesting model that also conditions on the president speaking.
mf = @model(h ~ President + Rem, c ~ President + Rem)
cir = fit(CIR{HDMR,LinearModel}, mf, covarsdf, counts, "Rem"; nocounts=true)
┌ Info: fitting 92 observations on 163 categories │ 16 covariates for positive and 16 for zero counts └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/hdmr.jl:498 ┌ Info: distributed InclusionRepetition run on local cluster with 4 nodes └ @ HurdleDMR /home/user/.julia/packages/HurdleDMR/zD6G8/src/hdmr.jl:511
TableCountsRegressionModel{CIR{HDMR,LinearModel},DataFrame,SparseMatrixCSC{Int64,Int64}} 2-part model: [h ~ President + Rem, c ~ President + Rem] Forward model coefficients for predicting Rem: ───────────────────────────────────────────────────────────────────────────────────────────────────────── Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% ───────────────────────────────────────────────────────────────────────────────────────────────────────── (Intercept) 17.4837 11.3688 1.54 0.1285 -5.17963 40.147 President: Calvin Coolidge 13.0088 13.4688 0.97 0.3374 -13.8407 39.8584 President: Donald J. Trump -2.14014 11.22 -0.19 0.8493 -24.5069 20.2266 President: Dwight D. Eisenhower -10.4126 8.31052 -1.25 0.2143 -26.9793 6.15409 President: Franklin D. Roosevelt -6.29263 9.24355 -0.68 0.4982 -24.7193 12.134 President: George H.W. Bush -22.2441 11.175 -1.99 0.0503 -44.5211 0.0329016 President: George W. Bush -20.2675 8.51922 -2.38 0.0200 -37.2502 -3.28473 President: Gerald R. Ford -5.36949 11.0539 -0.49 0.6286 -27.4051 16.6662 President: Harry S. Truman -9.36385 8.82709 -1.06 0.2923 -26.9603 8.23263 President: Herbert Hoover -36.1278 10.9276 -3.31 0.0015 -57.9116 -14.3439 President: Jimmy Carter -14.2189 10.3702 -1.37 0.1746 -34.8916 6.45379 President: John F. Kennedy -18.4067 11.4176 -1.61 0.1113 -41.1672 4.35383 President: Lyndon B. Johnson -18.0297 9.02763 -2.00 0.0496 -36.026 -0.0335 President: Richard Nixon -33.7795 9.89593 -3.41 0.0011 -53.5067 -14.0523 President: Ronald Reagan -16.7344 8.7381 -1.92 0.0595 -34.1535 0.684648 President: William J. Clinton -3.09842 8.2069 -0.38 0.7069 -19.4586 13.2617 zpos 477.131 327.085 1.46 0.1490 -174.901 1129.16 zzero 9002.65 2147.99 4.19 <1e-4 4720.72 13284.6 m -0.0506128 0.0478887 -1.06 0.2941 -0.146077 0.0448514 ℓ 0.0367347 0.257807 0.14 0.8871 -0.477194 0.550663 ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Here, nocounts=true
means we also fit a benchmark model without counts.
The last few coefficients are due to text data.
zpos
is the SR projection summarizing the information in repeated use of terms.
zzero
is the SR projection summarizing the information in term inclusion.
m
is the total number of excess counts.
ℓ
is the total number of nonzero counts.
We can get the forward and backward model coefficients with
coefbwd(cir)
([-7.505779153456341 -3.282723357305929 … 0.0 -3.8422495598493; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.007227242879000873], [-0.5918418471274728 2.2627069688748356 … 0.0 1.182793857457826; 0.0 -4.757435052354321 … 0.0 0.9190527309128335; … ; 0.0 -6.302814646916909 … 0.0 -0.6656455871872498; 0.0 0.0 … 0.0 0.0])
coeffwd(cir)
20-element Array{Float64,1}: 17.483687817158042 13.008842271887312 -2.1401445040275067 -10.412622758329546 -6.292626984306372 -22.244123896283437 -20.267477805867564 -5.369486999860684 -9.363847848691456 -36.127771366622326 -14.21891660087452 -18.406686103312907 -18.029749370757116 -33.779498729871115 -16.734437100226316 -3.09841665796032 477.13098745954227 9002.652906112522 -0.05061284356059193 0.03673465672186445
The fitted model can be used to predict Rem
with new data. Here we reuse the first 10 observations.
yhat = HurdleDMR.predict(cir, covarsdf[1:10,:], counts[1:10,:])
10-element Array{Union{Missing, Float64},1}: 35.01327624412601 29.846723755873985 -21.60376552127853 -24.869433502443925 -33.695746592945596 -25.101054383331906 11.735827128269902 19.331781366167014 17.275545003481156 10.8159318912295
We can also predict only with the other covariates, which in this case is just a linear regression
yhat_nocounts = HurdleDMR.predict(cir, covarsdf[1:10,:], counts[1:10,:]; nocounts=true)
10-element Array{Union{Missing, Float64},1}: 32.43000000000001 32.43000000000001 -26.317499999999995 -26.317499999999995 -26.317499999999995 -26.317499999999995 13.509166666666667 13.509166666666667 13.509166666666667 13.509166666666667
Kelly, Manela, and Moreira (2021) show that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.
Please reference the paper for additional details and example applications.