HurdleDMR.jl
HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multinomial Regression (HDMR), as described in:
Bryan Kelly, Asaf Manela & Alan Moreira (2021) Text Selection, Journal of Business & Economic Statistics (ungated preprint).
It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).
Setup
Install the HurdleDMR package, switch to Pkg mode (hit ])
pkg> add HurdleDMR
Add parallel workers and make package available to workers
using Distributed
addprocs(4)
import HurdleDMR; @everywhere using HurdleDMR
Setup your data into an n-by-p covars matrix, and a (sparse) n-by-d counts matrix. Here we generate some random data.
using CSV, GLM, DataFrames, Distributions, Random, LinearAlgebra, SparseArrays
n = 100
p = 3
d = 4
Random.seed!(13)
m = 1 .+ rand(Poisson(5),n)
covars = rand(n,p)
ηfn(vi) = exp.([0 + i*sum(vi) for i=1:d])
q = [ηfn(covars[i,:]) for i=1:n]
rmul!.(q,ones(n)./sum.(q))
counts = convert(SparseMatrixCSC{Float64,Int},hcat(broadcast((qi,mi)->rand(Multinomial(mi, qi)),q,m)...)')
covarsdf = DataFrame(covars,[:vy, :v1, :v2])
Distributed Multinomial Regression (DMR)
The Distributed Multinomial Regression (DMR) model of Taddy (2015) is a highly scalable approximation to the Multinomial using distributed (independent, parallel) Poisson regressions, one for each of the d categories (columns) of a large counts
matrix, on the covars
.
To fit a DMR:
m = dmr(covars, counts)
or with a dataframe and formula
mf = @model(c ~ vy + v1 + v2)
m = fit(DMR, mf, covarsdf, counts)
in either case we can get the coefficients matrix for each variable + intercept as usual with
coef(m)
By default we only return the AICc maximizing coefficients. To also get back the entire regulatrization paths, run
paths = fit(DMRPaths, mf, covarsdf, counts)
we can now select, for example the coefficients that minimize 10-fold CV mse (takes a while)
coef(paths, MinCVKfold{MinCVmse}(10))
HurdleDMR.DCR
— TypeAbstract Distributed Counts Regression (DCR) returned object
HurdleDMR.DMR
— TypeAbstract DMR returned object
HurdleDMR.DMRCoefs
— TypeRelatively light object used to return DMR results when we only care about estimated coefficients.
HurdleDMR.DMRPaths
— TypeRelatively heavy object used to return DMR results when we care about the regulatrization paths.
Distributions.ncategories
— MethodNumber of categories (terms/words/phrases) used for DMR estimation
HurdleDMR.dmr
— MethodShorthand for fit(DMR,covars,counts). See also fit(::DMR)
HurdleDMR.dmrpaths
— MethodShorthand for fit(DMRPaths,covars,counts). See also fit(::DMRPaths)
HurdleDMR.hasintercept
— MethodWhether the model includes an intercept in each independent counts (e.g. hurdle) regression
HurdleDMR.ncoefs
— MethodNumber of coefficient potentially including intercept used for each independent Poisson regression
HurdleDMR.ncovars
— MethodNumber of covariates used for DMR estimation
StatsAPI.coef
— Methodcoef(m::DMRCoefs)
Returns the coefficient matrices fitted with DMR using the segment selected during fit (MinAICc by default).
Example:
m = fit(DMR,covars,counts)
coef(m)
StatsAPI.fit
— Methodfit(DMR,covars,counts; <keyword arguments>)
dmr(covars,counts; <keyword arguments>)
Fit a Distributed Multinomial Regression (DMR) of counts on covars.
DMR fits independent poisson gamma lasso regressions to each column of counts to approximate a multinomial, picks a segement of each path, and returns a coefficient matrix (wrapped in DMRCoefs) representing point estimates for the entire multinomial (includes the intercept if one was included).
Example:
m = fit(DMR,covars,counts)
Arguments
covars
n-by-p matrix of covariatescounts
n-by-d matrix of counts (usually sparse)
Keywords
intercept::Bool=false
include an intercept in each poissonparallel::Bool=true
parallelize the poisson fitslocal_cluster::Bool=true
use local_cluster mode that shares memory across parallel workers that is appropriate on a single multicore machine, or remote cluster mode that is more appropriate when distributing across machines for which sharing memory is costly.verbose::Bool=true
showwarnings::Bool=false
select::SegSelect=MinAICc()
which path segment to pickkwargs...
additional keyword arguments passed along to fit(GammaLassoPath,...)
StatsAPI.fit
— Methodfit(DMRPaths,covars,counts; <keyword arguments>)
dmrpaths(covars,counts; <keyword arguments>)
Fit a Distributed Multinomial Regression (DMR) of counts on covars, and returns the entire regulatrization paths, which may be useful for plotting or picking coefficients other than the AICc optimal ones. Same arguments as fit(::DMR)
.
StatsAPI.fit
— Methodfit(DMR,@model(c ~ x1*x2),df,counts; <keyword arguments>)
Fits a DMR but takes a model formula and dataframe instead of the covars matrix. See also fit(::DMR)
.
c
must be specified on the lhs to indicate the model for counts.
StatsAPI.nobs
— MethodNumber of observations used. May be lower than provided after removing all zero obs.
StatsAPI.predict
— Methodpredict(m,newcovars; <keyword arguments>)
Predict counts using a fitted DMRPaths object and given newcovars.
Example:
m = fit(DMRPaths,covars,counts)
newcovars = covars[1:10,:]
countshat = predict(m, newcovars; select=MinAICc())
Arguments
m::DMRPaths
fitted DMRPaths model (DMRCoefs currently not supported)newcovars
n-by-p matrix of covariates of same dimensions used to fit m.
Keywords
select=MinAICc()
Seecoef(::RegularizationPath)
.kwargs...
additional keyword arguments passed along to predict() for each category j=1..size(counts,2)
Hurdle Distributed Multinomial Regression (HDMR)
For highly sparse counts, as is often the case with text that is selected for various reasons, the Hurdle Distributed Multinomial Regression (HDMR) model of Kelly, Manela, and Moreira (2018), may be superior to the DMR. It approximates a higher dispersion Multinomial using distributed (independent, parallel) Hurdle regressions, one for each of the d categories (columns) of a large counts
matrix, on the covars
. It allows a potentially different sets of covariates to explain category inclusion ($h=1{c>0}$), and repetition ($c>0$).
Both the model for zeroes and for positive counts are regularized by default, using GammaLassoPath
, picking the AICc optimal segment of the regularization path.
HDMR can be fitted:
m = hdmr(covars, counts; inpos=1:2, inzero=1:3)
or with a dataframe and formula
mf = @model(h ~ vy + v1 + v2, c ~ vy + v1)
m = fit(HDMR, mf, covarsdf, counts)
where the h ~ equation is the model for zeros (hurdle crossing) and c ~ is the model for positive counts
in either case we can get the coefficients matrix for each variable + intercept as usual with
coefspos, coefszero = coef(m)
By default we only return the AICc maximizing coefficients. To also get back the entire regularization paths, run
paths = fit(HDMRPaths, mf, covarsdf, counts)
coef(paths, AllSeg())
Syntax:
HurdleDMR.HDMR
— TypeAbstract HDMR returned object
HurdleDMR.HDMRCoefs
— TypeRelatively light object used to return HDMR results when we only care about estimated coefficients.
HurdleDMR.HDMRPaths
— TypeRelatively heavy object used to return HDMR results when we care about the regulatrization paths.
HurdleDMR.hdmr
— MethodShorthand for fit(HDMR,covars,counts). See also fit(::HDMR)
HurdleDMR.hdmrpaths
— MethodShorthand for fit(HDMRPaths,covars,counts). See also fit(::HDMRPaths)
HurdleDMR.ncoefspos
— MethodNumber of coefficient potentially including intercept used by model for positives
HurdleDMR.ncoefszero
— MethodNumber of coefficient potentially including intercept used by model for zeros
HurdleDMR.ncovarspos
— MethodNumber of covariates used for HDMR estimation of positives model
HurdleDMR.ncovarszero
— MethodNumber of covariates used for HDMR estimation of zeros model
HurdleDMR.posindic
— Methodposindic(A)
Returns an array of the same dimensions of indicators for positive entries in A.
HurdleDMR.posindic
— MethodSparse version simply replaces all the non-zero values with ones.
StatsAPI.coef
— Functioncoef(m::HDMRPaths, select::SegSelect=MinAICc())
Returns all or selected coefficient matrices fitted with HDMR.
Example:
m = fit(HDMRPaths,covars,counts)
coefspos, coefszero = coef(m, MinCVKfold{MinCVmse}(5))
StatsAPI.coef
— Methodcoef(m::HDMRCoefs)
Returns the coefficient matrices fitted with HDMR using the segment selected during fit (MinAICc by default).
Example:
m = fit(HDMR,covars,counts)
coefspos, coefszero = coef(m)
StatsAPI.fit
— Methodfit(HDMR,covars,counts; <keyword arguments>)
hdmr(covars,counts; <keyword arguments>)
Fit a Hurdle Distributed Multinomial Regression (HDMR) of counts on covars.
HDMR fits independent hurdle lasso regressions to each column of counts to approximate a multinomial, picks a segement of each path, and returns a coefficient matrix (wrapped in HDMRCoefs) representing point estimates for the entire multinomial (includes the intercept if one was included).
Example:
m = fit(HDMR,covars,counts)
Arguments
covars
n-by-p matrix of covariatescounts
n-by-d matrix of counts (usually sparse)
Keywords
inpos=1:p
indices of covars columns included in model for positive countsinzero=1:p
indices of covars columns included in model for zero countsintercept::Bool=false
include a intercepts in each hurdle regressionparallel::Bool=true
parallelize the poisson fitslocal_cluster::Bool=true
use local_cluster mode that shares memory across parallel workers that is appropriate on a single multicore machine, or remote cluster mode that is more appropriate when distributing across machines for which sharing memory is costly.verbose::Bool=true
showwarnings::Bool=false
select::SegSelect=MinAICc()
path segment selection criterionkwargs...
additional keyword arguments passed along to fit(Hurdle,...)
StatsAPI.fit
— Methodfit(HDMRPaths,covars,counts; <keyword arguments>)
hdmrpaths(covars,counts; <keyword arguments>)
Fit a Hurdle Distributed Multinomial Regression (HDMR) of counts on covars, and returns the entire regulatrization paths, which may be useful for plotting or picking coefficients other than the AICc optimal ones. Same arguments as fit(::HDMR)
.
StatsAPI.fit
— Methodfit(HDMR,@model(h ~ x1 + x2, c ~ x1),df,counts; <keyword arguments>)
Fits a HDMR but takes a model formula and dataframe instead of the covars matrix. See also fit(::HDMR)
.
h
and c
on the lhs indicate the model for zeros and positives, respectively.
StatsAPI.predict
— Methodpredict(m,newcovars; <keyword arguments>)
Predict counts using a fitted HDMRPaths object and given newcovars.
Example:
m = fit(HDMRPaths,covars,counts)
newcovars = covars[1:10,:]
countshat = predict(m, newcovars; select=MinAICc())
Arguments
m::HDMRPaths
fitted DMRPaths model (HDMRCoefs currently not supported)newcovars
n-by-p matrix of covariates of same dimensions used to fit m.
Keywords
select=MinAICc()
Seecoef(::RegularizationPath)
.kwargs...
additional keyword arguments passed along to predict() for each category j=1..size(counts,2)
Sufficient reduction projection
A sufficient reduction projection summarizes the counts, much like a sufficient statistic, and is useful for reducing the d dimensional counts in a potentially much lower dimension matrix z
.
To get a sufficient reduction projection in direction of vy for the above example
z = srproj(m,counts,1,1)
Here, the first column is the SR projection from the model for positive counts, the second is the the SR projection from the model for hurdle crossing (zeros), and the third is the total count for each observation.
Syntax:
HurdleDMR.srproj
— Functionsrproj calculates the MNIR Sufficient Reduction projection from text counts on to the attribute dimensions of interest (covars in mnlm). In particular, for counts C, with row sums m, and mnlm coefficients φj corresponding to attribute j, zj = C'φj/m is the SR projection in the direction of j. The MNIR paper explains how V=[v1 ... vK], your original covariates/attributes, are independent of text counts C given SR projections Z=[z1 ... z_K]. dir == nothing returns projections in all directions.
HurdleDMR.srproj
— Methodsrproj for hurdle dmr takes two coefficent matrices coefspos, coefszero, and a two specific directions and returns an n-by-4 matrix Z = [zpos zzero m l]. dirpos = 0 omits positive counts projections and dirzero = 0 omits zero counts projections. Setting any of these to nothing will return projections in all directions.
HurdleDMR.srproj
— Methodsrproj for hurdle dmr takes two coefficent matrices coefspos, coefszero, and a two specific directions and returns an n-by-4 matrix Z = [zpos zzero m l]. dirpos = 0 omits positive counts projections and dirzero = 0 omits zero counts projections. Setting any of these to nothing will return projections in all directions.
HurdleDMR.srproj
— Methodsrproj calculates the MNIR Sufficient Reduction projection from text counts on to the attribute dimensions of interest (covars in mnlm). In particular, for counts C, with row sums m, and mnlm coefficients φj corresponding to attribute j, zj = C'φj/m is the SR projection in the direction of j. The MNIR paper explains how V=[v1 ... vK], your original covariates/attributes, are independent of text counts C given SR projections Z=[z1 ... z_K]. dir == nothing returns projections in all directions.
HurdleDMR.srproj
— MethodLike srproj but efficiently interates over a sparse counts matrix, and only projects in a single direction (dir).
HurdleDMR.srprojX
— MethodBuilds the design matrix X for predicting covar in direction projdir hdmr version Assumes that covars include all variables for both positives and zeros models and indicates which variables are where with the index arrays inpos and inzero. inz=[1,2] if both zpos and zzero are included inz=[2] if zpos is dropped due to collinearity
HurdleDMR.srprojX
— MethodBuilds the design matrix X for predicting covar in direction projdir dmr version inz=[1] and testrank=false always for dmr, so variables are ignored and only here for convinence of unified calling function
Counts Inverse Regression (CIR)
Counts inverse regression allows us to predict a covariate with the counts and other covariates. Here we use hdmr for the backward regression and another model for the forward regression. This can be accomplished with a single command, by fitting a CIR{HDMR,FM} where the forward model is FM <: RegressionModel.
cir = fit(CIR{HDMR,LinearModel},mf,covarsdf,counts,:vy; nocounts=true)
where the nocounts=true
means we also fit a benchmark model without counts.
we can get the forward and backward model coefficients with
coefbwd(cir)
coeffwd(cir)
The fitted model can be used to predict vy with new data
yhat = predict(cir, covarsdf[1:10,:], counts[1:10,:])
We can also predict only with the other covariates, which in this case is just a linear regression
yhat_nocounts = predict(cir, covarsdf[1:10,:], counts[1:10,:]; nocounts=true)
Syntax:
HurdleDMR.CIR
— TypeCounts inverse regression (CIR) model supports both multinomial and hurdle inverse regressions and holds both the inverse and forward regression model estimates
HurdleDMR.coefbwd
— MethodReturns coefficients for backward model for counts as function of covariates
HurdleDMR.coeffwd
— MethodReturns coefficients of forward regression model. Set nocounts=true to get coefficients for the benchmark model without counts.
StatsAPI.fit
— Methodfit(CIR{DMR,FM},m,df,counts,projdir[,fmargs...]; <keyword arguments>)
Version of fit(CIR{DMR,FM}...) that takes a @model() and a dataframe instead of a covars matrix, and a projdir::Symbol specifies the dependent variable. See also fit(CIR...).
Example:
m = fit(CIR{DMR,LinearModel}, @model(c~x1+x2), df, counts, :x1; nocounts=true)
where c~
is the model for counts. x1
(projdir
) is the variable to predict. We can then predict with a dataframe as well
yhat = predict(m, df, counts)
yhatnc = predict(m, df, counts; nocounts=true)
StatsAPI.fit
— Methodfit(CIR{HDMR,FM},m,df,counts,projdir[,fmargs...]; <keyword arguments>)
Version of fit(CIR{HDMR,FM}...) that takes a @model() and a dataframe instead of a covars matrix, and a projdir::Symbol specifies the dependent variable. See also fit(CIR...).
Example:
m = fit(CIR{HDMR,LinearModel}, @model(h~x1+x2, c~x1), df, counts, :x1; nocounts=true)
where h~
is the model for zeros, c~
is the model for positives. x1
(projdir
) is the variable to predict. We can then predict with a dataframe as well
yhat = predict(m, df, counts)
yhatnc = predict(m, df, counts; nocounts=true)
StatsAPI.fit
— Methodfit(::CIR{BM,FM},covars,counts,projdir[,fmargs...]; <keyword arguments>)
Fit a Counts Inverse Regression (CIR) of covars[:,projdir] ~ counts + covars[:,~projdir]
.
CIR involves three steps:
- Fit a backward regression model BM<:DCR: counts ~ covars
- Calculate an sufficient reduction projection in direction projdir
- Fit a forward regression model FM<:RegressionModel:
covars[:,projdir] ~ srproj(counts) + covars[:,~projdir]
Example:
m = fit(CIR{DMR,LinearModel}, covars, counts, 1; nocounts=true)
yhat = predict(m, covars, counts)
yhatnc = predict(m, covars, counts; nocounts=true)
Arguments
covars
n-by-p matrix of covariatescounts
n-by-d matrix of counts (usually sparse)projdir
index of covars column used as dependent variable in forward modelfmargs...
optional arguments passed along to the forward regression model
Keywords
nocounts::Bool=false
whether to also fit a benchmark model without countsbmkwargs...
keyword arguments passed along to the backward regression model
StatsAPI.predict
— MethodPredict using a fitted Counts inverse regression (CIR) given new covars dataframe and counts. See also predict(::CIR)
.
StatsAPI.predict
— MethodPredict using a fitted Counts inverse regression (CIR) given new covars and counts.
Keywords
- Set
nocounts=true
to predict using a benchmark model without counts.
Hurdle
This package also provides a regularized Hurdle model (Mullahy, 1986) that can be fit using a fast coordinate decent algorithm, or simply by running two fit(GeneralizedLinearModel,...)
regressions, one for each of its two parts.
Syntax:
HurdleDMR.Hurdle
— TypeHurdle returned object
HurdleDMR.MinCVKfold
— TypeSelects the RegularizationPath segment according to CVSegSelect
with k
-fold cross-validation
StatsAPI.coef
— Methodcoef(m::Hurdle; select=MinAICc())
Returns a selected segment of the coefficient matrices of the fitted the TwoPartModel.
Example:
m = fit(Hurdle,GeneralizedLinearModel,X,y; Xpos=Xpos)
coefspos, coefszero = coef(m)
Keywords
kwargs...
are passed along to two coef() calls on the two model parts.
StatsAPI.coef
— MethodSelects the RegularizationPath segment coefficients according to S
with k
-fold cross-validation
StatsAPI.coef
— MethodSelects the RegularizationPath segment coefficients according to S
with k
-fold cross-validation
StatsAPI.fit
— Methodfit(Hurdle,M,f,df; fpos=Xpos, <keyword arguments>)
Takes dataframe and two formulas, one for each model part. Otherwise same arguments as fit(::Hurdle)
Example
fit(Hurdle,GeneralizedLinearModel,@formula(y ~ x1*x2), df; fpos=@formula(y ~ x1*x2+x3))
StatsAPI.fit
— Methodfit(Hurdle,M,X,y; Xpos=Xpos, <keyword arguments>)
Fit a Hurdle (Mullahy, 1986) of count vector y on X with potentially another covariates matrix Xpos used to model positive counts.
Example with GLM:
m = fit(Hurdle,GeneralizedLinearModel,X,y; Xpos=Xpos)
yhat = predict(m, X; Xpos=Xpos)
Example with Lasso regularization:
m = fit(Hurdle,GammaLassoPath,X,y; Xpos=Xpos)
yhat = predict(m, X; Xpos=Xpos, select=MinAICc())
Arguments
M::RegressionModel
counts
n-by-d matrix of counts (usually sparse)dzero::UnivariateDistribution = Binomial()
distribution for zeros modeldpos::UnivariateDistribution = PositivePoisson()
distribution for positives modellzero::Link=canonicallink(dzero)
link function for zeros modellpos::Link=canonicallink(dpos)
link function for positives model
Keywords
Xpos::Union{AbstractMatrix{T},Nothing} = nothing
covariates matrix for positives model or nothing to use X for both partsdofit::Bool = true
fit the model or just construct its shellwts::V = ones(y)
observation weightsoffsetzero::AbstractVector = similar(y, 0)
offsets for zeros modeloffsetpos::AbstractVector = similar(y, 0)
offsets for positives modeloffset::AbstractVector = similar(y, 0)
offsets for both model partsverbose::Bool=true
showwarnings::Bool=false
fitargs...
additional keyword arguments passed along to fit(M,...)
StatsAPI.predict
— Methodpredict(m,X; Xpos=Xpos, <keyword arguments>)
Predict using a fitted TwoPartModel given new X (and potentially Xpos).
Example with GLM:
m = fit(Hurdle,GeneralizedLinearModel,X,y; Xpos=Xpos)
yhat = predict(m, X; Xpos=Xpos)
Example with Lasso regularization:
m = fit(Hurdle,GammaLassoPath,X,y; Xpos=Xpos)
yhat = predict(m, X; Xpos=Xpos, select=MinAICc())
Arguments
m::Hurdle
fitted Hurdle modelX
n-by-p matrix of covariates of same dimensions used to fit m.
Keywords
Xpos::Union{AbstractMatrix{T},Nothing} = nothing
covariates matrix for positives model or nothing to use X for both partskwargs...
additional keyword arguments passed along to predict() for each of the two model parts.
Positive Poisson
This package also implements the PositivePoisson
distribution and the GLM necessary methods to facilitate fit with fit(::GeneralizedLinearModel
.
Syntax:
HurdleDMR.PositivePoisson
— TypePositivePoisson(λ)
A PositivePoisson distribution (aka zero-truncated Poisson, ZTP) descibes the number of independent events occurring within a unit time interval, given the average rate of occurrence λ
and, importantly, given that the number is not zero.
$P(X = k) = \frac{λ^k}{k!(1-e^{-λ})} e^{-λ}, \quad \text{ for } k = 1,2,\ldots.$
PositivePoisson() # PositivePoisson distribution with rate parameter 1
PositivePoisson(lambda) # PositivePoisson distribution with rate parameter lambda
params(d) # Get the parameters, i.e. (λ,)
mean(d) # Get the mean arrival rate, i.e. λ
External links:
API / Index
HurdleDMR.CIR
HurdleDMR.DCR
HurdleDMR.DMR
HurdleDMR.DMRCoefs
HurdleDMR.DMRPaths
HurdleDMR.HDMR
HurdleDMR.HDMRCoefs
HurdleDMR.HDMRPaths
HurdleDMR.Hurdle
HurdleDMR.MinCVKfold
HurdleDMR.PositivePoisson
Distributions.ncategories
HurdleDMR.coefbwd
HurdleDMR.coeffwd
HurdleDMR.dmr
HurdleDMR.dmrpaths
HurdleDMR.hasintercept
HurdleDMR.hdmr
HurdleDMR.hdmrpaths
HurdleDMR.ncoefs
HurdleDMR.ncoefspos
HurdleDMR.ncoefszero
HurdleDMR.ncovars
HurdleDMR.ncovarspos
HurdleDMR.ncovarszero
HurdleDMR.posindic
HurdleDMR.posindic
HurdleDMR.srproj
HurdleDMR.srproj
HurdleDMR.srproj
HurdleDMR.srproj
HurdleDMR.srproj
HurdleDMR.srprojX
HurdleDMR.srprojX
StatsAPI.coef
StatsAPI.coef
StatsAPI.coef
StatsAPI.coef
StatsAPI.coef
StatsAPI.coef
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.fit
StatsAPI.nobs
StatsAPI.predict
StatsAPI.predict
StatsAPI.predict
StatsAPI.predict
StatsAPI.predict