2 Defining Categorical Variables

The table below describes how customer’s response to the survey question varies with regards to the number of stars:

table(APP$Recommend, APP$Stars)

      0   1   2   3   4   5
No   55  80  46   8   1   0
Yes   0  10  41  92 104  43

It is clear that there is an association between the number of stars and the customer’s response to the recommendation question. In this case, we may propose that Stars should be included in the linear predictor and so fit the following logistic regression:

M4 <- glm(Recommend ~ 1 + Stars, family=binomial, data=APP)
M4

Coefficients:
(Intercept)        Stars
     -4.660        2.329

This mean that for every additional star, the log-odds for the customer to recommend the application increases by 2.329.

In the above analysis, we have interpreted the number of stars as a (discrete) numerical variable and that the association between stars and the yes/no response is linear. We may instead interpret the number of stars given by the consumer as a categorical variable as they can choose one of six possible options: 0 stars, 1 star, …, 5 stars. In order to fit a logistic regression model with stars as a categorical variable, we first need to convert the star explanatory variable to a factor class:

APP2 <- APP  ## Copy the APP data.frame into a new object called APP2
APP2$Stars <- factor(x = APP2$Stars , levels = c("0","1","2","3","4","5"))

levels(APP2$Stars)
[1] "0" "1" "2" "3" "4" "5"

Here, the command factor takes two arguments: x is a character vector containing all of the star entries from the data set and levels is a character vector that provides all of the options available for that categorical variable. The command levels returns the names for the options of the factor object.

Since there is no change to the ordering of the levels, the above conversion of the explanatory variable from numerical to categorical can be equivalently achieved by the following command:

APP2$Stars <- as.factor(APP$Stars)

Look at the summary of data frames APP and APP2 to see how R is interpreting the Stars explanatory variable.

Given this re-formatted data frame, we can now fit the logistic model where the Stars explanatory variable is treated as a categorical variable:

M5 <- glm(Recommend ~ -1 + Stars, family=binomial, data=APP2)
M5

Coefficients:
  Stars0    Stars1    Stars2    Stars3    Stars4    Stars5
-19.5661   -2.0794   -0.1151    2.4423    4.6444   19.5661

Here, the formula contains -1 to remove the intercept term so that the MLE for each co-efficient denotes the log-odds for recommendation for each star category. Run summary(M5) to obtain more information about the estimates.

The first thing to note is the extremely high uncertainty associated to the coefficients of Stars0 and Stars5. This occurs because there are no 0 star app users who said yes to recommendation and no 5 star app users who answered no to the same survey question. Theoretically, the log-odds for these events are - and respectively.

Secondly, the log-odds estimate for the other star categories increase from -2.08 to 4.64 at a steady rate of approximately 2.24 units per star. This is within the 95% confidence interval for the Stars coefficient estimated for model M4.

In this case, the models are similar and equally describes the variability seen within the data. Therefore, it can be argued for parsimony reasons that we may prefer M4 over M5. The relationship between categorical explanatory variables and the response may not be linear and so determining an appropriate numerical explanatory variable to describe the observed relationship with the categorical variable may not be possible. There is no correct answer, the best model to select ultimately depend on how much detail you require the model to describe for whatever it is being used for.