3 minute read

We studied how to model the classification problem with ordinal label by regression models.

Background

In many real-world applications, we encounter classification problems with ordinal labels. For example, we may want to rate an employee’s performance as below, meets, or exceeds expectations. Ordinal labels typically arise from product requirements that favor discrete outputs over continuous scores for ease of interpretation.

A straightforward approach is to treat the problem as a standard classification task, where each label is treated as a distinct class. However, this ignores the ordinal relationship between labels. As a result, the model incurs the same penalty for predicting a label of “exceeds expectations” (EE) instead of “meets expectations” (ME) as it would for predicting “below expectations” (BE) instead of ME. This is undesirable given the inherent ordering in ordinal labels.

So the question becomes: Can we preserve the ordinal nature of labels while still producing discrete outputs?

Motivation

To better understand how to optimize for ordinal labels, consider a three-class classification task with labels 1, 2, and 3. The penalty for each prediction might look like this:

    true    
    1 2 3
pred 1 0 a b
  2 a 0 a
  3 b a 0

Here, \(a < b\), reflecting that predicting label 3 instead of 1 is a more severe error than predicting label 2 instead of 1.

The symmetry and structure of this table suggest that the penalty is a function of the absolute difference between the true and predicted labels. Thus, even though the labels are categorical, we can treat them numerically and model the task as a regression problem.

Building Regression Model with Ordinal Labels

The main challenge in using regression is how to convert continuous predictions back to discrete labels. A simple way is to round the predicted value to the nearest integer. For example, a prediction of 1.4 would be cast to label 1.

This approach uses a fixed threshold (e.g., 1.5 and 2.5) to determine class boundaries. Alternatively, we can treat the threshold as tunable hyperparameters-allowing us to shift boundaries based on product needs or validation performance.

In our experiments, we use a dataset with three ordinal labels. To evaluate ordinal performance, we introduce a new metric called flip rate, defined as the proportion of label 1 instances misclassified as label 3, and vice versa.

Each experiment uses random-shuffle cross-validation, with a validation set used for early stopping.

Fixed Threshold

[Experiment Data Missing]

Tunable Threshold

[Experiment Data Missing]

Summary

While performance metrics for both approaches are comparable, the tunable threshold approach offers greater flexibility to align with product requirements.

Loss Function Design

We use Mean Squared Error (MSE) as the base loss function. However, this can lead to instability—particularly when input features are noisy. Small threshold shifts can lead to large classification changes, especially near decision boundaries.

To address this, we introduce a regularization term that gently discourages predictions near the threshold.

Gaussian Regularization Near Threshold

Let \(p_1\) and \(p_2\) be the decision thresholds between classes. The regularization term is:

\(f_1(y_{\text{pred}}) + f_2(y_{\text{pred}})\)

where \(f_1\) and \(f_2\) are Gaussian PDFs centered at \(p_1\) and \(p_2\), respectively, with variance \(\sigma^2\).

Cross-Entropy Loss with Soft Labels

MSE assumes unbounded real-valued predictions, which causes edge labels (1 and 3) to span larger ranges. Under distribution shift, this can cause extreme misclassifications.

To mitigate this, we consider a binary classification model trained with soft labels. For three classes:

  • Label 1 \(\rightarrow\) [1, 0]
  • Label 2 \(\rightarrow\) [0.5, 0.5]
  • Label 3 \(\rightarrow\) [0, 1]

We train using cross-entropy loss on these soft targets. At inference, we treat the confidence for class 1 as a pseudo-regression score and apply thresholds to map it back to class labels.

Beyond Ordinal Labels

Ordinal classification is a special case of structured classification where labels have inherent relationships. More generally, consider token prediction in LLMs. Though treated as a flat classification task, tokens often have semantic relationships (e.g., “cat” and “kitten”). Current models do not explicitly encode this structure.

Just as ordinal labels benefit from structure-aware modeling, it is worth exploring how to incorporate semantic relationships into loss functions for broader structured classification tasks.