Regression

Regression in Orange is from the interface very similar to classification. These both require class-labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class:

import Orange

data = Orange.data.Table("housing")
learner = Orange.regression.LinearRegressionLearner()
model = learner(data)

print("predicted, observed:")
for d in data[:3]:
    print("%.1f, %.1f" % (model(d)[0], d.get_class()))

Handful of Regressors

Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form:


data = Orange.data.Table("housing.tab")
tree_learner = SimpleTreeLearner(max_depth=2)
tree = tree_learner(data)
print(tree.to_string())

The script outputs the tree:

RM<=6.941: 19.9
RM>6.941
|    RM<=7.437
|    |    CRIM>7.393: 14.4
|    |    CRIM<=7.393
|    |    |    DIS<=1.886: 45.7
|    |    |    DIS>1.886: 32.7
|    RM>7.437
|    |    TAX<=534.500: 45.9
|    |    TAX>534.500: 21.9

Following is initialization of few other regressors and their prediction of the first five data instances in housing price data set:


random.seed(42)
data = Orange.data.Table("housing")
test = Orange.data.Table(data.domain, random.sample(data, 5))
train = Orange.data.Table(data.domain, [d for d in data if d not in test])

lin = Orange.regression.linear.LinearRegressionLearner()
rf = Orange.regression.random_forest.RandomForestRegressionLearner()
rf.name = "rf"
ridge = Orange.regression.RidgeRegressionLearner()

learners = [lin, rf, ridge]
regressors = [learner(train) for learner in learners]

print("y   ", " ".join("%5s" % l.name for l in regressors))

for d in test:
    print(("{:<5}" + " {:5.1f}"*len(regressors)).format(
        d.get_class(),
        *(r(d)[0] for r in regressors)))

Looks like the housing prices are not that hard to predict:

y    linreg    rf ridge
22.2   19.3  21.8  19.5
31.6   33.2  26.5  33.2
21.7   20.9  17.0  21.0
10.2   16.9  14.3  16.8
14.0   13.6  14.9  13.5

Cross Validation

Evaluation and scoring methods are available at Orange.evaluation:

data = Orange.data.Table("housing.tab")

lin = Orange.regression.linear.LinearRegressionLearner()
rf = Orange.regression.random_forest.RandomForestRegressionLearner()
rf.name = "rf"
ridge = Orange.regression.RidgeRegressionLearner()
mean = Orange.regression.MeanLearner()

learners = [lin, rf, ridge, mean]

res = Orange.evaluation.CrossValidation(data, learners, k=5)
rmse = Orange.evaluation.RMSE(res)
r2 = Orange.evaluation.R2(res)

print("Learner  RMSE  R2")
for i in range(len(learners)):
    print("{:8s} {:.2f} {:5.2f}".format(learners[i].name, rmse[i], r2[i]))

We have scored the regression two measures for goodnes of fit: root-mean-square error and coefficient of determination, or R squared. Random forest has the lowest root mean squared error:

Learner  RMSE  R2
linreg   4.88  0.72
rf       4.70  0.74
ridge    4.91  0.71
mean     9.20 -0.00

Not much difference here. Each regression method has a set of parameters. We have been running them with default parameters, and parameter fitting would help. Also, we have included MeanLearner in a list of our regression; this regressors simply predicts the mean value from the training set, and is used as a baseline.