Scoring (scoring)

Scoring (`scoring`)¶

Feature score is an assessment of the usefulness of the feature for prediction of the dependant (class) variable. Orange provides classes that compute the common feature scores for classification and regression regression.

The script below computes the information gain of feature “tear_rate” in the Lenses data set (loaded into data):

>>> print Orange.feature.scoring.InfoGain("tear_rate", data)
0.548795044422

Calling the scorer by passing the variable and the data to the constructor, like above is convenient. However, when scoring multiple variables, some methods run much faster if the scorer is constructed, stored and called for each variable.

>>> gain = Orange.feature.scoring.InfoGain()
>>> for feature in data.domain.features:
...     print feature.name, gain(feature, data)
age 0.0393966436386
prescription 0.0395109653473
astigmatic 0.377005338669
tear_rate 0.548795044422

The speed gain is most noticable in Relief, which computes the scores of all features in parallel.

The module also provides a convenience function score_all that computes the scores for all attributes. The following example computes feature scores, both with score_all and by scoring each feature individually, and prints out the best three features.

import Orange
voting = Orange.data.Table("voting")

def print_best_3(ma):
    for m in ma[:3]:
        print "%5.3f %s" % (m[1], m[0])

print 'Feature scores for best three features (with score_all):'
ma = Orange.feature.scoring.score_all(voting)
print_best_3(ma)

print

print 'Feature scores for best three features (scored individually):'
meas = Orange.feature.scoring.Relief(k=20, m=50)
mr = [ (a.name, meas(a, voting)) for a in voting.domain.attributes]
mr.sort(key=lambda x: -x[1]) #sort decreasingly by the score
print_best_3(mr)

The output:

Feature scores for best three features (with score_all):
0.613 physician-fee-freeze
0.255 el-salvador-aid
0.228 synfuels-corporation-cutback

Feature scores for best three features (scored individually):
0.613 physician-fee-freeze
0.255 el-salvador-aid
0.228 synfuels-corporation-cutback

It is also possible to score features that do not appear in the data but can be computed from it. A typical case are discretized features:

import Orange
iris = Orange.data.Table("iris")

d1 = Orange.feature.discretization.Entropy("petal length", iris)
print Orange.feature.scoring.InfoGain(d1, iris)

Calling scoring methods¶

Scorers can be called with different type of arguments. For instance, when given the data, most scoring methods first compute the corresponding contingency tables. If these are already known, they can be given to the scorer instead of the data to save some time.

Not all classes accept all kinds of arguments. Relief, for instance, only supports the form with instances on the input.

Score.__call__(attribute, data[, apriori_class_distribution][, weightID])¶

Parameters:	attribute (`Orange.feature.Descriptor` or int or string) – the chosen feature, either as a descriptor, index, or a name. data (Orange.data.Table) – data. weightID – id for meta-feature with weight.

All scoring methods support this form.

Score.__call__(attribute, domain_contingency[, apriori_class_distribution])

Parameters:	attribute (`Orange.feature.Descriptor` or int or string) – the chosen feature, either as a descriptor, index, or a name. domain_contingency (`Orange.statistics.contingency.Domain`) –

Score.__call__(contingency, class_distribution[, apriori_class_distribution])

Parameters:	contingency (`Orange.statistics.contingency.VarClass`) – class_distribution (`Orange.statistics.distribution.Distribution`) – distribution of the class variable. If `unknowns_treatment` is `IgnoreUnknowns`, it should be computed on instances where feature value is defined. Otherwise, class distribution should be the overall class distribution. apriori_class_distribution – Optional and most often ignored. Useful if the scoring method makes any probability estimates based on apriori class probabilities (such as the m-estimate).
Returns:	Feature score - the higher the value, the better the feature. If the quality cannot be scored, return `Score.Rejected`.
Return type:	float or `Score.Rejected`.

The code demonstrates using the different call signatures by computing the score of the same feature with GainRatio.

import Orange
titanic = Orange.data.Table("titanic")
meas = Orange.feature.scoring.GainRatio()

print "Call with variable and data table"
print meas(0, titanic)

print "Call with variable and domain contingency"
domain_cont = Orange.statistics.contingency.Domain(titanic)
print meas(0, domain_cont)

print "Call with contingency and class distribution"
cont = Orange.statistics.contingency.VarClass(0, titanic)
class_dist = Orange.statistics.distribution.Distribution( \
    titanic.domain.class_var, titanic)
print meas(cont, class_dist)

Feature scoring in classification problems¶

class Orange.feature.scoring.InfoGain¶: Information gain; the expected decrease of entropy. See page on wikipedia.

class Orange.feature.scoring.GainRatio¶: Information gain ratio; information gain divided by the entropy of the feature’s value. Introduced in [Quinlan1986] in order to avoid overestimation of multi-valued features. It has been shown, however, that it still overestimates features with multiple values. See Wikipedia.

class Orange.feature.scoring.Gini¶: Gini index is the probability that two randomly chosen instances will have different classes. See Gini coefficient on Wikipedia.

class Orange.feature.scoring.Relevance¶: The potential value for decision rules.

class Orange.feature.scoring.Cost¶

Evaluates features based on the cost decrease achieved by knowing the value of feature, according to the specified cost matrix.

cost¶: Cost matrix, an instance of Orange.misc.CostMatrix.

If the cost of predicting the first class of an instance that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate score can be constructed as follows:

>>> meas = Orange.feature.scoring.Cost()
>>> meas.cost = ((0, 5), (1, 0))
>>> meas(3, data)
0.083333350718021393

Knowing the value of feature 3 would decrease the classification cost for approximately 0.083 per instance.

class Orange.feature.scoring.Relief¶

Assesses features’ ability to distinguish between very similar instances from different classes. This scoring method was first developed by Kira and Rendell and then improved by Kononenko. The class Relief works on discrete and continuous classes and thus implements ReliefF and RReliefF.

ReliefF is slow since it needs to find k nearest neighbours for each of m reference instances. As we normally compute ReliefF for all features in the dataset, Relief caches the results for all features, when called to score a certain feature. When called again, it uses the stored results if the domain and the data table have not changed (data table version and the data checksum are compared). Caching will only work if you use the same object. Constructing new instances of Relief for each feature, like this:

for attr in data.domain.attributes:
    print Orange.feature.scoring.Relief(attr, data)

runs much slower than reusing the same instance:

meas = Orange.feature.scoring.Relief()
for attr in table.domain.attributes:
    print meas(attr, data)

k¶: Number of neighbours for each instance. Default is 5.

m¶: Number of reference instances. Default is 100. When -1, all instances are used as reference.

check_cached_data¶: Check if the cached data is changed, which may be slow on large tables. Defaults to True, but should be disabled when it is certain that the data will not change while the scorer is used.

class Orange.feature.scoring.Distance¶: The 1-D distance is defined as information gain divided by joint entropy H_{CA} (C is the class variable and A the feature):

1-D(C,A) = \frac{\mathrm{Gain}(A)}{H_{CA}}

class Orange.feature.scoring.MDL¶: Minimum description length principle [Kononenko1995]. Let n be the number of instances, n_0 the number of classes, and n_{cj} the number of instances with feature value j and class value c. Then MDL score for the feature A is

\mathrm{MDL}(A) = \frac{1}{n} \Bigg[ \log\binom{n}{n_{1.},\cdots,n_{n_0 .}} - \sum_j \log \binom{n_{.j}}{n_{1j},\cdots,n_{n_0 j}} \\ + \log \binom{n+n_0-1}{n_0-1} - \sum_j \log \binom{n_{.j}+n_0-1}{n_0-1} \Bigg]

Feature scoring in regression problems¶

class Orange.feature.scoring.Relief: Relief is used for regression in the same way as for classification (see Relief in classification problems).

class Orange.feature.scoring.MSE¶

Implements the mean square error score.

unknowns_treatment¶: Decides the treatment of unknown values. See Score.unknowns_treatment.

m¶: Parameter for m-estimate of error. Default is 0 (no m-estimate).

Base Classes¶

Implemented methods for scoring relevances of features are subclasses of Score. Those that compute statistics on conditional distributions of class values given the feature values are derived from ScoreFromProbabilities.

class Orange.feature.scoring.Score¶

Abstract base class for feature scoring. Its attributes describe which types of features it can handle which kind of data it requires.

Capabilities

handles_discrete¶: Indicates whether the scoring method can handle discrete features.

handles_continuous¶: Indicates whether the scoring method can handle continuous features.

computes_thresholds¶: Indicates whether the scoring method implements the threshold_function.

Input specification

needs¶

The type of data needed indicated by one the constants below. Classes with use DomainContingency will also handle generators. Those based on Contingency_Class will be able to take generators and domain contingencies.

Generator¶: Constant. Indicates that the scoring method needs an instance generator on the input as, for example, Relief.

DomainContingency¶: Constant. Indicates that the scoring method needs Orange.statistics.contingency.Domain.

Contingency_Class¶: Constant. Indicates, that the scoring method needs the contingency (Orange.statistics.contingency.VarClass), feature distribution and the apriori class distribution (as most scoring methods).

Treatment of unknown values

unknowns_treatment¶

Defined in classes that are able to treat unknown values. It should be set to one of the values below.

IgnoreUnknowns¶: Constant. Instances for which the feature value is unknown are removed.

ReduceByUnknown¶: Constant. Features with unknown values are punished. The feature quality is reduced by the proportion of unknown values. For impurity scores the impurity decreases only where the value is defined and stays the same otherwise.

UnknownsToCommon¶: Constant. Undefined values are replaced by the most common value.

UnknownsAsValue¶: Constant. Unknown values are treated as a separate value.

Methods

__call__(): Abstract. See Calling scoring methods.

threshold_function(attribute, instances[, weightID])¶

Abstract.

Assess different binarizations of the continuous feature attribute. Return a list of tuples. The first element is a threshold (between two existing values), the second is the quality of the corresponding binary feature, and the third the distribution of instances below and above the threshold. Not all scorers return the third element.

To show the computation of thresholds, we shall use the Iris data set:

iris = Orange.data.Table("iris")
meas = Orange.feature.scoring.Relief()
for t in meas.threshold_function("petal length", iris):
    print "%5.3f: %5.3f" % t

best_threshold(attribute, instances)¶

Return the best threshold for binarization, that is, the threshold with which the resulting binary feature will have the optimal score.

The script below prints out the best threshold for binarization of an feature. ReliefF is used scoring:

thresh, score, distr = meas.best_threshold("petal length", iris)
print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)

class Orange.feature.scoring.ScoreFromProbabilities¶

Bases: Score

Abstract base class for feature scoring method that can be computed from contingency matrices.

estimator_constructor¶

conditional_estimator_constructor¶: The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. Defaults use relative frequencies; possible alternatives are, for instance, ProbabilityEstimatorConstructor_m and ConditionalProbabilityEstimatorConstructor_ByRows (with estimator constructor again set to ProbabilityEstimatorConstructor_m), respectively.

Other¶

class Orange.feature.scoring.OrderAttributes(score=None)¶

Orders features by their scores.

score¶: A scoring method derived from Score. If None, Relief with m=5 and k=10 is used.

__call__(data, weight)¶

Score and order all features.

Parameters:	data (`Table`) – a data table used to score features weight (`Descriptor`) – meta attribute that stores weights of instances

Orange.feature.scoring.score_all(data, score=Relief(k=20, m=50))¶

Assess the quality of features using the given measure and return a sorted list of tuples (feature name, measure).

Parameters:	data (`Table`) – data table should include a discrete class. score (`Score`) – feature scoring function. Derived from `Score`. Defaults to `Relief` with k=20 and m=50.
Return type:	`list`; a sorted list of tuples (feature name, score)

Bibliography

[Kononenko2007]

Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining, Woodhead Publishing, 2007.

[Quinlan1986]

J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.

[Breiman1984]

L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984.

[Kononenko1995]