Variable Descriptors (variable)

Every variable is associated with a descriptor that stores its name and other properties. Descriptors serve three main purposes:

  • conversion of values from textual format (e.g. when reading files) to the internal representation and back (e.g. when writing files or printing out);
  • identification of variables: two variables from different data sets are considered to be the same if they have the same descriptor;
  • conversion of values between domains or data sets, for instance from continuous to discrete data, using a pre-computed transformation.

Descriptors are most often constructed when loading the data from files.

>>> from Orange.data import Table
>>> iris = Table("iris")

>>> iris.domain.class_var
DiscreteVariable('iris')
>>> iris.domain.class_var.values
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

>>> iris.domain[0]
ContinuousVariable('sepal length')
>>> iris.domain[0].number_of_decimals
1

Some variables are derived from others. For instance, discretizing a continuous variable gives a new, discrete variable. The new variable can compute its values from the original one.

>>> from Orange.preprocess import DomainDiscretizer
>>> discretizer = DomainDiscretizer()
>>> d_iris = discretizer(iris)
>>> d_iris[0]
DiscreteVariable('D_sepal length')
>>> d_iris[0].values
['<5.2', '[5.2, 5.8)', '[5.8, 6.5)', '>=6.5']

See Variable.compute_value for a detailed explanation.

Constructors

Orange maintains lists of existing descriptors for variables. This facilitates the reuse of descriptors: if two data sets refer to the same variables, they should be assigned the same descriptors so that, for instance, a model trained on one data set can make predictions for the other.

Variable descriptors are seldom constructed in user scripts. When needed, this can be done by calling the constructor directly or by calling the class method make. The difference is that the latter returns an existing descriptor if there is one with the same name and which matches the other conditions, such as having the prescribed list of discrete values for DiscreteVariable:

>>> from Orange.data import ContinuousVariable
>>> age = ContinuousVariable.make("age")
>>> age1 = ContinuousVariable.make("age")
>>> age2 = ContinuousVariable("age")
>>> age is age1
True
>>> age is age2
False

The first line returns a new descriptor after not finding an existing desciptor for a continuous variable named “age”. The second reuses the first descriptor. The last creates a new one since the constructor is invoked directly.

The distinction does not matter in most cases, but it is important when loading the data from different files. Orange uses the make constructor when loading data.

Base class

class Orange.data.Variable(name=”, compute_value=None)[source]

The base class for variable descriptors contains the variable’s name and some basic properties.

name

The name of the variable.

unknown_str

A set of values that represent unknowns in conversion from textual formats. Default is {“?”, “.”, “”, “NA”, “~”, None}.

compute_value

A function for computing the variable’s value when converting from another domain which does not contain this variable. The base class defines a static method compute_value, which returns Unknown. Non-primitive variables must redefine it to return None.

source_variable

An optional descriptor of the source variable - if any - from which this variable is derived and computed via compute_value.

attributes

A dictionary with user-defined attributes of the variable

master

The variable that this variable is a copy of. If a copy is made from a copy, the copy has a reference to the original master. If the variable is not a copy, it is its own master.

classmethod is_primitive()[source]

True if the variable’s values are stored as floats. Non-primitive variables can appear in the data only as meta attributes.

str_val(val)

Return a textual representation of variable’s value val. Argument val must be a float (for primitive variables) or an arbitrary Python object (for non-primitives).

Derived classes must overload the function.

to_val(s)[source]

Convert the given argument to a value of the variable. The argument can be a string, a number or None. For primitive variables, the base class provides a method that returns Unknown if s is found in unknown_str, and raises an exception otherwise. For non-primitive variables it returns the argument itself.

Derived classes of primitive variables must overload the function.

Parameters:s (str, float or None) – value, represented as a number, string or None
Return type:float or object
val_from_str_add(s)[source]

Convert the given string to a value of the variable. The method is similar to to_val except that it only accepts strings and that it adds new values to the variable’s domain where applicable.

The base class method calls to_val.

Parameters:s (str) – symbolic representation of the value
Return type:float or object
compute_value

Method compute_value is usually invoked behind the scenes in conversion of domains:

>>> from Orange.data import Table
>>> from Orange.preprocess import DomainDiscretizer

>>> iris = Table("iris")
>>> iris_1 = iris[::2]
>>> discretizer = DomainDiscretizer()
>>> d_iris_1 = discretizer(iris_1)

>>> d_iris_1[0]
DiscreteVariable('D_sepal length')
>>> d_iris_1[0].source_variable
ContinuousVariable('sepal length')
>>> d_iris_1[0].compute_value
<Orange.feature.discretization.Discretizer at 0x10d5108d0>

The data is loaded and the instances on even places are put into a new table, from which we compute discretized data. The discretized variable “D_sepal length” refers to the original as its source and stores a function for conversion of the original continuous values into the discrete. This function (and the corresponding functions for other variables) is used for converting the remaining data:

>>> iris_2 = iris[1::2]
>>> d_iris_2 = Table(d_iris_1.domain, iris_2)
>>> d_iris_2[0]
[<5.2, [2.8, 3), <1.6, <0.2 | Iris-setosa]

In the first line we select the instances with odd indices in the original table, that is, the data which was not used for computing the discretization. In the second line we construct a new data table with the discrete domain d_iris_1.domain and using the original data iris_2. Behind the scenes, the values for those variables in the destination domain (d_iris_1.domain) that do not appear in the source domain (iris_2.domain) are computed by passing the source data instance to the destination variables’ Variable.compute_value.

This mechanism is used throughout Orange to compute all preprocessing on training data and applying the same transformations on the testing data without hassle.

Note that even such conversions are typically not coded in user scripts but implemented within the provided wrappers and cross-validation schemes.

Continuous variables

class Orange.data.ContinuousVariable(name=”, number_of_decimals=None, compute_value=None)[source]

Descriptor for continuous variables.

number_of_decimals

The number of decimals when the value is printed out (default: 3).

adjust_decimals

A flag regulating whether the number_of_decimals is being adjusted by to_val.

The value of number_of_decimals is set to 3 and adjust_decimals is set to 2. When val_from_str_add is called for the first time with a string as an argument, number_of_decimals is set to the number of decimals in the string and adjust_decimals is set to 1. In the subsequent calls of to_val, the nubmer of decimals is increased if the string argument has a larger number of decimals.

If the number_of_decimals is set manually, adjust_decimals is set to 0 to prevent changes by to_val.

make(name)

Return an existing continuous variable with the given name, or construct and return a new one.

is_primitive()

True if the variable’s values are stored as floats. Non-primitive variables can appear in the data only as meta attributes.

str_val(val)

Return the value as a string with the prescribed number of decimals.

to_val(s)[source]

Convert a value, given as an instance of an arbitrary type, to a float.

val_from_str_add(s)[source]

Convert a value from a string and adjust the number of decimals if adjust_decimals is non-zero.

Discrete variables

class Orange.data.DiscreteVariable(name=”, values=(), ordered=False, base_value=-1, compute_value=None)[source]

Descriptor for symbolic, discrete variables. Values of discrete variables are stored as floats; the numbers corresponds to indices in the list of values.

values

A list of variable’s values.

ordered

Some algorithms (and, in particular, visualizations) may sometime reorder the values of the variable, e.g. alphabetically. This flag hints that the given order of values is “natural” (e.g. “small”, “middle”, “large”) and should not be changed.

base_value

The index of the base value, or -1 if there is none. The base value is used in some methods like, for instance, when creating dummy variables for regression.

classmethod make(name, values=(), ordered=False, base_value=-1)[source]

Return a variable with the given name and other properties. The method first looks for a compatible existing variable: the existing variable must have the same name and both variables must have either ordered or unordered values. If values are ordered, the order must be compatible: all common values must have the same order. If values are unordered, the existing variable must have at least one common value with the new one, except when any of the two lists of values is empty.

If a compatible variable is find, it is returned, with missing values appended to the end of the list. If there is no explicit order, the values are ordered using ordered_values. Otherwise, it constructs and returns a new variable descriptor.

Parameters:
  • name (str) – the name of the variable
  • values (list) – symbolic values for the variable
  • ordered (bool) – tells whether the order of values is fixed
  • base_value (int) – the index of the base value, or -1 if there is none
Returns:

an existing compatible variable or None

is_primitive()

True if the variable’s values are stored as floats. Non-primitive variables can appear in the data only as meta attributes.

str_val(val)

Return a textual representation of the value (self.values[int(val)]) or “?” if the value is unknown.

Parameters:val (float (should be whole number)) – value
Return type:str
to_val(s)[source]

Convert the given argument to a value of the variable (float). If the argument is numeric, its value is returned without checking whether it is integer and within bounds. Unknown is returned if the argument is one of the representations for unknown values. Otherwise, the argument must be a string and the method returns its index in values.

Parameters:s – values, represented as a number, string or None
Return type:float
val_from_str_add(s)[source]

Similar to to_val, except that it accepts only strings and that it adds the value to the list if it does not exist yet.

Parameters:s (str) – symbolic representation of the value
Return type:float

String variables

class Orange.data.StringVariable(name=”, compute_value=None)[source]

Descriptor for string variables. String variables can only appear as meta attributes.

make(name)

Return an existing continuous variable with the given name, or construct and return a new one.

is_primitive()

True if the variable’s values are stored as floats. Non-primitive variables can appear in the data only as meta attributes.

static str_val(val)[source]

Return a string representation of the value.

to_val(s)[source]

Return the value as a string. If it is already a string, the same object is returned.

val_from_str_add(s)

Return the value as a string. If it is already a string, the same object is returned.

Time variables

Time variables are continuous variables with value 0 on the Unix epoch, 1 January 1970 00:00:00.0 UTC. Positive numbers are dates beyond this date, and negative dates before. Due to limitation of Python datetime module, only dates in 1 A.D. or later are supported.

class Orange.data.TimeVariable(*args, **kwargs)[source]

TimeVariable is a continuous variable with Unix epoch (1970-01-01 00:00:00+0000) as the origin (0.0). Later dates are positive real numbers (equivalent to Unix timestamp, with microseconds in the fraction part), and the dates before it map to the negative real numbers.

Unfortunately due to limitation of Python datetime, only dates with year >= 1 (A.D.) are supported.

If time is specified without a date, Unix epoch is assumed.

If time is specified wihout an UTC offset, localtime is assumed.

parse(datestr)[source]

Return datestr, a datetime provided in one of ISO 8601 formats, parsed as a real number. Value 0 marks the Unix epoch, positive values are the dates after it, negative before.

If date is unspecified, epoch date is assumed.

If time is unspecified, 00:00:00.0 is assumed.

If timezone is unspecified, local time is assumed.