The R package OpenML is an interface to make interactions with the OpenML server as comfortable as possible. For example, the users can download and upload files, run their implementations on specific tasks and get predictions in the correct form directly via R commands. In this tutorial, we will show the most important functions of this package and give examples on standard workflows.
For general information on what OpenML is, please have a look at the README file or visit the official OpenML website.
After installation and before making practical use of the package, in most cases it is desirable to setup a configuration file to simplify further steps. Afterwards, there are different basic stages when using this package or OpenML, respectively:
DataSets, Tasks, Flows,
Runs, RunEvaluations,
EvaluationMeasures, and TaskTypes)listOMLdata.frameDataSets,
Tasks, Runs, Predictions, and
Flows)getOMLrunTaskMlrOMLTask and LearnerOMLMlrRun, OMLRunuploadOMLRunInstallation works as in any other package using
install.packages("OpenML")To install the current development version use the devtools
package and run
devtools::install_github("openml/openml-r")Using the OpenML package also requires a reader for the ARFF file format. By default farff is used. Alternatively, the RWeka package can be used. You can install the packages with the following calls.
install.packages(c("farff", "RWeka"))All examples in this tutorial are given with a READ-ONLY API key.
With this key you can read all the information from the server but not write data sets, tasks, flows, and runs to the server. This key allows to emulate uploading to the server but doesn’t allow to really store data. If one wants to write data to a server, one has to get a personal API key. The process of how to obtain a key is shown in the configuration section.
Important: Please do not write meaningless data to the server such as copies of already existing data sets, tasks, or runs (such as the ones from this tutorial)! One instance of the Iris data set should be enough for everyone. :D
In this paragraph you can find an example on how to download a task from the server, print some information about it to the console, and produce a run which is then uploaded to the server. For detailed information on OpenML terminology (task, run, etc.) see the OpenML guide.
library("OpenML")
## temporarily set API key to read only key
setOMLConfig(apikey = "c1994bdb7ecb3c6f3c8f3b35f4b47f1f")## OpenML configuration:
##   server           : https://www.openml.org/api/v1
##   cachedir         : C:\Users\GDaddy\AppData\Local\Temp\RtmpUzg5GZ/working_dir\RtmpMfS7jw/cache
##   verbosity        : 0
##   arff.reader      : farff
##   confirm.upload   : TRUE
##   apikey           : ***************************47f1f# download a task (whose ID is 1L)
task = getOMLTask(task.id = 1L)## Warning in getOMLDataSetById(data.id = data.id, cache.only = cache.only, : Data set has been deactivated.task## 
## OpenML Task 1 :: (Data ID = 1)
##   Task Type            : Supervised Classification
##   Data Set             : anneal :: (Version = 2, OpenML ID = 1)
##   Target Feature(s)    : class
##   Tags                 : basic, study_1, study_41, study_7, study_73, study_89, test-tagging, testtag,...
##   Estimation Procedure : Stratified crossvalidation (1 x 10 folds)
##   Evaluation Measure(s): predictive_accuracyThe task contains information on the following:
In the next line, randomForest is used as a classifier
and run with the help of the mlr package. Note
that one needs to run the algorithm locally and that mlr
will automatically load the package that is needed to run the specified
classifier.
# define the classifier (usually called "flow" within OpenML)
library("mlr")
lrn = makeLearner("classif.randomForest")
# upload the new flow (with information about the algorithm and settings);
# if this algorithm already exists on the server, one will receive a message
# with the ID of the existing flow
flow.id = uploadOMLFlow(lrn)
# the last step is to perform a run and upload the results
run.mlr = runTaskMlr(task, lrn)
run.id = uploadOMLRun(run.mlr)Following this very brief example, we will explain the single steps of the OpenML package in more detail in the next sections.
Interacting with the OpenML server requires an API key. For
demonstration purposes, we have created a public read-only
API key ("c1994bdb7ecb3c6f3c8f3b35f4b47f1f"), which
will be used in this tutorial to make the examples executable. However,
for a full-fledged usage of the OpenML package, you need
your personal API.
In order to receive your own API key
You can set your own OpenML configuration either just temporarily for
the current R session via setOMLConfig or permanently via
saveOMLConfig. In order to create a permanent configuration
file using default values and at the same time setting your personal API
key, run
saveOMLConfig(apikey = "c1994bdb7ecb3c6f3c8f3b35f4b47f1f")where "c1994bdb7ecb3c6f3c8f3b35f4b47f1f" should be
replaced with your personal API
key. It is noteworthy that basically everybody who has access
to your computer can read the configuration file and thus see your API
key. With your API key other users have full access to your account via
the API, so please handle it with care!
It is also possible to manually create a file
~/.openml/config in your home directory – you can use the R
command path.expand("~/.openml/config") to get the full
path to the configuration file on the operating system. The
config file consists of key = value pairs,
note that the values are not quoted. An exemplary minimal
config file might look as follows:
apikey=c1994bdb7ecb3c6f3c8f3b35f4b47f1fThe config file may contain the following
information:
server:
https://www.openml.org/api/v1cachedir:
file.path(tempdir(), "cache").verbosity:
0: normal output1: info output (default)2: debug outputarff.reader:
RWeka: this is the standard Java parser used in
Wekafarff: the farff package provides a
newer, faster parser without any Java requirementsconfirm.upload:
FALSE) one does not need to confirm the
upload decisionapikey:
If you manually modify the config file, you need to
reload the modified config file to the current R session
using loadOMLConfig(). You can query the current
configuration using
getOMLConfig()## OpenML configuration:
##   server           : https://www.openml.org/api/v1
##   cachedir         : C:\Users\GDaddy\AppData\Local\Temp\RtmpUzg5GZ/working_dir\RtmpMfS7jw/cache
##   verbosity        : 0
##   arff.reader      : farff
##   confirm.upload   : TRUE
##   apikey           : ***************************47f1fThe configuration file and some related things are also explained in the OpenML Wiki.
Once the config file is set up, you are ready to go!
In this stage, we want to list basic information about the various OpenML objects:
For each of these objects, we have a function to query the
information, beginning with listOML. All of these functions
return a data.frame, even in case the result consists of a
single column or has zero observations (i.e., rows).
Note that the listOML* functions only list information
on the corresponding objects – they do not download the respective
objects. Information on actually downloading specific objects is covered
in the next section.
To browse the OpenML data base for appropriate data sets, you can use
listOMLDataSets() in order to get basic data
characteristics (number of features, instances, classes, missing values,
etc.) for each data set. By default, listOMLDataSets()
returns only data sets that have an active status on OpenML:
datasets = listOMLDataSets()  # returns active data setsThe resulting data.frame contains the following
information for each of the listed data sets:
data.idstatus ("active",
"in_preparation" or "deactivated") of the data
setname of the data setmajority.class.size)str(datasets)## 'data.frame':    4390 obs. of  16 variables:
##  $ data.id                                : int  2 3 4 5 6 7 8 9 10 11 ...
##  $ name                                   : chr  "anneal" "kr-vs-kp" "labor" "arrhythmia" ...
##  $ version                                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ status                                 : chr  "active" "active" "active" "active" ...
##  $ format                                 : chr  "ARFF" "ARFF" "ARFF" "ARFF" ...
##  $ tags                                   : chr  "" "" "" "" ...
##  $ majority.class.size                    : int  684 1669 37 245 813 57 NA 67 81 288 ...
##  $ max.nominal.att.distinct.values        : int  7 3 3 13 26 24 NA 22 8 3 ...
##  $ minority.class.size                    : int  8 1527 20 2 734 1 NA 3 2 49 ...
##  $ number.of.classes                      : int  5 2 2 13 26 24 0 6 4 3 ...
##  $ number.of.features                     : int  39 37 17 280 17 70 6 26 19 5 ...
##  $ number.of.instances                    : int  898 3196 57 452 20000 226 345 205 148 625 ...
##  $ number.of.instances.with.missing.values: int  898 0 56 384 0 222 0 46 0 0 ...
##  $ number.of.missing.values               : int  22175 0 326 408 0 317 0 59 0 0 ...
##  $ number.of.numeric.features             : int  6 0 8 206 16 0 6 15 3 4 ...
##  $ number.of.symbolic.features            : int  33 37 9 74 1 70 0 11 16 1 ...head(datasets[, 1:5])##   data.id       name version status format
## 1       2     anneal       1 active   ARFF
## 2       3   kr-vs-kp       1 active   ARFF
## 3       4      labor       1 active   ARFF
## 4       5 arrhythmia       1 active   ARFF
## 5       6     letter       1 active   ARFF
## 6       7  audiology       1 active   ARFFTo find a specific data set, you can now query the resulting
datasets object. Suppose we want to find the
iris data set.
subset(datasets, name == "iris")##      data.id name version status format tags majority.class.size max.nominal.att.distinct.values
## 53        61 iris       1 active   ARFF                       50                               3
## 812      969 iris       3 active   ARFF                      100                               2
## 2602   41510 iris       9 active   ARFF                       NA                               3
## 2603   41511 iris      10 active   ARFF                       50                               3
## 2636   41567 iris      11 active   ARFF                       NA                               3
## 2637   41568 iris      12 active   ARFF                       50                               3
## 2638   41582 iris      13 active   ARFF                       NA                               3
## 2639   41583 iris      14 active   ARFF                       50                               3
## 2889   41996 iris      15 active   ARFF                       NA                               3
## 2890   41997 iris      16 active   ARFF                       50                               3
## 2892   42002 iris      17 active   ARFF                       NA                               3
## 2893   42003 iris      18 active   ARFF                       50                               3
## 2896   42010 iris      19 active   ARFF                       NA                               3
## 2897   42011 iris      20 active   ARFF                       50                               3
## 2898   42015 iris      21 active   ARFF                       NA                               3
## 2899   42016 iris      22 active   ARFF                       50                               3
## 2900   42020 iris      23 active   ARFF                       NA                               3
## 2901   42021 iris      24 active   ARFF                       50                               3
## 2902   42025 iris      25 active   ARFF                       NA                               3
## 2903   42026 iris      26 active   ARFF                       50                               3
## 2904   42030 iris      27 active   ARFF                       NA                               3
## 2905   42031 iris      28 active   ARFF                       50                               3
## 2906   42035 iris      29 active   ARFF                       NA                               3
## 2907   42036 iris      30 active   ARFF                       50                               3
## 2908   42040 iris      31 active   ARFF                       NA                               3
## 2909   42041 iris      32 active   ARFF                       50                               3
## 2910   42045 iris      33 active   ARFF                       NA                               3
## 2911   42046 iris      34 active   ARFF                       50                               3
## 2912   42050 iris      35 active   ARFF                       NA                               3
## 2913   42051 iris      36 active   ARFF                       50                               3
## 2914   42055 iris      37 active   ARFF                       NA                               3
## 2915   42056 iris      38 active   ARFF                       50                               3
## 2920   42065 iris      39 active   ARFF                       NA                               3
## 2921   42066 iris      40 active   ARFF                       50                               3
## 2922   42070 iris      41 active   ARFF                       NA                               3
## 2923   42071 iris      42 active   ARFF                       50                               3
## 2934   42091 iris      43 active   ARFF                       NA                               3
## 2937   42097 iris      44 active   ARFF                       NA                               3
## 2938   42098 iris      45 active   ARFF                       50                               3
## 3190   42661 iris      46 active   arff                       NA                              NA
## 3206   42699 iris      47 active   ARFF                       NA                              NA
## 3207   42700 iris      48 active   ARFF                       50                              NA
## 3292   42851 iris      49 active   ARFF                       NA                              NA
## 3309   42871 iris      50 active   ARFF                       NA                              NA
##      minority.class.size number.of.classes number.of.features number.of.instances
## 53                    50                 3                  5                 150
## 812                   50                 2                  5                 150
## 2602                  NA                NA                  5                 150
## 2603                  50                 3                  5                 150
## 2636                  NA                NA                  5                 150
## 2637                  50                 3                  5                 150
## 2638                  NA                NA                  5                 150
## 2639                  50                 3                  5                 150
## 2889                  NA                NA                  5                 150
## 2890                  50                 3                  5                 150
## 2892                  NA                NA                  5                 150
## 2893                  50                 3                  5                 150
## 2896                  NA                NA                  5                 150
## 2897                  50                 3                  5                 150
## 2898                  NA                NA                  5                 150
## 2899                  50                 3                  5                 150
## 2900                  NA                NA                  5                 150
## 2901                  50                 3                  5                 150
## 2902                  NA                NA                  5                 150
## 2903                  50                 3                  5                 150
## 2904                  NA                NA                  5                 150
## 2905                  50                 3                  5                 150
## 2906                  NA                NA                  5                 150
## 2907                  50                 3                  5                 150
## 2908                  NA                NA                  5                 150
## 2909                  50                 3                  5                 150
## 2910                  NA                NA                  5                 150
## 2911                  50                 3                  5                 150
## 2912                  NA                NA                  5                 150
## 2913                  50                 3                  5                 150
## 2914                  NA                NA                  5                 150
## 2915                  50                 3                  5                 150
## 2920                  NA                NA                  5                 150
## 2921                  50                 3                  5                 150
## 2922                  NA                NA                  5                 150
## 2923                  50                 3                  5                 150
## 2934                  NA                NA                  5                 150
## 2937                  NA                NA                  5                 150
## 2938                  50                 3                  5                 150
## 3190                  NA                NA                  5                 150
## 3206                  NA                NA                  5                 150
## 3207                  50                 3                  5                 150
## 3292                  NA                NA                  7                 150
## 3309                  NA                NA                  7                 150
##      number.of.instances.with.missing.values number.of.missing.values number.of.numeric.features
## 53                                         0                        0                          4
## 812                                        0                        0                          4
## 2602                                       0                        0                          4
## 2603                                       0                        0                          4
## 2636                                       0                        0                          4
## 2637                                       0                        0                          4
## 2638                                       0                        0                          4
## 2639                                       0                        0                          4
## 2889                                       0                        0                          4
## 2890                                       0                        0                          4
## 2892                                       0                        0                          4
## 2893                                       0                        0                          4
## 2896                                       0                        0                          4
## 2897                                       0                        0                          4
## 2898                                       0                        0                          4
## 2899                                       0                        0                          4
## 2900                                       0                        0                          4
## 2901                                       0                        0                          4
## 2902                                       0                        0                          4
## 2903                                       0                        0                          4
## 2904                                       0                        0                          4
## 2905                                       0                        0                          4
## 2906                                       0                        0                          4
## 2907                                       0                        0                          4
## 2908                                       0                        0                          4
## 2909                                       0                        0                          4
## 2910                                       0                        0                          4
## 2911                                       0                        0                          4
## 2912                                       0                        0                          4
## 2913                                       0                        0                          4
## 2914                                       0                        0                          4
## 2915                                       0                        0                          4
## 2920                                       0                        0                          4
## 2921                                       0                        0                          4
## 2922                                       0                        0                          4
## 2923                                       0                        0                          4
## 2934                                       0                        0                          4
## 2937                                       0                        0                          4
## 2938                                       0                        0                          4
## 3190                                       0                        0                          4
## 3206                                       0                        0                          4
## 3207                                       0                        0                          4
## 3292                                       0                        0                          4
## 3309                                       0                        0                          4
##      number.of.symbolic.features
## 53                             1
## 812                            1
## 2602                           1
## 2603                           1
## 2636                           1
## 2637                           1
## 2638                           1
## 2639                           1
## 2889                           1
## 2890                           1
## 2892                           1
## 2893                           1
## 2896                           1
## 2897                           1
## 2898                           1
## 2899                           1
## 2900                           1
## 2901                           1
## 2902                           1
## 2903                           1
## 2904                           1
## 2905                           1
## 2906                           1
## 2907                           1
## 2908                           1
## 2909                           1
## 2910                           1
## 2911                           1
## 2912                           1
## 2913                           1
## 2914                           1
## 2915                           1
## 2920                           1
## 2921                           1
## 2922                           1
## 2923                           1
## 2934                           1
## 2937                           1
## 2938                           1
## 3190                           0
## 3206                           1
## 3207                           1
## 3292                           3
## 3309                           3As you can see, there are two data sets called iris. We
want to use the original data set with three classes, which is
stored under the data set ID (data.id) 61, 41511, 41568,
41583, 41997, 42003, 42011, 42016, 42021, 42026, 42031, 42036, 42041,
42046, 42051, 42056, 42066, 42071, 42098, 42700. You can also have a
closer look at the data set on the corresponding OpenML web page (https://www.openml.org/d/61, 41511, 41568, 41583, 41997,
42003, 42011, 42016, 42021, 42026, 42031, 42036, 42041, 42046, 42051,
42056, 42066, 42071, 42098, 42700).
Each OpenML task is a bundle that encapsulates information on various objects:
"Supervised Classification" or
"Supervised Regression""predictive accuracy" for a classification taskListing the tasks can be done via
tasks = listOMLTasks()The resulting data.frame contains for each of the listed
tasks information on:
task.idtask.typetarget.featuretags which can be used for labelling the taskestimation.procedure (aka resampling strategy)evaluation.measures used for measuring the
performance of the learner / flow on the taskstr(tasks)## 'data.frame':    5000 obs. of  25 variables:
##  $ task.id                                : int  2 3 4 5 6 7 8 9 10 11 ...
##  $ task.type                              : chr  "Supervised Classification" "Supervised Classification" "Supervised Classification" "Supervised Classification" ...
##  $ data.id                                : int  2 3 4 5 6 7 8 9 10 11 ...
##  $ name                                   : chr  "anneal" "kr-vs-kp" "labor" "arrhythmia" ...
##  $ status                                 : chr  "active" "active" "active" "active" ...
##  $ format                                 : chr  "ARFF" "ARFF" "ARFF" "ARFF" ...
##  $ estimation.procedure                   : chr  "10-fold Crossvalidation" "10-fold Crossvalidation" "10-fold Crossvalidation" "10-fold Crossvalidation" ...
##  $ evaluation.measures                    : chr  "predictive_accuracy" NA "predictive_accuracy" "predictive_accuracy" ...
##  $ target.feature                         : chr  "class" "class" "class" "class" ...
##  $ cost.matrix                            : chr  NA NA NA NA ...
##  $ source.data.labeled                    : chr  NA NA NA NA ...
##  $ target.feature.event                   : chr  NA NA NA NA ...
##  $ target.feature.left                    : chr  NA NA NA NA ...
##  $ target.feature.right                   : chr  NA NA NA NA ...
##  $ quality.measure                        : chr  NA NA NA NA ...
##  $ majority.class.size                    : int  684 1669 37 245 813 57 NA 67 81 288 ...
##  $ max.nominal.att.distinct.values        : int  7 3 3 13 26 24 NA 22 8 3 ...
##  $ minority.class.size                    : int  8 1527 20 2 734 1 NA 3 2 49 ...
##  $ number.of.classes                      : int  5 2 2 13 26 24 0 6 4 3 ...
##  $ number.of.features                     : int  39 37 17 280 17 70 6 26 19 5 ...
##  $ number.of.instances                    : int  898 3196 57 452 20000 226 345 205 148 625 ...
##  $ number.of.instances.with.missing.values: int  898 0 56 384 0 222 0 46 0 0 ...
##  $ number.of.missing.values               : int  22175 0 326 408 0 317 0 59 0 0 ...
##  $ number.of.numeric.features             : int  6 0 8 206 16 0 6 15 3 4 ...
##  $ number.of.symbolic.features            : int  33 37 9 74 1 70 0 11 16 1 ...For some data sets, there may be more than one task available on the
OpenML server. For example, one can look for
"Supervised Classification" tasks that are available for
data set 61 via
head(subset(tasks, task.type == "Supervised Classification" & data.id == 61L)[, 1:5])##      task.id                 task.type data.id name status
## 51        59 Supervised Classification      61 iris active
## 263      289 Supervised Classification      61 iris active
## 428     1823 Supervised Classification      61 iris active
## 535     1939 Supervised Classification      61 iris active
## 580     1992 Supervised Classification      61 iris active
## 3300    7306 Supervised Classification      61 iris activeA flow is the definition and implementation of a specific algorithm workflow or script, i.e., a flow is essentially the code / implementation of the algorithm.
flows = listOMLFlows()
str(flows)## 'data.frame':    16365 obs. of  6 variables:
##  $ flow.id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ full.name       : chr  "openml.evaluation.EuclideanDistance(1.0)" "openml.evaluation.PolynomialKernel(1.0)" "openml.evaluation.RBFKernel(1.0)" "openml.evaluation.area_under_roc_curve(1.0)" ...
##  $ name            : chr  "openml.evaluation.EuclideanDistance" "openml.evaluation.PolynomialKernel" "openml.evaluation.RBFKernel" "openml.evaluation.area_under_roc_curve" ...
##  $ version         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ external.version: chr  "" "" "" "" ...
##  $ uploader        : int  1 1 1 1 1 1 1 1 1 1 ...flows[56:63, 1:4]##    flow.id             full.name               name version
## 56      56         weka.ZeroR(1)         weka.ZeroR       1
## 57      57          weka.OneR(1)          weka.OneR       1
## 58      58    weka.NaiveBayes(1)    weka.NaiveBayes       1
## 59      59          weka.JRip(1)          weka.JRip       1
## 60      60           weka.J48(1)           weka.J48       1
## 61      61       weka.REPTree(1)       weka.REPTree       1
## 62      62 weka.DecisionStump(1) weka.DecisionStump       1
## 63      63 weka.HoeffdingTree(1) weka.HoeffdingTree       1A run is an experiment, which is executed on a given combination of
task, flow and setup (i.e., the explicit parameter configuration of a
flow). The corresponding results are stored as a run result. Both
objects, i.e., runs and run results, can be listed via
listOMLRuns or listOMLRunEvaluations,
respectively. As each of those objects is defined with a task, setup and
flow, you can extract runs and run results with specific combinations of
task.id, setup.id and/or flow.id.
For instance, listing all runs for task 59 (supervised
classification on iris) can be done with
runs = listOMLRuns(task.id = 59L)  # must be specified with the task, setup and/or implementation ID
head(runs)##   run.id task.id setup.id flow.id uploader error.message
## 1     81      59       12      67        1          <NA>
## 2    161      59       13      70        1          <NA>
## 3    234      59        1      56        1          <NA>
## 4    447      59        6      61        1          <NA>
## 5    473      59       18      77        1          <NA>
## 6    491      59        7      62        1          <NA># one of the IDs (here: task.id) must be supplied
run.results = listOMLRunEvaluations(task.id = 59L)
str(run.results)## 'data.frame':    4283 obs. of  35 variables:
##  $ run.id                       : int  81 161 234 447 473 491 550 6088 6157 6158 ...
##  $ task.id                      : int  59 59 59 59 59 59 59 59 59 59 ...
##  $ setup.id                     : int  12 13 1 6 18 7 16 11 12 3 ...
##  $ flow.id                      : int  67 70 56 61 77 62 75 66 67 58 ...
##  $ flow.name                    : chr  "weka.BayesNet_K2(1)" "weka.SMO_PolyKernel(1)" "weka.ZeroR(1)" "weka.REPTree(1)" ...
##  $ flow.version                 : chr  "1" "1" "1" "1" ...
##  $ flow.source                  : chr  "weka" "weka" "weka" "weka" ...
##  $ learner.name                 : chr  "BayesNet_K2" "SMO_PolyKernel" "ZeroR" "REPTree" ...
##  $ data.name                    : chr  "iris" "iris" "iris" "iris" ...
##  $ upload.time                  : chr  "2014-04-07 00:05:11" "2014-04-07 00:55:32" "2014-04-07 01:33:24" "2014-04-07 06:26:27" ...
##  $ area.under.roc.curve         : num  0.983 0.977 0.5 0.967 0.978 ...
##  $ average.cost                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ build.cpu.time               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ build.memory                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ f.measure                    : num  0.94 0.96 0.167 0.927 0.947 ...
##  $ kappa                        : num  0.91 0.94 0 0.89 0.92 0.5 0.95 0.93 0.91 0.93 ...
##  $ kb.relative.information.score: num  1.39e+02 9.09e+01 -6.80e-05 1.31e+02 1.38e+02 ...
##  $ mean.absolute.error          : num  0.0384 0.2311 0.4444 0.0671 0.0392 ...
##  $ mean.prior.absolute.error    : num  0.444 0.444 0.444 0.444 0.444 ...
##  $ number.of.instances          : num  150 150 150 150 150 150 150 150 150 150 ...
##  $ precision                    : num  0.94 0.96 0.111 0.927 0.947 ...
##  $ predictive.accuracy          : num  0.94 0.96 0.333 0.927 0.947 ...
##  $ prior.entropy                : num  1.58 1.58 1.58 1.58 1.58 ...
##  $ recall                       : num  0.94 0.96 0.333 0.927 0.947 ...
##  $ relative.absolute.error      : num  0.0863 0.52 1 0.151 0.0881 ...
##  $ root.mean.prior.squared.error: num  0.471 0.471 0.471 0.471 0.471 ...
##  $ root.mean.squared.error      : num  0.16 0.288 0.471 0.211 0.178 ...
##  $ root.relative.squared.error  : num  0.339 0.611 1 0.447 0.377 ...
##  $ scimark.benchmark            : num  1981 1980 2011 1887 1998 ...
##  $ total.cost                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ unweighted.recall            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ usercpu.time.millis          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ usercpu.time.millis.testing  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ usercpu.time.millis.training : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weighted.recall              : num  NA NA NA NA NA NA NA NA NA NA ...Analogously to the previous listings, one can list further objects simply by calling the respective functions.
listOMLDataSetQualities()
listOMLEstimationProcedures()
listOMLEvaluationMeasures()
listOMLTaskTypes()Users can download data sets, tasks, flows and runs from the OpenML server. The package provides special representations for each object, which will be discussed here.
To directly download a data set, e.g., when you want to run a few
preliminary experiments, one can use the function
getOMLDataSet. The function accepts a data set ID as input
and returns the corresponding OMLDataSet:
iris.data = getOMLDataSet(data.id = 61L)  # the iris data set has the data set ID 61The following call returns an OpenML task object for a supervised classification task on the iris data:
task = getOMLTask(task.id = 59L)
task## 
## OpenML Task 59 :: (Data ID = 61)
##   Task Type            : Supervised Classification
##   Data Set             : iris :: (Version = 1, OpenML ID = 61)
##   Target Feature(s)    : class
##   Tags                 : basic, study_1, study_41, study_50, study_7, study_89, testsuite, under100k, ...
##   Estimation Procedure : Stratified crossvalidation (1 x 10 folds)
##   Evaluation Measure(s): predictive_accuracyThe corresponding "OMLDataSet" object can be accessed
by
task$input$data.set## 
## Data Set 'iris' :: (Version = 1, OpenML ID = 61)
##   Collection Date         : 1936
##   Creator(s)              : R.A. Fisher
##   Default Target Attribute: classand the class of the task can be shown with the next line
task$task.type## [1] "Supervised Classification"Also, it is possible to extract the data set itself via
iris.data = task$input$data.set$data
head(iris.data)##   sepallength sepalwidth petallength petalwidth       class
## 0         5.1        3.5         1.4        0.2 Iris-setosa
## 1         4.9        3.0         1.4        0.2 Iris-setosa
## 2         4.7        3.2         1.3        0.2 Iris-setosa
## 3         4.6        3.1         1.5        0.2 Iris-setosa
## 4         5.0        3.6         1.4        0.2 Iris-setosa
## 5         5.4        3.9         1.7        0.4 Iris-setosaAside from tasks and data sets, one can also download flows – by
calling getOMLFlow with the specific
flow.id
flow = getOMLFlow(flow.id = 2700L)
flow## 
## Flow 'classif.randomForest' :: (Version = 47, Flow ID = 2700)
##  External Version         : R_3.1.2-734b029d
##  Dependencies             : mlr_2.9, randomForest_4.6.12
##  Number of Flow Parameters: 16
##  Number of Flow Components: 0To download the results of one run, including all server and user
computed metrics, you have to define the corresponding run ID. For all
runs that are actually related to the task, the corresponding ID can be
extracted from the runs object, which was created in the
previous section. Here we use a run of task 59, which has the
run.id 525534. Single OpenML runs can be downloaded with
the function getOMLRun:
task.list = listOMLRuns(task.id = 59L)
task.list[281:285, ]##      run.id task.id setup.id flow.id uploader error.message
## 281 7244063      59  5275959    6952        1          <NA>
## 282 7245683      59  5277579    6952        1          <NA>
## 283 7245684      59  5277580    6952        1          <NA>
## 284 7245686      59  5277582    6952        1          <NA>
## 285 7245687      59  5277583    6952        1          <NA>run = getOMLRun(run.id = 524027L)
run## 
## OpenML Run 524027 :: (Task ID = 59, Flow ID = 2393)
##  User ID  : 970
##  Learner  : classif.randomForest(43)
##  Task type: Supervised ClassificationEach OMLRun object is a list object, which stores
additional information on the run. For instance, the flow of the
previously downloaded run has some non-default settings for
hyperparameters, which can be obtained by:
run$parameter.setting  # retrieve the list of parameter settings## $seed
##  (parameter of component 2393) seed = 1
## 
## $kind
##  (parameter of component 2393) kind = Mersenne-Twister
## 
## $normal.kind
##  (parameter of component 2393) normal.kind = InversionIf the underlying flow has hyperparameters that are different from the default values of the corresponding learner, they are also shown, otherwise the default hyperparameters are used (but not explicitly listed).
All the data that served as input for the run, including data set IDs
and the URL to the data, is stored in input.data:
run$input.data## 
## ** Data Sets **
##   data.id name                                                          url
## 1      61 iris https://www.openml.org/data/download/61/dataset_61_iris.arff
## 
## ** Files **
## Dataframe mit 0 Spalten und 0 Zeilen
## 
## ** Evaluations **
## Dataframe mit 0 Spalten und 0 ZeilenPredictions made by an uploaded run are stored within the
predictions element and can be retrieved via
head(run$predictions, 10)##    repeat fold row_id      prediction           truth confidence.Iris-setosa confidence.Iris-versicolor
## 1       0    0     43     Iris-setosa     Iris-setosa                      1                          0
## 2       0    0     14     Iris-setosa     Iris-setosa                      1                          0
## 3       0    0     37     Iris-setosa     Iris-setosa                      1                          0
## 4       0    0     23     Iris-setosa     Iris-setosa                      1                          0
## 5       0    0     10     Iris-setosa     Iris-setosa                      1                          0
## 6       0    0     99 Iris-versicolor Iris-versicolor                      0                          1
## 7       0    0     87 Iris-versicolor Iris-versicolor                      0                          1
## 8       0    0     97 Iris-versicolor Iris-versicolor                      0                          1
## 9       0    0     62 Iris-versicolor Iris-versicolor                      0                          1
## 10      0    0     92 Iris-versicolor Iris-versicolor                      0                          1
##    confidence.Iris-virginica
## 1                          0
## 2                          0
## 3                          0
## 4                          0
## 5                          0
## 6                          0
## 7                          0
## 8                          0
## 9                          0
## 10                         0The output above shows predictions, ground truth information about classes and task-specific information, e.g., about the confidence of a classifier (for every observation) or in which fold a data point has been placed.
The modularized structure of OpenML allows to apply the implementation of an algorithm to a specific task and there exist multiple possibilities to do this.
If one is working with mlr, one can
specify an RLearner object and use the function
runTaskMlr to create the desired "OMLMlrRun"
object. The task is created the same way as in the previous
sections:
task = getOMLTask(task.id = 59L)
library("mlr")
lrn = makeLearner("classif.rpart")
run.mlr = runTaskMlr(task, lrn)
run.mlr## $run
## 
## OpenML Run NA :: (Task ID = 59, Flow ID = NA)
## 
## $bmr
##   task.id    learner.id acc.test.join timetrain.test.sum timepredict.test.sum
## 1    iris classif.rpart          0.94               0.01                 0.03
## 
## $flow
## 
## Flow 'mlr.classif.rpart' :: (Version = NA, Flow ID = NA)
##  External Version         : R_4.2.1-v2.4b8be4e0
##  Dependencies             : R_4.2.1, OpenML_1.12, mlr_2.19.0, rpart_4.1.16
##  Number of Flow Parameters: 14
##  Number of Flow Components: 0
## 
## attr(,"class")
## [1] "OMLMlrRun"Note that locally created runs don’t have a run ID or flow ID yet. These are assigned by the OpenML server after uploading the run.
If you are not using mlr, you will have to invest some
more time and effort to get things done since this is not supported yet.
So, unless you have good reasons to do otherwise, we strongly encourage
to use mlr. If the algorithm you want to use is not
integrated in mlr yet, you can integrate it yourself (see
the tutorial)
or open an issue on mlr
GitHub repository and hope someone else will do it for you.
The following section gives an overview on how one can contribute building blocks (i.e. data sets, flows and runs) to the OpenML server.
A data set contains information that can be stored on OpenML and used by OpenML tasks and runs. This example shows how a very simple data set can be taken from R, converted to an OpenML data set and afterwards uploaded to the server. The corresponding workflow consists of the following three steps:
makeOMLDataSetDescription: create the description
object of an OpenML data setmakeOMLDataSet: convert the data set into an OpenML
data setuploadOMLDataSet: upload the data set to the
serverdata("airquality")
dsc = "Daily air quality measurements in New York, May to September 1973.
  This data is taken from R."
cit = "Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983)
  Graphical Methods for Data Analysis. Belmont, CA: Wadsworth."
## (1) Create the description object
desc = makeOMLDataSetDescription(name = "airquality",
  description = dsc,
  creator = "New York State Department of Conservation (ozone data) and the National
    Weather Service (meteorological data)",
  collection.date = "May 1, 1973 to September 30, 1973",
  language = "English",
  licence = "GPL-2",
  url = "https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html",
  default.target.attribute = "Ozone",
  citation = cit,
  tags = "R")
## (2) Create the OpenML data set
air.data = makeOMLDataSet(desc = desc,
  data = airquality,
  colnames.old = colnames(airquality),
  colnames.new = colnames(airquality),
  target.features = "Ozone")
## (3) Upload the OpenML data set to the server
## Because this is a simple data set which is generally already available in R
## please do not actually upload it to the server!
## The code would be:
#dataset.id = uploadOMLDataSet(air.data)
#dataset.idAlternatively you can enter data directly on the OpenML website.
A flow is an implementation of a single
algorithm or a script. Each mlr
learner can be considered an implementation of a flow, which can be
uploaded to the server with the function uploadOMLFlow. If
the flow has already been uploaded to the server (either by you or
someone else), one receives a message that the flow already exists and
the flow.id is returned from the function. Otherwise, the
flow will be uploaded, receive its own flow.id and return
that ID.
library("mlr")
lrn = makeLearner("classif.randomForest")
flow.id = uploadOMLFlow(lrn)
flow.idIn addition to uploading data sets or flows, one can also upload runs
(which a priori have to be created, e.g., using mlr):
## choose 2 flows (i.e., mlr-learners)
learners = list(
  makeLearner("classif.kknn"),
  makeLearner("classif.randomForest")
)
## pick 3 random tasks
task.ids = c(57, 59, 2382)
for (lrn in learners) {
  for (id in task.ids) {
    task = getOMLTask(id)
    res = runTaskMlr(task, lrn)$run
    run.id = uploadOMLRun(res)  # upload results
  }
}Before your run will be uploaded to the server,
uploadOMLRun checks whether the flow that created this run
is already available on the server. If the flow does not exist on the
server, it will (automatically) be uploaded as well.
Now, you should have gotten an idea on how to use our package. However, as there is always room for improvement, we are more than happy to receive your feedback. So, in case
please open an issue in the issue tracker of our GitHub repository.