Using SciKit-Learn

IMPORTANT NOTE: Using SciKit-Learn is currently only possible on Linux. For OS X or Windows, the steps below may work if some form of linux compatibility is available (e.g. using Cygwin on Windows), but no instructions for how to do this are provided yet.

SciKit-Learn (http://scikit-learn.org/stable/) is a collection of Machine Learning and other tools, implemented in Python. For this reason SciKit-Learn cannot be directly integrated in the LearningFramework plugin.

Instead, the use of SciKit-Learn is possible by running the training and application programs externally, in a separate process:

sklearn-wrapper

The sklearn-wrapper (https://github.com/GateNLP/sklearn-wrapper) software is necessary to apply a model or to automatically train a model from inside the LearningFramework.

The following steps are needed to prepare the sklearn-wrapper for use with the LearningFramework:

Installing weka-wrapper

Making sure the sklearn-wrapper commands work

From within the directory sklearn-wrapper, run the command:

./bin/sklearnWrapperApply.sh .

If this shows the error message “ERROR: No model path”, sklearn-wrapper should be ready to use.

Telling the LearningFramework how to run sklearn-wrapper commands

The LearningFramework needs to be able to run the sklearn-wrapper commands sklearnWrapperApply.sh and sklearnWrapperTrain.sh in order to use SciKit-Learn properly. For this the LearningFramework needs to know the location of where sklearn-wrapper is installed, i.e. the path of the directory (called sklearn-wrapper by default) which was created when the zip file was extracted during installation or where the git repository was cloned. This can be done by setting one of the following to the full path to that directory:

The setting in sklearn.yaml takes precedence over the java property which takes precedence of the environment variable.

If any of these are set to a relative path, then the LearningFramework will try to interpret that as relative to the data directory used.

You should also set the environment variable SKLEARN_WRAPPER_PYTHON to indicate the command to run python.

Using Exported Files for Training

When the training data is exported using the [[LF_Export]] with the EXPORTER_MATRIXMARKET2_CLASS or EXPORTER_MATRIXMARKET2_REGRESSION exporters, two files are created in the data directory:

Both files are stored in MatrixMarket Coordinate Format (see http://math.nist.gov/MatrixMarket/formats.html)

To import these files for use with SciKit-Learn and train a model the following Python code can be used:

import scipy.io as sio
## depfile = the path of the exported file deps.mtx
## indepfile = the path of the exported file indeps.mtx
deps = sio.mmread(depfile)
indeps = sio.mmread(indepfile)
## model = some learning algorithm e.g. sklearn.svm.SVC()
## sklearn can use the imported sparse matrix directly for the independent variables
## but needs the targets in a different shape and format
targets = deps.toarray().reshape(deps.shape[0],)
model.fit(indeps,targets)
## now store the trained model for later use ...