Using SciKit-Learn
IMPORTANT NOTE: Using SciKit-Learn is currently only possible on Linux. For OS X or Windows, the steps below may work if some form of linux compatibility is available (e.g. using Cygwin on Windows), but no instructions for how to do this are provided yet.
SciKit-Learn (http://scikit-learn.org/stable/) is a collection of Machine Learning and other tools, implemented in Python. For this reason SciKit-Learn cannot be directly integrated in the LearningFramework plugin.
Instead, the use of SciKit-Learn is possible by running the training and application programs externally, in a separate process:
- For training, the training data can be exported using the [[LF_Export]] PR and choosing the EXPORTER_MATRIXMARKET2_CLASS or EXPORTER_MATRIXMARKET2_REGRESSION exporters. Then, a model can separately be trained using SciKit-Learn (see below for how to import the MatrixMarket files into Python for use with SciKit-Learn).
- Alternately, the [[LF_TrainClassification]] or [[LF_TrainRegression]] PRs can be used to automatically export the training data as two MatrixMarket files, and use the sklearn-wrapper (see below) to externally train a model
- For application, the [[LF_ApplyClassification]] or [[LF_ApplyRegression]] PRs use the sklearn-wrapper (see below) to start a separate program which accepts data from the PR, classifies it and sends the classification back the the PR.
sklearn-wrapper
The sklearn-wrapper (https://github.com/GateNLP/sklearn-wrapper) software is necessary to apply a model or to automatically train a model from inside the LearningFramework.
The following steps are needed to prepare the sklearn-wrapper for use with the LearningFramework:
- Install sklearn-wrapper
- Make sure the sklearn-wrapper commands work on your system
- Tell the LearningFramework how to run the sklearn-wrapper commands
Installing weka-wrapper
- Get the latest release of the sklearn-wrapper: https://github.com/GateNLP/sklearn-wrapper/releases (Download the file called
sklearn-wrapper-x.y.zip
) or clone the github repository - if the zip file was downloaded, unzip the zip file somewhere where it will be accessible from the LearningFramework later
- this creates a directory
sklearn-wrapper
. A good location for this directory is parallel to the gateplugin-LearningFramework directory but any location can be configured. - See also the instructions in the sklearn-wrapper README file.
Making sure the sklearn-wrapper commands work
From within the directory sklearn-wrapper
, run the command:
./bin/sklearnWrapperApply.sh .
If this shows the error message “ERROR: No model path”, sklearn-wrapper should be ready to use.
Telling the LearningFramework how to run sklearn-wrapper commands
The LearningFramework needs to be able to run the sklearn-wrapper commands sklearnWrapperApply.sh
and sklearnWrapperTrain.sh
in order to use SciKit-Learn properly. For this the LearningFramework needs to know the
location of where sklearn-wrapper is installed, i.e. the path of the directory (called sklearn-wrapper by default) which was created when the zip file was extracted during installation or where the git repository was cloned. This can be done by setting one of the following to the full path to that directory:
- the environment variable
SKLEARN_WRAPPER_HOME
- the java property
gate.plugin.learningframework.sklearnwrapper.home
- the setting
sklearnwrapper.home
in a filesklearn.yaml
in the data directory used
The setting in sklearn.yaml
takes precedence over the java property which takes precedence of the environment variable.
If any of these are set to a relative path, then the LearningFramework will try to interpret that as relative to the data directory used.
You should also set the environment variable SKLEARN_WRAPPER_PYTHON
to indicate the command to run python.
Using Exported Files for Training
When the training data is exported using the [[LF_Export]] with the EXPORTER_MATRIXMARKET2_CLASS or EXPORTER_MATRIXMARKET2_REGRESSION exporters, two files are created in the data directory:
- deps.mtx contains the targets for all instances
- indeps.mtx contains the attributes / independet variables for all instances
Both files are stored in MatrixMarket Coordinate Format (see http://math.nist.gov/MatrixMarket/formats.html)
To import these files for use with SciKit-Learn and train a model the following Python code can be used:
import scipy.io as sio
## depfile = the path of the exported file deps.mtx
## indepfile = the path of the exported file indeps.mtx
deps = sio.mmread(depfile)
indeps = sio.mmread(indepfile)
## model = some learning algorithm e.g. sklearn.svm.SVC()
## sklearn can use the imported sparse matrix directly for the independent variables
## but needs the targets in a different shape and format
targets = deps.toarray().reshape(deps.shape[0],)
model.fit(indeps,targets)
## now store the trained model for later use ...