SVMlight is an implementation of Support Vector Machines (SVMs) in C. The author Thorsten Joachims designed a special input format to represent training/test data. It is also widely used by a lot of other programs.

Here is a brief introduction of the input format.

Note: I omitted the description of 'info' for additionally annotation, which may not compatible to other similar parser.

For each line, the synopsis seems as follow:

<target> <feature>:<value> <feature>:<value> ... <feature>:<value>

The definition of each element:

<target> = +1 | -1 | 0 | <float>
<feature> = <integer>
<value> = <float>

Good cases:

-1 1:0.43 3:0.12 9284:0.2

Bad cases:

-1 0:0.1 1:1.0  # feature should no less than 1
-1 1:0.1 2.0:0.1  # feature should be int
-1 1:0.1 1:1.1  # feature should be unique
-1 3:0.1 2:0.1  # feature should increasing

This format is used to indicate a sparse matrix and could be converted to the follow training data :

case class Feature (
    dim Int,
    score Float,

case class OneTrainData (
    target Float,  // target to predict
    features Array[Feature],

When we are trying to have a classification, we could assume positive as '1', negative as '-1'. For each input data, we could extract a set of features. we should serialize them as a N-dimensional array.

In the svmlight format, you should make sure the feature id should start from 1 instead of 0 and increasing. Some of the features in this dataset are missing (treated as 0), you can just ignore them.

Finally, the predicted result could be in the range of [-1, +1], and we can assume [0,1] as 1 , [-1, 0) as -1.

For the same reason, sometimes we may assume positive as '1', negative as '0'. and in prediction step we can assume [0, 0.5) as 0, [0.5, 1] as 1.