SVMlight is an implementation of Support Vector Machines (SVMs) in C. The author Thorsten Joachims designed a special input format to represent training/test data. It is also widely used by a lot of other programs.

Here is a brief introduction of the input format.

Note: I omitted the description of info for additionally annotation, which may not compatible to other similar parser.

For each line, the synopsis seems as follow:

<target> <feature>:<value> <feature>:<value> ... <feature>:<value>

The definition of each element:

<target> = +1 | -1 | 0 | <float>
<feature> = <integer>
<value> = <float>

Good cases:

-1 1:0.43 3:0.12 9284:0.2

Bad cases:

-1 0:0.1 1:1.0  # feature should no less than 1
-1 1:0.1 2.0:0.1  # feature should be int
-1 1:0.1 1:1.1  # feature should be unique
-1 3:0.1 2:0.1  # feature should increasing

This format is used to indicate a sparse matrix and could be converted to the follow training data :

case class Feature (
    dim Int,
    score Float,
)

case class OneTrainData (
    target Float,  // target to predict
    features Array[Feature],
)

When we are trying to have a classification, we could assume positive as '1', negative as '-1'. For each input data, we could extract a set of features. we should serialize them as a N-dimensional array.

In the SVMlight format, you should make sure the feature id should start from 1 instead of 0 and increasing. Some of the features in this dataset are missing (treated as 0), you can just ignore them.

Finally, the predicted result could be in the range of [-1, +1], and we can assume [0,1] as 1, and [-1, 0) as -1.

For the same reason, sometimes we may assume positive as 1, negative as 0. and in prediction step we can assume [0, 0.5) as 0, and [0.5, 1] as 1.


Yu

Ideals are like the stars: we never reach them, but like the mariners of the sea, we chart our course by them.

2 Comments

narayan chandra maiti · June 28, 2019 at 00:50

Microsoft Edge 18.17763 Microsoft Edge 18.17763 Windows 10 x64 Edition Windows 10 x64 Edition

Very good examples

    Yu · July 2, 2019 at 17:12

    Google Chrome 75.0.3770.100 Google Chrome 75.0.3770.100 Mac OS X  10.14.5 Mac OS X 10.14.5

    Thank you :)

Leave a Reply to Yu Cancel reply

Your email address will not be published. Required fields are marked *