SVMlight is an implementation of Support Vector Machines (SVMs) in C. The author Thorsten Joachims designed a special input format to represent training/test data. It is also widely used by a lot of other programs.
Here is a brief introduction of the input format.
Note: I omitted the description of 'info' for additionally annotation, which may not compatible to other similar parser.
For each line, the synopsis seems as follow:
<target> <feature>:<value> <feature>:<value> ... <feature>:<value>
The definition of each element:
<target> = +1 | -1 | 0 | <float> <feature> = <integer> <value> = <float>
-1 1:0.43 3:0.12 9284:0.2
-1 0:0.1 1:1.0 # feature should no less than 1 -1 1:0.1 2.0:0.1 # feature should be int -1 1:0.1 1:1.1 # feature should be unique -1 3:0.1 2:0.1 # feature should increasing
This format is used to indicate a sparse matrix and could be converted to the follow training data :
case class Feature ( dim Int, score Float, ) case class OneTrainData ( target Float, // target to predict features Array[Feature], )
When we are trying to have a classification, we could assume positive as '1', negative as '-1'. For each input data, we could extract a set of features. we should serialize them as a N-dimensional array.
In the svmlight format, you should make sure the feature id should start from 1 instead of 0 and increasing. Some of the features in this dataset are missing (treated as 0), you can just ignore them.
Finally, the predicted result could be in the range of [-1, +1], and we can assume [0,1] as 1 , [-1, 0) as -1.
For the same reason, sometimes we may assume positive as '1', negative as '0'. and in prediction step we can assume [0, 0.5) as 0, [0.5, 1] as 1.