SVMlight format

SVM^light is an implementation of Support Vector Machines (SVMs) in C. The author Thorsten Joachims designed a special input format to represent training/test data. It is also widely used by a lot of other programs.

Here is a brief introduction of the input format.

Note: I omitted the description of info for additionally annotation, which may not compatible to other similar parser.

For each line, the synopsis seems as follow:

 : : ... :

The definition of each element:

 = +1 | -1 | 0 | 
 = 
 =

Good cases:

-1 1:0.43 3:0.12 9284:0.2

Bad cases:

-1 0:0.1 1:1.0  # feature should no less than 1
-1 1:0.1 2.0:0.1  # feature should be int
-1 1:0.1 1:1.1  # feature should be unique
-1 3:0.1 2:0.1  # feature should increasing

This format is used to indicate a sparse matrix and could be converted to the follow training data :

case class Feature (
    dim Int,
    score Float,
)

case class OneTrainData (
    target Float,  // target to predict
    features Array[Feature],
)

When we are trying to have a classification, we could assume positive as ‘1’, negative as ‘-1’. For each input data, we could extract a set of features. we should serialize them as a N-dimensional array.

In the SVM^light format, you should make sure the feature id should start from 1 instead of 0 and increasing. Some of the features in this dataset are missing (treated as 0), you can just ignore them.

Finally, the predicted result could be in the range of $[-1, +1]$ , and we can assume $[0,1]$ as $1$ , and $[-1, 0)$ as $-1$ .

For the same reason, sometimes we may assume positive as $1$ , negative as $0$ . and in prediction step we can assume $[0, 0.5)$ as $0$ , and $[0.5, 1]$ as $1$ .

Talk with Kimi

SVM^light format

Published by Yu on April 18, 2018April 18, 2018

Yu

2 Comments

narayan chandra maiti · June 28, 2019 at 00:50

Yu · July 2, 2019 at 17:12

Leave a Reply Cancel reply

Code

ParsCit: An open-source CRF reference string parsing package

Code

Collaborative Filtering

Code

ParsCit 一个开源的CRF参考文献解析包

SVMlight format

Published by Yu on April 18, 2018April 18, 2018

Yu

2 Comments

narayan chandra maiti · June 28, 2019 at 00:50

Yu · July 2, 2019 at 17:12

Leave a Reply Cancel reply

Related Posts

Code

ParsCit: An open-source CRF reference string parsing package

Code

Collaborative Filtering

Code

ParsCit 一个开源的CRF参考文献解析包

SVM^light format