Very nice article Aria. You quickly mention Pegasos as a scalable alternative to...

psb217 · on April 16, 2012

You may want to look at the paper: "P-packSVM: Parallel Primal grAdient desCent Kernel SVM" from ICDM 2009. It presents an extension of Pegasos to non-linear kernels. Evaluating the pairwise kernels <x_i,x_j> and continuously updating the estimate of the norm of the implicit weight vector w seem to be the main hurdles to achieving the performance gains seen with linear kernels.

The key takeaway from the paper (for me) was that the computation time on a single processor was not significantly better than that of the standard implementation provided by SVM-Light. However, with a variety of tricks permitted by the use of an SGD/Pegasos-like method, the authors were able to get significant speedup when using a compute cluster, allowing a good reduction in computation times (e.g. ~200x reduction on 512 processors).

brendano · on April 16, 2012

For NLP applications, which I think Aria's article is mostly concerned with, non-linear kernelized classifiers are often little better than linear ones. I think that's one part of the recent interest in SGD-style training algorithms (they work for linear cases nicely, less so for kernelized ones).

[deleted part about kernelizing pegasos, realized i dont know that area]

srconstantin · on April 16, 2012

Link to LaSVM paper: jmlr.csail.mit.edu/papers/volume6/bordes05a/bordes05a.pdf Also a good overview of SVM techniques in general.

srconstantin · on April 16, 2012

So...are you saying you need the dual formulation in order to allow a kernel model?

ogrisel · on April 16, 2012

No actually it's not the case. But I don't know how Pegasos can be adapted to use kernels. If you take figure 1 of the paper [1] you will see that the gradient of the objective function is used to update a single weights vector `w` at each step of the projected stochastic gradient descent. In a kernel model, all the support vectors cannot be collapsed into a single weight vector `w`. You would need to handle the kernel expansion against the support vectors explicitly. But then how to select the support vectors out of all the samples from the dataset while keeping the algorithm online? The Pegasos paper does not say mention it.

[1] http://eprints.pascal-network.org/archive/00004062/01/Shalev...

sparsevector · on April 16, 2012

The set of support vectors is just the set of training examples that have non-zero alpha parameters. To implement the gradient update you just evaluate the support vector machine on the example (using the explicit kernel expansion) and then if the example has signed margin less than 1 you add y * eta to the corresponding alpha value.

The difficulty with Pegasos for non linear kernels is the support set quickly becomes very large and so evaluating the model becomes very slow. Note that since the alpha values are not constrained to be non-negative (unlike the standard dual algorithms) the alpha values don't ever get clipped to zero--instead they just slowly converge to zero. It's still (I think) one of the fastest methods in terms of theoretical convergence guarantees but perhaps not as fast as LaSVM or something similar in practice.

However, there's been a more general trend in machine learning to use linear models with lots of features instead of kernel models, partially because of these sort scalability issues.

ogrisel · on April 16, 2012

Thanks for the reply. I was told by twitter that this is similar to the kernel perceptron which I don't know well either. There is a good introduction with python code snippet here:

http://www.mblondel.org/journal/2010/10/31/kernel-perceptron...

However it seems that you need to compute the kernel expansion on the full set of samples (or maybe just the accumulated past samples?): this does not sound very online to me...

sparsevector · on April 16, 2012

It's true you need to compute the kernel dot product between every example you see and every example in the support set (every example that ever previously evaluated to signed margin < 1). Whether it's online depends on your definition of "online" . It's definitely not online in the sense of using memory independent of the number of examples, since you have to keep around the support set. I think there are results showing the support set grows linearly with the size of the training set under reasonable assumptions. However, it is online in the sense that it operates on a stream of data, computing predictions and updates for each example one-by-one. It's also online in the sense that its analysis is based on online learning theory (e.g. mistake / regret bounds). A lot of learning theory papers use "online" in the latter two senses, which is confusing if you expect the former.