Self-Supervised Learning [pdf]

l-m-z · on June 16, 2019

This is from yesterday's (June the 15th) workshop on self-supervised learning at ICML. The video of this talk can be seen here: https://www.facebook.com/icml.imls/videos/2030095370631729/ (not sure if the video is also available on other platforms than facebook)

hadsed · on June 16, 2019

In fact I found the second and third talks more interesting than the first! Which sadly are not included in the slides from the OP.

make3 · on June 16, 2019

thanks kind stranger, awesome video

varelse · on June 16, 2019

See also "The Revolution Will Not Be Supervised" from CS294-158...

https://www.youtube.com/watch?v=PX11C5Vfo9U

picozeta · on June 16, 2019

Thanks for that, this is an interesting talk.

planckscnst · on June 16, 2019

If we can use GANs to produce believable images of a subject and we can identify whether a video frame can exist between two others, it seems like we can produce infinitely high framerate videos that look like they were captured that way. This also means we can make slow motion video when we didn't record at a high framerate. I think with that, colorization, and possibly similar tools for audio, we might see some amazing recreations of classic performances.

gwern · on June 16, 2019

Yep, you certainly can. 'In-betweening', like superresolution, has been a GAN thing for years now, because triplets of frames are a clean dataset but you also care more about perceptual plausibility than pixel error. People use in-betweening GANs to make things like 60 FPS anime. (Not entirely sure why, but they do.)

ionionionioa · on June 16, 2019

>People use in-betweening GANs to make things like 60 FPS anime

Animation seems like an especially poor fit to me, since the actual framerate is often much lower than the video's framerate. Framerate can vary between scenes and even within different parts of one scene! Typically the background is very low framerate (sometimes as low as 4 FPS), the foreground is higher framerate (typically 8-12 FPS), while pans, zooms, and 3D elements are at a full 24 FPS. Most of the additional frames from interpolation will therefore be exact duplicates of other frames.

This does little to improve the smoothness of the video. It just adds in artifacts. And, since the frames between two drawings will be interpolated while frames within one drawing will be unchanged, the framerate will be inconsistent and appear as judder.

Interpolation will never work for 2D animation. No way, no how. Any worthwhile system will need to modify existing frames rather than simply adding more in between the original frames. I can understand interpolation for live action (though I still dislike it), but it is absolutely god-awful for animation.

gwern · on June 16, 2019

I think that's wrong: the whole point of GANs is that they're quite intelligent and good at faking outputs. I've seen interpolated/in-betweened videos (mostly but not entirely live-action), and it looks realistic to me.

The reason I'm somewhat skeptical is that just because something looks realistic doesn't mean that it what is intended. It's a version of the 'zoom in, enhance, enhance' problem. It's like the _Hobbit_ problem: a GAN could perfectly well fake a 60FPS version of a 30FPS version of the _Hobbit_ such that you couldn't tell that it wasn't the actual 60FPS version that Peter Jackson shot... but the problem is that it's 60FPS and that just feels wrong for cinema. Animators, anime included, use the limitations of framerate and deliberate switches between animating 'on twos' etc, with reductions in framerates being deliberately done for action segments and sakuga and other reasons. An anime isn't simply a film which was unavoidably shot with a too-low framerate.

(This is less true of superresolution: in most cases, if an anime studio could have afforded to animate at a higher resolution originally, they would have; and you're not compromising any 'artistic vision' if you use a GAN to do a good upscaling job instead of a lousy bilinear upscale built into your video player.)

ionionionioa · on June 17, 2019

>interpolated/in-betweened

That's the problem: no matter how smart your algorithm is, you cannot make animation look smooth by only adding frames. Not even human animators could do that.

The framerate of animation is irrelevant. What matters is the number of drawings per second, not the number of frames. An intelligent system would interpolate between drawings, which would often require modifying or deleting frames from the source.

I'm not some purist claiming that this is an evil technology. It just plain doesn't apply to animation, except for pans or the rare scene animated at a full 24 FPS.

gwern · on June 18, 2019

I'm not following. (If it doesn't apply at all, how is anyone doing it...?) Of course you can identify drawings per second, much the same way a monitor can display a 24FPS video at 120hz without needing to be an 'intelligent system': you increase or decrease the number of duplicates as necessary. You in-between pairs of different frames, replacing all the identical ones which are simply displaying the same drawing.

HeWhoLurksLate · on June 16, 2019

What's the reason that we shouldn't have 60 FPS in movies?

taneq · on June 17, 2019

We're so used to 24fps movies that it's become a subconscious cue for identifying 'realistic' film. Higher frame rates like like video games because all of the high frame rate CGI we see is in games.

(IMO this is just something we have to push through. I hate the low frame rate of movies.)

cercatrova · on June 16, 2019

SVP [0], a Windows / Linux program, can handle animation (as well as film) quite well, interpolating at 60fps or greater. Try it out for yourself and see. It doesn't use GANs however, but a sufficiently complex algorithm that does interpolation.

[0] https://www.svp-team.com

ionionionioa · on June 17, 2019

Yes, and it does so by turning the system off almost entirely. For animation, people often disable interpolation for everything except pans. As I mentioned before, those are usually at 24 FPS already.

https://www.svp-team.com/wiki/SVP3:Watching_anime

cycrutchfield · on June 16, 2019

I mean, both things have been around much much longer than GANs. I think Freeman’s paper on superresolution is from the early 2000s.

usmannk · on June 16, 2019

That sounds interesting. Where can I read more about people doing this? (And other fun things with GANs and/or anime)

zitterbewegung · on June 16, 2019

Gwern has a website https://www.gwern.net

rrishi · on June 16, 2019

This is really interesting. Any place I can get to know more about this?

antpls · on June 16, 2019

MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Frame Interpolation and Enhancement : https://sites.google.com/view/wenbobao/memc-net

One of the author also worked on https://news.ycombinator.com/item?id=20188316

kachurovskiy · on June 16, 2019

This is already possible in some video editors, see e.g. DaVinci Resolve optical flow - https://youtu.be/4LsLaJq5-ow

spyder · on June 16, 2019

Yes:

AI Learns Video Frame Interpolation | Two Minute Papers #197

https://www.youtube.com/watch?v=T_g6S3f0Z5I

danielscrubs · on June 16, 2019

Scrubs in 4k!

shostack · on June 16, 2019

Couple this then be used to reduce the storage required for phones to capture slow motion video since it can simply be done with regular video in post server side?

p1esk · on June 16, 2019

Yann Lecun is also a fan of self-supervised learning: https://www.facebook.com/epflcampus/videos/yann-lecun-self-s...

p1esk · on June 16, 2019

So, is there any difference between "self-supervised" and "unsupervised" learning?

arugulum · on June 17, 2019

There's essentially a push to relabel/rename many "unsupervised" methods and "self-supervised". Yann Lecun is one of the more famous proponents of this (https://www.facebook.com/yann.lecun/posts/10155934004262143), but I've been seeing the term gain traction.

The reason for this is that people felt that unsupervised learning was a misleading name for many of the so-called unsupervised learning methods, such as language-modeling. They argue that there is a supervised training signal in these methods, the only difference is that the training signal comes from the model "input" itself, rather than an external label.

Ultimately, I'm not entirely sure if there is really a distinction between the two if you argue it all the way down to the details (is PCA unsupervised? or self-supervised since it's constructing a model with respect to its own inputs), but I think it's generally intuitive what self-supervised methods refer to and I'm on board for this renaming.

why_only_15 · on June 17, 2019

On page 32 of the PDF it says self supervision is:

> - A form of unsupervised learning where the data provides the supervision

> – In general, withhold some information about the data, and task the network with predicting it

> – The task defines a proxy loss, and the network is forced to learn what we really care about, e.g. a semantic representation, in order to solve it

nl · on June 17, 2019

If you want to be really strict in the definitions there is a difference. There isn't really unsupervised learning, but there are unsupervised techniques - clustering etc.

In self-supervised training you use some kind of measurable structure to build a loss function against.

But in common usage people say "unsupervised" to mean "self-supervised". For example Word2Vec is usually referred to as unsupervised when it is technically self-supervised.

I think this is really because the self-supervised name was invented well after the techniques became common-place.

make3 · on June 16, 2019

That guy's definition of self-supervised learning is what I believe most people in the field would call unsupervised learning.

6nf · on June 17, 2019

Is it different from reinforcement learning?

jokoon · on June 16, 2019

That reminds me of those neural network models that would learn to change themselves, a little like a ML algorithm that would learn about neural network that worked best.

I think google did something like this some years ago?

nl · on June 17, 2019

Neural Architecture Search with Reinforcement Learning[1]

This isn't really a unsupervised or self-supervised technique at all. It's a combination of supervised learning with reinforcement learning (which is a whole other thing too).

[1] https://ai.google/research/pubs/pub45826

neruotablet · on June 16, 2019

You mean Neuroevolution? I.e using Evolutionary Algorithms https://youtu.be/L--IxUH4fac to evolve NNs?

nl · on June 17, 2019

No. The Google technique was reinforcement learning based and didn't use evolutionary algorithms at all.

https://ai.google/research/pubs/pub45826

shgidi · on June 16, 2019

A post series that also summarizes this subject: https://link.medium.com/IhOvrqFEzX

hbarka · on June 16, 2019

I’m nowhere near this field, but for the experts in here I was wondering if Self-Supervised Learning changes the paradigm for the approach to self-driving cars?

ec109685 · on June 16, 2019

At what point are we able to deliver low resolution video and have the system make up on the fly believable high resolution version of it?

dodobirdlord · on June 16, 2019

This technology (or at least variants) already exists. Image upscaling with convolutional neural nets is an old trick at this point, but with Nvidia integrating real-time denoising into their RTX technology I suspect that real-time upscaling is right around the corner if someone hasn't done it already.

[0]https://developer.nvidia.com/optix [1]https://topazlabs.com/gigapixel-ai/

felipellrocha · on June 16, 2019

ENHANCE!

derefr · on June 16, 2019

Nah, the CSI "enhance" thing is "multi-frame super-resolution image recovery", a different (though related) ML technique.

Speaking of, though: you'd think, by now, that security cameras that capture footage at very low framerates for the sake of storage space, would have ASICS in them using those models to convolve together a bunch of grainy input frames into a stream of fewer, but very good and clean frames.

Any hardware on the market with this capability yet?

ClassyJacket · on June 17, 2019

It makes sense for entertainment but not for security cameras - then you're filling it in with made up information. A security camera is supposed to be a record of truth.

oneshot908 · on June 17, 2019

Imagine a world where low information sorts interpret a sampling of possible hi-res reconstructions from low-res security videos as ground truth. That to me is far scarier than the OpenAI and MIRI fear-mongering about GPT-2.

derefr · on June 17, 2019

It’s not made-up information; it’s parallax / compressed sensing, in the same way that you can see through the grate on the front of a microwave oven to what’s behind it by moving your eyes around.

If it’s good enough for generating accurate fMRI images from sequentially-overlaid magnetic flux readings, it’s definitely good enough for generating visuals from slightly suckier visuals.

amthewiz · on June 16, 2019

The DNN based techniques are new but the concept of models that can fill in the blanks is old. It used to be called content addressable memory or autoassociative memory.

person_of_color · on June 17, 2019

Is this the game changer we are looking for?

p1esk · on June 17, 2019

What are you looking for?

suyash · on June 16, 2019

Mind blowing progress in the field of self supervised learning.