We were talking about rendering - not training. Have I misunderstood your point?

godelski · on Dec 9, 2023

Yes. This is a trained ML model.

TensorRT is a Nvidia too that actually tries to automatically create for you an optimized version of your network. You can usually expect to see a good 25% speedup, if not even up to 100%. It's only for inference though. There's other tools like triton that will also get you massive inference improvements. These are directly related to the FPS of course.

The optimization has two aspects to it really. One is about speed and the other is about performance. We can probably expect that this model is developed in pytorch. That's going to be far from as memory efficient as well as speed efficient as one could expect given proper tuning. See LLaMa CPP for an example of this. Additionally, there are many optimizations related to stability of models which enable such things as reduced precision, be that fp16, fp8, 4 bit quantization, or the 2 bit quantization that's currently sitting on the front page. If you don't need to compute over as large of a precision value you can get really good boosts in inference.

Now about the hyper-parameters. That's what defines how the model gets into its state. Yes, there's stochastic to models and retraining them with all things equal can end up with different results (see Seed Lottery), but there's also things you can adjust. Unfortunately most people tune these parameters based on a search for the best result on a test dataset which is statistically improper as it leads to data leakage (you are the one leaking data into the model despite it never seeing test data). This is typically going to lead to overfitting in the generalized domain even if you don't see a change in validation performance. You'll never see this though unless you actually get hands on or test in zero to few shot learning situations, but people even do this improperly and so yeah. But what are you going to do, if you can't get your results out because you'll get rejected for not doing the improper thing, you can't blame anyone for doing it, only the judges. I wish at least the practitioners would learn this because it really affects the products.

Still, my main point is that even without doing this the search spaces for these hyper-parameters is quite large. If you're unfamiliar, hyper-parameters are quite a vague term and can range from the more obvious like learning rate, model depth, width, optimizers to even more abstract levels such as a sub network (like the unet in this one), to activation functions, data augmentations, normalization methods, initializations, cost functions, and more. If you are familiar, sorry. The big labs spend a significant amount of time tuning these because if you have big models and big (high quality) data, you can really get them to fit almost anything. But smaller models and smaller data, you have to be more careful about your design choices. Considering the hardware they are mentioning and the times they used to train, we can absolutely assume that even the more obvious parameters like lr, lr scheduler, and weight decay are non-optimal. Small labs are generally not doing grid search or really any good search because you don't have the capacity to do this. But the training matters, actually a lot. Model only does 2k? Who says? The authors didn't train it on 8k images or to produce 8k images. It's an open question if it can do that. Can it be adapted to 4-bit quantization to significantly increase throughput or is there instabilities?

Training and rendering are deeply intertwined, but I am in fact talking about both. The important thing is to keep in mind the context of this work. It is not a final product. Putting it one way, LDM (what Stable Diffusion was built around) was released about this time 2 years ago. The original code is terribly inefficient, the authors unnecessarily double the memory load in their handling of the ema and you're not going to generate a 256x256 image quickly (I recently clocked that at about 1 image a second on an A100). But they developed Stability around it and that model is much better and much faster now. But it took millions of dollars and a few years to do so (and continuing). The general architecture is still the same in spirit but there have been big changes, a lot in training. And those training enable all the distillation for sd-turbo. There's a much bigger picture at play here than just a single paper that has a good result because it's just a stepping stone. If you just saw LDM as the model that couldn't beat a GAN (hell, they even dropped StyleGAN2 results in preference of StyleGAN 1 in some instances!) on just about any metric (quality, throughput (recently clocked StyleGAN2 at about 90fps fwiw), memory) then we wouldn't have Stable Diffusion. That's my point. You need to properly contextualize research, because research isn't the final product. And that contextualization means understanding how to properly evaluate and compare it. Someone mentioned the Facebook one and that's absolutely not a fair comparison. That facebook version trained with 110 cameras and got over 100k frames for each participant, though surprisingly only used 4A100s. But they are fundamentally looking at different things.

So yeah, super-resolution, re-projection, interpolation, etc are things that can make this into a better product but those are techniques better suited for something much more finalized. It's research, it's not even at the pitch to y-combinator stage yet. Don't judge it like it's something Facebook is shipping out into the metaverse tomorrow.