Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Human Synthesis (shunyuanzheng.github.io)
80 points by rbanffy on Dec 8, 2023 | hide | past | favorite | 16 comments


The title is missing the key word "real-time" like in the paper title.

Optimization was one of the biggest bottlenecks. This will unlock fast, reconstruction on more types of hardware, more efficiently.

GS also is a great compression technique. Even so, Meta called their avatar one a Codec.


I tried to make it fit and I inadvertently removed a vital organ. Sorry.


This seems like the type of thing you would want for VR.


Or for simulating crazy impossible dynamic camera shots from a few fixed cameras. Or for creating a gods eye security view of a premises, with virtual move/pan/tilt/zoom. Reality TV as if you have a camera man in the middle of the action, from just a periphery of discrete fixed cameras. And of course your VR POV on the court of a basket ball game, running after the ball, dodging players, calling fouls better than the refs -- in fact, why bother putting refs on the court/field.


Agreed, if you could stream a couple of shots that the endpoint then hydrates into spatial volumetric data, that would _kill_, particularly combined with some extrapolation for feature recovery so you could send low-res streams of all objects in a world.


If it was fast and high-resolution, yes.

But with 25 fps @ 2K it's not really usable yet. You'd want more like 140 FPS (e.g. 70 Hz per eye) at 8K, so one would need another 100x speed-up to make this VR-ready.


> But with 25 fps @ 2K it's not really usable yet

Yet is the key word. Let's nor forget the context here. This is a __research__ paper, and not one by a big tech company. Yes, a big university, but still. I'm absolutely certain you could get this to at least 30fps if not 45. They're doing this on a 3090, so that gives some context about how much money that lab has. If they had a bunch then they'd be using A100s or H100s and at least a node. What's actually really impressive is that they are training their model on a single 3090 in 15hrs. Most certainly you can scale this model, and then teach an even smaller distilled version that will be much faster. If you're a betting man I'd put money down that they are not implementing this in C, are not using custom cuda kernels, TensorRT, nor are they quantizing their model. They might not even be using fp16.

I don't know why everyone expects everything to be TRL 8+. What's wrong with low TRL research? Especially from academia. This is definitely somewhere in TRL 3-5. You wanna see cool things? Take cool research and turn them into products. It's a whole other challenge. But we've even seen LLMs and diffusion models get a lot faster in such a short time. Why not this too?


What if you mix splatting and diffusion models, so you get point cloud like type semi-3d image (re)construction?


Sure, why not? Or why not splatting + GAN since the GAN will be much more computationally light and you get high diversity from the splatting already? All worth a shot, right? Lots of things to explore.


Besides, while transistors aren’t shrinking that much, Moore’s Law is still a thing. I’d expect 50fps at 2K in less than 2 years.


If they solve the data compression issue, I can see this becoming a hardware thing a la how we have triangle rasterisation or ray tracing.


Dunno. There's a bunch of other optimisations you're overlooking (super-resolution, reprojection etc)


Or you know... TensorRT. It's an academic project by a lab that trained everything on a 3090. It's absolutely reasonable to believe that this is nowhere near optimized. It's even absolutely reasonable to believe that they haven't even found the optimal hyper-parameters. That's kinda true about a lot of research, though definitely some big labs will do these optimizations and for some reason people will overlook models that are nearly as good but built on 1/1000th the budget. There's a ton of missed opportunities.


We were talking about rendering - not training. Have I misunderstood your point?


Yes. This is a trained ML model.

TensorRT is a Nvidia too that actually tries to automatically create for you an optimized version of your network. You can usually expect to see a good 25% speedup, if not even up to 100%. It's only for inference though. There's other tools like triton that will also get you massive inference improvements. These are directly related to the FPS of course.

The optimization has two aspects to it really. One is about speed and the other is about performance. We can probably expect that this model is developed in pytorch. That's going to be far from as memory efficient as well as speed efficient as one could expect given proper tuning. See LLaMa CPP for an example of this. Additionally, there are many optimizations related to stability of models which enable such things as reduced precision, be that fp16, fp8, 4 bit quantization, or the 2 bit quantization that's currently sitting on the front page. If you don't need to compute over as large of a precision value you can get really good boosts in inference.

Now about the hyper-parameters. That's what defines how the model gets into its state. Yes, there's stochastic to models and retraining them with all things equal can end up with different results (see Seed Lottery), but there's also things you can adjust. Unfortunately most people tune these parameters based on a search for the best result on a test dataset which is statistically improper as it leads to data leakage (you are the one leaking data into the model despite it never seeing test data). This is typically going to lead to overfitting in the generalized domain even if you don't see a change in validation performance. You'll never see this though unless you actually get hands on or test in zero to few shot learning situations, but people even do this improperly and so yeah. But what are you going to do, if you can't get your results out because you'll get rejected for not doing the improper thing, you can't blame anyone for doing it, only the judges. I wish at least the practitioners would learn this because it really affects the products.

Still, my main point is that even without doing this the search spaces for these hyper-parameters is quite large. If you're unfamiliar, hyper-parameters are quite a vague term and can range from the more obvious like learning rate, model depth, width, optimizers to even more abstract levels such as a sub network (like the unet in this one), to activation functions, data augmentations, normalization methods, initializations, cost functions, and more. If you are familiar, sorry. The big labs spend a significant amount of time tuning these because if you have big models and big (high quality) data, you can really get them to fit almost anything. But smaller models and smaller data, you have to be more careful about your design choices. Considering the hardware they are mentioning and the times they used to train, we can absolutely assume that even the more obvious parameters like lr, lr scheduler, and weight decay are non-optimal. Small labs are generally not doing grid search or really any good search because you don't have the capacity to do this. But the training matters, actually a lot. Model only does 2k? Who says? The authors didn't train it on 8k images or to produce 8k images. It's an open question if it can do that. Can it be adapted to 4-bit quantization to significantly increase throughput or is there instabilities?

Training and rendering are deeply intertwined, but I am in fact talking about both. The important thing is to keep in mind the context of this work. It is not a final product. Putting it one way, LDM (what Stable Diffusion was built around) was released about this time 2 years ago. The original code is terribly inefficient, the authors unnecessarily double the memory load in their handling of the ema and you're not going to generate a 256x256 image quickly (I recently clocked that at about 1 image a second on an A100). But they developed Stability around it and that model is much better and much faster now. But it took millions of dollars and a few years to do so (and continuing). The general architecture is still the same in spirit but there have been big changes, a lot in training. And those training enable all the distillation for sd-turbo. There's a much bigger picture at play here than just a single paper that has a good result because it's just a stepping stone. If you just saw LDM as the model that couldn't beat a GAN (hell, they even dropped StyleGAN2 results in preference of StyleGAN 1 in some instances!) on just about any metric (quality, throughput (recently clocked StyleGAN2 at about 90fps fwiw), memory) then we wouldn't have Stable Diffusion. That's my point. You need to properly contextualize research, because research isn't the final product. And that contextualization means understanding how to properly evaluate and compare it. Someone mentioned the Facebook one and that's absolutely not a fair comparison. That facebook version trained with 110 cameras and got over 100k frames for each participant, though surprisingly only used 4A100s. But they are fundamentally looking at different things.

So yeah, super-resolution, re-projection, interpolation, etc are things that can make this into a better product but those are techniques better suited for something much more finalized. It's research, it's not even at the pitch to y-combinator stage yet. Don't judge it like it's something Facebook is shipping out into the metaverse tomorrow.


Where are they getting training data?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: