In our last blog post, we showed how Ubicept Photon Fusion produced more trustworthy frames than AI-based video denoisers. To summarize, here’s an image that shows the difference:
The top two frames show the tradeoff between shorter exposures and higher noise (left) versus longer exposures and lower noise (right). The bottom two frames show how an AI-based video denoiser (left) wipes out important visual details, while our approach (right) reduces noise while preserving visual fidelity.
It’s important to emphasize, however, that visual fidelity is only a means to an end. After all, our company name comes from our mission to enable ubiquitous perception! Achieving that requires preserving the features that perception systems depend on, such as corners and edges. If those features are corrupted by blur or noise, the rest of the perception pipeline can fall apart. For example, here’s what happens when we run a multiscale feature detector (KAZE) on the image above:
In the Ubicept Photon Fusion frame, the detections correspond to meaningful visual structures. These are exactly what downstream algorithms (e.g., SLAM, optical flow, visual tracking) depend on to function reliably. The same can’t be said of the other frames!
To some of you, all this discussion about preprocessing and features might seem dated. You might be thinking: with today’s deep learning systems capable of end-to-end perception, shouldn’t we be able to just feed the raw video into a neural network and let it figure everything out?
Rather than debate it in theory, why not show you first? The images you see mages below come from processing three input videos:
We fed these frames into RAFT for optical flow estimation and MiDaS for monocular depth estimation using their publicly available pre-trained models from GitHub:
We hope that the results speak for themeslves, but here are some of our general observations:
You might be wondering why we chose to use pre-trained models. After all, they weren’t optimized for the characteristics of this particular sensor, so it’s not surprising that they struggled with the noisy raw inputs. It’s certainly plausible we could have achieved better results by capturing hours of data, adding ground truth, and retraining or fine-tuning the models. But that wasn’t feasible. The reality is that very few computer vision teams (including ours) have the time and resources to do that for every new sensor, configuration, or deployment.
Even more importantly, retraining wouldn’t address the deeper issue: bandwidth. One thing we’ve learned over years of working in this field is that ultra-high frame rates can significantly improve perception—especially in low-light or fast-motion scenarios. The raw data shown earlier was captured at “only” 240 fps, but the SPAD sensors we use in our most advanced demonstrations can run orders of magnitude faster. At those speeds, raw data streams can easily exceed tens or even hundreds of gigabits per second. Feeding that directly into a deep neural network simply isn’t viable.
That’s exactly where Ubicept Photon Fusion comes in. It takes high-frame-rate, noisy data and transforms it into lower-frame-rate video that’s clean, compact, and physically grounded—with structure that downstream perception models can actually trust. We’ve improved its performance by orders of magnitude since our initial demos, and through evaluations with key partners, we’ve demonstrated that it can run in real time on edge GPUs across a range of hardware platforms. And for SPAD sensors, our FLARE technology (which you can read about on our technology page) complements this process by optimizing ed encodings of photon-level data.