Better Perception with Ubicept Photon Fusion

April 25, 2025
Why deep learning can still benefit from cleaner data

Executive summary

  • Ubicept Photon Fusion preserves critical visual features that are often lost in conventional denoising or temporal averaging, producing clean, compact, and physically grounded video that downstream perception models can trust.
  • Pre-trained deep learning models struggle on raw data without costly, sensor-specific retraining—and even retraining doesn’t solve the bandwidth challenges of ultra-high-frame-rate sensors.
  • Ubicept bridges the gap by converting high-bandwidth, noisy inputs into efficient video streams that run in real time on edge GPUs, enabling modern perception models to operate out of the box.

The perception problem

In our last blog post, we showed how Ubicept Photon Fusion produced more trustworthy frames than AI-based video denoisers. To summarize, here’s an image that shows the difference:

The top two frames show the tradeoff between shorter exposures and higher noise (left) versus longer exposures and lower noise (right). The bottom two frames show how an AI-based video denoiser (left) wipes out important visual details, while our approach (right) reduces noise while preserving visual fidelity.

It’s important to emphasize, however, that visual fidelity is only a means to an end. After all, our company name comes from our mission to enable ubiquitous perception! Achieving that requires preserving the features that perception systems depend on, such as corners and edges. If those features are corrupted by blur or noise, the rest of the perception pipeline can fall apart. For example, here’s what happens when we run a multiscale feature detector (KAZE) on the image above:

In the Ubicept Photon Fusion frame, the detections correspond to meaningful visual structures. These are exactly what downstream algorithms (e.g., SLAM, optical flow, visual tracking) depend on to function reliably. The same can’t be said of the other frames!

But deep learning ...

To some of you, all this discussion about preprocessing and features might seem dated. You might be thinking: with today’s deep learning systems capable of end-to-end perception, shouldn’t we be able to just feed the raw video into a neural network and let it figure everything out?

Rather than debate it in theory, why not show you first? The images you see mages below come from processing three input videos:

  • Raw input frames from an IMX287-based machine vision camera at 240 fps with 1/240 s exposures
  • Averaged frames simulating the motion blur of the same camera shooting at 30 fps with 1/30 s exposures
  • The result of applying Ubicept Photon Fusion on the raw input frames, as detailed in our previous post

We fed these frames into RAFT for optical flow estimation and MiDaS for monocular depth estimation using their publicly available pre-trained models from GitHub:

Even though the lighting here is relatively bright, the hand and pen illustrate the tradeoff between noise and blur.
Noise dominates in this frame, but averaging does a decent job of reducing it since there isn’t much motion.
This frame shows a combination of high noise from low light and high blur from fast motion.
And here's one more frame we included for fun!

We hope that the results speak for themeslves, but here are some of our general observations:

  • Raw frames (first column) cause RAFT to produce incoherent flow fields due to overwhelming noise. MiDaS performs somewhat better, generating usable depth maps with some loss of fine detail.
  • Averaged frames (second column) result in decent performance when motion is minimal. However, once motion is introduced, the blur from averaging prevents either model from resolving meaningful structure.
  • Ubicept Photon Fusion (third column) provides consistently better results. It suppresses noise and preserves structural detail, allowing both models to extract more reliable and informative outputs. While not flawless, the improvement is clear.

But, again, deep learning ...

You might be wondering why we chose to use pre-trained models. After all, they weren’t optimized for the characteristics of this particular sensor, so it’s not surprising that they struggled with the noisy raw inputs. It’s certainly plausible we could have achieved better results by capturing hours of data, adding ground truth, and retraining or fine-tuning the models. But that wasn’t feasible. The reality is that very few computer vision teams (including ours) have the time and resources to do that for every new sensor, configuration, or deployment.

Even more importantly, retraining wouldn’t address the deeper issue: bandwidth. One thing we’ve learned over years of working in this field is that ultra-high frame rates can significantly improve perception—especially in low-light or fast-motion scenarios. The raw data shown earlier was captured at “only” 240 fps, but the SPAD sensors we use in our most advanced demonstrations can run orders of magnitude faster. At those speeds, raw data streams can easily exceed tens or even hundreds of gigabits per second. Feeding that directly into a deep neural network simply isn’t viable.

That’s exactly where Ubicept Photon Fusion comes in. It takes high-frame-rate, noisy data and transforms it into lower-frame-rate video that’s clean, compact, and physically grounded—with structure that downstream perception models can actually trust. We’ve improved its performance by orders of magnitude since our initial demos, and through evaluations with key partners, we’ve demonstrated that it can run in real time on edge GPUs across a range of hardware platforms. And for SPAD sensors, our FLARE technology (which you can read about on our technology page) complements this process by optimizing ed encodings of photon-level data.

#
Passive
#
Interactive
#
Innovation

Latest posts from the Ubicept Blog

See All