Back to blog

Ball (in-play) tracking with SAM3

The failure mode

A normal ball detector answers a simpler question than the one we usually care about in sports footage. It finds all balls, and it does not usually tell us which ball is active in the game.

That distinction matters because real broadcasts often contain more than one real ball. There can be spare balls on the sideline, balls held by ball kids, balls near the bench, and the actual ball moving through play. Returning all of them is obviously the reasonable output for a detector that has been trained on just detecting balls. In addition to this, there can be false positives from visual clutter, reflections, or other round objects.

In the clip below, around frame 470, two balls are sitting on the sideline. SAM3 detects and tracks them. They are real balls, so this is not a false positive in the usual sense. They are just not the ball in play. At around frame 100 you can see a false positive ball, we need to protect ourselves against both of these cases.

All ball candidates
Selected ball in play
Same soccer clip, first showing every SAM3 ball candidate and then the selected ball-in-play track.

A detector trained directly on a ball-in-play class is possible, but it is a significantly harder target than plain ball detection. The active ball usually does not look different from a spare ball. Its status comes from motion, timing, and game context, which is stuff that a detector that operates on single frames cannot see.

The active ball can also change during a clip. Ball x can be in play at time a, then ball y can become active shortly after, for example after the first ball goes out and a ball kid throws in another one.

We decide to keep the two jobs (detection and selection) separate. SAM3 tracks ball candidates broadly, while a second pass selects the candidate that appears to be in play with heuristics.

Candidate tracking vs ball-in-play selection

Splitting detection from selection changes the optimization target: the candidate tracker should mainly optimize for recall, because the selection step cannot recover an active ball that was never tracked in the first place.

That makes a noisy candidate view acceptable. Spare balls, duplicate-looking tracks, and some false positives are all easier to deal with downstream than a false negative on the actual ball.

The selector can then be stricter. It ranks the candidate tracks using temporal evidence such as motion, direction changes, candidate birth times, and whether a candidate keeps behaving like the game ball over a short window. It also has to allow handoffs, since a later candidate can become the ball in play if the old ball leaves play and a new one enters.

Why SAM3

SAM3 is not the obvious choice if the only goal is speed. It is slower than a small dedicated ball detector, and for a fixed setting a specialized model is probably the better engineering choice.

The reason it is useful here is that the fixed setting rarely stays fixed. Sports footage changes a lot across cameras, zoom levels, lighting, venues, compression, and level of play. A tracker trained for one distribution can degrade quickly when the video comes from a different setup.

In testing, SAM3 has been more robust across those shifts than most dedicated ball trackers we tried, which is probably explained by the fact that it has been trained on an enormous dataset. The same model can track candidates in soccer, tennis, basketball, etc without retraining, and it works in normal broadcast views as well as zoomed-in views where the ball occupies a very different number of pixels, and it works on professional footage as well as amateur footage.

The primitive we use from SAM3 is also a good fit for the split above. We can prompt for ball, let SAM3 propagate object IDs and masks over time, and treat new detections as new candidate tracks instead of forcing the model to return a single answer.

Selecting the ball in play

Once the candidate tracks exist, selection becomes a smaller problem. For each candidate, we compute the mask centroid in every frame and pass that trajectory to a sport-specific heuristic. The heuristic returns frames where the candidate has evidence of being the live ball.

We use trajectory changepoints as the basic event. A ball that is actually in play tends to have moments where its direction changes because it was kicked, hit, bounced, passed, or otherwise interacted with. A spare ball sitting on the sideline may be detected perfectly, but it will not produce many useful events.

The selector processes candidate tracks in chronological order. A track with no events is ignored. The first track with evidence becomes the active track, and later tracks can replace it if they produce stronger evidence over the comparison window. This is what allows the selected ball to change when a new ball enters play.

The important part is that the selector does not need to know the rules of every sport. It only needs a function that scores candidate tracks. Soccer, tennis, basketball, or another sport can provide different heuristics while keeping the same candidate-tracking and selection code.

Sport-specific heuristics

The sport-specific part is deliberately small: for each candidate track, return the frames where that track looks like it is involved in play. The rest of the selection code can stay the same across sports.

For soccer, the useful signal is a direction change near a player. The code builds a union of SAM3 player masks for each frame, dilates that region a bit, and only counts a ball changepoint if the ball mask overlaps it. This works well as a proxy for kicks, tackles, and deflections near players.

Tennis needs a different heuristic, because racket contact often happens away from the body mask. There we count fast enough direction changes directly from the ball trajectory.

All ball candidates
Selected ball in play
Same tennis clip, first showing every SAM3 ball candidate and then the selected ball-in-play track.

Basketball would probably need another version of the same idea, but the interface stays simple. The sport code marks useful event frames, and the shared selector handles candidate ranking and handoffs.

Cleaning up drifted tracks

One failure mode showed up repeatedly when using SAM3 for ball tracking: a ball track can drift onto a player, most often around the feet. After an occlusion, contact, or a few ambiguous frames, the track may stop following the ball and start following the boot or the lower body instead.

Frames 640-690 from the all-candidates render, where SAM starts tracking a player foot as the ball and the cleanup heuristic can terminate the track.

The fix is a simple heuristic. The cleanup pass detects players with SAM3 as well, keeps player boxes and masks for every frame, and checks whether the centroid of a candidate ball mask has moved inside any player box. That condition is not enough by itself: a real ball can pass in front of a player, and in soccer it will often be close to a player's feet. The track only gets killed when this player-attachment signal persists over a short window and the SAM keep-alive signal has already dropped to its minimum value.

When that happens, the object is removed from the live candidate set, and the version of the track used for selection is trimmed back to the last frame before the drift started. The raw candidate render can still show what happened, but the selector does not get to use the frames where the ball track has turned into part of a player. Empirically this seems to work very well, and we do not kill any actual ball tracks like this.

Code

The code for this is available at github.com/holma91/sam3-ball-tracking.

The CLI takes a video and a sport, runs candidate tracking, selection, and rendering, then writes the all-candidates video, the selected ball-in-play video, and the intermediate artifacts used by the selector.

uv run sam3-ball-track examples/videos/soccer/clip-1.mp4 --sport soccer
uv run sam3-ball-track examples/videos/tennis/clip-1.mp4 --sport tennis
uv run sam3-ball-track examples/videos/soccer/clip-1.mp4 --sport soccer --sam-version sam3.1