Post provide Chinmay keshava lalgudi.
Drone imagery offers an efficient way to gather data on mobile animals. Drones are used for population surveys, creating 3D models of habitat, and even studying how animals move and behave in their environment. While collecting this data is relatively easy, manually annotating it is painstaking and slow. Analysing drone imagery can often mean spending hours in front of a computer annotating footage. In fact, we found just annotating bounding boxes around sharks can take hours per minute of video – valuable time that would be better spent thinking about interesting scientific questions.
Our team has focused on developing a technique to analyse sharks in Santa Elena Bay, Costa Rica – a critical habitat for the endangered Pacific nurse shark. Using drones has provided us the rare opportunity to study the movement ecology of this understudied species.

Why deep learning models fail in the wild
While many studies use deep learning approaches to analyse drone imagery of animals, they typically rely on training species-specific models from a single habitat. While these models often perform well in the exact setting they were trained on, they can often be highly sensitive to changes in lighting, contrast, distractors (such as new vegetation or tidal movement), as well as individual variation. For example, a model trained on nurse sharks from data collected around mid-day might struggle on imagery in the exact same location, but collected around sunset.
Indeed, when we first trained our own models, we also found this to be the case. While we were able to train models with strong in-domain performance (data collected around the same area under the same conditions), the specialised models greatly struggled with out-of-domain performance.
These models are also challenging to work with because they must constantly be trained and retrained as conditions change. Every time you want to study a new species, work in a different location, or analyse footage collected under different weather conditions, you essentially need to start from scratch – collect new training data, manually annotate thousands of images, and spend days or weeks training and validating a new model. This creates a major barrier for scientists trying to focus on biological questions. Therefore, we set out to develop a generalizable solution for researchers – something that doesn’t require training and is easy for anyone to use!
The power of foundation models
Powerful systems like GPT (which works with language) and CLIP (which connects images and text) have completely changed how we solve problems using AI. Instead of needing to be retrained for every new task, these models learn from huge, diverse datasets and can often handle new challenges right away — a skill known as “zero-shot” learning.
The foundation model Segment Anything Model 2 (SAM 2) caught our attention because of its video understanding capabilities. Unlike traditional image segmentation models that analyse each frame one-by-one, SAM 2 can use information from earlier frames to keep track of objects as they move through a video.
This temporal awareness is extremely important for biological research – when a shark briefly disappears behind a wave or becomes obscured by water glare, SAM 2 can use its “memory” of where the animal was in previous frames to maintain the track rather than losing it entirely. We found that SAM 2 worked especially well in our challenging coastal marine environment, where animals are always moving through shifting water conditions — with changing reflections, backgrounds, shadows, and surface ripples.
FLAIR
Our new study offers just this: a video processing pipeline called Frame Level AlIgnment and tRacking (FLAIR). FLAIR uses both SAM 2’s segmentation capabilities along with CLIP’s ability to classify images. By passing language prompts (e.g. “a shark swimming in clear water”) through CLIP, our pipeline guides SAM 2 to focus on the right objects. The key innovation is FLAIR’s alignment strategy — if a candidate shark is identified by CLIP and tracked by SAM 2 at multiple timepoints, it is likely a real shark rather than a false positive, such as a shadow or piece of debris.
FLAIR dramatically outperforms traditional approaches, especially in challenging real-world conditions. FLAIR also generalizes to several shark species, including white and blacktip reef sharks, as well as other animals entirely, such as zebras!

We conducted a user study to see how much FLAIR really speeds up annotation time. While labelling a typical 5-minute drone video would take more than 20 hours of manual effort, FLAIR completes the entire segmentation process in just under an hour of near-fully automated processing.
We also show it’s possible to calculate biometrics like body length (essential for population demographics) and tail beat frequency (which can reveal energy expenditure and swimming efficiency) from the accurate segmentation masks our system generates. This opens doors to research questions that would otherwise be extremely time-consuming to study.
We built FLAIR to be accessible to everyone. To help researchers use the tool easily, we provide two ways to easily use FLAIR: a Google Colab notebook and a Python workflow. In either option, you can import your video, enter a prompt, and start tracking!
To learn more, check out our paper and tool
Post edited by swifenwe and Prayer Kanyile.

