The patent addresses a common problem in both NLP and computer vision: the selection of data for model training.
In computer vision, this data is often obtained from client video streams. A video stream is a sequence of images (or frames) ordered by time. Until now, these images were obtained for training purposes, for example, every X frames, without taking into account the content of the image or any additional information that would help to know if this data could help to improve the performance of the model. This patent establishes several pipelines to select the frames that are likely to have the greatest positive impact on the model. These pipelines include
-Filtering with VLMs: Questions are asked about the scene to identify contextual or environmental elements. This method can be very useful in the case of a night scene or a scene with a complex pattern, for example.
-Clustering in embedding space: A dimensional reduction is performed on the embeddings and they are grouped into clusters. Using the median of each cluster, images with more information (or different information from the rest) can be obtained.
-Optical flow: In a closed domain and in two consecutive frames, the optical flow is computed, which reveals which objects are in motion (knowing which objects can have motion in the scene) and, in turn, a detector and a segmenter are applied to obtain the segmentation masks of the objects under study. Then, using logical operations, we try to catch the frames in which there is a greater difference between the optical flow masks and the detection plus segmentation masks, thus revealing the frames in which the model has a worse performance.
Status: submitted to the European Patent Office.