ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ЦЭМИ РАН |
||
The Kashtanka.pet project addresses the problem of searching for lost pets efficiently. There are numerous platforms and groups in social networks collecting ads about missing and found cats and dogs. However, it is often almost impossible for a human to find a specific lost pet among millions of ads about found pets distributed across all those websites. We develop a system that allows pet owners and volunteers to find lost pets efficiently with the help of AI. It crawls websites for ads about lost and found pets, and retrieves the pairs of ads announcing the same pet was lost and then found. The retrieved pairs are then inspected and further processed by humans. On the poster we present the architecture of the Kashtanka.pet system. Then we address the problem of evaluating and improving the quality of underlying AI models of lost pets retrieval. The standard manual annotation of a dataset for our task requires finding the matching pairs of lost and found ads, which makes the annotation process prohibitively difficult. Thus, we generate the matching pairs automatically by splitting sets of photos from the ads containing several photos into two parts. However, the simple random splitting results in both parts sharing some photos made in the same place. This may promote models searching for the same background rather than the same pet, and makes the setup unrealistically simple because in reality a pet is lost and found in different places. To mitigate this we propose a method that finds those ads that contain photos of a pet taken in several places and splits them accordingly. In order to estimate the quality of this method and select its hyperparameters, we additionally annotated pairs of photos from random ads asking annotators if those photos were taken in the same or different places. Several methods solving the task are proposed and compared. The pipeline currently deployed at kashtanka.pet is based on YOLOv4 for pet detection and cropping, EfficientNet for obtaining their embeddings, and GRU for aggregation of those embeddings across several images from the same ad. The model is trained with the triplet loss on the target dataset. Two other methods employ the pre-trained multimodal BLIP and SLIP models, which are the recently introduced improvements over the popular CLIP model. We found that even without fine-tuning on the data from our target domain the image embeddings from the multimodal models significantly outperform the currently deployed pipeline.