A vast amount of audio-visual data is available on the Internet thanks to video streaming services, to which users upload their content. However, there are difficulties in exploiting this data for supervised learning due to the
lack of labels. More specifically, neural networks require large amounts of data to estimate their parameters reliably. Crowd-sourcing platforms such as Amazon Mechanical Turks and Figure8 offer services for annotation to remote workers. However, such services are expensive, time-consuming, particularly on video streams with huge amount of frames.
Unlabeled video in the wild presents a valuable, yet so far unharnessed, source
of information for learning vision tasks. We present the first attempt of fully
self-supervised learning of object detection using transcripts in videos without any
manual object annotation. We pose the problem as learning
with a weakly and noisily-labeled data, and propose a novel model that
can confront high noise levels, and yet train a classifier to localize the object of
interest in the video frames, without any manual labeling involved.
Dr. Rami Ben-Ari is the computer vision and deep learning technical lead Video-AI group at IBM Haifa Research Lab. He joined IBM Research in 2014 after holding several research positions at Israeli companies.
Rami holds a PhD in Applied Mathematics from Tel-Aviv University in computer vision. His main research area is in Computer Vision, with background in Stereo vision, Shape from X, Optical flow, Visual tracking and Segmentation. More recently his research was focused on applications of deep learning in medical imaging.
Currently, he works on action recognition and multi-modal learning in video.
Rami serves as an adjunct professor at Bar-Ilan University in Israel and has authored and co-authored over 30 papers in peer-reviewed journals and conferences.