Machine learning is the science of getting computers to act without being explicitly programmed. While traditional software solutions rely on sets of rules for executing an algorithm in order to solve a problem, machine learning based systems rely on given examples, aka labeled data, from which the algorithm "learns" how to correctly identify new instances. The quality and quantity of the labeled data significantly effects the precision and recall of the resulting solution. Collecting large amounts of high quality labeled data is well known challenge that involves various questions. E.g. what labeled data is required for solving a specific task? How can we measure the quality of the generated labeled data? How can we measure the quality of the labeling output of a specific labeler? In this lecture we will understand the importance of high quality labeled data, compare two main labeling data procedures (exhaustive and retrospective labeling), and discuss the unique issues that arise when outsourcing large scale labeling tasks to the crowd, e.g. on an outsourcing platform such as Crowdflower.
Yoav Kantor finished his M.Sc. degree at the Technion in 2013 and works at IBM Research - Haifa lab since.
He is part of the Debating Technologies team, that develops a Machine- Learning based system that given a controversial topic can automatically generate relevant persuasive arguments by scanning massive text corpora. He will be happy to share his experience and insights gained via implementing an automatic large scale labeling mechanism, which is used on a daily basis by the project team.