Big datasets such as IoT device and server logs are increasingly stored in object storage to achieve low cost and endless capacity. This talk will discuss techniques for efficient SQL analytics of these datasets directly on object storage using Apache Spark. We firstly cover how to store the data using modern formats such as Parquet and ORC, how to apply techniques such as as Hive Style Partitioning, and how to best organize the data into objects to achieve maximum efficiency. We then cover a technique called data skipping which collects metadata for each object in order to skip over objects irrelevant to a SQL query. These techniques enable significant reductions in the amount of data scanned for selective queries, which boosts performance and reduces cost.
Paula Ta-Shma holds Ph.D. and M.Sc. degrees in computer science from the Hebrew University of Jerusalem. She belongs to the IBM Cloud and Data Technologies group at IBM Research, Haifa, and leads research efforts on data skipping and big data layout on Cloud Object Storage. Her work has been presented at multiple industry conferences including the Apache Spark Summit, the OpenStack summit and IBM Think, as well as academic conferences such as FAST and SYSTOR. Dr. Ta-Shma’s primary areas of interest include big data analytics in the cloud, data ingestion and layout, cloud storage and IoT.