Efficient SQL Analytics on Big Data in Object Storage

Dr. Paula Ta-Shma

Abstract

Big datasets such as IoT device and server logs are increasingly stored in object storage to achieve low cost and endless capacity. This talk will discuss techniques for efficient SQL analytics of these datasets directly on object storage using Apache Spark. We firstly cover how to store the data using modern formats such as Parquet and ORC, how to apply techniques such as as Hive Style Partitioning, and how to best organize the data into objects to achieve maximum efficiency. We then cover a technique called data skipping which collects metadata for each object in order to skip over objects irrelevant to a SQL query. These techniques enable significant reductions in the amount of data scanned for selective queries, which boosts performance and reduces cost.

Speaker

Photo of Dr. Paula Ta-Shma

Paula Ta-Shma holds Ph.D. and M.Sc. degrees in computer science from the Hebrew University of Jerusalem. She belongs to the IBM Cloud and Data Technologies group at IBM Research, Haifa, and leads research efforts on data skipping and big data layout on Cloud Object Storage. Her work has been presented at multiple industry conferences including the Apache Spark Summit, the OpenStack summit and IBM Think, as well as academic conferences such as FAST and SYSTOR. Dr. Ta-Shma’s primary areas of interest include big data analytics in the cloud, data ingestion and layout, cloud storage and IoT.

Lecture languages

EnglishHebrew

Topics

Cloud & DataIoTStorage

Duration options

1 hour

Travel/delivery options

In-countryOutside of country: Open for discussionRemote via video conference

Country

Israel

Lecture booking request

Thank you for your interest in hosting an IBM speaker. Please fill out the following form with as much detail as possible. An IBM representative will reach out to discuss your booking request. All guest lectures are subject to availability and agreements under this collaboration are not legally binding.