Ibotta Builds a Cost-Efficient, Self-Service Data Lake Using Qubole

Start Free Trial
April 2, 2019 by Updated October 19th, 2021

This blog is a customer guest post written by Ibotta. Ibotta是一家移动技术公司,通过为收据和在线购物提供应用程序内的现金返现奖励,正在改变传统的返现行业, electronics, 服装, 礼物, 供应, restaurant dining, and more for anyone with a smartphone.

今天, Ibotta is one of the most-used shopping apps in the United States, driving more than $7 billion in purchases per year to companies like Target, 好市多, and Walmart. Ibotta自2012年成立以来,总下载量超过2700万次,向用户支付的费用超过5亿美元.

在电子商务和零售行业保持竞争优势是极其困难的,因为它需要为消费者建立一个迷人的和独特的购物体验.

Ibotta’s Previous Data Infrastructure

Prior to moving to a big data platform with Qubole, Ibotta的数据和分析基础设施是基于一个静态和刚性的云数据仓库. 只要数据集是结构良好的,并且是表格格式,这种方法就可以工作. 然而,随着业务的增长,更新和更复杂的数据格式被开发和吸收.

At the same time, Ibotta在数据工程等新的数据分析团队上进行了大量投资, decision science, and machine learning. The teams needed access to the same data, but each team sought to interact with the data in a different way.

Data Engineering needed a set of tools that allowed it to perform Extract, Transform, and Load (ETL) processes in many different ways using MapReduce, Apache 蜂巢, 火花, and/or 转眼间. 机器学习团队希望将火花用于特征工程,并培训和部署其模型. Decision Science wanted to use SQL, R, and Python to extract insights and business recommendations from the data.

Moving Beyond Descriptive Analytics

Ibotta needed to grow beyond descriptive analytics — 哪一个 was complementary to its products — into a pure data-driven company. 该组织需要被分割,以便Ibotta能够为适当的小组和人员配备足够的人员,以帮助实现下列愿望:

  • For the Data Engineering team: Design the data lake, manage technologies, provide data services, and create automated pipelines that feed into various data marts
  • For the Machine Learning team:创建新的产品特性,并使用从个性化到优化的用例进行预测和规定性分析
  • For the Decision Science team:为内部利益相关者和外部客户合作伙伴开发并提供自助服务的洞察平台

Ibotta需要一种方式,让每个用户都能够自助访问数据,并能够使用合适的工具来使用像火花这样的大数据引擎来使用他们的用例, 蜂巢, and 转眼间. 与此同时,数据工程团队需要能够准备易于使用的数据. To address the various goals of its data teams, Ibotta built a cost-efficient, self-service data lake using a cloud-native platform.

Building a Self-Service Data Lake

Ibotta意识到,构建自助服务平台的第一步是定义哪些数据是关键的,以使分析团队满足关键的业务里程碑. At the time, 用户使用数据组合(来自事务系统和数据仓库)来运行他们的模型.

After the value of each dataset was defined, 数据工程团队可以开始构建管道,从数据仓库和Amazon Aurora中提取数据,并将其转换为JSON格式, 哪一个 was then stored in the raw storage area.

From there, 其他管道将JSON格式转换为优化的Row Columnar (ORC)和Parquet柱状格式,并将结果数据存储在优化的存储区域中. Using 气流 and its ability to monitor new partitions in the metastore, 一旦新的数据位置暴露在蜂巢 metastore中,下游管道就可以开始运行.

To mitigate the legacy data warehouse constraints, Ibotta现在有ETL任务将数据从蜂巢加载到Snowflake以供其使用 Business Intelligence (BI) tool, Looker. Ibotta利用蜂巢和火花作业将原始数据处理成决策科学团队所使用的生产就绪表. 这一切都是利用了风流的钩子到Qubole,通过API轻松实现自动化作业. 气流 gives more control over orchestration than Cron and AWS Data Pipeline. It also provides performance benefits, 包括并行化和以有向无环图(DAG)而不是假设线性依赖来调度作业的灵活性.

Leveraging 大数据 for ML, ETL, and Ad Hoc Querying

Ibotta uses Qubole to provision and automate its big data clusters. Specifically, it uses 火花 for machine learning and other complicated data processing tasks; 蜂巢 and 火花 are used for ETL processes, and 转眼间 is used for ad hoc queries like exploratory analytics.

Utilizing this platform, Ibotta已授权决策科学团队使用BI工具为数百名用户制作实时仪表盘. Since instituting their new data platform, 在四个月内,Ibotta处理的数据量增加了三倍多, and it is passing more than 30,000 queries per week through Qubole.

在Qubole就位后,Ibotta的决策科学团队立即获得了授权. 实现了AWS弹性计算云(Amazon EC2)中数据自助访问和计算资源高效规模的目标,适用于大数据工作负载. Within a month, 机器学习团队在产品中推出了新的规范性分析功能,其中包括一个推荐引擎, A/B testing framework, and an item-text classification process.

Conclusion

通过使用 Qubole 在AWS, 艾博塔的团队能够自己提供资源,而不需要一个中央管理团队. 大数据集群使用的是60%到90%的Spot实例与按需节点的混合, 哪一个, combined with the use of Qubole’s heterogeneous cluster capability, 这使得实现大数据工作负载的最低运行成本非常容易和可靠.

Additionally, 自动伸缩和集群生命周期管理为Ibotta的云基础设施成本提供了显著的节省. This means that managing budget and ROI is much easier, Ibotta可以预测如何相应地扩展不同的功能和项目.

Ibotta is focusing on delivering next-generation eCommerce 功能和产品有助于推动更好的用户体验和合作伙伴盈利. Qubole允许Ibotta花时间开发和生产可伸缩的数据产品. 更重要的是,它可以专注于为用户和客户带来价值.

Want more information? Read Ibotta’s full story about building a self-service data lake with Qubole.

Start Free Trial
  • 博客 Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • 事件

    Data Innovation Summit MEA 2022

    3月. 7, 2022 | Global

    数据2030峰会2022 -亚太版-数据和人工智能驱动组织的数据战略

    五月. 24, 2022 | Global
  • 阅读广告软件公司如何在Qubole上使用转眼间处理超过100亿的每日事件
    友情链接: 1 2 3 4 5 6 7 8 9 10