Top Feature Stores for Machine Learning data scientists must know in 2022


SOURCE: MOEZ-62905.MEDIUM.COM
OCT 22, 2022

Introduction

A feature store is a system for managing data that gives data scientists and engineers a central place to find and use data for machine learning. A feature store enables data / (features) to be shared across different machine learning pipelines, which can speed up the development of new models and improve model performance. Feature stores have recently emerged as an important component of the enterprise machine learning stack and is a key component in enabling:

  1. Automated feature computation
  2. Share and re-use features across different teams and projects
  3. Managing feature metadata
  4. Serve or extract features offline, real-time or on-demand
  5. Monitor the full lifecycle of features from generation to serving

The two main types of data in machine learning are:

  • Batch Data — In most cases, the data originates from data lakes or data warehouses. These are large hunks of data that have been saved for the purpose of being used by models; however, they are not necessarily kept up to current in real time. Example: Data from customers of a bank, such as age, country, etc.
  • Real-time Data — Usually these come via Streaming and Log events. These are the data that are stored online and are continually being generated by many sources, such as the events that are logged on a system. Example: A transaction in a bank is logged in real-time and fed to the Feature Store.

These two types of data are managed by two types of feature stores:

  • Offline Feature Stores — Offline Store comprises of preprocessed features of Batch Data, used for building a historical source of features, that can be used by Data Scientists in the Model Training pipeline. With it’s historical components, in most Feature Stores it can be used to provide a series of features at a given time frame or time point. It is normally stored in data warehouses.
  • Online Feature Stores — Online Store combines the data from the Offline Store with the real-time features from streaming data sources. It is built with the objective of being the most up-to-date collection of organized features, which can be used to feed the Model in Production with new features for inference.

In this blog, we will review some of the most popular feature stores in 2022.

Open-Source Feast

The Feast feature store is an open source tool that provides a unified view of data for feature engineering, training, and serving machine learning models. It is designed to be scalable and extensible, and to support a wide variety of data sources and feature types.

Features

  • Feast automates the process of storing and serving your features from an online store to support real-time predictions.
  • Feast guarantees you’re serving the same data to models during training and inference, eliminating training-serving skew.
  • Feast integrates with your existing data pipelines and ML tooling. It runs on top of cloud managed services, reusing your existing infrastructure and spinning up resources as needed.
  • Feast brings standardization and consistency to your data engineering workflows across models and teams. Many teams use Feast as the foundation of their internal ML platforms.

Image Source

Feast is a Python library, so you can simply install it using pip. For more details, please see the quickstart guide

Airbnb’s Zipline

Airbnb’s data management system, Zipline, was created with ML use cases in mind. Previously, machine learning specialists at Airbnb spent about 60% of their time gathering and creating transformations for ML jobs. Zipline cuts the time it takes to do this task from months to days by making the procedure declarative. In a straightforward configuration language, it enables data scientists to quickly create machine learning features. The system then provides users with access to point-in-time correct features for offline model training and online inference.

To learn more about Zipline, watch this amazing talk by Varant Zanoyan.

Uber’s Michelangelo Palette

Michelangelo is a platform for machine learning created by Uber, focused on sharing machine learning feature pipelines with various teams within Uber.

Michelangelo Palette is a platform for machine learning that Uber uses internally to make it easy to add features to machine learning models that are already in use. Managing features for machine learning models is one of the biggest bottlenecks in productizing machine learning models.

Image Source

To learn more about Uber’s Michelangelo platform and how they have built it, watch this presentation by Amit Nene and Eric Chen. This talk discusses the infrastructure built by Uber for the Michelangelo ML Platform that enables a general approach to feature engineering across diverse data systems.

Open-Source Hopsworks

Hopsworks is widely used as a stand-alone feature store. A feature store provides offline and online stores for large volumes of historical feature values and real-time access for current feature values. A feature store also provides an API for creating, reading, updating, and deleting feature values, and retrieving training data and feature vectors.

You can utilize your own data warehouse or lake as an offline store in addition to Hopsworks’ built-in offline store, which is based on Apache Hudi. Hopsworks comes with its own offline store. The offline store is where historical feature values are kept, and it is also where training data and features for offline batch scoring are retrieved.

Hopsworks online feature store is built on RonDB, the only database optimized for feature store use cases. The online store contains the latest feature values and is used to provide feature vectors to deployed models at runtime.

Image Source

To start for free, check out this official guide.

Conclusion

In the context of machine learning and MLOps, feature stores are becoming a more popular part of data architecture. A feature store’s purpose is to simultaneously transform data from diverse data sources into features that the model training pipeline and the model serving pipeline may use. Feature Stores combine multiple data sources and preprocess those into features.

I write about data science, machine learning, and PyCaret. If you would like to be notified automatically, you can follow me on Medium, LinkedIn, and Twitter.

Similar articles you can read