Today, businesses are generating a massive amount of data every second across hundreds of disparate data sources. As organizations attempt to capitalize on new ML/AI capabilities, each new system or tool disaggregates the journey from data to business value even further, creating costly, slow, and labor-intensive data preparation. The situation creates challenges for the data engineering teams to keep up with business and technical demands.
In machine learning, a "feature" is an input variable, similar to the explanatory ("x") variable in simple linear regression. A machine learning project might use hundreds or even millions of features, and each feature must be paired with labels (similar to the dependent "y" variable) to train a model. A robust and scalable ML/AI development program requires improving the feature extraction process, an early step in ML that prepares data for analysis by abstracting complex schemas and their data into basic objects and attributes.
Relying on reference architectures for feature re-use can help, but this introduces issues of latency, complexity, and another data silo to be managed. To simplify and speed up feature extraction and re-use, a new technology has arisen, called feature stores, which assists with the demanding data preparation required for effective ML. No longer unique to the capabilities of large firms like Uber and Airbnb, feature stores can transform raw data into feature values, store those features, and serve those features for training and analysis in the future.
Importantly, feature stores do not remove the need for Snowflake or similar cloud solutions. Feature stores easily overlay with existing data infrastructures, enabling almost instantaneous updates, reducing risk, and scaling quickly with an enterprise. The graphic below compares feature stores to current approaches.
In short, all of an organization’s data can be converted to reusable features and analyzed with full fidelity, regardless of format or source location, for immediate analytics. Not only does this speed up a data scientist's development timeline, but it brings economies of scale to ML organizations by enabling collaboration. When a feature is registered in a feature store, it becomes available for immediate reuse by other models across the organization.
One example of a leading provider of feature store capabilities is Molecula, who recently completed a $17.6 million Series A round with Tensility as a participant. Molecula leaves data at its source and continuously extracts and updates only features into a centralized feature store. This process eliminates the need to copy, move, or pre-aggregate data, reduces the data footprint by 60-90 percent, and provides a secure data format for sharing.
Whether through Molecula or another organization, feature stores pair with current systems to enable prescriptive analytics while reducing complexity, costs, and risk. We are excited at the capability they provide to data engineers and data scientists to improve the data pipeline and leverage all of a company's data for better business outcomes.