h2o-3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
H2O is an open-source machine learning platform that trains models on large datasets across distributed clusters, with AutoML to automatically find and rank the best algorithm without manual tuning.
H2O is an open source machine learning platform built for speed and scale. It runs in memory across distributed clusters, which means it can handle large datasets that would be slow or impractical to process on a single machine. Python and R users can install it with a single command (pip or the R package installer), and it also supports Scala, Java, and JSON interfaces, as well as a browser-based notebook called Flow.
The platform includes a wide set of machine learning algorithms: regression models, gradient boosting, XGBoost, random forests, deep neural networks, k-means clustering, principal component analysis, stacked ensembles, naive Bayes, and others. For users who do not want to choose and tune algorithms manually, H2O AutoML automates the entire process: it trains multiple models across different algorithms, tunes their settings, and produces a ranked leaderboard so you can pick the best result without needing to understand each algorithm in detail.
H2O is designed to integrate with existing big data infrastructure. It works alongside Hadoop and Apache Spark, and there is a dedicated Sparkling Water project for deeper Spark integration. Models trained in H2O can be saved and reloaded, or exported to lightweight formats called POJO and MOJO that can run in production environments without depending on the full H2O platform.
The codebase is extensible, meaning developers can write custom data transformations and algorithms and access them through the same interfaces. Pre-built packages are available via PyPI, Anaconda, and CRAN. The project has a full documentation site, Stack Overflow presence, GitHub discussions, and a Gitter chat channel for community support. The full README is longer than what was shown.
Where it fits
- Use AutoML to automatically train and compare dozens of models on your dataset and get a ranked leaderboard without needing to pick or tune algorithms yourself.
- Train models on a dataset too large for a single machine by running H2O across a distributed cluster alongside existing Hadoop or Spark infrastructure.
- Export a trained H2O model as a MOJO file to deploy it in production without requiring the full H2O platform to be running.
- Explore and build models interactively through the browser-based Flow notebook without writing any code.