xgboost
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
XGBoost is a fast, accurate machine learning library for making predictions from structured data like spreadsheets. It builds sequences of small decision trees where each one corrects the previous one's mistakes.
XGBoost (short for eXtreme Gradient Boosting) is a machine learning library used to make accurate predictions from tabular data — things like spreadsheets, databases, or structured records. It uses a technique called gradient boosting, which works by building many small decision trees (branching "if this, then that" logic chains) in sequence, where each new tree corrects the mistakes of the previous ones. The end result is a highly accurate predictive model.
The library is designed to be scalable, meaning it can handle massive datasets — the README mentions it can tackle problems with billions of examples. It runs on a single machine for smaller tasks, but also integrates with distributed computing systems like Hadoop, Spark, Dask, and Kubernetes when you need to process data across many machines at once.
XGBoost provides interfaces for Python, R, Java, Scala, and C++, so data scientists and engineers can use it in the environment they're most comfortable with. It's commonly used in data science competitions and real-world prediction tasks — for example, forecasting sales, detecting fraud, or classifying data.
You'd reach for XGBoost when you have labeled training data (examples with known answers) and want to build a model that predicts outcomes for new data. It's especially useful when raw speed and accuracy on structured data matter. The core library is written in C++, which keeps it fast, with language bindings layered on top. Licensed under Apache 2.0.
Where it fits
- Train a model to predict customer churn from a spreadsheet of account features and historical behavior.
- Build a fraud detection classifier on transaction records using XGBoost's Python interface.
- Scale a prediction job to billions of rows by connecting XGBoost to a Spark or Dask distributed cluster.
- Submit a Kaggle competition entry on structured data using XGBoost's proven high accuracy.