albedo
A recommender system for discovering GitHub repos, built with Apache Spark
Albedo is a research project that builds a system for recommending GitHub repositories to users based on their past starring and following behavior. The idea is to look at what repos a person has starred, then suggest other repos they might like based on patterns found across many users' data.
The system works in several stages. First, it collects data from GitHub using the API, pulling information about which users starred which repos. Then it builds profiles for each user and each repo, capturing features that might predict interest. From there, it trains multiple machine learning models and compares how well each approach recommends repos that users actually care about.
The project tries out several different recommendation strategies. One simple baseline just recommends whatever is most popular overall. A collaborative filtering approach called ALS looks for patterns across users to infer what any given person might like based on what similar users have starred. A content-based approach uses text similarity to find repos whose descriptions and topics resemble ones the user already starred. A logistic regression model ranks the candidates generated by those other methods. The README includes accuracy scores for each approach so you can see how they compare.
The technical stack is Scala and Apache Spark for the heavy computation, Python for data collection and syncing to Elasticsearch, and MySQL for storage. Running the project requires Docker to set up the environment and a GitHub personal access token to pull data. The author also published several blog posts walking through each stage of the system for anyone who wants to learn how recommender systems work in practice.