Hall of Fame

A common activity for big data is determining similarity between two objects of the same type. Baseball reference has a formula for computing the similarity of players, found here.

The aim of this project was to look for players that should be in the Hall of Fame, but they are not using Lahman’s Baseball Database. We say a player should be in the Hall of Fame is at least 3 of the 5 most similar players are in the Hall of Fame. This is also using similarity as the distance function with the kNN problem.

Why I have done it:

  • I built this project as an assignment for Cloud Computing course while studying at Baylor University. I enjoyed solving these problems with new technologies.

When I have done it:

  • While I was in my first semester (Spring 2021) at Baylor University.

Technical details:

  • I used pySpark (RDD, Python), Dask (Python, Dask Bag), MongoDB (file system, MongoDB Query Language (MQL)) separately to solve the problem using the public Lahman’s Baseball dataset.
Tonni Das Jui
Tonni Das Jui
PhD student at Baylor University