Chapter 4 Data Cleaning

Initially provided as XML files, the datasets were converted into csv files and then merged to create a final dataset with 132,000 observations and 98 features. As this project primarily focuses on passing, the data was converted into network data for each game. Each game consists of an array of matrices that represent the passing count between players for each possession.

Below is an example of a 10x10 matrix for a possession. The rows indicate the passer, and the column indicates the receiver.

100023 100283 839023 456782 222789 134783 111124 098783 352671 213416
100023 0 1 0 3 0 0 0 0 0 0
100283 0 0 0 0 0 0 0 0 0 0
839023 0 1 0 0 0 0 0 0 0 0
456782 0 0 0 0 0 0 0 0 0 0
222789 0 0 0 0 0 0 0 0 0 0
134783 0 0 0 0 0 0 0 0 0 0
111124 0 0 0 0 0 0 0 0 0 0
098783 0 0 0 0 0 0 0 0 0 0
352671 0 0 0 0 0 0 0 0 0 0
213416 0 0 0 0 0 0 0 0 0 0

4.1 Changes in Shot Clock Time

As college basketball is a consistently changing sport, the NCAA changed the play rules for the 2013-2014 college basketball season. Instead of a 35 second shot clock, the NCAA established a 30 second shot clock. Since this work does not have a temporal component, the rule change does not affect the results of model building drastically. However, the extra five seconds may have allowed players to pass the ball more frequently, which would affect the passing matrices.