MLB Extended Gameday Pitch Logs

Alan M. Nathan

August 6, 2007

(return)

I. How to Download

 

(a)    Downloading the data: Go to the web site http://gd2.mlb.com/components/game/mlb/. Click on the year, then on the month; on the next page click on the day; on the next page click on the specific game; on the next page click on pdb; on the next page click on pitchers. For the Baltimore vs. Boston game played on August 1, 2007, the full link is as follows: http://gd2.mlb.com/components/game/mlb/year_2007/month_08/day_01/gid_2007_08_01_balmlb_bosmlb_1/pbp/pitchers/

 

(b)   The above steps take you to a page with a bunch of links that are of the form zzzzzz.xml, where zzzzzz is a six-digit code for a specific pitcher (see section III). For the above game, click on 122201.xml, which will get you to the pitch logs of Paul Shuey, who pitched to two batters in the 7th inning.

 

(c)    You will then see a lot of numbers on the screen. Use whatever tools you have with your browser (e.g., "save page as") to save it as 122201.xml in some convenient folder.

 

(d)   Now launch Excel. From the File menu, open the file you just saved. An Open XML box will pop up. Check the As an XML list box, then click OK, and the file will load. You should see columns A through AK (37 columns total) filled and with headers in the first row. Immediately save is as an Excel file. The number of columns may change depending on when the file was written, and there is no guarantee that the number will remain the same into the future. However, the header names will hopefully stay constant. In the next section, I will discuss the meaning of the important parameters in the database.

 

II. What do the columns mean?

 

Only the columns that are not obvious from the context will be described. Before proceeding, some definitions are needed. A coordinate system is used in which the origin is at the point of home plate. The y-axis point from catcher to pitcher; the z-axis points vertically up; the x-axis points to the catcher's right. Using this notation, the pitched baseball is traveling in the -y direction. Due to gravity, it will fall so that z generally decreases along the trajectory.

 

id: six-digit code identifying the pitcher (see section III).

 

batter: six-digit code identifying the batter (see section III).

 

pitcher: same as id.

 

x, y: x and z (note the change of notation in that the parameter y refers to the z coordinate) location of pitch as it crosses the front of home plate. The units are in camera pixels and are therefore not very useful. I recommend ignoring these two parameters and instead use p_x and p_z, defined below.

 

start_speed: speed of ball in mph at the starting position (defined below).

 

end_speed: speed of ball in mph as it crosses the front of home plate, located 1.417 ft from the point of home plate (i.e., at the coordinate y=1.417). Note the end_speed is less than start_speed due to the effect of air resistance.

 

sz_top, sz_bottom:: a line of constant z (in ft) defining the lower and upper limits, respectively, of the strike zone. That is, these are the height above home plate of the top and bottom of the strike zone. Currently, these parameters are set for each batter by the operator by visually observing the image from the center-field camera.

 

pfx_x,pfx_z: The deviation (in inches) of the pitch trajectory from a straight-line in the x (horizontal) and z (vertical) directions between y=40 ft and the front edge of home plate, y=1.417 ft. It is important to note two things. First, the initial value is y=40 ft, regardless of the value of the initial value y0 (defined below). If the pitcher's release point had been used (approximately y=55 ft), then the deviation would have been nearly twice as large. Second, the effect of gravity has been removed from pfx_z, so that both parameters are the "break" of the pitch due to the Magnus force on a spinning baseball. Note that the online Gameday reports the quantity pfx, which is presumably the square root of pfx_x2+pfx_z2. Given our sign conventions, a positive value of pfx_x cooresponds a deviation to the catcher's right and a negative value to the catcher's left. Similarly, a postive value of of pfx_z is a pitch the drops less than it would from gravity alone (most pitches fall in this category), whereas a negative value is a pitch that drops more than from gravity alone (e.g., a "12-6" curveball).

 

p_x, p_z: location of pitch in the x and z coordinates, respectively, as it crosses the front of home plate, in units of ft. When you watch Gameday, this is the location of the dot on the screen that appears for each pitch. It is computed from the tracked trajectory. There is a one-to-one correlation between p_x and x and between p_z and y (see above description of x and y).

 

x0,y0,z0, vx0,vy0,vz0,ax,ay,az: These parameters are the most important ones in the database, since all others are computed using these. The parameters represent the result of making a least-squares fit to the measured trajectory assuming constant acceleration, for each of the three dimensions. The first three parameters are the initial positions in ft, the next three are initial velocities in ft/s, the next three are the accelerations (assumed constant) in ft/s2. To calculate the full trajectory, use the formula x(t)=x0+vx0*t+0.5*ax*t2, and a similar formula for y(t) and z(t), where t is the time. Note that y0, which is the distance from the point of home plate where the tracking begins, is 50 ft for the file discussed above although earlier files used 55 ft or 40 ft. This point is the location where the start_speed is determined and from which the break parameters (see below) are calculated.. Note that start_speed is just the square root of

vx_02+vy_02+vz_02, converted to mph.

 

break_y,break_angle, break_length: These quantities refer to a different definition of "break" than the quantities pfx_x, pfx_z defined above. They are arrived at as follows. A straight line is drawn from the starting location x0,y0,z0 to the final location defined by p_x, p_z and y=1.417 ft. Such a line determines a straight-line trajectory from starting point to ending point. That trajectory is then compared to the actual trajectory determined by the constant acceleration fit to the data. The quantity break_length is the largest deviation, in inches, of the actual from the straight-line trajectory. The quantity break_y is the y-distance from home plate where the maximum deviation occurs. The quantity break_angle is the direction of the deviation, with the convention that a pitch that breaks away from or toward a RHH has a negative or positive angle, respectively; a break_angle of 0 is a pitch with no horizontal break and is typical of a straight fastball. Most pitches will have a break_angle between about -50o and +50o. An angle greater than 90o in absolute value is almost impossible, since that would imply an upward break, defeating gravity (the myth of the "rising fastball"). Both break_length and break_angle are shown on the Gameday screen. Also the break_angle is indicated with an arrow that points straight down for 0o, down and to the left for a positive angle, or down and to the right for a negative angle. You can use the break_angle to find the x and z components of break_length as follows: break_x = break_length*sin(break_angle) and break_z = break_length*cos(break_angle). Note that break_x has the opposite sign as pfx_x because of the convention used to define the angle. That is, a pitch that breaks away from a RHH (i.e., toward the catcher's right) has a positive pfx_x but a negative break_x.

 

 

 

III. Identifying pitchers and batters

 

If you go to the site with the specific game whose pitch logs you are downloading, http://gd2.mlb.com/components/game/mlb/year_2007/month_08/day_01/gid_2007_08_01_balmlb_bosmlb_1/,you will see a bunch of links. Click on players and you will see a players.xml file containing the names of the players on the rosters of each team and their associated 6-digit numerical code. You should "save page as", then load it into Excel and save it for future reference.