Automated Story Selection for Color Commentary in Sports

Automated Story Selection for Color Commentary in Sports

Greg Lee, Vadim Bulitko, and Elliot A. Ludvig

Abstract—Automated sports commentary is a form of auto-

radios to live games in order to listen to the interpretations of the

mated narrative. Sports commentary exists to keep the viewer

commentators [24]. Also, some sporting venues now support a

informed and entertained. One way to entertain the viewer is by

handheld video device that provides the spectator with in-game

telling brief stories relevant to the game in progress. We present a system called the sports commentary recommendation system

commentary [30].

(SCoReS) that can automatically suggest stories for commentators

The purpose of commentators is to help the viewer follow

to tell during games. Through several user studies, we compared

the game and to add to its entertainment value. One way to add

commentary using SCoReS to three other types of commentary

entertainment to a broadcast is to tell interesting, relevant sto-

and show that SCoReS adds significantly to the broadcast across

ries from the sport’s past. The sport of baseball is particularly

several enjoyment metrics. We also collected interview data from professional sports commentators who positively evaluated a

suited to storytelling. Baseball is one of the oldest professional

demonstration of the system. We conclude that SCoReS can be a

sports in North America, existing since 1876. This longevity

useful broadcast tool, effective at selecting stories that add to the

provides a rich history from which to draw interesting stories.

enjoyment and watchability of sports. SCoReS is a step toward

The more popular commentators are known as “storytellers”

automating sports commentary and, thus, automating narrative.

[33], as they augment the games they call by adding stories that

Index Terms—Artificial intelligence, automated narrative, infor-

connect baseball’s past to its present. One of these “storytellers,”

mation retrieval.

Vin Scully, has been commentating for Brooklyn/Los Angeles Dodgers games since 1950 and has been voted the “most mem- orable personality in the history of the franchise.”

A typical baseball game lasts about three hours, but contains only ten minutes of action, where the ball is live and something PORTS broadcasting is a billion dollar industry. Most pro- is happening on the playing field. This leaves two hours and fifty fessional sports are broadcast to the public on television, minutes of time where little is happening on the playing field,

NTRODUCTION I. I

reaching millions of homes. The television experience differs and it is the job of the commentators to entertain the viewer. in many ways from the live viewing experience, most signifi- This downtime is a good time to tell stories [31]. Baseball is

cantly through its commentary. also known for being a statistically rich league, and being able

Much research has been done into the importance of com- to match statistics from the current game state to a past situation mentary during sports broadcasting. When watching a game on in baseball adds to the commentators’ ability to entertain the television, “

the words of the commentator are often given viewer. most attention” [10]. The commentary has the effect of drawing

To illustrate, consider the case where one baseball team is the attention of the viewer to the parts of the picture that merit trailing by four runs in the ninth inning. As this is not a partic-

closer attention [16], an effect called italicizing [3]. Commen- ularly interesting situation, it may be a good time for a story. tary can also set a mood during a broadcast. A commentator who An appropriate story might be that of the Los Angeles Dodgers, creates a hostile atmosphere during a broadcast often makes the who on September 18, 2006, were also trailing by four runs in viewing experience more enjoyable for the viewer [6]. The de- the bottom of the ninth inning. The Dodgers hit four consecu- scriptions given in a broadcast are so useful that fans often bring tive home runs to tie the game. The broadcast team could tell

this story to the viewer, because the situations are similar. Thus, what is needed is a mapping from a game state (a particular point

Manuscript received September 21, 2012; revised March 30, 2013; accepted June 10, 2013. Date of publication July 30, 2013; date of current version June

in a game) to an appropriate story.

12, 2014. This work was supported by the Natural Sciences and Engineering Re-

Sports storytelling is a form of narrative discourse. Narra-

search Council of Canada (NSERC) and by the International Council for Open

tive discourse is a creative activity that involves relaying a se-

Research and Open Education (iCORE). G. Lee was with the Department of Computing Science, University of

ries of events in an interesting and entertaining manner. It is

Alberta, Edmonton, AB T6G 2E8 Canada. He is now with the NICHE Re-

a recounting of contingent events from the past with one or

search Group, Dalhousie University, Halifax, NS B3H 3J5 Canada (e-mail:

more main characters. Automating narrative discourse is a chal-

gmlee@ualberta.ca). V. Bulitko is with the Department of Computing Science, University of Al-

lenging problem for AI and a subject of much recent research

berta, Edmonton, AB T6G 2E8 Canada (e-mail: bulitko@ualberta.ca).

[36]. Our work focuses on one aspect of computational nar-

E. A. Ludvig is with the Princeton Neuroscience Institute, Princeton Univer-

rative—that of selecting the narrative components to include

sity, Princeton, NJ 08540 USA (e-mail: eludvig@princeton.edu).

based on the current context. We feel the solution supplied here

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

provides an important tool to the larger problem of computa-

Digital Object Identifier 10.1109/TCIAIG.2013.2275199

tional narrative.

1943-068X © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LEE et al.: AUTOMATED STORY SELECTION FOR COLOR COMMENTARY IN SPORTS 145

Our hypothesis is that sports story selection can be automated with AI. Specifically, we set out to test whether an AI approach can be developed that maps game states to relevant stories, thereby significantly increasing audiences’ enjoyment of the broadcast. To achieve this goal, we develop an AI system that tells stories in the context of baseball. The sports commentary recommendation system (SCoReS), equipped with some scored examples of story-game state pairs, learns offline to connect sports stories to game states. This learned mapping is then used during baseball games to suggest relevant stories to a (human) broadcast team or, in the absence of a broadcast team (e.g., in a sports video game), to autonomously output a relevant story to the audience. In the case of suggesting stories to commentators, SCoReS is an example of human–computer interaction [9], as it is a computer system that communicates to humans information that will improve their performance.

This research makes the following contributions to com- putational narrative. First, we formalize story-based sports commentary as a mathematical problem. Second, we use ma- chine-learning methods and information retrieval techniques to solve the problem. This solution is for the specific domain of sports story selection, but is also general enough to be used in other domains involving story selection. Third, we implement the approach in the domain of baseball, evaluate the resulting AI system by observing feedback from human participants, and show that it is effective in performing two separate tasks: 1) au- tomating sports commentary, and thus automating narrative in a special case; and 2) assisting human commentators. That is, we show that our combination of information retrieval techniques is able to map previously unseen baseball game states to stories in a sufficiently effective manner to improve the enjoyability of baseball broadcasts, increase interest in watching baseball, and suggest stories to professional commentators that they would tell during a live game.

This paper is an extension of [21]. Beyond the above addi- tions of describing the connection to narrative discourse and SCoReS contributions to AI, it provides full detail on each com- ponent of the SCoReS algorithm and features used in experi- ments. It also provides a detailed example displaying the oper- ation of SCoReS, a full related research section, and a complete list of parameters and results for the experiments. Finally, the future research and applications sections have been expanded, including the addition of possible solutions to challenges faced thus far in the development of SCoReS.

The rest of the paper is organized as follows. First, we describe color commentary in detail, and formulate the problem of mapping sports game states to interesting stories. This is followed by a review of research related to story-based com- mentary. Next, we describe information retrieval techniques in detail and describe our approach to mapping game states to sto- ries. This approach combines information retrieval techniques designed to rank stories based on a given game state, and others that ensure the higher ranked stories are indeed relevant to said game state. We then describe empirical work performed to choose a story ranker and use this ranker to select stories for baseball broadcasts. The quality of the story mapping is evaluated with user studies and demonstrations to professional sports commentators. We conclude with a discussion of lessons

learned, future research and applications, and a summary of the contributions of this paper.

ROBLEM II. P F ORMULATION Commentating in sports generally involves two people—a

play-by-play commentator and a color commentator. Play-by- play commentating involves relaying to the audience what is actually happening on the field of play. Beyond reporting the actions of the players as they happen, the play-by-play commen- tator typically mentions such facts as the score of the game, up- coming batters and statistics for the teams and players involved in the game. Color commentary, on the other hand, is much more subjective and broad, with the purpose being to add entertain- ment (i.e., “color”) to the broadcast. This can be done in several ways, as we describe below.

After a play is over, color commentators tend to analyze what has happened beyond the surface. For instance, if a player swings awkwardly at a pitch and misses, the color commen- tator may point out that the reason for the hitch in his swing is that he has an ankle injury and is unable to plant his foot when swinging. This gives the viewer some extra information beyond what he or she can see, or is told by the play-by-play commentator.

Yet another manner in which the color commentator adds to

a broadcast is by giving background information on the players involved in the game. Whereas play-by-play commentators can provide a player’s statistics, color commentators tend to add more personal information about a player, possibly including their own interactions with that player. Passing on their exper- tise in the sport at hand is one more way color commentators add to the broadcast.

Another rich aspect of color commentary is storytelling, which is our focus. Effective storytelling in sports broadcasts involves telling a story that is interesting to the audience, and that is related to what is actually happening in the game being broadcast. While play-by-play commentators are generally trained journalists, color commentators are typically former professional athletes or coaches. As former members of the league being shown, they are thought to bring a level of exper- tise to the broadcast. Because they actually played or coached in the league, color commentators tend to tell stories from their own experiences in the game. This gives the audience a firsthand account of snippets of baseball history, which can be quite entertaining. Unfortunately, color commentators do not have firsthand knowledge of most of baseball history. Their knowledge of stories, however vast for human beings, is still limited relative to the total set of baseball stories available and they cannot necessarily connect the stories they do know to the game state at hand. This is where AI can be of assistance. Com- puters can both store many more stories than a human brain as well as quickly compute the quality of a match between each story in the library and the game state at hand. The problem we are addressing in this paper is the finding and matching of interesting stories as live commentary to a sports game.

ELATED III. R W ORK

In this section, we review existing research relevant to the problem we are solving. Some approaches deliver sto-

146 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 2, JUNE 2014

ries to nonsports video games (Section III-A), and others deliver live play-by-play commentary, with some added color (Section III-B). None of the existing work delivers story-based color commentary to a live game.

A. Storytelling in an Unpredictable Environment Narrative generation explicitly attempts to automate story-

telling by having computers create a story with an entertaining plot and characters. The ability to generate effective narrative is important in different applications, such as entertainment, training, and education [28], and this area has been studied for over 20 years [14]. There has been much work on automated storytelling in nonsports video games [29]. Some systems gen- erate or adapt a story due to actions taken by the player, creating a so-called interactive drama. Automated story director (ASD) [27] is an experienced manager that accepts two inputs: an exemplar narrative that encodes all the desired experiences for the player and domain theory that encodes the rules of the environment. As the player can influence the narrative, ASD adapts the examplar narrative to ensure coherence of the story and a feeling of agency for the player. These adaptations are preplanned based on the domain theory. Systems such as ASD actively change the course of the game based on user actions. As a commentary system cannot alter what happens in a live sports game (real or video game), these systems are not directly applicable to our problem.

Human commentators try to weave a narrative with a coherent plot through unpredictable sports games [31]. Plot is a difficult narrative dimension for the broadcast team, as they do not know what the end result of the game will be until it concludes. They can, however, attempt to predict what will happen and begin to build up a plot that fits with the predicted ending. This brings into play different themes that the broadcaster can choose to follow, and

he or she can follow more than one at a time (hedging his or her bets, so to speak). In the broad scheme of narratives, the range of themes spans all of human experience. In baseball, these themes are limited, and they include: spirit, human interest, pity, games- manship, overcoming bad luck, old-college-try, and glory [31], [7]. Matching game states and stories thematically (or categori- cally) can be used to help gauge similarity between the two, and we investigate this in our approach.

B. Automated Commentating in Sports StatSheet [2] and Narrative Science [11] automatically write

previews for sports games that have not yet happened, and sum- maries about sports games that have already happened. For sum- maries, they are provided with statistics from a completed game and compose a narrative about the game, with the goal being to provide an interesting summary of game events. For previews, they are provided with statistics from past games, and compose a preview for the game that should entice the reader to watch said game. Neither Statsheet nor Narrative Science operates with live game data and both are unable to solve our problem of providing live stories during a game. They may, however, provide an addi- tional potential database of stories about past games to augment our current system.

Modern commercial sports video games typically employ professional broadcast teams from television to provide com- mentary for the action in the game. This involves prerecording

large amounts of voice data that will be reusable for many different games. As an example, a clip to the effect of “Now that was a big out!” could be recorded and used in multiple situations where one team was in dire need of retiring a batter or a runner. Recorded clips often use pronouns (i.e., “he” and “they”) so that they are not specific to any particular player or team. This makes the commentary generic, which reduces the amount of recording required, as recording voice data can be time consuming and expensive. Unfortunately, generic commentary is less exciting to the audience. We would like our AI system to deliver colorful commentary tailored to the current game by mapping the current game state to a story chosen specifically for said game state.

The MLB: The Show [32] suite of games is popular among baseball video games [15]. Recorded clips of Matt Vasgersian of the Major League Baseball (MLB) network act as the play-by- play commentary while Dave Campbell of ESPN’s Baseball Tonight is the color commentator. Most of the color commentary involves analysis of what has happened on the field and indirect suggestions to the player as to how to improve their play. As far as we have seen, there is no storytelling in these games. The problem we are trying to solve, automated story selection for sports, could provide an additional angle to the color commen- tary in these games.

In MLB 2K7 [1], another prominent baseball video game, Jon Miller and Joe Morgan (formerly of ESPN: Sunday Night Base- ball) provide the play-by-play and color commentary, respec- tively. There is at least one instance of relating a game state to a story from baseball’s past, occurring when a third strike is dropped by the catcher and he has to throw to first base to record the out. Here, Miller reminds us of a 1941 World Se- ries game where Mickey Owens committed an error on a sim- ilar play, changing the course of the series. There appear to be a very limited number of stories told in the game, however, and a single story is used repeatedly. The mapping of the game state to the story appears to be hand-coded and based on just one fea- ture (a dropped third strike).

Robot World Cup Soccer (RoboCup) is a research testbed involving robots playing soccer [17]. There is also a RoboCup simulation league, where the games are not physically played, but are simulated on a computer. Both the physical and sim- ulation leagues provide researchers with a standard testbed in which to evaluate their AI strategies for various goals. In this context, previous work in automated commentary has focused primarily on automated play-by-play commentary. Byrne, Rocco, and MIKE [5] are three systems that produce automated play-by-play commentary for RoboCup simulator league games. The three systems obtain their data from the Soccer Server [17], which summarizes the gameplay’s main features—the player locations and orientations, the ball loca- tion, and the score. Each system generates natural language templates, filling in player and team names where appropriate, then uses text-to-speech software to verbalize the derived com- mentary. Dynamic engaging intelligent reporter agent (DEIRA) is a similar system to Byrne, Rocco, and MIKE, as it performs the same task, but in the sport of horse racing.

There are some attempts within these systems to provide color commentary, but none go as far as to try to incorporate

LEE et al.: AUTOMATED STORY SELECTION FOR COLOR COMMENTARY IN SPORTS 147

storytelling. That is, these systems tackle a problem that is has been obtained, a system must learn to match the stories to different from what our work is aiming to solve—they auto- game states. The broader the story database, the more likely an mate factual commentary with some bias added, but do not appropriate story can be found that matches any given game implement color commentary via stories. Our system could be state. As “being interesting” is an informal and subjective mea- used in conjunction with each of these automated play-by-play sure, we evaluate the quality of the mapping by incorporating systems to create fully automated commentary, featuring both the selected stories into a simulated broadcast and test the en- play-by-play and color.

joyment of viewers and the interest of professional commenta- Within its live online game summaries, MLB uses a system tors. called SCOUT [18] that provides textual analysis of the cur-

To compare stories to game states, we extract features rent game. The viewer is shown information such as the type of from both, such as the score, the teams involved, and what pitches thrown during an at-bat and the tendencies of the batter type of action is happening on the field (e.g., a home run). with respect to the pitcher. While SCOUT provides some color, Formally, the game state is a vector of

numeric features: it is mostly a statistical summary and currently does not tell sto-

. To illustrate, binary feature may be ries. Our system could extend SCOUT by adding stories to MLB

there is a runner on first base. Integer online game summaries.

1 if in game state

can be the current inning. Similarly, a baseball Freytag and MacEwan [13] identify a five-stage pyramid of story can be described with a vector of

feature

numeric features: dramatic structure: exposition, rising action, climax, falling ac-

can be 1 if the story tion, and denouement. Rhodes et al. [26] describe a system that involves a runner on first base and integer feature

. Binary feature

can be the follows this pyramid and adds dramatic commentary to sports inning number mentioned in the story. The task is then to map video games by using language with different levels of emo-

to a relevant and interesting .

tional connotation depending upon the game situation. Values

between a game state for the different motifs (themes) [7] are stored, as well as a

The match quality

is scored vocabulary for each that changes as their level of intensity in- on a five-point integer scale ranging from 0 for a completely creases. Each theme has a set of actions that can occur in the inappropriate match to 4 for a perfect match. Thus, the problem game that either increase its intensity (Freytag’s rising action) is, given a game state , to retrieve a story of the highest or decrease its intensity (Freytag’s falling action). One theme possible match quality

and story

within the system is “urgency,” with some level 1 phrases being “looking shaky” and “fortune not on their side,” and level 3

B. Training Data

phrases being “doomed” and “beyond salvation.” Lexicalized We solve this problem by using IR techniques and machine tree-adjoining grammar is used to generate comments. While learning. Rather than separately feeding the game state and story

drama is added to the commentary with this program, it is an features to IR algorithms, we make the connection between cor- augmentation to the play-by-play commentary more than an ad- responding features more explicit. A similarity vector is com- dition of color commentary. Our system could further augment puted for a game state specified by feature vector and a story the dramatic commentary by adding relevant stories. Using this specified by feature vector . Each component of vector is the dynamic as a feature could also help choose which stories are result of comparing one or more features of to one or more most relevant to the grand narrative arc.

relevant features of . Logical connectives are used to compare binary features. For instance, to match the runner on first base

features of and , the biconditional over the corresponding In this section, we present an AI approach to solving the features is used:

ROPOSED IV. P A PPROACH

. For nonbi- problem of delivering story-based color commentary to a live nary features, feature-specific functions are used. For instance, baseball game. We start by framing the problem as an informa- when comparing the current inning number and the inning tion retrieval problem (Section IV-A). We then describe the ma- number involved in a story , the similarity feature is cal- chine-learning techniques used (Sections IV-B and IV-C) and culated as

, where values closer to 1 indicate a finally combine these techniques in Section IV-D.

closer pairing of inning features. Another example is the mar- quee matchup feature valued between 0 and 1. It indicates how

A. Information Retrieval Framework well the story and game state match in terms of a strong hitter

We approach the problem of automated story selection in and a strong pitcher being involved. This feature is calculated sports as an information retrieval (IR) problem. In the context by combining several statistical features for the batter and the of sports, the game state for which we are seeking an appro- pitcher in with a story category feature in [20]. priate story is treated as the query, while the candidate stories

The similarity vector indicates how related and are, but returned by the system are the documents. Our system’s goal does not provide a scalar value. What is needed is a way to map then is, given a game state, to return to the user a ranked list

—the five-point-scale quality of the match between of stories, based on how appropriate they are for the game state.

to

and . Machine-learning techniques are used to create this We assume that the game state is available live, as is the case for mapping from training data . MLB [23]. We also assume that a database of stories has previ-

To build this training data set, a set of game state vectors ously been collected.

is taken to form a set of similarity Thus, the problem is to retrieve stories most appropriate for vectors for all

and all story vectors from our game states during live sports broadcasts. Once a story database story vector library

. Each similarity

148 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 2, JUNE 2014

Fig. 1. SCoReS chooses a story to output to a commentator.

vector is then labeled with the ground-truth value of the (line 4). The set of feature combinations to be considered for quality of match between the corresponding game state and use in weak rankers

is calculated based on the number of the story. Mathematically,

features in each in , and the number of features to use for is the similarity vector for and

. For simplicity’s sake tie-breaking (line 5).

in the rest of the paper, we refer to as the story library rather than the story vector library and

as the game state library, Algorithm 1: SCoReS AdaRank. Input: : training data; : rather than the game state vector library. Also, as game states number of game states; : IR scoring function; : number of and stories are the equivalent of queries and documents in this iterations; : number of tie-breaking features. Output: : work, we will use the former terms in the rest of the paper.

ranker.

C. SCoReS Offline

1: partition by game state:

values, yielding ground truths wise, pairwise, and listwise [22]. Pointwise algorithms are re-

IR algorithms are generally divided into three groups: point-

2: sort each

by

3: initialize weights

to 1

4: initialize ranker and weak ranker confidences to (MSE) typically used as an error function. Pairwise algorithms

gression and classification algorithms, with mean squared error

from perform an incomplete ordering on the data. Listwise algorithms

5: get all combinations of length

6: for each iteration up to do

make direct use of IR metrics to search for a good ranking of

documents.

8: for each combination in

do

Our approach to selecting stories for game states is a hybrid

approach that uses a listwise algorithm to rank stories and a

pointwise algorithm to evaluate the top-ranked story. This two-

sort

by , yielding

step process involves a “ranker” and an “evaluator,” as shown

in Fig. 1. In this section, we describe the offline training and

if mean

then

creation of the two elements of SCoReS, and then describe the

mean

online operation in detail (Section IV-D).

1) Machine Learning a Ranker: For the ranking algorithm,

16: add to

we adapted AdaRank [35], a listwise algorithm based on Ad-

17: calculate

aBoost [12]. AdaRank forms a “strong” ranker by iteratively

18: add to

selecting and combining “weak” rankers. A weak ranker uses

19: update

a single component of the similarity vector to rank (i.e., sort) the training data . Each weak ranker has a “vote” on the final ranking, based on how its ordering of over the training data

At each iteration of SCoReS AdaRank, elements of was scored according to a chosen IR scoring metric.

whose first elements have not yet been used in are considered

SCoReS AdaRank (Algorithm 1) accepts as input a set of as possible weak rankers (lines 6–15). Thus, each feature of training data , the number of game states in , an IR scoring may only be used once as the main sorter in a weak ranker. For function , the number of weak rankers to compose the strong each game state, the weighted score of sorting

by (with ranker , and the number of tie-breaking features to use . In any remaining ties broken randomly) is calculated, using the line 1, SCoReS AdaRank first partitions by its

and current weights (lines 10–12). The games states. This is done because stories can be meaningfully arguments to

constituent scoring function

vary based on which scoring metric is used, sorted by the match quality values

only for a given game but all metrics we consider accept , the match qualities for state. The ground-truth rankings

are then calculated (line

as input. If the mean weighted score for is greater than

2) for possible use in evaluating weak rankers (line 12). All the maximum encountered so far in this iteration, the feature weights are initialized to

(line 3) as all game states are combination to be used in the weak ranker for this iteration initially equally important. The main ranker

and its corre- is set to (lines 13–15). After evaluating all valid elements of sponding confidence values

are initialized to be empty sets

, the best weak ranker for this iteration is added to the main

LEE et al.: AUTOMATED STORY SELECTION FOR COLOR COMMENTARY IN SPORTS 149

TABLE I

in this ordering would be (0.97, 0.59, 0.89), with weighted mean

U NORDERED T RAINING D ATA

FOR SC O R E SA DA R ANK

(as all weights are equal). The weak ranker for this iteration would thus be

and would not be considered as a main sorter in weak rankers in later iterations (as it is the main sorter here). is calculated with the following formula (from [35]):

where

is the vector of match qualities for the ordering of

is the vector of match qualities for the ground truth. for this iteration would thus be 1.13. The weights for each game state are then updated with the following formula (from [35]):

game state by , and

TABLE II

giving the new

T HE O RDERING OF

A FTER B EING S ORTED BY F EATURE C OMBINATION

On the second iteration, game state 2 has more weight, as

.R ANDOM T IE -B REAKING W AS N ECESSARY FOR

G AME S TATE

2 (B ETWEEN S TORIES 1 AND 2)

the weak (and main) ranker

failed to sort it as well as it did game states 1 and 3. This leads to , and thus main ranker

. After normalizing, . This means the weak ranker using the inning feature to sort and the marquee matchup feature to break ties gets 53% of the vote for ranking , and the runner on first base feature combined with the marquee matchup feature tiebreaker gets 47% of the vote for the ranking output by SCoReS AdaRank.

and

2) Machine Learning an Evaluator: Though the ranker output by SCoReS AdaRank provides a ranked list of stories, it does not provide a value for

for these stories. Thus, while there is always a top-ranked story for a game state, SCoReS AdaRank provides no indication as to how “good” the top-ranked story is. We added an evaluator to provide an estimate of . The evaluator can then be used as a threshold to ensure the top-ranked SCoReS AdaRank story is worth telling.

In principle, it is possible to use the evaluator on its own to ranker (line 16) and and are updated (lines 18–19) as in rank stories. The reason we do not do this is because of its point- [35]. The data are reweighted after each iteration so that game wise nature; it uses MSE as a scoring metric, which treats each states for which stories have been poorly ordered are given datum equally: there is no preference with respect to accuracy more weight (line 19).

toward the top of a ranked list. Thus, the accuracy at the bottom As a small, concrete example, let

be normalized discounted of the ranked list is regarded as being as important as the top cumulative gain (NDCG) [35],

. Let of the list, making using the evaluator as a ranker less desirable the similarity vectors in consist of the following features: the than the ranker–evaluator combination. runner on first base feature

, and

, the strikeout feature

, the

inning feature , and the marquee matchup feature

. The

D. SCoReS Online

first two are binary: they are 1 if both the story and game state Our SCoReS system thus consists of a ranker (learned by involve a runner on first base (or both involve a strikeout), and 0 SCoReS AdaRank) and an evaluator (learned by a pointwise otherwise. The marquee matchup and inning features are com- algorithm) as shown in Algorithm 2. puted as previously described. Table I shows possible training

data split into its three constituent game states. The weight of Algorithm 2: SCoReS. Input:

: game states for a live each game is initially set to 1/3. The best feature combination game; : our story library; : ranker from SCoReS AdaRank; for the first iteration is

: weak ranker confidences for ; : evaluator; threshold ture to be used as the main sorter, with the strikeout feature as for evaluator.

—the runner on first base fea-

a tiebreaker. Sorting by this feature combination yields the or- derings shown in Table II. Assuming we consider the top three

1: for each game state in

do

ranking positions relevant, the NDCG scores for each game state

2: if should tell a story in then

150 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 2, JUNE 2014

3: create , comparing to

TABLE III

4: rank with and

T HE S TORY AND G AME S TATE C ATEGORIES U SED IN O UR E XPERIMENTS

5: top-ranked story 6:

if

7: output to broadcast team (or viewer)

SCoReS processes each game state in a given (novel) game (line 1). If the current game state is appropriate for a story we create similarity vectors of and each story in (line 3). Generally, any game state is appropriate to suggest a story to human commentators, but in the case of SCoReS operating au- tonomously (as in video games), we limit the number of stories output to the viewer by only allowing stories to be told after an appropriate “downtime” (typically 10–20 pitches). The sim-

In order to build training data for SCoReS, we first down- ilarity vectors are sorted with the ranker

loaded MLB game statistics from MLB’s XML site [23], with learned by SCoReS AdaRank offline (line 4). The top-ranked permission from MLB. For all experiments, the game state and story is extracted in line 5 and then scored by the evaluator in story features were kept constant. The feature vector for each line 6. If the story passes the provided threshold , and SCoReS game state included statistics such as the score, the inning, the is operating autonomously, the story is output to the viewer (line date, the teams involved, and various statistics for the current 7). If a broadcast team is using SCoReS, the top few stories can batter and pitcher (such as home runs and wins). Statistics for

and confidences

be output. the previous pitch’s batter and pitcher were also present because stories are often told in reference to an event preceding the cur- rent game state. Details are found in [20].

MPIRICAL V. E E VALUATION

One hundred and ten stories were gathered from Rob Neyer’s

To evaluate the quality of SCoReS, we conducted a series Big Book of Baseball Legends [25], Baseball Eccentrics [19], of empirical studies where we asked potential users to eval- and Wikipedia. Stories ranged in year from 1903 to 2005. An uate the system. We examined several possible applications of example of a story from Rob Neyer’s Big Book of Baseball Leg- SCoReS: providing generic commentary, adding stories to ex- ends is the following, concerning a matchup between two base- isting commentary, picking context appropriate stories, and as- ball stars Greg Maddux and Jeff Bagwell in the 1990s: sisting professional commentators to select stories. These re-

“Leading 8-0 in a regular-season game against the As- sults are a more comprehensive exposition of the results pre- tros, Maddux threw what he had said he would never throw sented in [21]. to Jeff Bagwell, a fastball in. Bagwell did what Maddux We conducted several user studies to evaluate whether wanted him to do—he homered. So two weeks later, when SCoReS improved broadcast quality in any of these applica- Maddux was facing Bagwell in a close game, Bagwell was tions. The first user study was conducted to ensure commentary looking for a fastball in, and Maddux fanned him on a was beneficial within our game library before we even tried

changeup away.”

to improve said commentary. Subsequent user studies tested whether adding stories to a broadcast within our game library

This story would have a high match quality with game situa- helps and whether intelligently selecting stories based on the tions that involve a large difference in scores (eight runs in the game context works better than simply adding any story to the story) and a matchup between star players. Feature selection and broadcast. A demonstration of the SCoReS system to profes- categorization of the stories were done by hand. Story features sional commentators was also carried out. Their feedback is an were similar to game state features, but also included the cat- estimate of whether SCoReS can be successfully deployed in a egory. Using our baseball expertise, and based on the features professional broadcast setting.

available from MLB’s XML site, we chose the ten categories

These experiments evaluated SCoReS in both its modes of listed in Table III. While these categories are not those used in operation: as an autonomous commentary tool within the user other work, we thought they were appropriate given the avail- studies, and as an assistant to a human color commentator in able data (i.e., it would be difficult to categorize a game state as the interviews with professional commentators. In the user “the Lucky Break Victory” as in [31] without video of the game, studies, we inserted the top-ranked story from SCoReS into using only selected statistics). video clips from actual baseball games. We chose to use actual

Building similarity vectors created a new feature set, con- games because watching clips from a video game would likely sisting mostly of binary features (as described in Section IV-B), not be interesting to the participants, inducing boredom. Note which were home run, sacrifice, single, double, triple, double that improving commentary from actual games appears to be a play, strikeout, fly out, pop out, ground out, walk, intentional harder problem than improving sports video game commentary, walk, hit by pitch, substitution, runner on first, runner on second, as commentary in video games is prerecorded and generic, runner on third, one team (Do and have at least one team in whereas commentary during an actual game is tailored to that common?), and two teams. Confidence similarity features (How particular game.

well and match in terms of a category?) included marquee

LEE et al.: AUTOMATED STORY SELECTION FOR COLOR COMMENTARY IN SPORTS 151

TABLE IV

on its own, output stories that averaged a 2.3 match quality, out-

R ANKER P RODUCED BY SC O R E SA DA R ANK W ITH 40 T RAINING D ATA

putting a story for all 40 folds. Thus, it does not provide as good a ranking as the ranker, but does provide a scalar value within the hybrid approach that is used as a threshold to deter- mine which stories are output by SCoReS.

These experiments show that SCoReS performs well with re- spect to outputting a story with a high match quality, which is one of the goals of the system. Related metrics for evaluating SCoReS are the enjoyment of viewers watching games con- taining commentary augmented by SCoReS and how useful pro-

matchup, great statistics for batter or pitcher, bad statistics for fessional commentators think SCoReS would be to them while batter, bad statistics for pitcher, opening of inning, important they are commentating on live games. To evaluate SCoReS in games from history, big finish, blow out, and home run hitter these ways, we made the transition from cross-validation ex- in a one-run game. Other similarity features scaled between 0 periments to user studies and interviews with professional com- and 1 included balls, strikes, outs, inning, run difference, and mentators. month. We refer the reader to [20] for complete detail on the feature vectors, as the full sets of game, story, and similarity

B. User Studies

The true measure of success for SCoReS is how viewers per- what we believed would be a sufficient number of features to ceive the stories it selected in the context of the game. We con-

features totaled

, providing SCoReS with

choose stories. ducted two user studies to test whether commentary improves

A. Choosing a Ranker

a broadcast, whether inserting stories into a broadcast makes it more enjoyable, and whether the added stories need to be in the

In order to choose a ranker and an evaluator for SCoReS, proper context to add to the broadcast. Each user study involved we performed a leave-one-out cross-validation experiment. participants watching video clips from two AAA (minor league) Training data consisted of 40 randomly selected game states baseball games: the July 15, 2009, AAA All-Star game between from the 2008 MLB season and the 110 stories gathered from the International League and the Pacific Coast League, and the various sources. At each cross-validation fold, 4290 data (39 April 7, 2011, game between the Buffalo Bisons and Syracuse game states

110 stories) were used for training, while 110 Chiefs. Each study involved three different types of commen- data (1 game state

110 stories) were used for testing. The tary, depending upon which hypothesis we were testing. Each same story database was used in training and testing. Candidate video clip was between three and six minutes in length, and evaluators were trained on

and then used within SCoReS at they were always shown in chronological order. The order of each fold.

the commentary, however, varied. After each clip, participants For IR scoring metrics , we considered the following: answered questions related to their enjoyment of the clip. At the

NDCG, mean average precision (MAP), winner takes all end of the session, participants completed a background ques- (WTA) [35], expected reciprocal rank (ERR) [8], and return tionnaire. score (RS), a metric that simply returns the match quality of

1) User Study I—The Need for Commentary: In the first user the top-ranked story. We tested the cross product of this set study, we compared “SCoReS commentary” to two different

of scoring metrics

. We types of commentary. For “no commentary,” we removed the also tested higher values of with a reduced set , including commentary from the broadcast, and left the crowd noise. 1 The the cross product of the full set of scoring metrics

and

“original commentary” had voiceovers for the original profes- and

, and the cross product of the full set of sional commentary, with no stories inserted or present in the scoring metrics

original commentary. Voicing over the commentary was neces- Cross validation chooses the set of parameters that is best at sary as we needed to insert stories into some types of commen-

and

predicting the score of unseen game state–story pairs. The best tary, and, for consistency, needed the stories read in the same performing instantiation of SCoReS had a decision tree as the voice as the other commentary. Otherwise, any difference in evaluator. The best set of parameters for AdaRank found using preference might reflect a preference for certain voices, and not this process was

NDCG . To choose the for the addition of the selected stories. The “SCoReS commen- actual ranker for SCoReS, we provided all 40 training data tary” contained a story selected by our system. to SCoReS AdaRank. This produced the ranker shown in

We recruited 16 participants from the local community. To Table IV.

measure the quality of the different types of commentary, we This combination of ranker and evaluator output stories that evaluated participants’ answers to the eight questions listed in

averaged a 3.35 match quality. Stories were only output for 23 Table V. For both games, each participant saw one clip each of the 40 folds, because in 17 folds, the evaluator (decision tree) with “original commentary,” “SCoReS commentary,” and “no deemed the story chosen by SCoReS AdaRank to be of insuf- commentary.” Thus, each participant saw six video clips in total. ficient quality to output. As a comparison, a perfect selector

(based on the ground-truth labelings, done by hand by a domain 1 The crowd noise was artificial as we could not remove commentary without expert) would output stories with a 3.7 average match quality, removing the crowd noise. After removing all the sound from the original broad-

cast, we added an audio file of crowd noise. This held for all types of commen-

outputting a story for all 40 folds. The decision tree, operating tary.

152 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 2, JUNE 2014

TABLE V

TABLE VII

U SER S TUDY Q UESTIONS A SKED TO P ARTICIPANTS IN U SER S TUDY 1. T HE R ANKER P RODUCED BY SC O R E SA DA R ANK W ITH 3080 D ATA

P ARTICIPANTS W ERE A SKED TO R ATE E ACH Q UESTION F ROM

1 TO

7 (S TRONGLY D ISAGREE –S TRONGLY A GREE )

except “Viewing this clip made me more interested in par- ticipating in sports.” There were no significant differences between “SCoReS commentary” and “original commentary” in this experiment.

2) User Study II—Baseball Fans Prefer SCoReS: Having collected evidence that commentary itself adds entertainment to a game broadcast, we changed the user study format, re- placing “no commentary” with “mismatch commentary.” For “mismatch commentary,” we inserted a story in the same place as we would with SCoReS, but instead of the story being se- lected for that game state, it was actually a story chosen by SCoReS for a different game state, in the other game the par- ticipant saw. This manipulation kept the story pool consistent across the conditions, thereby controlling for the overall level of interest.

User Study II differed further from User Study I in the fol- lowing ways. An extra clip was added before the clips of in- terest because pilot data indicated that the first clip in a sequence was always rated more poorly. Participants were screened with the question “Are you a baseball fan?” As baseball fans actually

Fig. 2. Mean ( standard error of the mean) difference between “SCoReS

enjoy baseball games, we hypothesized that they would be likely

commentary” and “no commentary.” *** indicates

to notice differences in commentary. Also, SCoReS is built to improve the enjoyment of a sport for a fan. If someone does

TABLE VI

not enjoy a sport to begin with, it may be difficult to change his P ARAMETERS FOR U SER S TUDY I or her mind. SCoReS had access to the full 40 training data,

and output the ranker shown in Table IV. Parameters for this study are given in Table IX. Len Hawley, the play-by-play com- mentator for the Acadia University Varsity Hockey (Wolfville, NS, Canada) team did the play-by-play reading for these ex- periments. Clip ordering was counterbalanced across subjects, ensuring that each type of commentary appeared in each chrono- logical position the same number of times for each game.

For this experiment, SCoReS was trained on 35 game states and We recruited 39 students from the University of Alberta (Ed-

88 stories (3080 data), as we did not yet have our full training monton, AB, Canada) who were self-described baseball fans. data.

For each of the two games, each participant saw one clip each

The parameters for this experiment are shown in Table VI, with “original commentary,” “SCoReS commentary,” and “mis- with the ranker being that shown in Table VII. Jim Prime, match commentary.” Fig. 3 shows the mean difference between who did the play-by-play for this study, is an author of several “SCoReS commentary” and both “original commentary” and baseball books, including Ted Williams’ Hit List, a book he “mismatch commentary” for the second user study. “SCoReS coauthored with Ted Williams of the Boston Red Sox [34]. commentary” was ranked higher than “original commentary” SCoReS had a database of 88 stories from which to choose for the “Viewing this clip made me more interested in watching stories for each game state. Fig. 2 shows that “SCoReS com- baseball” metric, with

. This shows that adding stories mentary” ranked significantly higher than “no commentary” to commentary can improve a broadcast. “SCoReS commen- across all metrics.