Computer Science authors titles recent submissions

ADVISE: Symbolism and External Knowledge for Decoding Advertisements
Keren Ye
Adriana Kovashka
Department of Computer Science
University of Pittsburgh

arXiv:1711.06666v1 [cs.CV] 17 Nov 2017

{yekeren, kovashka}@cs.pitt.edu

Abstract
In order to convey the most content in their limited space,
advertisements embed references to outside knowledge via
symbolism. For example, a motorcycle stands for adventure
(a positive property the ad wants associated with the product being sold), and a gun stands for danger (a negative
property to dissuade viewers from undesirable behaviors).
We show how to use symbolic references to better understand the meaning of an ad. We further show how anchoring ad understanding in general-purpose object recognition
and image captioning can further improve results. We formulate the ad understanding task as matching the ad image
to human-generated statements that describe the action that
the ad prompts and the rationale it provides for this action.
We greatly outperform the state of the art in this task. We

also show additional applications of our learned representations for ranking the slogans of ads, and clustering ads
according to their topic.

TRAIN
danger gun

cool
cool

A

danger

C

bottle

bottle
TEST


cool motorbike

B

I should buy this drink
e ause it’s ex iting.

D

Figure 1. Our key idea: Use symbolic associations shown in yellow (a gun symbolizes danger; a motorcycle symbolizes coolness)
and recognized objects shown in red, to learn an image-text space
where each ad maps to the correct statement that describes the
message of the ad. The symbol “cool” brings images B and C
closer together in the learned space, and further from image A and
its associated symbol “danger.” At test time (shown in orange), we
use the learned image-text space to retrieve a matching statement
for test image D. At test time, the symbol labels are not provided.

1. Introduction
Advertisements are a powerful tool for affecting human

behavior. Product ads convince us to make large purchases,
e.g. for cars and home appliances, or small but recurrent purchases, e.g. for laundry detergent. Public service
announcements (PSAs) encourage different behaviors, e.g.
combating domestic violence or driving safely. To stand out
from the rest, ads have to be both eye-catching and memorable [74], while also conveying the information that the ad
designer wants to impart. All of this must be done in limited space (one image) and time (however many seconds the
viewer is willing to spend looking at an ad).
How can ads get the most “bang for their buck”? One
technique is to make references to knowledge that viewers
already have, in the form of e.g. cultural knowledge, conceptual mappings, and symbols that humans have learned
[58, 39, 61, 38]. These symbolic mappings might come
from literature (e.g. a snake symbolizes evil or danger),
movies (e.g. motorcycles symbolize adventure or coolness),

common sense (a flexed arm symbolizes strength), or even
pop culture (Usain Bolt symbolizes speed).
In this paper, we describe how to use symbolic mappings
to predict the messages of advertisements. On one hand, we
use the symbol bounding boxes and labels from the Ads
Dataset of [23] as visual anchors to ideas outside the image. On the other hand, we use knowledge sources external

to the main task, such as object detection, to better relate
ad images to their corresponding messages. These are both
forms of using outside knowledge which boil down to learning links between objects and symbolic concepts. We use
each type of knowledge in two ways, as a constraint or as
an additive component for the learned image representation.
We focus on the following multiple-choice task: Given
an image and several statements, the system must identify
the correct statement to pair with the ad. For example, for
test image D in Fig. 1, the system might predict the message
is “Buy this drink because it’s exciting.” Our method learns

a joint image-text embedding that associates ads with their
corresponding messages. The method has three components: (1) an image embedding which takes into account individual regions in the image, (2) constraints on the learned
space computed from symbol labels and object predictions,
and (3) an additive expansion of the image representation
using a symbol distribution.
In more detail, we first use the symbol bounding boxes
from the Ads Dataset [23], without labels, to learn a region
proposal network. For each image, we compute its representation as a weighted average of the representations of its
important regions. It is this representation that we embed in

the joint image-text space. Second, we constrain the learned
space using the sparse ground-truth symbol labels from the
Ads Dataset, and the predictions from a generic captioning
method not based on ads [29]. Images that have similar
symbol labels and similar predicted captions should project
closeby in the learned space. Third, we add an adaptive additive refinement to our image representation that brings the
image representation closer to its corresponding statement.
Both the constraints and the additive component depend on
external knowledge in the form of symbols and object predictions, i.e. we show two ways to learn an embedding for
ads that rely on outside knowledge. We call our method
ADVISE: ADs VIsual Semantic Embedding.
We focus on public service announcements, rather than
product (commercial) ads. PSAs tend to be more conceptual and challenging, often involving multiple steps of reasoning. Quantitatively, 59% of the product ads in the dataset
of [23] are straightforward, i.e. would be nearly solved with
traditional recognition advancements. In contrast, only 33%
of PSAs use straightforward strategies, while the remaining 67% use a number of challenging non-literal approaches
to convey their message. Our method outperforms several
relevant baselines, including prior visual-semantic embeddings [35] and methods for understanding ads [23].
In addition to showing how to use external knowledge
to solve ads, we also demonstrate how recent advances

in object recognition help with the ad-understanding task.
While [23] evaluates basic techniques for ad-understanding,
it does not make use of recent advances in computer vision, e.g. region proposals [18, 55, 42, 15], attention
[8, 73, 70, 60, 72, 54, 51, 43, 13, 79, 52], or image-text
embeddings [35, 6, 7, 11, 16, 53].
Note that symbols can be culture-dependent, and the
messages of ads might be interpreted differently by different
viewers. We do not deal with these nuisance factors. Symbols were annotated by workers in the United States, and
from the point of view of ad designers, a single message is
encoded. It is this message, and this US-based symbolism,
that we are interested in predicting.
To summarize, our contributions are as follows:
• We show how to effectively use symbolism to better

understand ads.
• We show how to make use of noisy caption predictions
to bridge the gap between the abstract task of predicting
the message of an ad, and more accessible information such
as the objects present in the image. Detected objects are
mapped to symbols via a domain-specific knowledge base.

• We improve the state of the art in understanding ads
by 35% in the case of public service announcements, and
30% in the case of product ads.
• We show that for the “abstract” PSAs, conceptual
knowledge in the form of symbols helps more, while for the
more “straightforward” product ads, use of general-purpose
object recognition techniques is more helpful.
The remainder of the paper is organized as follows.
We briefly discuss related work in Sec. 2. In Sec. 3.1,
we describe the retrieval task on which we focus, and in
Sec. 3.2, we describe standard triplet embedding using the
Ads Dataset. In Sec. 3.3, we discuss the representation of
an image as a weighted combination of region representations, weighed by their importance via an attention model.
In Sec. 3.4, we describe how we use external knowledge to
constrain the learned space. In Sec. 3.5, we develop an optional additive refinement of the image representation, again
using external knowledge and symbols. In Sec. 4, we compare our method to state of the art methods, and conduct
extensive ablation studies. We conclude in Sec. 5.

2. Related Work
Advertisements and multimedia. The most related work

to ours is [23] which proposes the problem of decoding ads,
formulated as answering the question “Why should I [action]?” where [action] is what the ad suggests the viewer
should do, e.g. buy something or behave a certain way
(e.g. help prevent domestic violence). The dataset contains 64,832 image ads. Annotations available include the
topic (product or subject) of the ad, sentiments and actions the ad prompts, rationales provided for why the action
should be done, symbolic mappings (signifier-signified, e.g.
motorcycle-adventure), etc. Considering the media domain
more broadly, [30] analyze in what light a photograph portrays a politician, and [31] analyze how the facial features
of a political candidate determine the outcome of an election. Also related is work in parsing infographics, charts
and comics [4, 33, 26]. In contrast to these, our interest is
analyzing the implicit references ads were created to make.
Vision and language and image-text embeddings. In recent years, there is great interest in joint vision-language
tasks, e.g. image and video captioning [67, 32, 10, 29, 2, 73,
66, 65, 77, 71, 14, 52, 59, 9, 36], visual question answering
[3, 76, 44, 72, 60, 69, 63, 80, 81, 21, 68, 28, 64], and crossdomain retrieval [7, 6, 77, 40]. The latter often makes use
of learned joint image-text embeddings, as we also do in

this work. [35] uses triplet loss where an image and its corresponding human-provided caption should be closer in the
learned embedding space than image/caption pairs that do
not match. [11] propose a bi-directional network to maximize correlation between matching images and text, akin to

CCA [19]. [16] utilize the images and texts in Wikipedia
articles for self-supervision. While these achieve excellent
results, none of them consider images with implicit or explicit persuasive intent, as we do.
External knowledge for vision-language tasks. We propose to use external knowledge for decoding ads, in two
ways: via symbols that inherently refer to outside knowledge, and by using outside knowledge to learn to detect
symbols. [69, 68, 28, 81, 64] examine the use of knowledge bases for answering visual questions. [66] use external
sources to diversify their image captioning language model.
[49] learn to compose object classifiers by relating similarity in semantics to visual similarity. [45, 17] use knowledge
graphs or hierarchies to aid in object recognition. These
works all use mappings that are objectively/scientifically
grounded, i.e. lions are related to cats, lions are a type of
cat, etc. In contrast, we use cultural associations that arose
in the media/literature and are internalized by humans, e.g.
motorcycles are associated with adventure.
Region proposals and attention. We make use of region
proposals [18, 55, 42, 15] and attention [8, 73, 70, 60, 72,
54, 51, 43, 13, 79, 52]. Region proposals focus the job of
an object detector to regions likely to contain objects. Attention helps focus learning and prediction tasks on regions
likely to be relevant to the task; these regions are learned using backpropagation. We show that the regions over which
we compute attention must be specific to the domain of ads.

Modeling of human attention [25, 5, 27, 37, 78] and memorability [24, 34] are also relevant since ads are created to
draw and hold the viewer’s attention [74].

3. Approach
We learn a joint image-text embedding space where we
can evaluate the similarity between ad images and ad messages. We use symbols and external knowledge to constrain and refine this space in three ways. First, rather than
consider ad images as a whole, we represent an ad image
as a weighted average of the representations of its regions
(Sec. 3.3). Second, we enforce that images that have the
same symbol labels, or the same detected objects [29], map
closeby in the learned space (Sec. 3.4). Finally, we propose
an additive refinement of the image representation via an
attention-masked symbol distribution (Sec. 3.5). In Sec. 4
we demonstrate the utility of each component.

3.1. Task and dataset
Our goal is to develop a method for understanding advertisements. Concretely, we consider the task of questionanswering in the dataset of [23]. The authors formulated ad
understanding as answering the question is “Q: Why should
I [action]? A: [one-word reason]” An example questionanswer pair is “Q: Why should I speak up about domestic
violence? A: bad.” The one-word reason is picked from a

full-sentence reason, also available in the dataset. We believe that using a single word is insufficient to capture the
rhetoric of complex ads, so we slightly modify their task.
Rather than a softmax over 1000 words, we ask the system to pick which statement is most appropriate for the image. We retrieve statements in the format: “I should [action]
because [reason].” Using the same example, the statement
would be “I should speak up about domestic violence because being quiet is as bad as committing violence yourself.” Given an image, we rank 50 statements (3 related and
47 unrelated) based on their similarity to the image, in the
learned feature space.

3.2. Basic image-text triplet embedding
As a foundation for our approach, we first directly learn
an embedding that optimizes for the desired task. We require that in the feature space, the distance between an image and its corresponding statement should be smaller than
the distance between that image and any other statement, or
between other images and that statement. In other words,
the loss that is being minimized is
L(v, t; θ) =

N
X


kvia − tpi k22 − kvia − tni k22 + β



+

i=1

+

N
X


(1)
ktai



vip k22



ktai



vin k22





+

i=1

where v indicates the visual embedding we are learning and
t indicates the text embedding, via , vip , tai , tpi correspond to
the same ad, and vin , tni correspond to a different ad.
In order to extract visual embedding v from image x, we
first extract the image’s CNN feature (1536-D) from [62],
then use a fully connected layer to project it to the 200-D
joint embedding feature space. Given w ∈ R200×1536 , the
parameter of the fully connected layer,
v = w · CN N (x)

(2)

The text embedding vector t is a summation of individual
word embedding [47, 48, 46] vectors. Both the image and
the text feature are l2-normalized. We set the hyper parameter β to 0.4 in Eq (1) using the results of preliminary experiments. To make the training process converge faster, we use
a twist on the hard negative mining approach of [12, 57].

Region proposal and
attention weighing
0.6

vi = w · CN N (xi )

Embedding of image

0.4

(3)

0.2
0

α1 α2 α3
2

200D image embedding

ai = wa · CN N (xi )

x1 x2 x3

z=

x
Knowledge inference and
symbol embedding

+

Triplet training

KB

uobj, usymb

ysymb

XK

Embedding of external knowledge

Figure 2. Our image embedding model with knowledge branch. In
the main branch (top), multiple image regions are proposed by the
region proposal network. Attention weighing is then applied on
these regions and the embedding of the image is computed as a
weighted sum of the regions’ embedding vectors. The knowledge
branch (bottom) predicts the existence of symbols, maps these to
200-D, and adds them to the image embedding (top).

α = sof tmax(a)

i=1

(4)

αi vi

(5)

The loss used to learn the image-text embedding is the
same as in Eq. (1), but defined using the region-based image
representation z instead of v: L(z, t; θ).
We demonstrate that (1) learning a region proposal network with attention, and (2) learning from symbol bounding
boxes, greatly help the statement retrieval task. In particular, statement ranking results are worse if we use a generic
pre-trained region proposal network. We argue that generalpurpose object detection models cannot capture nuance in
ads since they ignore uncommon or abstract objects.

3.4. Constraints via symbols and captions
3.3. Embedding using symbol regions
Since ads are carefully designed, they may involve complex narratives with several distinct components, i.e. several
regions in the ad might need to be interpreted individually
first, before we can reason about them jointly to infer the
message of the ad. Thus, we represent an image as a collection of its constituent important reasons, using an attention
module to learn the weights for each region.
We learn a region proposal network using the symbol bounding boxes of [23]. The idea is that ads draw
the viewer’s attention in a particular way, and the symbol
bounding boxes without symbol labels can be used to approximate this. This label-agnostic method is a new use of
[23]’s symbolism data that has not been explored before.
We use a pre-trained network [42, 22, 20] and fine-tune it
using the symbol bounding box annotations. We show in
Sec. 4 that this fine-tuning is crucial.
To further model the viewer’s attention, we also incorporate the bottom-up attention mechanism [64, 1], which is a
weighing among region proposals.
In more detail, we extract CNN features for each detected ads region xi , i ∈ {1, . . . , K = 10} in the image x.
We then use a fully connected layer to project each regionbased feature to: 1) a 200-D embedding vector vi (Eq. 3),
and 2) a confidence score ai saying how much the region
should contribute to the final representation (Eq. 4). In Eq
(3) and Eq (4) w ∈ R200×1536 and wa ∈ R1×1536 . The
final embedding vector z is a weighted sum of these regionbased vectors weighed by their confidence score (Eq. 5).
This intuitive idea was also used in [64] for visual question answering, but in our case, the attention distribution do
not depend on questions. In Fig. 2, we show how we use
bottom-up attention to weigh the different regions.

In Sec. 3.2, we describe how we learn a joint imagetext space using triplet loss defined over pairs of images
and their corresponding reason statements. Since symbols
provide additional information that humans use to decode
ads, we now propose additional constraints to our triplet
loss such that two images (and their statements) that were
annotated with the same symbol are closer in the learned
space than images (and statements) annotated with different symbols. The extra loss term we use to constrain the
training process is shown in Eq (6) where s is the 200-D
embedding of symbol labels.
Lsym (s; θ) =

N
X


ksai − zip k22 − ksai − zin k22 + β



+

i=1

+

N
X


(6)
ksai



tpi k22



ksai



tni k22





+

i=1

By applying the new constraints, the model converges
faster and the training process becomes more stable. At the
same time, we explicitly embed symbol labels in the same
feature space as images and statements. These symbols embedding vectors serve as entry points for external knowledge and shall be further discussed in Sec. 3.5.
Further, note that there is some regularity in terms of
the objects that ads with similar rhetoric portray. For example, environment ads often feature animals, safe driving
ads feature cars, beauty ads feature faces, drink ads feature
bottles, etc. The Ads Dataset contains insufficient data to
properly learn about object categories. Thus, to ground the
embeddings that our model learns, we use DenseCap [29]
to “annotate” the images with captions, then we create additional constraints out of these “annotations.” If two images/statements have similar DenseCap predicted captions,
they should be closer than images/statements with different

captions. The extra loss term we use for constrain the training process is shown in Eq (7).
Lobj (c; θ) =

N
X


kcai − zip k22 − kcai − zin k22 + β



+

i=1

+

N
X


(7)
kcai



tpi k22



kcai



tni k22





+

i=1

Note that the object/DenseCap embedding model does
not share weights with the statements’ embedding model
since the meaning of the same surface words may vary in
these two different domains.
The same object can be used to make different points,
e.g. faces in beauty ads vs domestic violence ads, cars in car
ads vs safety ads, etc. Similarly, symbol labels in isolation
do not tell the full story of the ad. Thus, we reduce the
impact of the symbol-based and object-based constraints by
weighing the corresponding loss by 0.1.
Note that [50] also propose to use labels to constrain an
embedding space. However, we show in our experiments
that it is not sufficient to use any type of label in the domain
of interest. We experiment with another type of label from
[23]’s dataset, namely the topic of the ad (e.g. what product
it is selling), and show that symbols give a greater benefit.

3.5. Additive external knowledge
In this section, we describe how to make use of external
knowledge via a symbol representation that is adaptively
added to the image representation to compensate inadequacies of the image embedding. This external knowledge can
take the form of a mapping between physical objects and
implicit concepts, or a classifier mapping pixels to concepts.
Assume we are viewing a challenging ad whose meaning is not immediately obvious. The only thing we can do
is to use our human experience to find some evidence. For
example, do the visual cues remind us of concepts we have
seen in other ads? This is how external knowledge helps us
to decode the ad. Our model is able to interpret ads in the
same way: based on external knowledge base, it infers the
abstract symbols. Since the model knows the exact meaning of these symbols (since it already knows the embedding vectors of these symbols, see Sec. 3.4), it is able to
reconstruct the image representation using these symbols’
embedding vectors by weighing them. Fig. 2 (bottom)
shows the general idea of the external knowledge branch.
It is worth mentioning that our model only uses external
knowledge to compensate its own lack of knowledge and it
assigns small weights for uninformative symbols.
We propose two ways to additively expand the image
representation with external knowledge. Both ways are a
form of knowledge base (KB) mapping physical evidence
to concepts. The first is to directly train classifiers to

link certain visuals to the symbolic concept. More specifically, we use the 53-way multilabel symbol classifier usymb
from [23] as 53 individual classifiers. We obtain a symbol distribution ysymb = sigmoid(usymb · x). We learn a
weight αjsymb for each of j ∈ {1, . . . , C = 53} classifiers denoting the confidence of the classifier. Therefore,
the learned model of the knowledge branch is an attention model weighing the 53 symbol classifiers. The attention weights are adjusted depending on whether a particular
symbol is helping in the statement matching task.
The second method is to learn associations between actual objects in the image (surface words for detected objects) and abstract concepts of symbols. For example, what
type of ad might I see a “car” in? What about a “rock” or
“animal”? We first construct a knowledge base associating
object words to symbol words. We compute the similarity
in the learned text embedding space between symbol words
and DenseCap words, then create a mapping rule (object
implies symbol) for each symbol and its five most similar DenseCap words. This results in a 53×V matrix uobj ,
where V is the size of the DenseCap vocabulary. Each row
contains 5 entries of 1 denoting the mapping rule, and V −5
entries of 0. Examples of learned mappings are shown in
Table 6.
For a given image, we use [29] to predict the 3 most
probable words in the DenseCap vocabulary, and put the
results in a multi-hot yobj ∈ RV ×1 vector. We then matrixmultiply to accumulate evidence for the presence of all
symbols using the detected objects: ysymb = uobj · yobj .
symb
We associate a weight αjl
with each rule in the KB,
j ∈ {1, . . . , C = 53}, l ∈ {1, . . . , V }, which explains the
importance of the rule saying to what extent the lth word in
the DenseCap vocabulary symbolizes the j th symbol (e.g.
to what extent “rock” is used to illustrate “natural”).
For both methods, we first use the attention weights
αsymb as a mask, then project the 53-D symbol distribution
ysymb into 200-D, and add it to the image embedding.
This additive branch is most helpful when the information it contains is not already contained in the main image
embedding branch. We discovered this happens when the
discovered symbols are rare. This poses a learning challenge for our object-to-symbol mapping method. In order
to learn attention weights on the full 53×V matrix, we must
have enough data, but if we have enough data, the additional
branch is not likely to be active. Breaking this dependency
is the subject of our future work.

3.6. ADVISE: our final model
Our final ADs VIsual Semantic Embedding loss combines the losses from Sec. 3.2, 3.3, 3.4, and 3.5:
XK
Lf inal (z, s, c, t; θ) = L(
αi vi + ysymb , t; θ)
i=1
(8)
+ 0.1 Lsym (s; θ) + 0.1 Lobj (c; θ)

VSE on Ads: “I should
wear Revlon makeup
because it will make
me more attractive”

VSE on Ads: “I should buy Ben
& Jerry's ice cream because
they treat the cows that give the
milk for their ice cream well”

ADVISE (ours): “I
should stop smoking
because it doesn't make
me pretty”

ADVISE (ours): “I should
report domestic abuse because
ignoring the problem will not
make anything better”

Figure 3. The performance of our ADVISE method compared to
the strongest baseline, VSE ON A DS. In the first example, the
baseline is tricked into thinking this is a beauty ad, as is the intent
of the ad designer for creating a more dramatic effect. In the second example, the baseline may have gotten confused by the purple
colors often used in Ben & Jerry’s ads.

4. Experimental Validation
We evaluate to what extent our proposed method is able
to match an ad to its intended message (see Sec. 3.1). We
compare our method to the following approaches from recent literature:
• O NE - WORD, the QA method from [23] which uses
symbols in a different, less effective way. The goal is to
predict a one-word answer to the question “Why should the
viewer [action]?”, e.g. “Why should the viewer buy this
car?” An answer might be e.g. “reliable” or “fast”. The
method combines three features: the VGG embedding of
the image, an LSTM embedding of the question, and a distribution over 53 symbols using a symbol classifier. In order
to adapt this method to our task, we take the predicted one
word, and use Glove similarity to rank the statement options
in terms of their similarity to this word.
• the Visual-Semantic Embedding (VSE) from [35],
trained using Flickr30K [75] and COCO [41]. Note that
more recent image-text joint embedding methods exist, but
these use complex architectures [11] or are specialized to
particular applications [16, 6, 7]. We focus on VSE as a
simple general-purpose embedding.
• VSE ON A DS uses the same method as [35] but trains
it on around 39,000 images from the Ads Dataset and more
than 111,000 associated statements (training set size varies
for different folds), as described in Sec. 3.2.
We compute two metrics: Recall@3, which denotes the
number of true statements ranked within the Top-3, and
Rank, which is the averaged ranking value of the highestranked true matching statement (highest rank is 0, which
means rank at the first place). We expect a good model to
have high Recall@3 and low Rank score.
We use five random splits of the dataset into
train/validation/test sets, and show mean results and standard error over a total of 62,468 test cases.
We show the improvement that our method produces
over state of the art methods, in Table 1. Since public service announcements (e.g. domestic violence campaigns and
anti-bullying campaigns) typically use different strategies

and sentiments than product ads (e.g. ads for cars and coffee), we separately show the result for PSAs and products.
We observe that our method greatly outperforms the prior
relevant research. PSAs in general appear harder than product ads, consistent with our argument in Sec. 1. Compared
to VSE [35], our method improves Recall@3 and Rank by
five times for PSAs, and three/five times for products. Compared to the strongest baseline, VSE ON A DS, we improve
Rank by 35% for PSAs, and 30% for product ads. Note that
in Table 1 we show the better of the two alternative methods
in Sec. 3.5, namely the symbol classifier. Qualitative results
are shown in Fig. 3.
We also conduct ablation studies to verify the benefit of
the components of our method.
• GENERIC REGION embedding using image regions
from a generic region proposal network [22] trained on the
COCO [41] detection dataset.
• SYMBOL BOX embedding and ATTENTION (Sec. 3.3)
• SYMBOL / OBJECT constraints (Sec. 3.4)
• additive knowledge (Sec. 3.5), first predicting objects
and mapping to symbols (KB OBJECTS) or directly predicting symbols via training data as a KB (KB SYMBOLS)
The results are shown in Table 2 for PSAs, and Table 3
for products. We also show percent improvement of each
new component. Improvement is computed with respect to
the previous row, except for KB OBJECTS and KB SYM BOLS , whose improvement is computed with respect to the
third-to-last row, i.e. the method on which both KB methods are based. The largest increase in performance comes
from focusing on individual regions within the image to understand the ad’s story. This makes sense because ads are
carefully designed and multiple elements work together to
convey the message. Qualitative examples showing the impact of regions are shown in Fig. 4. Especially for PSAs,
these regions must be learned on the ads domain specifically to further greatly increase performance.
Beyond this, the story that the results tell differs between
PSAs and products. Our key idea to rely on external knowledge and symbols is more helpful for the challenging, abstract PSAs that are the focus of our work. In contrast,
general-purpose computer vision techniques help more for
product ads that rely on straightforward strategies.
Interestingly, attention helps for product ads, but slightly
hurts for PSAs. It appears PSAs tell their “story” holistically, so subselecting individual regions is detrimental.
Finally, the additive inclusion of external information
helps more when we directly predict the symbols, but also
when we first extract objects and map these to symbols.
Note that given the plethora of object recognition resources,
KB OBJECTS is much cheaper in terms of human effort, as
KB SYMBOLS required over 64,000 symbol labels. In contrast, KB OBJECTS simply relies on mappings between ob-

Recall@3 (Higher ↑ is better)
PSA
Product
0.313 ± 0.010 0.504 ± 0.003
0.697 ± 0.017 0.653 ± 0.004
1.220 ± 0.018 1.511 ± 0.004
1.507 ± 0.018 1.726 ± 0.004

Method
VSE [35]
O NE - WORD [23]
VSE ON A DS
ADVISE (O URS )

Rank (Lower ↓ is better)
PSA
Product
10.817 ± 0.177 7.394 ± 0.036
7.934 ± 0.208 6.336 ± 0.036
3.139 ± 0.095 2.112 ± 0.019
2.032 ± 0.076 1.474 ± 0.016

Table 1. Our main result. We observe our method greatly outperforms three recent methods in retrieving matching statements for each ad.

Method
VSE ON A DS
GENERIC REGION
SYMBOL BOX
+ ATTENTION
+ SYMBOL / OBJECT
+ KB OBJECTS

+ KB SYMBOLS

Rec@3 ↑
1.220
1.384
1.452
1.450
1.487
1.488
1.507

Rank ↓
3.139
2.414
2.159
2.237
2.128
2.102
2.032

% improvement
Rec@3 Rank
Ours

13
5
0
3
0
1

23
11
-4
5
1
5

Table 2. Ablation study on PSAs. All external knowledge components (i.e. all except attention) give a boost over the naive VSE.

Method
VSE ON A DS
GENERIC REGION
SYMBOL BOX
+ ATTENTION

Rec@3 ↑
1.511
1.668
1.694
1.725

Rank ↓
2.112
1.649
1.549
1.491

% improvement
Rec@3 Rank
9
2
2

22
6
4

Table 3. Ablation study on products. General-purpose recognition
approaches, e.g. regions and attention, produced the main boost.
The symbol-based method components in Sec.3.4 and Sec.3.5 produced small or no improvements.

Method
Symbol labels
Topic labels

PSA
Rec@3 ↑ Rank ↓
2
4
1
4

Product
Rec@3 ↑ Rank ↓
0
1
0
0

Generic

Figure 4. Visualization of region proposals for PSAs. Note how
our proposals based on ads focus on relevant regions of the image,
e.g. the smoke (which can often be a symbol) and the tip of the
cigarette (left), the wound (middle), and the region of damage in
the forest (right). The generic COCO boxes straddle the boundaries of meaningful regions in the ad.

Method
VSE [35]
O NE - WORD [23]
VSE ON A DS
ADVISE (O URS )

Hard stmt (↓)
9.676
8.725
4.642
3.835

Slogan (↓)
9.564
7.365
3.108
2.336

Clustering (↑)
0.173
N/A
0.293
0.356

Table 5. Other tasks that our learned image-text embedding helps
with. We show Rank for the first two tasks (lower is better), and
homogeneity score [56] for the third task (higher is better). N/A
indicates the method does not learn an embedding.

Table 4. % improvement for different types of labels as constraints.

ject and symbol words, which can be obtained much more
efficiently as they are not image-dependent. Thus, KB OB JECTS would likely generalize better to a new domain of
ads (or ads in a different culture) where the symbol training
data from the Ads Dataset is not available.
In Table 4, we show that not any type of label would suffice as constraint. In particular, even though the Ads Dataset
includes 6 times more topic labels that could be used as
constraints compared to symbol labels, symbol labels give
much greater benefit. Thus, [50]’s approach is not enough;
the type of labels must be carefully chosen.
In Table 5, we demonstrate the versatility of our learned
embedding, compared to the baselines from Table 1. None
of the methods were retrained, i.e. we simply used the pretrained embedding used for the results in Table 1. First, we

again perform a statement retrieval task, but make the task
harder. In particular, all statements that are to be ranked
are from the same topic (i.e. all statements are about car
safety or about beauty products). The second task uses creative captions that MTurk workers were asked to write for
2000 ads in [23]. We perform a retrieval task among these
slogans, using an image as the query. Finally, we check
how well an embedding clusters ad images with respect to a
ground-truth “clustering” defined by the topics of ads. For
example, if two images show faces but one is a domestic
violence ad while the other is a beauty ad, would an embedding accurately place these ads in different clusters despite
their visual similarity? We see that our method greatly outperforms all other embeddings.
To summarize, the main takeaways from our quantitative
experiments are as follows:

• Our ADVISE method greatly exceeds the state of the
art for the task of retrieving statements that describe the ad’s
message correctly (VSE [35] and O NE - WORD [23]).
• The region embedding greatly improves upon the traditional embedding, even though the latter is trained directly
for the task of interest.
• Relying on symbol boxes always helps as symbol
boxes indicate how ads draw the viewer’s attention.
• For PSAs, symbol labels as additional constraints help
further; importantly, they help not just because they are
proxies for the metric learning [50], but because they capture the idea of the ads. In contrast, another type of label
that is also from the ads dataset but much more rough (topics), does not help as much. Regularizing with external info,
e.g. predictions about objects, also provides a benefit.
Finally, in Table 6, we show some qualitative results
demonstrating the utility of anchoring our learned space
with DenseCap predictions. We consider three vocabularies: the 53 symbol words from [23], the 27,999 unique
words from the action/reason question-answers, and the
823 unique words from the DenseCap annotations. In the
learned space, we compute the nearest neighbors for each
symbol, statement, and DenseCap word, to establish rough
synonymy. This is the knowledge base used in Sec. 3.5.
These discovered results can be used as a “dictionary”
showing the meaning of ad. In other words, if I see a given
object, what should I predict the message of the ad is?
We begin with an intuitive result in the first triplet (ID
1). When an ad designer wants to allude to comfort, they
might use objects such as a soft bed, where “soft” is a statement word (“Why should I buy this? Because it’s soft.”) In
ID 3-6, we demonstrate the evidence in terms of different
symbols and different objects, for statements that are given.
The first column essentially tells us the meaning of a visual,
and the last column tells us how a concept is illustrated. To
illustrate “coolness”, one might show sunglasses, and these
might refer to adventure. If the statement contains the word
“driving,” then perhaps this is a safe driving ad, where visuals in the ad allude to safety, danger, or speed, while physically the visuals contain cars, tires, and windshields. Ads
about “home” show houses and kitchens, but these refer to
safety, family, and comfort. Freedom, family and relaxation
are “American” concepts alluded to by flags. In ID 6, we
see further evidence that PSAs (on health, art and violence)
are more conceptual (the viewer needs to think).
In IDs 8-10, we see the intuitive context and symbolism associated with “meat” and “rocks,” and the double role
that “faces” can play (in beauty and domestic violence ads).
Finally, observe the different role of “tomato” (ID 11) vs
“ketchup” (ID 7): the former symbolizes health while the
latter is associated with flavor and hotness.

ID
1

Symbol
comfort

2

speed,
excitement, adventure,
power, fashion
safety, danger,
injury,
speed,
death
comfort, relaxation, christmas,
safety, family
freedom, vacation, relaxation,
sex, family
health, death, art,
injury, violence
delicious, hot,
food,
strong,
hunger
hunger, food, delicious, hot, desire
environment, nature, adventure,
travel, strong

3

4

5

6
7

8

9

10

violence, humor,
love,
desire,
strong

11

food,
healthy,
hunger,
delicious, variety

Statement
couch,
sofa,
soft,
bed,
comfy
cool

driving

home

american

think
ketchup

food,
meal,
steak, meals,
roast
wilderness,
outdoors, terrain, rugged,
rover
make-up,
makeup,
maybelline,
eyeliner, covergirl
salads, food,
salad, menu,
toppings

DenseCap
pillow,
bed,
blanket, couch,
rug
sunglasses,
sleeve, jacket,
carrying, scarf
car,
windshield,
van,
tire, license
house, cabinet,
kitchen, bush,
sink
flag,
persons, mustard,
papers, striped
kites, lamp, art,
bike, design
beer, pepper,
sauce, jar, juice
meat

rock

face

tomato

Table 6. Discovered synonym triplets between symbol words, action/reason words, and DenseCap words. The single word (the
word appearing alone in a table cell, shown in italics) is the query.

5. Conclusion
We presented a method for matching image advertisements to statements which describe the idea of the ad. Our
method uses external knowledge in the form of symbols
and predicted objects in two ways, as constraints for a joint
image-text embedding space, and as an additive component
for the image representation. We also verify the effect of
state of the art computer vision techniques in the form of
region proposals and attention for the task of automatically
understanding ads. Our method outperforms existing techniques by a large margin. In the future, we will investigate
further external resources for decoding ads, such as predic-

tions about the memorability or human attention over ads,
and use our object-symbol mappings to analyze the variability that the same object category exhibits when used for
different ad topics.

References
[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson,
S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and VQA. CoRR, abs/1707.07998,
2017. 4
[2] L. Anne Hendricks, S. Venugopalan, M. Rohrbach,
R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without
paired training data. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016. 2
[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,
C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question
answering. In The IEEE International Conference on Computer Vision (ICCV), December 2015. 2
[4] Z. Bylinskii, S. Alsheikh, S. Madan, A. Recasens, K. Zhong,
H. Pfister, F. Durand, and A. Oliva. Understanding infographics through textual and visual tag prediction. arXiv
preprint arXiv:1709.09215, 2017. 2
[5] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba,
and F. Durand. Where should saliency models look next? In
European Conference on Computer Vision, pages 809–824.
Springer, 2016. 3
[6] Y. Cao, M. Long, J. Wang, and S. Liu. Deep visual-semantic
quantization for efficient image retrieval. In CVPR, 2017. 2,
6
[7] K. Chen, T. Bui, C. Fang, Z. Wang, and R. Nevatia. Amc:
Attention guided multi-modal correlation learning for image
search. In CVPR, 2017. 2, 6
[8] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation.
In Computer Vision and Pattern Recognition (CVPR). IEEE,
2016. 2, 3
[9] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, and
M. Sun. Show, adapt and tell: Adversarial training of crossdomain image captioner. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. 2
[10] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual
recognition and description. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June
2015. 2
[11] A. Eisenschtat and L. Wolf. Linking image and text with
2-way nets. In CVPR, 2017. 2, 3, 6
[12] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. Vse++:
Improved visual-semantic embeddings.
arXiv preprint
arXiv:1707.05612, 2017. 3
[13] J. Fu, H. Zheng, and T. Mei. Look closer to see better: recurrent attention convolutional neural network for fine-grained
image recognition. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017. 2, 3

[14] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng. Stylenet:
Generating attractive visual captions with styles. In The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 2
[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic
segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 580–587,
2014. 2, 3
[16] L. Gomez, Y. Patel, M. Rusinol, D. Karatzas, and C. V. Jawahar. Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR, 2017. 2, 3,
6
[17] W. Goo, J. Kim, G. Kim, and S. J. Hwang. Taxonomyregularized semantic deep convolutional neural networks. In
European Conference on Computer Vision, pages 86–101.
Springer, 2016. 3
[18] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn.
In The IEEE International Conference on Computer Vision
(ICCV), Oct 2017. 2, 3
[19] H. Hotelling. Relations between two sets of variates.
Biometrika, 28(3/4):321–377, 1936. 3
[20] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 4
[21] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko.
Learning to reason: End-to-end module networks for visual
question answering. In The IEEE International Conference
on Computer Vision (ICCV), Oct 2017. 2
[22] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and
K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 4,
6
[23] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas,
Z. Agha, N. Ong, and A. Kovashka. Automatic understanding of image and video advertisements. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
July 2017. 1, 2, 3, 4, 5, 6, 7, 8
[24] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. What
makes a photograph memorable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1469–1482,
2014. 3
[25] L. Itti, C. Koch, and E. Niebur. A model of saliency-based
visual attention for rapid scene analysis. IEEE Transactions
on pattern analysis and machine intelligence, 20(11):1254–
1259, 1998. 3
[26] M. Iyyer, V. Manjunatha, A. Guha, Y. Vyas, J. Boyd-Graber,
H. Daume, III, and L. S. Davis. The amazing mysteries of the
gutter: Drawing inferences between panels in comic book
narratives. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), July 2017. 2
[27] M. Jiang, S. Huang, J. Duan, and Q. Zhao. Salicon: Saliency
in context. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1072–1080,
2015. 3

[28] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman,
L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and executing programs for visual reasoning. In ICCV, 2017. 2,
3
[29] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully
convolutional localization networks for dense captioning.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016. 2, 3, 4, 5
[30] J. Joo, W. Li, F. F. Steen, and S.-C. Zhu. Visual persuasion: Inferring communicative intents of images. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2014. 2
[31] J. Joo, F. F. Steen, and S.-C. Zhu. Automated facial trait judgment and election outcome prediction: Social dimensions of
face. In Proceedings of the IEEE International Conference
on Computer Vision, pages 3712–3720, 2015. 2
[32] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In The IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), June 2015. 2
[33] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi,
and A. Farhadi. A diagram is worth a dozen images. In
European Conference on Computer Vision, pages 235–251.
Springer, 2016. 2
[34] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva. Understanding and predicting image memorability at a large
scale. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2390–2398, 2015. 3
[35] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying
visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. 2,
3, 6, 7, 8
[36] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. In The IEEE
International Conference on Computer Vision (ICCV), Oct
2017. 2
[37] M. Kümmerer, T. S. Wallis, and M. Bethge. Deepgaze ii:
Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563, 2016. 3
[38] J. H. Leigh and T. G. Gabel. Symbolic interactionism: its effects on consumer behaviour and implications for marketing
strategy. Journal of Services Marketing, 6(3):5–16, 1992. 1
[39] S. J. Levy. Symbols for sale. Harvard business review,
37(4):117–124, 1959. 1
[40] X. Li, D. Hu, and X. Lu. Image2song: Song retrieval via
bridging image content and lyric words. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
2
[41] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.
Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and
C. L. Zitnick. Microsoft COCO: common objects in context.
CoRR, abs/1405.0312, 2014. 6
[42] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
In European conference on computer vision, pages 21–37.
Springer, 2016. 2, 3, 4

[43] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when
to look: Adaptive attention via a visual sentinel for image
captioning. In CVPR, 2017. 2, 3
[44] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering questions about
images. In The IEEE International Conference on Computer
Vision (ICCV), December 2015. 2
[45] K. Marino, R. Salakhutdinov, and A. Gupta. The more
you know: Using knowledge graphs for image classification.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017. 3
[46] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
estimation of word representations in vector space. arXiv
preprint arXiv:1301.3781, 2013. 3
[47] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
J. Dean. Distributed representations of words and phrases
and their compositionality. In C. J. C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26,
pages 3111–3119. Curran Associates, Inc., 2013. 3
[48] T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities
in continuous space word representations. In HLT-NAACL,
pages 746–751, 2013. 3
[49] I. Misra, A. Gupta, and M. Hebert. From red wine to red
tomato: Composition with context. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), July
2017. 3
[50] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and
S. Singh. No fuss distance metric learning using proxies. In
ICCV, 2017. 5, 7, 8
[51] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for
multimodal reasoning and matching. In CVPR, 2017. 2, 3
[52] M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek.