Creating a Fine- Grained Corpus for Chinese Sentiment Analysis


Creating a FineGrained Corpus for
Chinese Sentiment
Yanyan Zhao, Bing Qin, and Ting Liu, Harbin Institute of Technology


he Web holds a considerable amount of user-generated content describing the opinions of customers on products and services through

reviews, blog, tweets, and so on. These reviews are valuable for customers
Existing corpora

making purchasing decisions and for companies guiding business activities.

are almost sentence-


However, browsing the extensive collection

of reviews and finding useful information is
level works that
a time-consuming and tedious task. Consequently, sentiment analysis and opinion
ignore important
mining have attracted significant attention
in recent years, paving the way for autoglobal sentiment
matic analysis of reviews and extraction of
the information most relevant to users.
information in other
Sentiment analysis entails several interesting and challenging tasks. A fundamental one
is polarity classification—determining the posentences. Given
larity of a sentence or a document—but this
sort of task is coarse-grained and can’t prothe rise of advanced
vide detailed information.1–3 Recently, there
applications, more
has been a shift toward fine-grained tasks
that not only search for opinionated text
fine-grained corpora but also analyze its polarity (positive, neutral, negative) and intensity (weak, medium,
are needed, especially strong, extreme), identifying the associated
source or opinion holder, as well as the topic,

for Chinese.
target entity, or aspect of the opinion.4–7
Many of these tasks are based on statistical
and machine learning algorithms, making annotated, fine-grained corpora necessary for
measuring algorithm performance and as

training data for supervised machine learning
Some public corpora exist for fine-grained
tasks.7,9–11 Although they provide more detailed sentiment information than their
coarse-gained counterparts (which contain polarity labels only), their annotation
schemes are generally at the sentence or expression level. But not all useful information
can co-occur in the same sentence, meaning
these schemes ignore important global sentiment information. For example, in the sentiment sentence, “The image quality is so
good,” “image quality” can be labeled as an
aspect, but the owner of this “image quality”
isn’t mentioned. Such information is more important in practical applications and might be
found in other sentences in the document.
Unfortunately, useful sentence-level sentiment information hasn’t drawn special attention or been annotated in the existing public
corpora. For example, most of the current corpora annotate the aspect-opinion pair, but more

than 25 percent of aspects can’t find their corresponding opinion words. Sentences containing

Table 1. Problems in the existing corpora.*


Target entity ignoring

these aspects might then be discarded
in the annotation process or incorrectly
treated as non-sentiment. Take the sentiment sentence, “The lens is a piece of
work,” as an example: it has no aspectopinion pair but can clearly show the
opinion for the aspect “lens.” Today,

quite a few researchers have mentioned
this problem, but they haven’t treated
it as an independent task—proposed
methods are very simple and not ideal.12
In this article, we discuss the problems
underlying the existing product review
corpus, briefly survey the research area,
and propose a new annotation scheme
that’s more fine-grained than existing
ones and can provide global sentiment
information that’s more useful and important in advanced applications. More
importantly, from analyzing our annotated corpus, we can explore vital but
ignored tasks to provide useful hints
about future research directions. As a
case study, we present a Chinese corpus on two kinds of products—a digital
camera and a mobile phone.

Analysis of Existing
Sentiment Analysis Corpora
Coarse-grained tasks have many public

corpora. Cornell Movie Review Data
pabo/movie-review-data), for example,
is a commonly used sentiment analysis corpus that includes three datasets—sentiment polarity datasets, sentiment scale datasets, and subjectivity datasets—that are all used for sentiment
classification tasks. Stanford’s Large
Movie Review Dataset (LMRD; http://
is used for binary sentiment classification and contains more substantial data
compared with previous benchmark datasets. Coarse-grained sentiment analysis tasks, such as classifying a sentence
or document into several polarities or
ratings, are generally training models based on these corpora. However,
many practical applications pay more
JaNuarY/FEbruarY 2015

(The appearance is very beautiful,)
(and the 3-inch screen looks great.)
Implicit polarity
(The color fits my taste.)
Implicit aspect

(So expensive!)
*Red text indicates aspect, whereas blue text is the polarity word.

attention to the product or the particular aspect to which the polarity is linked.
This necessitates fine-grained tasks and
corresponding fine-grained corpora.
A basic fine-grained task is to extract an aspect in a text and identify its
polarity. In the sentence, “The picture
quality is good,” for example, we can
extract “picture quality” as the aspect
and recognize “positive” as the polarity tag. Product Review Data (PRD)7
includes several product review datasets that are annotated with the product
aspect and its polarity and intensity in
every sentence. Another Movie Review
Data9 is labeled with aspect-opinion
pairs (such as ).
Another corpus in the literature10 is annotated at both sentence and expression levels, filtering individual sentences
according to whether they’re an opinion and identifying opinion expressions
including the opinion holder, modifiers,
and so on at the expression level. Differing from product or movie reviews,

the multiperspective question-answering (MPQA;
opinion corpus contains news articles
from a wide variety of sources manually annotated for opinions and other
private states (beliefs, emotions, sentiments, speculations, and so on). This
dataset annotates the agent, expressivesubjectivity, target entity, attitude, and
other fine-grained elements.
Compared with these abundant English resources, Chinese sentiment analysis corpora are limited. The most
popular one is from the Chinese Opinion Analysis Evaluation (COAE),11
which contains product, movie, and finance reviews and is annotated with

aspect-opinion pairs and their polarities. Although these fine-grained corpora
can provide more detailed sentiment information than the coarse-gained ones,
their annotation schemes are generally at the sentence or expression levels.
A primary problem is thus the omission of important information outside
the sentence. In most cases, the aspect’s
target entity might not be in the same
sentence. We call this the “target entity
ignoring” problem—for example, in the
first line of Table 1, the two sentiment

sentences are annotated with the aspectopinion pair,
and , but the target entity that corresponds to the aspect (“appearance” or “screen”) isn’t annotated.
Obviously, target entity is an important
element for sentiment analysis and opinion mining, and should be included in
the annotation scheme, but statistics indicate that about 90 percent of aspects
don’t appear with the target entity in the
same sentence, with two to three target
entities present in each review. Hence,
recognizing the corresponding target entity for each aspect is a necessary task
that also requires appropriate corpora.
Moreover, several useful sentence-level
sentiment elements haven't drawn special or enough attention, nor have they
been annotated in the public existing
corpora. We illustrate some representative ones as follows.
Implicit Polarity

Existing corpora are mostly annotated with aspect-opinion pairs. However, among the sentiment sentences,
not all aspects are modified by polarity words. The second line in Table 1 is


Annotation Guidelines
To make the corpus adequate for solving the problems presented in Table 1,
we designed a new annotation scheme.
Although it’s also organized in a sentence unit, it can’t be easily treated as
a sentence-level annotation because it
includes cross-sentence and global information. It also enhances the existing
annotation scheme by proposing several useful elements in sentences.
Inspired by practical applications,
this annotation guideline is suitable
for many common sentiment analysis
tasks. For a given sentence, it not only
classifies the polarity but also extracts a
more complete and fine-grained structured representation. The new elements
in this scheme can also generate new
and interesting sentiment analysis tasks.
Based on these principles, we designed the following sentiment elements
for annotation:

Figure 1. An example XML representation for several sentiment sentences.

an example: the sentence doesn’t contain an aspect-opinion pair, but actually shows the opinion as “positive”
for the aspect “color.”
As mentioned earlier, more than
25 percent of aspects can’t identify
their corresponding opinion words
and will be discarded in the annotation
procedure; they’re also often ignored
in practical applications. Moreover,
sentiment sentences containing this
problem are always incorrectly considered as non-sentiment sentences,
which might affect sentiment analysis performance. Today, quite a few
researchers have mentioned this problem12 but haven’t treated it as an independent task. Besides, no corpora
annotate these kinds of elements.
Here, we call the aforementioned
problem implicit polarity. This term

refers to a kind of aspect that’s modified without a corresponding polarity

word but that shows polarity. We call
the sentiment sentences that have this
problem implicit sentiment sentences.
Given the proportion of implicit sentiment sentences in the corpus, this
element should be annotated.
Implicit aspect

Implicit aspect doesn’t explicitly occur in the sentence but is implied in
the polarity word. In the third line
of Table 1, for example, the polarity
word “expensive” implies the aspect
“price.” Another similar example is
“quick” implying “speed.” Although
this problem has aroused the interests
of a few researchers,7,13 no existing
corpora annotate this kind of element
to provide a platform.

• Target entity refers to the main topic.
In product reviews, it might be the
product or brand discussed in a given
sentence. In Figure 1, for example,
the target entity for each sentence is
“NEX-5N.” However, target entity
doesn’t always appear with the aspect
in the same sentence, such as the second and third sentences in Figure 1.
Hence, recognizing the corresponding
target entity for each aspect in a given
sentence is a challenging and new
meaningful task in sentiment analysis.
• Aspect refers to a component or attribute of a certain product. In the
first sentence of Figure 1, “appearance” is tagged as an aspect. Aspect
recognition is a hot research topic in
sentiment analysis. Here, aspect is
exactly present in the given sentence.
• Implicit aspect is explained in the
preceding section. Compared with
common aspect, it’s implied, hence,
not occurring in a given sentence—
see, for example, the third sentence of
Figure 1, where the implicit aspect is

Table 2. The corpus statistics.

“price.” This kind of aspect is always
implied in the polarity words. Recognizing implicit aspect is an interesting
task, the results of which can enhance
the findings of sentiment analysis.
• Polarity expression is the word or
phrase that modifies the aspect and
indicates sentiment orientation,
such as the word “beautiful” in the
first sentence of Figure 1. More specifically, in this article, polarity expression refers to the polarity word,
which is tagged during annotation.
• Modifier refers to the word modifying the polarity word, such as “very”
in the first sentence of Figure 1.
• Negation is the word that can reverse the polarity of the polarity
word, such as “no” or “not.” It plays
a special role in sentiment analysis.
• Polarity refers to the sentiment orientation of a given target entity/
aspect/implicit aspect. In this article, we consider only three polarity
tags, namely, “positive,” “negative,” and “neural.”
• Transition words are always located at the beginning of the sentence, such as the word “but” in
the third sentence of Figure 1. A
sentence containing a transition
word can show opposite polarity
with the sentence before or after it.
• Compare is used to evaluate whether
the given sentence is comparative.
A comparative sentence contains
comparative words, such as the
word “than” in the third sentence of
Figure 1. Comparative sentiment
analysis can be considered as a
characteristic task and has recently
gained extensive attention.
Each sentiment sentence is annotated
in XML to represent every element. Figure 1 shows an XML representation for
several sentiment sentences extracted
from the digital camera domain.
Three main elements are mostly relevant in sentiment analysis tasks: . Object refers to
the main and complete topic; description refers to the words or phrases describing the object; and polarity refers
to the sentiment orientation of . In our annotation scheme,
the combination of elements (1)–(3) is
the complete object, that of elements
(4)–(6) is the description of the object,
and elements (7)–(9) can be used to
compute polarity.
Among the aforementioned elements,
“target entity” and “polarity” can’t be
tagged with “NULL” under any condition. Other elements can be tagged
with “NULL” in the absence of an appropriate word or phrase. If a sentence
contains more than one target entity or
aspect, all of them will be annotated,
and different units will be constructed.

A Fine-Grained Corpus
and Analysis
We present the Chinese product
review corpus as a case study for the
issues proposed in previous sections.
Data Collection

We manually collected online customer
reviews from several famous Chinese forum sites—namely, and for the digital camera, and and http:// for the mobile phone.
The raw corpus contained 400 documents, in which 200 were for the digital camera and 200 documents were
for the mobile phone. Every document
included the post’s main body and title. We asked two experts to annotate them manually, which came to
8,042 sentences for the digital camera
domain and 9,530 sentences for the
mobile phone domain. According to

Cohen’s κ score, the agreement calculated was satisfactory, that is, κ = 0.71
for the digital camera domain and
κ = 0.73 for the mobile phone domain.
To extend our corpus, inspired by previous works14,15 we applied a third
independent annotation where inconsistency was detected. Then, the cases
where inconsistency persisted, such as
when experts selected different tags,
were discarded as too ambiguous to
be annotated. Ultimately, 2,020 and
2,144 sentences were annotated for
the digital camera and mobile phone
domains, respectively. Table 2 presents
the corpus statistics, which we’ll discuss in the following subsection.
This corpus was domain-oriented
and not very large because the annotation procedure was complex and
manual: it might be time-consuming to manually build a large corpus,
which can cover all the general domains, based on the proposed annotation schema. In the future, to enlarge
the corpus and also easily build other
domain corpora, we can use a semiautomatic annotation procedure. First,
we can automatically annotate some
elements according to existing sentiment resources. Of course, the automatic annotation isn’t perfect and can
lead to some incorrect annotations or
miss some elements. Thus, we might
need to then manually correct some of
the annotations. But clearly, this semiautomatic annotation will require less
work than fully manual annotation.
We can use the most frequent target
entity in a document as the “target entity” for each sentence and annotate
the “aspect” or “polarity-expression”
by matching an aspect or polarity expression dictionary. Similar annotations, such as “modifier,” “negation,”


1342, 16.7%

6022, 74.9%

1234, 12.9%
7386, 77.5%
863, 9.1%

655, 8.1%

47, 0.5%


although the neutral and non-sentiment sentences are always joined in
one class in previous work.18
Target entity distribution in the sentiment analysis corpus. Target entity is an






Figure 2. Statistics for polarity distribution. “POS,” “NEG,” and “NEU” represent
three polarity tags, and “NONE” represents the non-sentiment sentences for (a)
digital camera and (b) mobile phone domains.
Table 3. Target entity distribution statistics for the digital camera
and mobile phone domains.*



Digital camera

Mobile phone


Ratio indicating that target
entity and aspect co-occur
in the same sentence

√: GF3
(the appearance of GF3
is very beautiful)
(the appearance is very




Ratio of the reviews that
only contain one target




Average no. of target
entities for each review



*The green text indicates target entity.

“transition word,” and “compare
word” can also be annotated using
corresponding dictionaries.
analysis and Exploitation
for the Corpus

To obtain relevant ideas for the potential future use of the corpus, we
performed in-depth analysis and exploitation from which we could mine
new sentiment analysis tasks.
Polarity distribution in the sentiment
analysis corpus. Table 2 shows that

each review contains an average of
40 to 50 sentences, in which approximately 10 to 11 are sentiment sentences. Hence, the ratio between sentiment and non-sentiment sentences
is 1:3, of which the non-sentiment
sentences are the main part. Figure 2,

which shows the polarity distribution
statistics separately for the digital
camera and mobile phone domains,
also reflects this phenomenon.
Classifying a sentence as either sentiment or non-sentiment is a necessary
step in sentiment analysis. However, at
present, not much work focuses on this
traditional task except for some methods proposed16,17 in the preliminary
stage of sentiment analysis. Therefore,
we should pay more attention to this
old but important task in the future.
Figure 2 also shows the distribution
of three different polarities, with the
neutral one just a minimal part of sentiment sentences that can be simply ignored. Note that the neutral sentences
are different from non-sentiment sentences here because they still show
the sentiment orientation “neutral,”

important element in sentiment analysis.19–21 Without recognizing the target entity, other fine-grained elements,
such as aspect or polarity, become almost useless in practical applications.
Table 3 shows target entity distribution statistics.
The low ratios (especially for the
mobile phone domain) shown in row
A-1 illustrate that just a few aspects
and their corresponding target entities are co-occurring in the same sentence. The ratios of A-2 indicate that
most reviews have more than one target entity, and approximately two to
three target entities are in each review,
as shown in A-3. These statistics imply that for most aspects, we should
explore their target entities in other
sentences and choose the proper one
from the target entity candidates.
However, most public product review corpora ignore the annotation of
the target entity, and few studies focus
on target entity recognition, especially
the “target entity aspect” (“target aspect” for short in the following sections) pair extraction tasks. Thus, by
analyzing the target entity distribution
statistics, we can propose a new and
important task: target-aspect pair extraction to solve the “target entity ignoring” problem in Table 1. Moreover,
developing the sentiment analysis corpora with target entity tags is vital, as
they can be used for the algorithm research of these tasks.
Aspect-opinion pair distribution in
the sentiment analysis corpus. Aspect-

opinion pair extraction is one of the
most important tasks in sentiment
analysis.7,22,23 Table 4 illustrates the
statistics for aspect-opinion pair distriIEEE INTELLIGENT SYSTEMS

Table 4. Aspect-opinion pair distribution statistics for the digital camera and mobile phone domains.



Digital camera (%)

Mobile phone (%)


Ratio indicating that aspect and polarity word
co-occur in the same sentence,

(the appearance is very beautiful)
Aspect-opinion pair:
< appearance, beautiful, positive>




Ratio indicating that only the aspect, but no
polarity word, occurs in the given sentence

(The color fits my taste.)
< color, NONE, positive >




Ratio indicating that only the polarity word, but no
aspect, occurs in the given sentence; aspect
is implied in the polarity word

(So expensive!)
< NONE, expensive, negative >




Ratio indicating that polarity word directly
modifies the target entity, but not the aspect
< NO aspect (target entity), polarity word>

√: GF3
(GF3 is great!)
< NONE, great, negative >



bution for the two domains. The ratio
of the aspect-opinion pair for the camera domain is 60.94 percent and 63.48
percent for the phone domain (from
B-1), indicating that most aspects are
modified by corresponding polarity
words. This finding can also demonstrate the importance of the aspectopinion pair recognition task.
B-2 in Table 4 shows that the ratios of aspects without corresponding polarity word but with sentiment
orientation (such as the examples in
B-2) aren’t low: 26.83 percent for the
digital camera and 25.75 percent for
the mobile phone. Many researchers
have studied algorithms for the aspectopinion pair extraction task, but most
of them have ignored the aspects without corresponding polarity words.
Sentences containing this problem of
implicit polarity are always confused
with some non-sentiment sentences
because both do not contain polarity
words. The difference, however, is that
implicit polarity sentences actually
show polarities. In addition, the ratio
of this kind of aspects isn’t minimal.
As such, we should focus on the problem of implicit polarity. Here, we propose a new task—implicit polarity recognition—that can distinguish implicit
sentiment sentences from the non-sentiment sentence as well as improve the
aspect-opinion recognition results.
Table 4 shows that the ratios
of sentiment sentences containing
JaNuarY/FEbruarY 2015

Figure 3. Example product review. The author fully recommends this product
despite numerous negative blue-colored sentences containing aspect-opinion pairs.

aspects are about 88 percent (60.94 +
26.83) for the digital camera domain
and 89 percent (63.48 + 25.75) for
the mobile phone domain. Thus, more
than 10 percent of sentiment sentences
don’t contain aspects for each domain.
These sentiment sentences have two
cases: the implicit aspect implied in the
polarity word (B-3), and the polarity
word that directly modifies the target
entity but not the aspect (B-4).
Table 4 shows that the implicit aspect
ratio for the camera domain is 6.58 percent and 5.78 percent for the mobile
phone domain, which occupies a certain proportion. Implicit aspect recognition can be retreated as an important
task from this exploitation. B-4 shows
that the ratio of the polarity word directly modifying the target entity is 5.64

percent for the camera and 4.99 percent
for phone, which also occupies a certain proportion. In previous studies, researchers treated both target entity and
aspect as the same and used similar approaches to distinguish them. However,
target-opinion and aspect-opinion
pairs can be used in different applications. The former can be considered as
the conclusion of a product, whereas
the latter is the detailed attribute description for a product.
For example, in Figure 3, the italicized sentence, “Canon 600D is great!”
containing a target-opinion pair is
a conclusion for the product “Canon
600D.” The author fully recommends this product despite numerous
negative underlined sentences containing aspect-opinion pairs. Therefore,


extracting this kind of conclusion sentences with target-opinion pairs is important, as they’re useful in product
We also analyzed comparative sentiment sentences. Statistics indicate that
comparative sentences occupy a certain proportion: 8.27 percent for the
camera domain and 6.95 percent for
the mobile phone domain. Given that
comparative sentences always compare
two target entities or one aspect of two
target entities, they entail numerous interesting applications.


rom our observations of the finegrained corpus, we propose two
new sentiment analysis tasks, thus
generating relevant ideas for future directions: target-aspect pair extraction
and implicit polarity recognition.
To explore a completely structured
representation of each sentiment sentence, a new scheme is necessary that
can encompass the global information
to solve the problem on the original
sentence-level scheme. Target entity is
one of the most important pieces of
global information. Recognizing the
corresponding target entity for each
aspect is necessary in practical applications, thus target-aspect pair extraction can be treated as a new sentiment
analysis task and should receive additional attention. Similarly, a certain
percentage of aspects don’t contain
corresponding polarity words. Sentiment sentences containing these kinds
of aspects are always confused with
non-sentiment sentences. Recognizing
this kind of aspect and searching its
polarity are important and interesting
future tasks as well.
In the future, we’ll annotate more
customer reviews to extend our corpus. We’ll also explore algorithms for
new sentiment analysis tasks proposed
in this article and improve the existing algorithms with inspiration from

the corpus to address the old sentiment
analysis tasks.

JaNuarY/FEbruarY 2015


