The Analysis of Subcriber in Postpaid Mobile Communication Industry (Case Study: Indosat Matrix)

THE ANA
ALYSIS OF
F SUBSCR
RIBER
IN POSTPAID
P
D MOBILE
E TELECO
OMMUNIC
CATION IN
NDUSTRY
( Case Study: Ind
dosat Matriix )

NUR
R ANDI SE
ETIABUDI

DEPART
TMENT OF
F STATIST

TICS
FA
ACULTY OF
O MATHE
EMATICS AND NAT
TURAL SC
CIENCES
BO
OGOR AGR
RICULTUR
RAL UNIV
VERSITY
2009
9

kahatur imih sareng bapa di lembur

THE ANA
ALYSIS OF
F SUBSCRIIBER

IN POSTPAID
P
D MOBILE
E TELECOMMUNICA
ATION IN
NDUSTRY
( Case Study:
S
Indo
osat Matrixx )

NUR
R ANDI SE
ETIABUDI

DEPART
TMENT OF
F STATIST
TICS
FA

ACULTY OF
O MATHE
EMATICS AND NAT
TURAL SCIENCES
BOGOR AGR
RICULTUR
RAL UNIV
VERSITY
2009

ABSTRACT
NUR ANDI SETIABUDI. The Analysis of Subscriber in Postpaid Mobile Telecommunication
Industry (Case Study: Indosat Matrix ). Advised by ASEP SAEFUDDIN and WISHNU SUBEKTI.
Accurate segmentation, profiling and churn analysis are appropriate way to encounter customer
issue in order to face business competition in mobile telecommunication industry. This research was
established a comprehensive study in segmentation, profiling and churn analysis for Indosat Matrix’s
subscribers using statistics and data mining tools with considering the abilities in business.
Segmentation was built by K-means clustering algorithm based on subscribers’ values. The five
segments were decided in business expert. The result pointed out that K-Means enable to define
segments. Then profiles of each segment according were visualized on two dimension plot called

biplot. Each segment had different characteristic in usage to each other. Analysis of churn was
performed by binary logistic regression. The analysis was performed twice. The first was to estimate
model of churn based on invoice and tenure. Several statistical tests suggested that this model was
considered enable to predict churn event. By fitting current invoice and tenure to estimated model of
churn, subscribers could be classified into ‘at risk’ or ‘at safe’ group according to estimated
probability of churn. Since classification was obtained, the second model was subscribers ‘at risk’
versus derived variables of invoice, called features usage, as explanatory variables. This model was
also considered enable to discriminate subscribers ‘at risk’ and ‘at safe’ excellently.
Key words : telecommunication, segmentation, churn analysis, K-means clustering, biplot, logistic
regression

THE ANALYSIS OF SUBSCRIBER
IN POSTPAID MOBILE TELECOMMUNICATION INDUSTRY
( Case Study: Indosat Matrix )

NUR ANDI SETIABUDI
G14052422

Research Report
to complete the requirement for graduation of Bachelor Degree in Statistics

at Department of Statistics
Faculty of Mathematics and Natural Sciences
Bogor Agricultural University

DEPARTMENT OF STATISTICS
FACULTY OF MATHEMATICS AND NATURAL SCIENCES
BOGOR AGRICULTURAL UNIVERSITY
2009

Title

: The Analysis of Subscriber in Postpaid Mobile Telecommunication Industry
(Case Study: Indosat Matrix )
Author : Nur Andi Setiabudi
NIM
: G14052422

Approved by :
Advisor I


Advisor II

Dr. Asep Saefuddin, M.Sc.
NIP. 195703161981031004

Wishnu Subekti, ST, MM.
NIK. 75013679

Acknowledged by :
Dean of Faculty of Mathematics and Natural Sciences
Bogor Agricultural University

Dr. drh. Hasim, DEA
NIP. 196103281986011002

Graduation date:

BIOGRAPHY
Nur Andi Setiabudi was born in Cilacap on first of September, 1987 as the son of Duryat and
Irun. He has a brother and two sisters.

He finished his education form SD Negeri Dayeuhluhur 03 at 1999 and graduated from SLTP
Negeri 1 Dayeuhluhur at 2002. After graduated from SMA Negeri 1 Dayeuhluhur in 2005, he
continued his study in Bogor Agricultural University through USMI. A year later, he took
Statistics as his major in Department of Statistics, and also chose Consumer Sciences in
Department of Family and Consumer Sciences as the supporting courses.
During his studies, he was active in collage organization, especially Statistics Student
Association, Gamma Sigma Beta. Beside, ever being the Administration staff at 2007, he was also
mandated as Department Head of Survey and Research in 2008. In February 2008, he had
opportunity to follow internship in Research Institute for Tea and Cinchona, Bandung. In July
2009, he had a privilege to perform his final research in PT. Indosat, Jakarta.

ACKNOWLEDGEMENTS

Alhamdulillah, many grateful to Allah SWT as The Most Merciful, Who gives me chance,
spirit, healthy, and capability especially in finishing my research.
This paper is the representation of my research in customer relationship management. It was
performed to complete a requirement for graduation of Bachelor Degree in Statistics, at
Department of Statistics, Faculty of Mathematics and Natural Sciences, Bogor Agricultural
University.
I have been admitted that the completion of my research would not be possible without help

from many people, since the research has just planned until finished. Thousand appreciations are
presented for their ideas, critics, and improvement during the process. I would like to express my
sincere gratitude to my advisors, Mr. Asep Saefuddin for his expert guidance and suggestion for
this research, and Mr. Wishnu Subekti for enlightening discussion. Thanks are shown to Mr. Fahar
Yuhandi for his valuable help in providing data. Anyway, I also wish to thank all my friends in
‘Statistika 42’ and ‘Pondok Assalam’ for togetherness in finding knowledge and truly friendships.
I give my special thanks to ‘my special’, Widya Ningsih, for sharing of the nice days. I am
especially grateful to my beloved family, Imih, Bapa, Ibu, Riska, Kang Ofik, Teh Enci, Pa
Ridwan, Tegar and Kia for their never ending love and support.
Finally, I wish my little work would be useful for all.

Bogor, September 2009

Nur Andi Setiabudi

CONTENT
Page
LIST OF FIGURE ············································································································
LIST OF TABLE ·············································································································
LIST OF APPENDIX ········································································································


viii
viii
viii

INTRODUCTION ···········································································································
Background
··········································································································
Objective ···············································································································

1
1
1

LITERATURE REVIEW ·································································································
Cluster Analysis ····································································································
Biplot Analysis ·····································································································
Binary Logistic Regression ····················································································

2

2
2
3

METHODOLOGY ···········································································································
Source of Data
·····································································································
Method
···············································································································

4
4
4

RESULT AND DISCUSSION ·························································································
Segmentation and Profiling ····················································································
Profile of Sample ·······················································································
Segments and Profiles
···············································································
Churn Analysis

····································································································
Model of Churn
························································································
Model of ‘At Risk’
····················································································

5
5
5
5
7
7
8

CONCLUSION

··············································································································

RECOMMENDATION
REFERENCE

10

···································································································

10

···················································································································

11

LIST OF FIGURE
Page
Figure 1 Plot of tenure vs. invoice by segment
Figure 2 Biplot for segment and feature usage

······························································
·······························································

5
6

LIST OF TABLE
Page
Table 1
Table 2
Table 3
Table 4
Table 5
Table 6
Table 7
Table 8

Descriptive statistic of sample
········································································
Invoice usage of sample by feature ··································································
Evaluation of segmentation result
·································································
Segment summary information
·······································································
Description of sample for churn analysis
························································
Summary of logistic regression analysis of churn ············································
Description of sample for ‘at risk’ analysis
·····················································
Summary of logistic regression analysis of ‘at risk’ ··········································

5
5
5
6
7
8
9
9

LIST OF APPENDIX
Page
Appendix 1
Appendix 2.A
Appendix 2.B
Appendix 3.A
Appendix 3.B
Appendix 4
Appendix 5
Appendix 6.A
Appendix 6.B
Appendix 7.A
Appendix 7.B
Appendix 8.A
Appendix 8.B
Appendix 9.A
Appendix 9.B
Appendix 10.A
Appendix 10.B
Appendix 11.A
Appendix 11.B

Description of variables for analysis
····················································
Histogram of tenure on standardized data ··············································
Histogram of invoice on standardized data ···········································
Segment summary ················································································
Mean and standard deviation of invoice and tenure by segment
on standardized data ·············································································
Percentage of invoice by feature and segment ········································
Definition of categorical and dummy variables for churn and ‘at risk’
analysis ·································································································
Bar chart of subscriber churn and stay by tenure ···································
Bar chart of subscriber churn and stay by invoice ··································
ROC curve for model of churn ·······························································
Plot of sensitivity and specificity versus all possible cut off points
in the model of churn ············································································
Probability of churn by all possible categories of tenure ·······················
Probability of churn by all possible categories of invoice ·······················
Bar chart of subscriber ‘at risk’ and ‘at safe’ by tenure and invoice ·······
Bar chart of subscriber ‘at risk’ and ‘at safe’ by features usage ·············
ROC curve for model of ‘at risk’ ···························································
Plot of sensitivity and specificity versus all possible cut off points
in the model of ‘at risk’ ·········································································
Probability to categorized ‘at risk’ by all possible categories
of voice domestic usage
·····································································
Probability to categorized ‘at risk’ by all possible categories
of SMS usage ······················································································

13
14
14
15
15
16
17
18
18
19
19
20
20
21
21
23
23
24
24

1

INTRODUCTION
Background
The mobile telecommunication industry
has been dynamically developing over the
years. It is going to create new opportunities
and be more profitable for business.
Therefore, many companies are interested in
joining this sector which then yielded tight
competition. Providers compete fiercely to
each other in acquiring new subscribers and
retaining the existing ones to raise
profitability.
The provider has to offer the best services.
But, the biggest challenge comes from
customer
issues.
Understanding
the
subscribers is getting important to face the
business environment. Provider must be able
to know well about their subscribers. To make
it easy, it is necessary to classify subscribers
in several segments and profile them
according to desirable criteria. Moreover,
provider also has to recognize the high risk
subscribers to avoid them from churning.
Segmentation is a term to describe the
process of dividing subscribers into
homogeneous groups or classes called
segments based on similar characteristics,
such as value and usage behavior. Using
segmentation, provider is more effective in
channeling resources and discovering
opportunities (Jansen 2007). Accurate
verifiable segmentation gives information to
decision makers to evaluate and execute
strategies
for
improving
subscribers’
profitability and campaigns efficiency.
Profiling is describing subscribers and
subscribers within associated segment by their
attributes. Knowing the profile of each
customer, provider can treat the customer
according to what they needed in order to
increase the lifetime value (Boundsaythip &
Runsala 2001).
Term of churn refers to attrition or
degradation of the number of subscribers.
There are three kinds of churn in literatures;
those are involuntary churn or forced attrition,
voluntary churn, and unavoidable or expected
churn (Berry & Linoff 2004; Yang & Chiu
2006).
Predicting customer churn is very critical.
It is useful for provider to identify signals of
churn. Likelihood or probability of churn can
be analyzed by using their call record
generated and stored on the data warehouse
system. Once churn indications were detected,
provider can determine what incentives that

should be offered to their subscribers from the
risk group in order to improve retention and
extend loyalty.
As one of mobile telecommunication
services provider in Indonesia, Indosat also
interested in segmentation, profiling and
churn analysis for their subscribers. So far, for
their postpaid service, Matrix, segmentation
and profiling have been performed
subjectively based on invoice which is divided
into two segments: regular and VIP. Actually
modeling for churn has not been analyzed
(Subekti 2009; personal communication).
This research was established a
comprehensive studies in segmentation,
profiling and churn analysis for a million
Matrix subscribers using statistics and data
mining tools with considering the abilities in
business. Segments were defined by K-means
clustering algorithm based on subscribers’
values, those were invoice and tenure. Biplot
was very useful to visualize profile of each
segment according to features usage. Analysis
of churn was also performed by binary logistic
regression using invoice for features usage as
the explanatory variables.
There were related works have been
already performed. Jansen (2007) defined
segments for Vodafone’s subscribers based on
usage behavior. Several clustering techniques
were adopted, and then the results of each
technique were evaluated and compared to
each other. Jansen then profiled each segment
according to demographic data. The relation
between segments and profiles was also
analyzed. Lin (2007) performed segmentation
based on call detail record for a mobile
operator’s subscribers. Lin utilized K-means
for executing his research. Mozer et al (2000)
predicted churn by using data from call detail
record. Models were constructed by logistic
regression, decision tree and neural network.
In addition, a case study for churn analysis
applying logistic regression was established
by Mutanen (2006).
Objective
The main objectives of this research were:
1. To define segment of subscribers based on
their values represented by tenure and
invoice.
2. To describe profile of each segment
according to feature usage.
3. To obtain factors affecting the churn based
on tenure and invoice and recognize
subscribers ‘at risk’ according to feature
usage.

2

BRIEF THEORITICAL REVIEW
Cluster Analysis
The objective of cluster analysis is the
organization of n objects into K clusters (K <
n) according to similarities among them, such
that objects within a cluster are more similar
to each other than to objects in other clusters.
In business perspective, clustering process
have similar mean with segmentation.
The principle of cluster analysis is the
similarity measure object based on variables.
This similarity may become the distance
measurement. There are many ways to
measure similarity. Euclidean distance is
recommended for a measure of similarity on
clustering. Euclidean distance appropriates for
uncorrelated variables. If correlation occurs,
the data should be transformed using principal
component analysis.
The Euclidean distance between two pdimensional objects, X = [x1, x2, …, xp]', and
Y = [y1, y2, …, yp]', is

d E ( x, y ) =

( x − y )' ( x − y )

(1)

If the condition to use Euclidean distance
could not be satisfied, then we have to use
Mahalonobis distance. Actually, Mahalonobis
distance is Euclidean distance but weighted
with covariance matrix. Mahalonobis distance
of two objects, X and Y, is defined as :
(2)
d M ( x, y ) = ( x − y )' S −1 ( x − y )
where matrix S contains the sample variances
and covariances. However, without prior
knowledge of the distinct group, the S cannot
be computed. For this reason, Euclidean
distance is often preferred for clustering
(Johnson & Wichern 1998).
Basically, there are two method of cluster
analysis; hierarchical method and nonhierarchical
or
partitional
method.
Hierarchical method permit a cluster to has
sub-clusters, and it is often organized in
dendogram.
Hierarchical
method
is
appropriate if the size of data set was not so
large, and the number of clusters has not been
known. A partitional method is simply a
division of data set objects into nonoverlapping cluster such that each data object
is in exactly one cluster. It is possible to run
clustering from huge number of data set.
To determine the number of clusters
should be formed, but so far no generally
accepted procedure. The usual ways are by
plotting the scores of first two principal
components. This decision should be guided
by theory and practicality of the result.

The K-means is the famous algorithm for
partitional clustering. This algorithm was first
published by McQueen (1967). His K-means
concept represents a generalization of
ordinary sample mean. The process appears to
give partitions which are reasonably efficient
in the sense of within-cluster variance. Kmeans allocates each object to one of the K
clusters to minimize the within of sum of
square :
K

SSE = ∑ ∑ d (ci , xi ) 2

(3)

i =1 x∈Ci

ci =

and

1
mi

∑x

(4)

x∈Ci

where xi = the-ith object, Ci = the-jth cluster, ci
= cluster center (centroid) of Ci, mi = the
number of objects in the-ith cluster, K = the
number of clusters and d = Euclidean
distance.
Basic algorithm of K-means is described
as follows:
1. Select K data points to be initial centroid.
2. Assigns each object to the nearest
centroid.
3. Recompute the centroid of each cluster.
4. Repeat step 2 and 3 until centroids do not
change.
One approach to evaluate clustering is by
root mean square standard deviation (RMS). It
provides a measure of average distance
between each object within cluster. The RMS
of a cluster Ci is
RMS =

∑ x∈C d (ci , xi ) 2
i

v ( mi − 1)

(5)

where v = number of variables. Wellseparated clusters comprised of homogenous
objects will have a small of RMS.
Biplot Analysis
A biplot is graphical representation of the
information in n x p data matrix. The birefers to the kind of information contained in
a data matrix. The information in the rows
pertains to object and that in the columns
pertains to variables (Johnson & Wichern
1998).
A biplot is very useful to visualize objects
and variables on same graph more simply. It is
widely used for finding further information in
principle component analysis, row and
column factor in correspondence analysis of
two-way contingency tables and to detect
interaction in two-way analysis of variance.

3

The biplot analysis is based on the singular
value decomposition, SVD. Consider an n x p
matrix of rank r, where r ≤ p ≤ n, the matrix
then may be decomposed as :
X = U L A'
(6)
where Unxp and Apxr are matrices of singular
vectors and Lrxr is a diagonal matrix of
singular values of matrix X. U is the matrix
with column corresponding to the p
orthogonal eigenvectors of X'X and A is the
orthogonal matrix corresponding to the
eigenvector of X'X. The singular values are
the positive square roots of the eigenvalues of
X'X.
Matrix equation in (6) can be written as :
(7)
X = U Lα L1-α A'
where 1 ≤ α ≤ 1. If G = U Lα and H = L1-α A',
then the (i,j)th element of matrix X can be
expressed by :
Xij = gi' hj
(8)
where i = 1, 2, ... , n and j = 1, 2, ..., p, and the
gi' are rows of G, and the hi' are rows of H
(Sartono et al 2003).
Although many values of α are possible,
three are commonly used, 1, ½, and 0. When
the value 1 is selected, the result is called a
row metric preserving biplot. In this display
the distances between pairs of rows is
preserved and is useful for studying objects.
When the value 0 is selected, the result is a
column metric preserving biplot. This display
preserves distances between the columns and
is useful for interpreting variance and
relationships between variables. The other
value of α, ½, gives equal scaling or weight to
the rows and columns. It is useful for
interpreting interaction in two factors
(Lipkovich & Smith 2002).
The ability of biplot in representing the
variety from the original data can be
computed as follows:
λ +λ
ρ 2 = 1r 2
(9)

∑ λk
k =1

where :
λ1 = the first biggest eigenvalue
λ2 = the second biggest eigenvalue
λk = the kth eigenvalue
Binary Logistic Regression
The binary logistic regression is a form of
regression which is used for binary response
variable; such as ‘event’ (y=1) and ‘nonevent’
(y=0), ‘churn’ and ‘stay’, etc. Suppose, there
is a single explanatory variable x. The logistic
regression model has linear form for the logit

of probability of event at value x, π(x), as
follows:
⎡ π ( x) ⎤
logit[ π(x)] = ln ⎢
⎥ = α + β x (10)
⎣1 − π ( x ) ⎦
Then, the odds of a event are :
π ( x)
= exp[ α + β x ] = e α ( e β ) x (11)
1 − π ( x)
This exponential relationship provides an
interpretation of β : the odds multiply by eβ
for every 1-unit increase in x (Agresti 2007).
Suppose, x1 and x2 are values in x. The oddsratio of x1 to x2, θ, are :

π ( x1 )
π (x2 )
1 − π ( x1 ) 1 − π ( x 2 )
exp(α + βx1 )
=
exp(α + βx 2 )
= exp[β ( x1 − x 2 )]

θ=

(12)

or

(13)
ln(θ) = β(x1 – x2)
Refer to equation (12), odds-ratio θ = exp(β)
when x1=1 and x2=0.
The odds ratio is a measure of effect size,
describing the strength of association or nonindependence between two binary data values.
It plays an important role in logistic regression
that is often used for drawing conclusion of
model.
The significance of the explanatory
variables in the logistic model related to
response variable could be assessed by G test
statistic and Wald test. G test statistic is
likelihood ratio test which is used to measure
the significance of the parameters in the
model overall. Denote there are p explanatory
variables for logistic regression model, then G
test statistic could be expressed as :

⎡L ⎤
(14)
G = −2 ln ⎢ 0 ⎥
⎢⎣ L p ⎥⎦
where L0 = likelihood without explanatory
variables, and Lp = likelihood with p
explanatory variables, with hypothesis of test :
H0 : β0 = β1 = … = βp = 0, versus
H1 : at least one βi ≠ 0; where i = 1, 2, …, p
G test statistic follows a chi-square
distribution with p degrees of freedom.
If the null hypothesis in the G test was
rejected, the Wald test could be used to assess
the significance of each βi partially. The
formula of Wald test statistic is :
)
βi
(15)
W= ) )
SE ( β i )

4

with hypothesis of test :
H0 : βi = 0, versus
H1 : βi ≠ 0; where i = 1, 2, …, p
Under the null hypothesis, the W statistic
follows a normal distribution. (Hosmer &
Lemeshow 2000).
Accuracy of logistic regression model is
evaluated by classification table which is rely
on a single cut off point to classify test result.
It
gives
information
about
correct
classification rate (CCR), sensitivity and
specificity.
Sensitivity
measures
the
proportion of correctly classified events,
whereas specificity measures the proportion of
correctly classified nonevent. (Peng, Lee &
Ingersoll 2002). Hosmer & Lemeshow (2000)
underlined that classification table is most
appropriate when classification is a started
goal of the analysis.
A more complete description of
classification accuracy is given by the area
under the ROC (receiver operating
characteristics) curve or commonly called Cstatistic. Suppose there is a total of t pairs
with different responses, nc of them are
concordant, nd of them are discordant, and t–
nc–nd of them are tied, C-statistic is expressed
by (SAS Institute 2003) :
C = [nc + 0.5(t–nc–nd)] /t
(16)
It plots the sensitivity and 1– specificity
for entire range of possible cut off point. The
area under ROC curve, which is range from
zero to one, provides a measure of the model
ability to discriminate between those subject
who experience the event of interest versus
those who not. As general rule, C = 0.5
suggests no discrimination; 0.7 ≤ C < 0.8 is
considered acceptable discrimination; 0.8 ≤ C
< 0.9 is considered excellent discrimination;
and C ≥ 0.9 is considered outstanding
discrimination (Hosmer & Lemeshow 2000).
METHODOLOGY
Source of Data
In the term of segmentation and profiling,
this research involved samples approximately
3% of total active MSISDNs in three months
serially generated randomly from data
warehouse. Segmentation has been performed
based on tenure and invoice. Feature usage
was adopted to profile subscribers and their
associated segment. Appendix 1 provides list
of those variables.
This research used a little more than data
mentioned above for analyzing subscribers
churn. Churn analysis involved sample about

6% of active MSISDNs in three months ago.
Invoice and tenure was used as explanatory
variables of churn. To estimate model of ‘at
risk’, about 2.8% of current active MSISDNs
were used. In this analysis, explanatory
variables were features usage. All variables
are described on Appendix 1. Only MSISDNs
with more than zero invoices were used.
Method
This research was divided into two
sections. The first was aimed to perform
segmentation and profiling, and the second
one was to analyze subscriber churn and
subscriber ‘at risk’. Methodologies of this
research are summarized as follows:
Section I : Segmentation and profiling
1. Preparing the data; included exploring
characteristics of data, transforming the
data into z values to adjust scale, and
stripping out the outliers.
2. Calculating the Pearson correlation
coefficient between variables.
3. Running the K-means clustering algorithm
to make segments, based on tenure and
invoice, and reviewing the result.
4. Constructing the biplot.
5. Finding profiles of each segment.
6. Interpretation.
Section II : Churn Analysis
1. Preparing the data; included classifying
explanatory
variables’ values
into
categories.
2. Representing
categories
of
each
explanatory
variable
into
dummy
variables.
Appendix
5
displays
information about dummy variables.
3. Finding the main characteristics of the
data.
4. Running the binary logistic regression for
making model of churn.
5. Classifying subscribers into ‘at risk’ or ‘at
safe’ group by fitting the invoice and
tenure in current month to model of churn
from step 3 refer to optimum cut off point,
which is ‘at risk’ (risk=1) if probability to
churn exceed cut off point, otherwise ‘at
safe’ (risk=0).
6. Making logistic model of subscriber at
risk.
7. Checking and evaluating each logistic
regression model.
8. Interpreting result consider to business
perspectives.

5

Segmentation and Profiling
Profile of Sample
Among the satisfied samples, about 32
MSISDNs which exceed 99.9% percentile of
invoice have been stripped out, because those
were indicated as outliers.
Segmentation has been performed based
on customer’s values by utilizing two
variables; those were tenure or duration of
subscription and invoice (invoice average in
three months). And then, derived variables of
invoice called features usage were used to
profile segments.
Table 1 Descriptive statistic of sample
Statistic
Minimum
Maximum
Mean
Median
Std. deviation

Tenure
(Month)

Invoice
(Rp.)

1.03
177.00
52.56
44.20
41.57

0
3,998,803
177,341
91,920
279,377

Descriptive statistic of tenure and invoice
of samples are summarized on Table 1.
Whereas, distribution of samples could be
seen on Appendix 2.A and 2.B. By Table 1,
samples had very high variance, especially
invoice. Minimum value of tenure was 1
months, the maximum one was 177 months,
and it had mean 53 days. Mean of invoice was
Rp 177,341, but about 50% of subscribers
only had invoice less than the amount of Rp
92,000.
Table 2 Invoice usage of sample by feature
Feature
Voice
Domestic
International
VoIP
3G
Value Added
SMS
MMS
GPRS
3G Data
International Roaming
Voice
SMS
GPRS
Invoice

Percentage of
invoice
59.88
4.06
2.05
0.09
16.97
0.10
1.44
0.55
8.98
3.69
2.19
100.00

The largest part of invoice were spent for
using voice domestic, SMS and voice

international roaming. Subscribers spent over
85% of invoice for using those three features.
On the contrary, subscribers spent less than
1% of invoice for enjoying features with 3G.
Table 2 displays invoice usage of sample by
features.
Segments and Profiles
Because variable tenure and invoice were
on different scale, at the first, the data have
been standardized into z values which had
zero mean and standard deviation one, follow
formula z = ( x − x ) / s where x and s were
mean and standard deviation of samples.
Since the Pearson correlation between tenure
and invoice was about 0.27, Euclidean
distance was used when executing K-means
clustering algorithms because there is no
strong correlation between variables. The
number of segments to be indentified was
determined in business expert. In this case,
Indosat decided to have five segments
(Subekti 2009, personal communication).

Invoice (Rp 000)

RESULT AND DISCUSSION

Tenure (month)

Figure 1 Plot of tenure vs. invoice by segment
Table 3 Evaluation of segmentation result
Segment
A
B
C
D
E

RMS
0.30
0.39
0.45
0.91
1.72

Dist. to Nearest
Segment
1.10
1.10
1.76
2.27
4.88

Dist.
Ratio
3.65
2.80
3.89
2.48
2.83

Figure 1 shows that K-means allocated
each subscriber into non-overlapping segment,
so that one subscriber was exactly in one
segment. The root mean square standard
deviation for each segment is shown in second
column of Table 3. The distance to nearest
segment provides a measure of the separation
between centroids. Distance ratio was
calculated by dividing distance to the nearest
segment with the root mean square standard
deviation, RMS. Following the Table 3,

6

segment D and E have very large variation
within segment, and the segment E was the
largest variation. This situation also was
shown in Figure 1 obviously. However, the
distance ratio were large enough, hence this
situation provided the satisfactory result. For
further consideration, Appendix 3.A and 3.B
provides more information about K-means
clustering result.
Recall Figure 1, all five segments formed
homogenous subscribers within segment, and
each segment was also most likely different to
each others. The first three segments (A, B,
and C) had the lowest invoice, but they were
separated by tenure. The last two segments (D
and E) had higher invoices compared to A, B,
and C. The tenure of segment D and E
scatterly distributed or they have very large
range.
Table 4 Segment summary information

Segment
A
B
C
D
E
ALL

Subsc.
12,418
12,562
4,064
1,965
271
31,280

Mean
Tenure
Invoice
(Month)
(Rupiah)
15.33
73,865
59.97
137,674
132.63
200,978
69.43
767,615
92.00 2,123,224
52.56
177,341

Profiling was created by exploring size
and mean of tenure and invoice of each
segment. Those profiles were displayed in

Table 4. In additional, profiling was also
created based on percentage of invoice for
using features. Information about feature
usage were summarized in Appendix 4 and
visualized by biplot in Figure 2. Using SAS
macro which has been written by Friendly
(1998), the biplot selected α=½ and was able
to cover the information about 99.6%.
Finally, according to Table 4 and Figure 2,
the characteristics of segment are described as
the following:
1. Segment A :
Segment A was occupied by 39.7% of
subscribers. In this case, the new
subscribers with the lowest invoice were
considered belong to this segment. The
largest parts of invoice were spent for
utilizing domestic voice and SMS features.
Interestingly, the segment A used
secondary features, such as voice 3G,
GPRS, MMS and data 3G, more than the
other segments did.
2. Segment B :
Segment B was occupied by 40.2% of
subscribers. It was likely the largest
segment. In the average, subscribers of the
segment B have registered their
subscription since five years. However,
they spent the small amount of invoice.
The largest parts of invoice of this
segment were for voice domestic dialing
and SMSs.

Dimension 2 (1.5%)

Blue : Segment
Red : Feature usage

Dimension 1 (98.1%)

Figure 2 Biplot for segment and feature usage

7

3. Segment C :
Segment C was only occupied by
13.0% of subscribers. Although the
subscribers of this segment contributed
low invoice, they were considered had
high level of loyalty which indicated by
tenures were over 10 years. Subscribers
within segment C associated with voice
domestic feature.
4. Segment D :
Segment D was only occupied by 6.3%
of subscribers. The invoice mean of
subscribers within segment D was greater
than over-all subscribers did. Additionally,
the variance of tenure was very large. It is
indicated that this segment include very
low tenure and very high tenure.
Compared to other segments, subscribers
of segment D were VoIP’s users. They
also spent invoice more than that of the
first three segments in using feature for
international connection purposes, such as
international voice and roaming.
5. Segment E :
Segment E was only occupied by 0.9%
of subscribers. It was the smallest segment
formed. The subscribers of segment E
were considered as the most profitable
which indicated by very high of invoice.
Like segment D, the variance of tenure
was very large. The most of invoices are
allocated for using international features,
e.g. international roaming and voice
dialing
According to the biplot, segment A had
similar behavior with segment B in spending
invoice. They were dominated by basic
feature users. They also spent their invoice for
using several secondary features although in
small amount. Hence, segment C had similar
behavior with segment D in spending invoice.
Segment E was very different to other
segments in spending invoice. They were
subscribers
who
need
international
connectivity services.
The biplot in Figure 2 also provides
important information about features usage.
According to the biplot, SMS usage and voice
domestic were uncorrelated. There were
positive correlation between SMS, GPRS,
data 3G, MMS and voice 3G usage, but those
had negative correlation to SMS international
roaming and VoIP usage. Voice international
roaming also had positive correlation to voice
international and GPRS international roaming
usage, but those had negative correlation to
voice domestic usage.

Churn Analysis
Model of Churn
Logistic regression model was used to
analyze the churn of subscribers. The
explanatory variables were categories of
tenure and invoice in three months ago. Those
two-explanatory variables were fitted to the
active status in current month; ‘churn’
(churn=1) or ‘stay’ (churn=0). It was
involved MSISDNs of 60,000 selected
randomly. Description of samples provided in
Table 5, and for additional information,
Appendix 6.A and 6.B provides distribution of
sample by categories of tenure and invoice.
Table 5 Description of sample for churn
analysis
Churn
0
1

Frequency
Total

Percentage

57653
2347

96.1
3.9

Summary of logistic regression analysis of
churn are presented in the Table 6.
Overall model evaluation is examined by
using likelihood ratio or G test. This test
yielded that the logistic model of churn was
significant, hence it was effective to estimate
churn event based on tenure and invoice in
three months ago.
Model performance was measured by area
under ROC curve or C-statistic. In this model,
C-statistic exceeded 0.841. This means that
for 84% of all possible pairs of subscribers–
one was churn and the other stay–the model
correctly assigned a higher probability to
those who were churn. It was considered
excellent discrimination of churn and stay. For
further information, Appendix 7.A provides
figure of ROC curve for model of churn.
To assess accuracy of model, an optional
cut off point should be decided as that
maximizes sensitivity and specificity. In this
case, optimal cut off point was at 0.05
(Appendix 7.B) which yielded correct
classification rate 78%, sensitivity 76%, and
specificity 78%. This result also suggested
that model gave satisfactory result in
predicting the churn events.
Probability of churn for any given tenure
and invoice could be simply illustrated as
following. A subscriber with tenure=1 (less
than 3 months) and invoice=1 (less than Rp
25,000) has estimated probability of churn:
exp(−5.708 + 3.710 + 0.816)
≈ 0.235
1 + exp(−5.708 + 3.710 + 0.816)

8

Table 6 Summary of logistic regression analysis of churn
)
)
Explanatory variables
β
SE (β )

)

θ

df

Wald’s χ2
2848.500
2344.989
1151.502
726.426
313.251
481.646
37.098

40.842
20.855
8.560
11.957
2.063

–5.708

0.107

3.710
3.038
2.147
2.481
0.724

0.109
0.113
0.121
0.113
0.119

1
5
1
1
1
1
1

0.816
1.299
0.399

0.097
0.062
0.063

3
1
1
1

511.372
70.343
437.354
40.470

2.262
3.666
1.491

Test

df

χ2

p-value

Model evaluation
Likelihood ratio (G) test
Wald test

8
8

3873.261
2807.530