Big Data Paper

Big Data
Amanda Marquardt
March 26, 2017
ACG4401C
Professor Robert Tennant

1

Executive Summary
Big Data is a growing topic that brings up many challenges and opportunities for
information technology advancement. Today, data is growing every second from
everything we do and systems to support these masses of data are becoming more and
more capable of handling larger sets of data to return more useful information. Storage
of data is particularly cheap and parallel processing has been one of many alternative
technologies implemented using multiple processing servers to process Big Data
effectively, efficiently and at an affordable cost to companies.
As data continues to grow and Big Data analytics gains more popularity, legalities in
regards to Big Data also become a concern. The legal system does not provide any
clear boundaries concerning ownership, contractual rights, or privacy standards which
proves as a challenge for Big Data analytics. Implementing laws and regulations could
serve as a benefit to avoid future issues in Big Data development.

Big Data is not a clearly definable term and what constitutes as “Big Data” is not held to
a definable standard; however, it is agreed that Big Data consists of both structured and
unstructured data. A large majority of existing data consists of unstructured data with
social media is the leading contributor to unstructured data today. Almost everyone
around the world today uses some sort of social media making the data is procures of
extreme value to businesses and individuals in regards to business, finance, and many
other areas. Big Data is a big topic with many contributing factors bringing opportunity
and challenges. Information technology development to support the massive amounts of
data being created is essential to receive the most benefit and to give companies
competitive advantages.
2

Introduction
When hearing the term “Big Data” the general meaning many people would
assume is simply large amounts of data. Technically, there is still no standard
definition of Big Data; however, Big Data in short consists of data sets that are
too large to analyze using ordinary information system algorithms creating a
demand for more complex systems to manage these overwhelming data sets.
Until recently, data was produced only from human input, but now data is growing
constantly from data manually inputted by humans and data created by

computers.
The differentiating components of Big Data compared to ordinary data have been
referred to as the 4 V’s: volume, velocity, veracity, and variety. IBM has estimated
as of 2017, there are 2.5 quintillion bytes of data are created every day from
everything we do. This statement proves the existence of high volume and high
velocity in today’s data. Veracity has also proven itself present among data today.
The 2.5 quintillion bytes of data being produced every day are coming from a
multitude sources that transmit data such as email, social media, online
shopping, etc. The variety component refers to the presence of structured and
unstructured data. Structured data is data already organized into logical view
making information easily accessible such as excel worksheets. Unstructured
data consists of text and multimedia content such as photographs and music files
that are not easy to organize or access.

3

As the four V’s continue to grow among data, challenges as well as opportunities
to keep up with this constant expansion also continue to grow. Considering data
is constantly growing from everything we do, it seems as if the opportunities for
data management is limitless. We live in a data booming generation, but having

massive amounts of data is useless it cannot be transformed into information we
can make sense of to draw useful conclusions. It is essential Information
Technology continues to improve to gain the most benefit from the massive
amounts of data available to us. The ability to organize and analyze the massive
loads of data existing today can provide useful information to every profession in
every field from helping marketing specialists analyze and interpret sale trends to
assisting medical professionals around the world share their knowledge to cure
diseases and save lives. It is no question that Big Data is among us, but the only
way to benefit from it is by having access to affordable systems to control it all.

Parallel Processing
Before the term Big Data came about, the only source of data was the data
entered by employees. As technology evolved and the internet became
accessible to everyone, not only employees but, users and computers
themselves started generating data causing a huge rise is data accumulation
bringing us the term “Big Data”. Users have been the biggest contributor to the
expansion of data. From Facebook posts, Tweets, online shopping, many people
often are creating data without even knowing they are.

4


When data was much smaller and only entered by employees, relational
databases were used where data was entered into the processor. Now that there
is so much more data, it is overwhelming for a single server to process. It was
obvious that more storage would be necessary as we evolved into a world of “Big
Data” which lead to advanced systems consisting of more process servers
capable of storing more data.
Companies used to pay database vendors such as Oracle and IBM to manage
their data. Eventually, Google’s data became too large for their vendor to manage
leading them to the creation of MapReduce. MapReduce was an algorithm used
to break down their large database into smaller parts using multiple process
servers. This allowed more data to be stored and processed at a faster rate. This
was the start one method used today known as parallel processing.
Parallel processing is a popular processing method used today. It processes data
by bringing multiple processors to the data, as opposed to before when data was
brought to the servers. Companies are providing services to help implement
parallel processing infrastructures for businesses.
Hadoop is a popular servicer that makes parallel processing available to
companies. Typically, companies using over 10 terabytes of data receive the
greatest net benefit from Hadoop. Hadoop is considered an open source platform

meaning it does not cost anything to use, but experts will more than likely be
needed to manage the system.

Social Big Data
5

The variety component of Big Data has become the most challenging issue
regarding analyzing Big Data because unstructured data is difficult to break down
and organize due to all the different sources of data it contains. Approximately
90% of a company’s data consists of unstructured data. As mentioned before,
unstructured data includes files containing text and multimedia components
including emails, photographs, music files, websites, etc. Although unstructured
data procures the majority of all existent data, it has presented the most difficulty
for the IT community to design software to analyze it. The challenge lies within
filtering relevant data from multiple mediums and then organizing it into
information of use. Challenges associated with analyzing unstructured data puts
a limit on information available to companies which in turn limits the availability of
decision making tools.
Social media is a popular and fast growing form of unstructured data. Facebook
currently holds over a billion users, active Twitter accounts are in the

multimillions, and there are over 400 million profiles on LinkedIn just to name a
few. Social media is constantly being used for a multitude of reasons such as for
networking, marketing, or personal use. These social media websites are
equipped to manage and store this big data; however, the volume of data
produced by social media has the potential to produce tons of beneficial
information for companies.
In early 2000, one method for predicting stock performances was by evaluating
the magnitude of messages on financial blogs mentioning specific companies.
The idea was the more messages posted regarding a specific company would
6

lead to a rise in stock price the next day. This idea led to a study in started in
2012 (Sanger & Warin) between stock prices and Tweets. The study used 71
companies of the S&P500, gathering the number of times the name of the
company was mentioned on Twitter and the number of financial Tweets (“$”
before ticker, i.e $GOOG) posted regarding the companies and compared it to
the stock return prices intraday and overnight. After a years’ worth of data was
collected, it was concluded that the number of financial Tweet have a negative
correlation with overnight stock returns. Data from Twitter has been used to
study numerous areas, but the research involved to gather and analyze the data

is difficult and time consuming without software to assist with filtering and
compiling the data needed to conduct the study.
The number of Tweets collected by Sanger & Warin were procured from a
website called PeopleBrowsr. PeopleBrowsr is the largest Social Intelligence
Platform in the world. This website allows companies to create instantly large
networks via social media as well compiling data from their network into queries
to filter useful information. Social media analytics tools such as PeopleBrowsr are
becoming more popular and has been a huge step in the right direction for
challenges associated with unstructured data.

The Law & Big Data Analytics
In 2013, the website PeopleBrowr mentioned above encountered legal issues
when Twitter stated they wanted to stop allowing PeopleBrowr access to their
7

data. PeopleBrowsr was paying Twitter $1 million a year for access to their
database; however, Twitter wanted control of their data and offer it exclusively to
other companies. PeopleBrowsr filed a complaint against Twitter for violation of
common state, California Unfair Competition Law, and claiming that data
obtained from Twitter is the main source of their business and to be refused

access would cause the business to cease. The case was directed to Federal
court as the violation was considered more aligned to violation of federal law, The
Sherman Act. The parties settled allowing PeopleBrowsr access to Twitter’s data
for the rest of that year, but then Twitter was granted full control of the data
causing PeopleBrowsr to purchase the data from other companies at a much
higher cost than the $1 million/ year they were paying directly to Twitter.
Big Data is a valuable commodity that can make or break a company. As seen in
the case with PeopleBrowsr and Twitter, there is a grey area regarding the legal
rights to access of data. Twitter was accused of being in violation of the Sherman
Act which brings up the issue of monopolizing Big Data. This case also brings up
contractual issues with Big Data. Twitter originally gave PeopleBrowsr licensing
rights to their data, but became an issue as to duration of the agreement. When
data holders provide data services for compensation, there are no clear legal
requirements regarding contracts or no regulation enforcing contracts associated
with Big Data.
In 2015, Radio Shack filed for Bankruptcy, and the company’s assets were
liquidated to pay their debt. Included in Radio Shack’s assets was their customer
records consisting of over 100 million records. The customers on record signed
8


privacy policies causing some of these customers to contest the sale of their
personal information. The Bankruptcy court allowed General Wireless to
purchase the customer records but with restrictions. Some of the limitations
included giving the customers notice of the sale with an option to opt-out, data
must be used in the same line of business as Radio Shack, etc.
Privacy is another issue that comes into play particularly regarding Big Data
analytics. With larger quantities of data brings greater value and more analysis
opportunities; but, some people are reluctant to sharing data fearing exposure of
the personal information. Although data analysts typically anonymize their
results, Big Data still becomes Big Data from input of data from sources
everywhere and from everyone. Big Data is of great value which addresses the
issue regarding whether people can seek compensation as to their valuable
contribution. The government has an ethical responsibility to protect citizen’s
privacy, but lack a definitive line as to what violates “invasion of privacy” in Big
Data analytics. Big Data analyses can provide great public good without causing
harm to citizens, but the law lacks a definitive line in regards to what extent the
legal system will protect Big Data analytics comparatively to the protection of
citizen’s personal data.
Mentioned are only a few issues regarding the law and Big Data, particularly Big
Data analytics. As information systems become more advanced and more data

can be processed into useful information, the legal issues will also continue to
grow. We are becoming increasingly more in control of data produced from
everyone all around the world as Information Technology continues to advance.
9

For Big Data analysis to continue to progress, it is essential for more legal
regulation to be put in effect.

Conclusion
Big data has given this generation new and exciting challenges to face. We have
been able to evolve from relational databases that brought data to a single
processor to a parallel processing system which intercorrelates multiple servers
to store more data at higher speeds. Social media has been proven to be of
extreme value in today’s Big Data analyses. Social media data is so widely used
that it has been used for many areas of research including stock returns;
however, unstructured data such as social media has been proven as a
challenge in Big Data analytics. Unstructured data consisting of so many sources
makes it difficult to filter all relevant data effectively. Big Data has also brought
concern regarding the legal systems involvement with Big Data. We have seen
that there are many issues at hand with Big Data and the legal system that leave

grey areas. Ownership of data, contractual requirements, and privacy issues are
a few concerns that have required legal action. The cases mentioned in this
paper involving data been decided on a case by case basis; however, whether
the legal system should enact clear guidelines in regards to Big Data and the
technicalities to protect the progression of Big Data analytics is worth
consideration.

References

10

1. Arthur, L (2013), “What is Big Data”, Forbes Magazine, https://www.forbes.com/

sites/lisaarthur/2013/08/15/what-is-big-data/#122b6075c85b
2. Talluri, Sushma. "Big Data using Cloud Technologies." Global Journal of
Computer Science and Technology 16.2 (2016).
3. Mishra, Devendra Kumar. "CHALLENGES WITH UNSTRUCTURED BIG DATA
ANALYSIS USING MACHINE LEARNING APPROACH: A REVIEW." Futuristic
Trends in Engineering, Science, Humanities, and Technology FTESHT-16 (2016):
130.
4. Šebalj, Dario, Ana Živković, and Kristina Hodak. "Big data: Changes in data
management." Ekonomski vjesnik/Econviews-Review of Contemporary Business,
Entrepreneurship and Economic Issues 29.2 (2016): 487-499.
5. Sanger, William, and Thierry Warin. "High Frequency and Unstructured Data in
Finance: An Exploratory Study of Twitter." Journal of Global Research in
Computer Science 7.4 (2016).
6. Allen, Anita L. "Protecting One's Own Privacy in a Big Data Economy." (2016).
7. Brooker, Phillip, Julie Barnett, and Timothy Cribbin. "Doing social media
analytics." Big Data & Society 3.2 (2016): 2053951716658060.
8. Kitchin, Rob, and Gavin McArdle. "What makes Big Data, Big Data? Exploring
the ontological characteristics of 26 datasets." Big Data & Society 3.1 (2016):
2053951716631130.
9. Zeno-Zencovich, Vincenzo, and Giorgio Giannone Codiglione. "Ten Legal
Perspectives on the Big Data Revolution'." (2016).
10. Pradhananga, Yanish, Shridevi Karande, and Chandraprakash Karande. "High
Performance Analytics of Big Data with Dynamic and Optimized Hadoop Cluster."
Advanced Communication Control and Computing Technologies (ICACCCT),
2016 International Conference on. IEEE, 2016.
11. Hillam, Jared. “What is Hadoop?” YouTube, 14 July 2012. Web. 26 Mar. 2017.

11

12. Carter, Edward L., and Laurie Thomas Lee. "Information Access and Control in
an Age of Big Data." Journalism & Mass Communication Quarterly 93.2 (2016):
269-272.
13. McAfee, David. “Twitter, PeopleBrowsr Settle Dispute Over Data Access.”
Law360.com
14. Che, Dunren, Mejdl Safran, and Zhiyong Peng. "From big data to big data
mining: challenges, issues, and opportunities." International Conference on
Database Systems for Advanced Applications. Springer Berlin Heidelberg, 2013.
15. Jagadish, H. V., et al. "Big data and its technical challenges." Communications of

the ACM 57.7 (2014): 86-94.

12