Last updated on Jan. 27, 2021, 12:26 p.m. by mayank
All about the Oil of the 21st Century!
All the tech folks out there as well as people remotely associated with technology must have heard about data taking its powerful position as oil in the 20th Century. The 21st century is flooded with loads of data all over the Earth. To make the most of this large amount of data generated by literally everything on this planet, we need specialized people in data, technology and statistics. Data Scientist, after being one of the hottest jobs in the 21st Century, is still attractive to new graduates as well as undergraduates, especially from the Engineering background. Big Tech with the likes of Facebook, Google, Apple are mentioned in the news frequently for their infamous data breaches, data leaks as well as the misuse of consumer data. But that doesn’t mean they are always doing something fishy with the data collected from their consumers, they use it for the social good as well as research and marketing which ultimately benefits the consumers themselves.
Now, being aware of the status quo of Big Data, one would definitely be curious to know what data is being collected and how exactly is this data being used to our advantage. So let’s dive in to understand how Big Data works. Let us start from basics, first have a look at the definition of data.
Oxford Dictionary defines data as -
“a fact or piece of information”
So basically anything which tells us something is data! But the Big Tech can’t always make use of this data as it is in the raw form. So, they need to process the data and make something meaningful out of it. One needs meaningful data in order to make the best use of Big Data to their advantage. A lot of data is available out there, some in structured and some in unstructured format. The structured data is easy to use as it is in most of the cases. The main problem lies with the unstructured data. Petabytes of unstructured data is generated every minute and large companies and organizations are consistently trying to make the best use of this data to their advantage.
But first let us understand the origin of such a large amount of data- where does the Big Data come from? Short answer- Big Data is generated from three places, Machine Data, Organizational Data and the data generated by people. Long answer, well the next section is devoted to that.
Diving into the list of data generation sources, first comes the machine data- the data generated by machines all over the world. Machines include right from the small sensors embedded in your fitbit to the Large Hadron Collider (LHC) located at CERN. Let us look at how machine generated data is collected and used. Fitbits, which many people use around the world store your basic medical information such as heartbeat, sleep activity, pulse rate, etc. This data can be used for medical diagnosis and thereby help detect early symptoms of some chronic illnesses. One of the best examples of this is how Google leverages AI and data to detect medical emergencies well ahead of time! Similarly, self-driving cars are being developed by collecting data from sensors and IoT devices present in certain vehicles which first collect enormous amounts of data relating to how the car owners drive using the motion detectors, cameras among other sensors and then use specialized AI and ML algorithms to innovate and excel in the autonomous driving industry. Apple is using cutting edge technology to put its fully autonomous vehicle on the road.
Coming to the second source of Big Data i.e. the Organizational data. This is the traditional data collecting method. Large amounts of data is collected by various organizations as part of their daily routine as well as for future research purposes. This data is highly structured and well-stored. A good example of this type of data can be the various records of employees as well as customers stored by any huge organization or company such as Microsoft, banks storing customer records, employee records, their internal stuff, etc. The third category is the data collected from people. Everyday, all the people all around the globe knowingly or unknowingly contribute to the vast scales of data while doing their daily chores, which include simple tasks such as performing a google search, visiting a website, travelling by public transport and other routine works. Every like we give on Facebook, each google search we perform, contributes to the user database of these companies. They then use this user-generated data to target ads specific to the people and thereby show them the relevant products and services. This adds overall value to the economy and by personalized marketing and hence makes this user data valuable to both the advertiser (the one whose ads are shown) as well as the Ad Network (the facilitator like Facebook, Google). This is how user-generated data helps make the Big-Tech billions of dollars in revenue as well as delivering the consumer more value- generated ads.
These three sources generate seemingly unlimited volumes of data. To get a perspective, 90% of the data ever existed was generated in the last two years. Now that we know how is this data generated, let us see what are the features of data generated by various sources and their importance in modelling and computations. Following the internet terminologies there are mainly 5 Vs (well I have also got 6th one for you) of Big Data.
Having seen the various aspects of Big Data and its use cases, we now look at a practical insight to gain a better perspective about Data Science. Let us take the case study of PUBG- Player Unknown’s BattleGrounds.
PUBG - case study
PUBG, a very popular multiplayer game is played by millions of gamers across the world. (Fun fact: all the people involved with this site are also good PUBG players). Naturally, due to its large customer base, PUBG has a huge volume of data coming at its doorstep every second. This includes player_id, device information, profile pic, the time for which the app was used, in-app purchases,region location,friends-list, other gaming activity details, and many more. As these all are different from each other, they may be in the form of audio, strings, images, etc, i.e. a huge variety of data. Due to the large amount of time users spend on this data and the large amount of users, this data can be said to have a high velocity. We can also say that the data veracity i.e. the quality of data is only moderately good as there are a lot of bots and other malpractices the players use in order to win the game. One important feature is the valency or the interconnectedness of data in this case. The user data is generally interconnected with each other, for eg. the in-app purchases the user has made is connected with the amount of time the user spends on the gaming activity as well as the rating of the player, i.e. a better player might spend more on in-app purchases in order to upgrade or maybe vice versa might happen. The in-app purchases are also influenced by demographics as well as location. All in all, this data collected from users helps the company make better decisions about various aspects of the game i.e. it creates value for its consumers. One example of this is how the game provides feedback using the game statistics, timestamps and the scores of a player. Another example is abnormal behaviour detection or identifying cheating although this is manual sometimes. The game used video clips of the gameplay, sounds, etc. in order to detect cheating in a game and thereby deliver more value by providing a fair gaming environment. The location based data and device information like version, battery level, WiFi strength, available space, network type, OS version, platform, carrier, country code, etc. helps improve the feed as well as game functionalities. A detailed list of how PUBG collects and uses various types of user data can be found on their website.
The above case study helps us understand the purpose of Big Data and the process of collection, the way it is stored and used. Data helps us everyday perhaps many a times unknowingly, making our life easier and smoother. The above piece that you just read was the first blog of the ‘Data Science’ Series. Stay tuned for more amazing content.
Until then, share with your inquisitive friends, family members and acquaintances.
Jan. 23, 2021, 6:56 p.m.