+91 9404 340 614    gyaanibuddy@gmail.com


Data- It is getting bigger !

Last updated on Jan. 27, 2021, 12:26 p.m. by mayank

Image desc
Dive into the world of data science with machine learning and artificial intelligence. The first blog of the Data Science series

All about the Oil of the 21st Century!

All the tech folks out there as well as people remotely associated with technology must have heard about data taking its powerful position as oil in the 20th Century. The 21st century is flooded with loads of data all over the Earth. To make the most of this large amount of data generated by literally everything on this planet, we need specialized people in data, technology and statistics. Data Scientist, after being one of the hottest jobs in the 21st Century, is still attractive to new graduates as well as undergraduates, especially from the Engineering background. Big Tech with the likes of Facebook, Google, Apple are mentioned in the news frequently for their infamous data breaches, data leaks as well as the misuse of consumer data. But that doesn’t mean they are always doing something fishy with the data collected from their consumers, they use it for the social good as well as research and marketing which ultimately benefits the consumers themselves. 

Now, being aware of the status quo of Big Data, one would definitely be curious to know what data is being collected and how exactly is this data being used to our advantage. So let’s dive in to understand how Big Data works. Let us start from basics, first have a look at the definition of data.  

Oxford Dictionary defines data as - 

‚Äča fact or piece of information” 

So basically anything which tells us something is data! But the Big Tech can’t always make use of this data as it is in the raw form. So, they need to process the data and make something meaningful out of it. One needs meaningful data in order to make the best use of Big Data to their advantage. A lot of data is available out there, some in structured and some in unstructured format. The structured data is easy to use as it is in most of the cases. The main problem lies with the unstructured data. Petabytes of unstructured data is generated every minute and large companies and organizations are consistently trying to make the best use of this data to their advantage.


But first let us understand the origin of such a large amount of data- where does the Big Data come from? Short answer- Big Data is generated from three places, Machine Data, Organizational Data and the data generated by people. Long answer, well the next section is devoted to that. 

Diving into the list of data generation sources, first comes the machine data- the data generated by machines all over the world. Machines include right from the small sensors embedded in your fitbit to the Large Hadron Collider (LHC) located at CERN. Let us look at how machine generated data is collected and used. Fitbits, which many people use around the world store your basic medical information such as heartbeat, sleep activity, pulse rate, etc. This data can be used for medical diagnosis and thereby help detect early symptoms of some chronic illnesses. One of the best examples of this is how Google leverages AI and data to detect medical emergencies well ahead of time! Similarly, self-driving cars are being developed by collecting data from sensors and IoT devices present in certain vehicles which first collect enormous amounts of data relating to how the car owners drive using the motion detectors, cameras among other sensors and then use specialized AI and ML algorithms to innovate and excel in the autonomous driving industry. Apple is using cutting edge technology to put its fully autonomous vehicle on the road. 


Coming to the second source of Big Data i.e. the Organizational data. This is the traditional data collecting method. Large amounts of data is collected by various organizations as part of their daily routine as well as for future research purposes. This data is highly structured and well-stored. A good example of this type of data can be the various records of employees as well as customers stored by any huge organization or company such as Microsoft, banks storing customer records, employee records, their internal stuff, etc. The third category is the data collected from people. Everyday, all the people all around the globe knowingly or unknowingly contribute to the vast scales of data while doing their daily chores, which include simple tasks such as performing a google search, visiting a website, travelling by public transport and other routine works. Every like we give on Facebook, each google search we perform, contributes to the user database of these companies. They then use this user-generated data to target ads specific to the people and thereby show them the relevant products and services. This adds overall value to the economy and by personalized marketing and hence makes this user data valuable to both the advertiser (the one whose ads are shown) as well as the Ad Network (the facilitator like Facebook, Google). This is how user-generated data helps make the Big-Tech billions of dollars in revenue as well as delivering the consumer more value- generated ads. 

These three sources generate seemingly unlimited volumes of data. To get a perspective, 90% of the data ever existed was generated in the last two years. Now that we know how is this data generated, let us see what are the features of data generated by various sources and their importance in modelling and computations. Following the internet terminologies there are mainly 5 Vs (well I have also got 6th one for you) of Big Data. 

  1. Volume: Starting with the simplest one, volume. The enormous amount of data generated is mainly beneficial due to its sheer volume, the huge volume of data helps give us confirmatory patterns and normal behaviour and tell us about outliers. As stated previously, this volume is growing exponentially day-by-day and while increasing volume helps us in analysing data more efficiently and thereby making better decisions, it also brings issues like storage, accessibility and many more. 
  2. Variety: Variety indicates the various forms in which the data is available like text, images, geospatial data, videos, audio, etc. Since we can’t handle all the data types with one model, we need data pre-processing and manipulation to come to a common ground for the different forms of data available and then feed them into our model and derive valuable insights from it.
  3. Velocity: In this fast moving world where everything is accessible within a click of a second, the velocity with which the data moves is also comparable to the speed of light (not literally!). Velocity is the rate at which data is generated. This velocity is very high for public data collection systems like Google, Facebook, etc. since every time a user activity is detected it gets recorded and hundreds of millions of users use such applications tens of times in a day. 
  4. Veracity: Data Veracity applies data quality. The quality of data that is generated is also very important as better quality ensures more effective understanding and better analytics. The quality of data is an important factor to consider in the incoming data as not all the data collected is valid. One major source of worthless data is fake ids, many people create fake ids and give fake information to websites for particular access. This creates tons of data which is practically useless for the majority of purposes. Hence one needs to make sure the quality and correctness of data is appropriate for use. 
  5. Valence: Valency is a crucial concept in both chemistry as well Data Science. We know valency in chemistry helps us determine the chemical bonding of atoms. Similarly, Valency in data science also helps us determine the connectedness of data. The data generated should have inter-connectivity between the various data points and should show similar features with respect to various data points. Inter-connectedness helps us connect the dots and derive useful results from data which help in decision making. 
  6. Central to them all- Value: All the five Vs mentioned above are all central to the Value created through our product. Here product refers to the model developed and the analytics produced from Big Data. Value created through the project determines the success and its applied usage in the industry. 


Having seen the various aspects of Big Data and its use cases, we now look at a practical insight to gain a better perspective about Data Science. Let us take the case study of PUBG- Player Unknown’s BattleGrounds. 


PUBG - case study

PUBG, a very popular multiplayer game is played by millions of gamers across the world. (Fun fact: all the people involved with this site are also good PUBG players). Naturally, due to its large customer base, PUBG has a huge volume of data coming at its doorstep every second. This includes player_id, device information, profile pic, the time for which the app was used, in-app purchases,region location,friends-list, other gaming activity details, and many more. As these all are different from each other, they may be in the form of audio, strings, images, etc, i.e. a huge variety of data. Due to the large amount of time users spend on this data and the large amount of users, this data can be said to have a high velocity. We can also say that the data veracity i.e. the quality of data is only moderately good as there are a lot of bots and other malpractices the players use in order to win the game. One important feature is the valency or the interconnectedness of data in this case. The user data is generally interconnected with each other, for eg. the in-app purchases the user has made is connected with the amount of time the user spends on the gaming activity as well as the rating of the player, i.e. a better player might spend more on in-app purchases in order to upgrade or maybe vice versa might happen. The in-app purchases are also influenced by demographics as well as location. All in all, this data collected from users helps the company make better decisions about various aspects of the game i.e. it creates value for its consumers. One example of this is how the game provides feedback using the game statistics, timestamps and the scores of a player. Another example is abnormal behaviour detection or identifying cheating although this is manual sometimes. The game used video clips of the gameplay, sounds, etc. in order to detect cheating in a game and thereby deliver more value by providing a fair gaming environment. The location based data and device information like version, battery level, WiFi strength, available space, network type, OS version, platform, carrier, country code, etc. helps improve the feed as well as game functionalities. A detailed list of how PUBG collects and uses various types of user data can be found on their website. 


The above case study helps us understand the purpose of Big Data and the process of collection, the way it is stored and used. Data helps us everyday perhaps many a times unknowingly, making our life easier and smoother. The above piece that you just read was the first blog of the ‘Data Science’ Series. Stay tuned for more amazing content. 


Until then, share with your inquisitive friends, family members and acquaintances. 



by mayank
KJ Somaiya College of Engineering Mumbai

Hii I am a TY student
blog comments powered by Disqus