For all those budding professionals and newbies alike who are thinking of taking a dive into the booming world of data science, we have compiled a quick cheat sheet to get you brushed up with the basics and methodologies that underline this field.

**Data Science-The Basics**

The data that gets generated in our world is in a raw form, i.e., numbers, codes, words, sentences, etc. Data Science takes this very raw data to process it using scientific methods to transform it into meaningful forms to gain knowledge and insights.

**Data**

Before we dive into the tenets of data science, let’s talk a bit about data, its types, and data processing.

**Types of Data**

**Structured** – Data that is stored in a tabulated format in databases. It can be either numeric or text

**Unstructured** – Data that cannot be tabulated with any definitive structure to speak of is called unstructured data

**Semi-structured** – Mixed data with traits of both structured and unstructured data

**Quantitative** – Data with definite numeric values that can be quantified

**Big Data** – Data stored in huge databases spanning multiple computers or server farms is called Big Data. Biometric data, social media data, etc. is considered Big Data. Big data is characterised by 4 V’s

**Data Preprocessing**

**Data Classification** – It’s the process of categorizing or labeling data into classes like numerical, textual or image, text, video, etc.

**Data Cleansing** – It consists of weeding out missing/inconsistent/incompatible data or replacing data using one of the following methods.

- Interpolation
- Heuristic
- Random Assignment
- Nearest Neighbour

**Data Masking** – Hiding or masking out confidential data to maintain the privacy of sensitive information while still able to process it.

**What is Data Science Made of?**

**Concepts of Statistics**

**Regression**

**Linear Regression**

Linear Regression is used to establish a relationship between two variables such as supply and demand, price and consumption, etc. It relates one variable x as a linear function of another variable y as follows

Y = f(x) or Y =mx + c, where m = coefficient

**Logistic regression**

Logistic regression establishes a probabilistic relationship rather than a linear one between variables. The resulting answer is either 0 or 1 and we look for probabilities and the curve is an S-shaped one.

If p < 0.5, then its 0 else 1

Formula:

**Y = e^ (b0 + b1x) / (1 + e^ (b0 +b1x))**

where b0 = bias and b1 = coefficient

**Probability**

Probability helps to predict the likeliness of occurrence of an event. Some terminologies:

**Sample:** The set of likely outcomes

**Event:** It is a subset of the sample space

**Random Variable:** Random variables help to map or quantify likely outcomes to numbers or a line in a sample space

**Probability Distributions**

**Discrete Distributions:** Gives the probability as a set of discrete values (integer)

P[X=x] = p(x)

**Continuous Distributions:** Gives the probability over a number of continuous points or intervals instead of discrete values. Formula:

P[a ≤ x ≤ b] = a∫b f(x) dx, where a, b are the points

**Correlation and Covariance**

**Standard Deviation:** The variation or deviation of a given dataset from its mean value

σ = √ {(Σi=1N ( xi – x ) ) / (N -1)}

**Covariance **

It defines the extent of deviation of random variables X and Y with the mean of the dataset.

Cov(X,Y) = σ2XY = E[(X−μX)(Y−μY)] = E[XY]−μXμY

**Correlation**

Correlation defines the extent of a linear relationship between variables along with their direction, +ve or -ve

ρXY= σ2XY/ σX * *σY

**Artificial Intelligence**

The ability of machines to acquire knowledge and make decisions based on inputs is called Artificial Intelligence or simply AI.

**Types**

- Reactive Machines: Reactive machine AI works by learning to react to predefined scenarios by narrowing down to the fastest and best options. They lack memory and are best for tasks with a defined set of parameters. Highly reliable and consistent.
- Limited Memory: This AI has some real-world observational and legacy data fed to it. It can learn and make decisions based on the given data but cannot gain new experiences.
- Theory of Mind: It is an interactive AI that can make decisions based on the behaviour of the surrounding entities.
- Self Awareness: This AI is aware of its existence and functioning apart from the surroundings. It can develop cognitive abilities and understand and evaluate the impacts of its own actions on the surroundings.

**AI terms**

**Neural Networks**

Neural Networks are a bunch or network of interconnected nodes that relay data and information in a system. NNs are modeled to mimic neurons in our brains and can take decisions by learning and predicting.

**Heuristics**

Heuristics is the ability to predict based on approximations and estimates quickly using prior experience in situations where available information is patchy. It’s quick but not accurate or precise.

**Case-Based Reasoning**

The ability to learn from previous problem-solving cases and apply them in current situations to arrive at an acceptable solution

**Natural Language Processing**

It’s simply the ability of a machine to understand and interact directly in human speech or text. For ex, voice commands in a car

**Machine Learning**

Machine Learning is simply an application of AI using various models and algorithms to predict and solve problems.

**Types**

**Supervised **

This method relies on input data that is associative with the output data. The machine is provided with a set of target variables Y and it has to arrive at the target variable through a set of input variables X under the supervision of an optimization algorithm. Examples of supervised learning are Neural Networks, Random Forest, Deep Learning, Support Vector Machines, etc.

**Unsupervised**

In this method, input variables have no labeling or association, and algorithms work to find patterns and clusters resulting in new knowledge and insights.

**Reinforced**

Reinforced learning focuses on improvisation techniques to sharpen or polish the learning behaviour. It is a reward-based method where the machine gradually improves its techniques to win a target reward.

**Modeling Methods**

**Regression**

Regression models always give numbers as output through interpolation or extrapolation of continuous data.

**Classification **

Classification models come up with outputs as a class or label and are better at predicting discrete outcomes like ‘what kind’

Both regression and classification are supervised models.

**Clustering**

Clustering is an unsupervised model that identifies clusters based on traits, attributes, features, etc.

**ML Algorithms**

**Decision Trees**

Decision trees use a binary approach to arrive at a solution based on successive questions at each stage such that the outcome is either of the two possible ones like ‘Yes’ or ‘No’. Decision trees are simple to implement and interpret.

**Random Forest or Bagging**

Random Forest is an advanced algorithm of decision trees. It uses a large number of decision trees which makes the structure dense and complex like a forest. It generates multiple outcomes and thus leads to more accurate results and performance.

**K- Nearest Neighbour (KNN)**

kNN makes use of the proximity of the nearest data points on a plot relative to a new data point to predict which category it falls in. The new data point gets assigned to the category with a higher number of neighbours.

k = number of nearest neighbours

**Naïve Bayes**

Naïve Bayes works on two pillars, first that every feature of data points are independent, unrelated to each other, i.e. unique, and second on the Bayes theorem which predicts outcomes based on a condition or hypothesis.

Bayes Theorem:

P(X|Y) = {P(Y|X) * P(X)} / P(Y)

Where P(X|Y) = Conditional probability of X given occurrence of Y

P(Y|X) = Conditional probability of Y given occurrence of X

P(X), P(Y) = Probability of X and Y individually

**Support Vector Machines**

This algorithm tries to segregate data in space based on boundaries which can be either a line or a plane. This boundary is called a ‘hyperplane’ and is defined by the nearest data points of each class which in turn are called ‘support vectors’. The maximum distance between support vectors of either side is called margin.

**Neural Networks**

**Perceptron**

The fundamental neural network works by taking weighted inputs and outputs based on a threshold value.

**Feed Forward Neural Network **

FFN is the simplest network that transmits data in only one direction. May or may not have hidden layers.

**Convolutional Neural Networks **

CNN uses a convolution layer to process certain parts of the input data in batches followed by a pooling layer to complete the output.

**Recurrent Neural Networks**

RNN consists of a few recurrent layers between I/O layers that can store ‘historic’ data. The dataflow is bi-directional and is fed to the recurrent layers for improving predictions.

**Deep Neural Networks and Deep Learning**

DNN is a network with multiple hidden layers between I/O layers. The hidden layers apply successive transformations to the data before sending it to the output layer.

**‘Deep Learning’** is facilitated through DNN and can handle huge amounts of complex data and achieve high accuracy because of multiple hidden layers

**Conclusion**

Data science is a vast field that runs through different streams but comes across as a revolution and a revelation for us. Data science is booming and will change how our systems work and feel in the future.

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.