Data Science Interview Questions and Answers- Part 3

Data science interviews can be challenging, requiring a blend of technical expertise, problem-solving skills, and business acumen. Whether you’re a beginner or an experienced data scientist, preparing for common interview questions is essential to showcase your ability to analyze data, build models, and derive meaningful insights.

This page is designed to help you navigate data science interviews by covering key topics such as statistics, machine learning, Python, SQL, and data analysis. Expect questions on probability, hypothesis testing, feature engineering, model evaluation, and deep learning concepts.

Employers often look for candidates who can apply theoretical knowledge to real-world scenarios. By exploring this guide, you’ll gain insights into common interview questions and sample answers to tackle tricky questions. With thorough preparation, you can confidently approach your data science interview and increase your chances of landing your dream role.

Answer:

R is widely used in Data Visualizations for the following reasons-

  • We can create almost any type of graph using R.
  • R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well.
  • It is easier to customize graphics in R compared to Python.
  • R is used in feature engineering and in exploratory data analysis as well.

Answer:

Following are the three types of biases that occurs during sampling:

  1. Selection bias
  2. Undercoverage bias
  3. Survivorship bias

Answer:

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

Answer:

Resampling is done in any of these cases:

  • Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points
  • Substituting labels on data points when performing significance tests
  • Validating models by using random subsets (bootstrapping, cross-validation)

Answer:

The process of dimensionality reduction constitutes reducing the number of features in a dataset to avoid overfitting and reduce the variance. There are mostly four advantages of this process:

  • This reduces the storage space and time for model execution.
  • Removes the issue of multi-collinearity thereby improving the parameter interpretation of the ML model.
  • Makes it easier for visualizing data when the dimensions are reduced.
  • Avoids the curse of increased dimensionality.

Answer:

Random forest is better than multiple decision trees as random forests are much more robust, accurate, and lesser prone to overfitting as it is an ensemble method that ensures multiple weak decision trees learn strongly.

Answer:

The test set is used to test or evaluate the performance of the trained model. It evaluates the predictive power of the model.

The validation set is part of the training set that is used to select parameters for avoiding model overfitting.

Answer:

The impact of missing values can be known after identifying what kind of variables have the missing values.

  • If the data analyst finds any pattern in these missing values, then there are chances of finding meaningful insights.
  • In case of patterns are not found, then these missing values can either be ignored or can be replaced with default values such as mean, minimum, maximum, or median values.
  • If the missing values belong to categorical variables, then they are assigned with default values. If the data has a normal distribution, then mean values are assigned to missing values.
  • If 80% values are missing, then it depends on the analyst to either replace them with default values or drop the variables.

Answer:

Ensemble Learning is basically combining a diverse set of learners(Individual models) together to improvise on the stability and predictive power of the model.

Answer:

Also referred to as “loss” or “error,” cost function is a measure to evaluate how good your model’s performance is. It’s used to compute the error of the output layer during backpropagation. We push that error backwards through the neural network and use that during the different training functions.

Answer:

  • Epoch – Represents one iteration over the entire dataset (everything put into the training model).
  • Batch – Refers to when we cannot pass the entire dataset into the neural network at once, so we divide the dataset into several batches.
  • Iteration – if we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).

Answer:

There are four layers in CNN:

  • Convolutional Layer – the layer that performs a convolutional operation, creating several smaller picture windows to go over the data.
  • ReLU Layer – it brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map.
  • Pooling Layer – pooling is a down-sampling operation that reduces the dimensionality of the feature map.
  • Fully Connected Layer – this layer recognizes and classifies the objects in the image.

Answer:

Long-Short-Term Memory (LSTM) is a special kind of recurrent neural network capable of learning long-term dependencies, remembering information for long periods as its default behaviour. There are three steps in an LSTM network:

Step 1: The network decides what to forget and what to remember.

Step 2: It selectively updates cell state values.

Step 3: The network decides what part of the current state makes it to the output.

Answer:

Following are the variants of Back Propagation:

  • Stochastic Gradient Descent: We use only a single training example for calculation of gradient and update parameters.
  • Batch Gradient Descent: We calculate the gradient for the whole dataset and perform the update at each iteration.
  • Mini-batch Gradient Descent: It’s one of the most popular optimization algorithms. It’s a variant of Stochastic Gradient Descent and here instead of single training example, mini-batch of samples is used.

Answer:

ROC curve stands for Receiver Operating Characteristics curve, which graphically represents the performance of a binary classifier model at all classification threshold. The curve is a plot of true positive rate (TPR) against false positive rate (FPR) for different threshold points.

Answer:

While analyzing the dataset, there are instances where the number of variables or columns are in excess. However, we are required to only extract significant variables from the group. For example, consider that there are a thousand features. However, we only need to extract handful of significant features. This problem of having numerous features where we only need a few is called ‘curse of dimensionality’.

There are various algorithms for dimensionality reduction like PCA (Principal Component Analysis).

Answer:

For serializing and de-serializing an object in Python, we make use of pickle module. In order to save this object on drive, we make use of pickle. It converts an object structure into character stream.

Answer:

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximise the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.

Answer:

An analysis which is applied to none attribute at a time is known as univariate analysis. Boxplot is widely used, univariate model.

Answer:

A single dependent variable depends on several independent variables in a multiple regression model. When these independent variables are deduced to possess high correlations with each other, the model is considered to reflect multicollinearity.

One can overcome multicollinearity in their model by removing a few highly correlated variables from the regression equation.