Skip to content

Data Science Interview Questions and Answers

Data Science Interview Questions

This set of questions covers a range of topics that are commonly asked in data science interviews. They are meant to give you an idea of the types of questions that may be asked, as well as to help you prepare for data science interviews.

1. What Do You Essentially Mean By Data Science?

Data science is a field of study that combines mathematics, statistics, computer science, and business to extract insights from data. Data scientists use their skills to solve problems and make decisions by analyzing data. They work with large data sets to find trends and patterns, and they use this information to make predictions about the future.

An example of data science in action is the work that was done by a team of researchers at Stanford University to improve the accuracy of predictions made by a popular machine learning algorithm. The team was able to improve the accuracy of predictions made by the algorithm by more than 15% by incorporating new data into their analysis.

2. What Do You Understand By Selection Bias And Mention Its Types?

Selection bias is a type of bias that happens when the selection of participants in a study influences the results. There are different types of selection bias:

Selection bias is a type of bias that occurs when the selection of participants or data for a study affects the results. There are many different types of selection bias, including:

  • Sampling bias: Sampling bias occurs when the selection of participants in a study is not random. This can be caused by factors such as self-selection or selective sampling.
  • Selection bias: Selection bias occurs when the selection of data for a study is not random. This can be caused by factors such as selective reporting or selection on the basis of pre-existing beliefs.

3. Differences Between Supervised And Unsupervised Learning

Supervised learning is a type of machine learning where input data is labeled and used to train a model. The model is then able to make predictions on new, unlabeled data. Unsupervised learning is a type of machine learning where input data is not labeled and the model needs to learn from the data itself how to group or categorize it.

Supervised learning is where you have a set of training data, and a known answer or label for each example. Unsupervised learning is where you don’t have a set of training data, and you’re trying to learn the structure or patterns in the data.

4. What Is The Primary Goal Of A/B Testing In Data Science?

A/B Testing is a technique used in data science to compare the performance of two versions of a process or system. The goal of A/B Testing is to identify which version performs better so that the better version can be implemented.

A/B testing is a scientific way of comparing two versions of a web page or app to see which one performs better. The primary goal of A/B testing is to improve the user experience and increase conversions.

5. How Do You Build A Random Forest Model?

A random forest model is built by first randomly selecting a number of input variables, or “features,” to use in the model. Then, a number of decision trees are randomly created using these features. The predictions made by these decision trees are then averaged together to create the final prediction for the random forest model.

The code below is from the scikit-learn documentation on random forests. from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, max_depth=3, min_samples_split=2) X = [[1, 2, 3], [4, 5, 6]] y = [1, 2, 3], y) print(clf.predict([[7, 8, 9]]))

6. What Are Dimensionality Reduction And Its Benefits?

Dimensionality reduction is the process of reducing the number of dimensions in a dataset. This can be done for several reasons: to make the data easier to work with, to improve performance when using machine learning algorithms, or to reduce the size of the dataset. There are a number of different dimensionality reduction techniques, each with its own benefits and drawbacks. Some common dimensionality reduction techniques include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Singular Value Decomposition (SVD).

7. How Can You Avoid Overfitting Your Model?

There are a few ways you can avoid overfitting your model:

  • Use more data: If you have more data, your model will be less likely to overfit.
  • Use a more sophisticated model: A more sophisticated model is less likely to overfit than a less sophisticated model.
  • Use cross-validation: Cross-validation will help you determine how well your model is fitting the data.
  • Regularize your model: Regularization will help prevent your model from overfitting the data.
  • Tune your parameters: Tuning your parameters will help make sure your model is not overfitting the data.

8. What Are The Feature Selection Methods Used To Select The Right Variables?

The feature selection methods used to select the right variables are:

Correlation: This measures the linear relationship between two variables. The closer the correlation value is to 1, the stronger the linear relationship between the two variables.

Coefficient of Determination: This measures how much of the variance in one variable is explained by the other variable. The closer the coefficient of determination value is to 1, the stronger the relationship between the two variables.

Principal Component Analysis: This technique extracts the maximum amount of variation in a set of data by analyzing all of the variables together. It does this by creating new variables, called principal components, that are based on the original variables.

The most popular feature selection methods are:

  • Fisher’s Score
  • Correlation coefficient
  • Dispersion ratio
  • Chi-squared statistic
  • Akaike information criterion (AIC)
  • Bayesian information criterion (BIC)

9. How Should You Maintain A Deployed Model?

There are a few key things to remember when maintaining a deployed model:

  • Always keep the latest version of the model deployed.
  • Test the model frequently in development before deploying to ensure that it is working as expected.
  • Make changes to the model in development and then redeploy the model to test it.
  • If there are any issues with the deployed model, troubleshoot and fix them as soon as possible.

10. What Is The Significance Of P-Value?

The significance of the p-value is that it is a measure of how likely it is that the results of a study are due to chance. A p-value of 0.05 or less means that there is a 5% or less chance that the results are due to chance, and this is typically considered to be statistically significant.

In general, the p-value is a measure of how likely it is that the results observed in a study could have occurred by chance. A lower p-value indicates that the results are less likely to have occurred by chance, and thus are more likely to be due to the intervention or exposure being studied.

11. Write A Basic Sql Query That Lists All Orders With Customer Information

SELECT * FROM orders JOIN customers ON orders.customer_id=customers.customer_id

SELECT order_id, customer_id, shipping_address FROM orders;

12. You Are Given A Dataset on Cancer Detection. You Have Built A Classification Model And Achieved An Accuracy of 96%. Why Shouldn’t You Be Happy With Your Model Performance? What Can You Do About It?

Even though the model achieves an accuracy of 96 percent, there is still some room for improvement. In particular, the model could be more accurate on the rare cancers. Additionally, the model could be improved by incorporating more data (e.g., data from a different cancer type or from a different population).

There are a few potential things you could do in order to improve your model’s accuracy:

  • Check your data for accuracy and completeness. Make sure that all of the data is correct and that there are no missing values.
  • Run additional tests to make sure your model is actually predicting cancer correctly. Compare your model’s predictions to a gold standard or another independently-developed model.
  • Try adjusting your model’s parameters in order to see if you can improve its accuracy.
  • Use a different machine learning algorithm to see if you can get better results.

13. What Are The Feature Vectors? What Are The Steps In Making A Decision Tree?

Feature vectors are arrays of numerical values that represent the features of a particular object or event. They can be used to train machine learning algorithms to make decisions based on patterns in the data. Decision trees are a type of machine learning algorithm that can be used to predict the outcome of events based on past data.

The steps involved in making a decision tree are:

  1. Choose a feature vector to use as input data.
  2. Create a tree diagram that represents the relationships between the input data and the predicted outcome.
  3. Use the tree diagram to calculate the probability of each outcome, based on the input data.
  4. Choose the most likely outcome based on the probabilities calculated in step 3.

14. What Is Root Cause Analysis? What Is Logistic Regression?

Root cause analysis is a problem solving technique that is used to identify the root cause of a problem. The goal of root cause analysis is to identify the factors that are causing the problem and to find a way to fix them.

Logistic regression is a type of statistical analysis that is used to predict the likelihood of a particular outcome. Logistic regression can be used to predict the likelihood of a person being infected with a disease, the likelihood of someone being arrested, or the likelihood of someone winning a lottery.

Root cause analysis (RCA) is a problem-solving approach that seeks to identify the root cause of an issue, in order to find a lasting solution. Logistic regression is a type of statistical modeling that can be used to predict the probability of a particular event occurring, based on a set of observed variables.

15. What Are Recommender Systems?

Recommender systems are a type of artificial intelligence that are used to predict what a user might want to buy or watch. They are used to recommend items to users based on their past behavior.

There are many different types of recommender systems, but they all have the same goal: to recommend items to users that they may be interested in. In some cases, the items may be products that a user can purchase; in other cases, the items may be articles or other content that a user can consume. One of the most common types of recommender systems is called a collaborative filtering system. Collaborative filtering systems rely on feedback data from users in order to make recommendations. For example, if a user likes item A, the system will assume that the user may also like item B, C, and D. This type of system is often used to recommend products to users, based on their past purchases. Another

16. What Is Cross-Validation?

Cross-validation is a technique for assessing the accuracy of a predictive model. The model is divided into two groups: the training set and the test set. The training set is used to build the model, and the test set is used to evaluate the accuracy of the model.

Cross-validation technique also used for estimating the accuracy of a supervised learning algorithm. The basic idea is to divide the data into two sets: a training set and a test set. The training set is used to train the algorithm, and the test set is used to evaluate the accuracy of the algorithm.

17. What Is Collaborative Filtering?

Collaborative filtering is a technique used to make recommendations to users based on their past behavior. It relies on the idea that if someone liked one thing, they are likely to like something similar.

Collaborative filtering technique used by recommender systems to find similar items for a given user. It relies on feedback data (e.g., ratings) from a group of users to recommend similar items for other users in the group.

18. Do Gradient Descent Methods Always Converge To Similar Points?

Yes, gradient descent methods always converge to similar points.

Gradient descent methods always converge to similar points. This is because gradient descent finds the local minimum of a function, and all minimums are global minima.

19. What Are The Drawbacks Of The Linear Model?

There are several drawbacks to the linear model. First, the linear model is limited to predicting outcomes that are directly proportional to the independent variables. Second, the linear model assumes that all of the variation in the dependent variable can be explained by the independent variables. Third, the linear model is not able to account for interactions between the independent and dependent variables. Finally, the linear model is only appropriate for data that are linearly related.

Linear regression models are limited in that they can only model linear relationships between the predictor and response variables. Additionally, linear regression models are not able to account for the inherent uncertainty in the data. As a result, linear regression models should be used with caution when making predictions.

20. What Is The Law Of Large Numbers?

The law of large numbers is a theorem that states that the average of the results of a large number of trials will be close to the expected value of the trial.

21. What Are The Confounding Variables?

Confounding variables are factors that can influence the results of an experiment but are not part of the experiment itself. For example, in a study on the effects of caffeine on heart rate, age and weight would be confounding variables because they could affect a person’s heart rate even if they did not drink caffeine. Confounding variables can make it difficult to determine the true effect of the independent variable.

22. What Is Star Schema?

Star schema is a data modeling technique used in business intelligence and data warehousing. It is a fact-based model that uses a central fact table surrounded by multiple dimension tables. The star schema gets its name because the fact table resembles a star when it is drawn in a diagram.

23. What Are Eigenvalue And Eigenvector?

An eigenvalue is a number associated with a particular linear transformation that causes a certain fixed point to remain stationary. A vector is called an eigenvector of the linear transformation if it is mapped onto itself by the transformation and its magnitude is multiplied by the corresponding eigenvalue.

Eigenvalues and eigenvectors are important concepts in linear algebra. An eigenvalue is a value that, when multiplied by a vector, produces a new vector that is unchanged. The corresponding eigenvector is the vector associated with the eigenvalue.

24. Why Is Resampling Done?

There are many reasons why resampling may be done. One reason is to improve the quality of a signal by interpolating between samples. This can be done to reduce aliasing or to make a signal more smooth. Another reason to resample is to change the resolution of a signal. This can be done, for example, to decrease the size of a file or to increase the number of points in a signal for further processing.

25. How Regularly Must An Algorithm Be Updated?

The frequency of updating an algorithm depends on the algorithm and the context in which it is being used. Generally, algorithms should be updated when new information about the problem or context becomes available.

26. What Is Selection Bias?

Selection bias is the tendency of a sample to be unrepresentative of the population from which it is drawn.

One example of selection bias is self-selection. This occurs when individuals who are more likely to exhibit a particular behavior are the ones who are more likely to participate in a study or survey on that behavior. For example, if researchers were studying the effects of a new drug on heart health, they might find that people who take the drug have better heart health than those who do not take the drug. This might be because healthier people are more likely to take medication than people who are less healthy. As a result, the researchers would be mistakenly attributing the good heart health of the people who took the drug to the drug itself, when it may actually be due to the fact that healthier people are more likely to take medication in general.

27. What Are The Types Of Biases That Can Occur During Sampling?

There are many types of biases that can occur during sampling. Some examples are:

  • Selection bias: When the sample is not representative of the population.
  • Sampling bias: When the sample is not randomly selected.
  • Convenience sampling: When the sample is selected because it is easy to obtain
  • Cluster sampling: When the sample is selected from groups (clusters) of people

28. What Is Survivorship Bias?

Survivorship bias is a cognitive bias that occurs when people focus on the survivors in a group while ignoring the non-survivors. This can lead to an inaccurate assessment of a situation because the survivors may be unrepresentative of the whole group.

Survivorship bias is the logical error of concentrating on the people or things that “survived” some process and inadvertently overlooking those that did not because they are no longer visible.

29. Why Is R Used In Data Visualization?

There are a few reasons why R is often used in data visualization. First, R is open source and free to use, which makes it a popular choice for researchers and data analysts. It also has a wide variety of built-in functions and libraries that make it easy to visualize data. Additionally, R is a relatively easy language to learn, which makes it a popular choice for beginning data analysts.

30. What Is The Difference Between An Error And A Residual Error?

An error is an discrepancy between a measured value and the actual value. A residual error is the error in a measured value after accounting for known systematic errors.

31. Difference Between Normalisation And Standardization

Normalisation is the process of organising data in a database so that it can be easily accessed and used. This is done by creating tables and indexes, and then defining relationships between the tables. Normalisation ensures that the data is consistent and accurate, and that it can be retrieved quickly and easily.

Standardization is the process of ensuring that data is consistent and meets specific standards. This involves defining rules for formatting and entering data, and then enforcing these rules. Standardization makes it easier to compare data from different sources, and to use it for analysis or reporting.

32. Difference Between Point Estimates And Confidence Interval

Point estimates give a single number as an estimate of some population parameter. A confidence interval, on the other hand, is a range of numbers that is likely to include the population parameter. The width of the confidence interval depends on the size of the sample and the level of confidence chosen by the researcher.


The purpose of this article was to provide an overview of data science interview questions and to serve as a guide for those preparing for a data science interview. The questions covered in this article are representative of the types of questions that may be asked in a data science interview.

However, it is important to note that the specific questions asked during an interview will vary depending on the company and the role.

Leave a Reply

Your email address will not be published. Required fields are marked *