Friday, May 15, 2020

Most Frequently Asked Data Science Interview Questions

Most Frequently Asked Data Science Interview Questions Photo Credit â€" Pexels.comData science is one of the most paid jobs in the IT industry. You would be staggered to know that they get around $120,000 salary in a year. But do you know they are required to have very high skill set to avail such a career.Do you have any idea how difficult it is to face data science interviews? It is for this reason I have penned a couple of most frequently asked questions for data science candidates.1. What is root cause analysis?evalFor finding out root causes of problems RCAA causal factor is different in the sense it does affect an event’s outcome but the difference is that it is not the root cause. Industrial accidents, software testing, healthcare, project management are the main areas use this analysis.2. What is a resampling method and why is it useful?Classical parametric tests compare observed statistics with theoretical distributions. Resampling a data-driven methodology is based on recurring sampling within the same sampling.Resampling ref ers to methods for achieving the following:1) When performing significance tests like randomization tests and re-randomization tests labels are exchanged on data points.2) Precision of sample statistics Due to non-random sample of population, error gets introduced which is a problematic state indeed. Consider an example sample of 100 test cases being made up of 55/25/15/5 split of 4 cases which really occurred in equal population numbers, then a model would most likely make the false assumption that probability is the deciding predictive factor.Avoiding altogether non-random samples is the best antidote to cope with this bias. When this is not possible then boosting, resampling and weighting are strategies introduced to cope with this situation.eval5. Why is A/B testing used?For two variables A and B this is a statistical hypothesis testing in a randomized experiment. It is useful in reducing bounce rates, enhancing customer engagement, easing analysis, higher conversion values and so on.6. What are the benefits of using random forest?This technique is used to combine several weak learners to provide a strong learner.eval1) A common random forest algorithm can be used for both regression and classification task.2) Using random forest algorithm for classification will avoid the problem of overfitting.3) This algorithm can be used in feature engineering in that out of the total available features the most important features from the dataset can be identified.7. What is logistic regression?Also known as the logit model this method is used to predict the binary outcome from a linear combination of predictor variables. This method is popular because the results are easy to interpret.8. What do you know about feature vectors?To represent a new item a feature vector is used which is nothing but an n-dimensional vector of numerical features. In machine mastering, feature vectors are used to symbolize numeric or symbolic traits, referred to as capabilities, of an objec t in a mathematical, easily analyzable way.9. Do you have an idea about statistical strength?Statistical strength or sensitivity of a binary hypothesis check is the probability that when the alternative hypothesis 1) The test data being used for performance comparison should not have selection bias.2) One has to verify whether the results reflect the local maxima/minima or global maxima/minima.3) The test data should have enough variety so that it closely aligns with real life data4) While comparing performance, the test environment should be common with no variation for original algorithm and new algorithm.5) Even if tests are recurred there should be similar resultseval11. Which is better huge number of false positives or a huge number of false negatives?False negatives may provide an incorrect message to patients and the doctor that the disease is absent when it is actually present. This obviously leads to potential danger to the patient because of inadequate treatment. So natura lly it is required to have too many false positives in this case.In spam filtering, a false positive occurs when spam filtering mechanism wrongly classify a genuine email message as spam and therefore stops its delivery. In this case however false negatives is preferred over false positives.12. What are the assumptions required for linear regression?There are four major assumptions:1) The data residuals are normally distributed and are independent from each other.2) Between the regressors and dependent variables there is a linear relationship. It is another way of saying that your model actually fits the data3) Homoscedasticity which means for all values of the predictor variable the variance around the regression line is the same4) Between explanatory variables there is minimal multicollinearity.You can check out this resource for moreData Science Interview questions.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.