You are on page 1of 6

JD for this Position

Candidate should have implemented at least one / two projects


Should have experience in R Programming / Python
Should have experience in SAS / SPSS
Should have experience on Machine learning algorithms
Should have worked on at least 3-4 models
Should have exposure / knowledge on Data Visualization tool such as
Tableau, Java Script libraries like D3.js. charts.js or angular.js
Should be a very good story teller & a very good team player

Mandatory Questions for Screening:

What Analytics products did you develop so far?


What is the current model youre working on?
Is it a POC / POV / product that is Moved to production?
What is the Business Problem they are trying to address?
Business benefits?
What is the Model they have built and how they built it?
What are the data sets they considered for this?
Size of the data Sets?
How many data sets are used to train this model?
Accuracy of the model?
Programming they used for the above model? If R they should be able
to build come code snippet during the interview process
What visualization tools they used for this project?
Finally, they should demo the models if required.

Technical Sample Questions:


1-PIG:
1.1) What are the different pig data types?
Following are the data types supported by pig Latin language
i) Primitive data type (Int, Long, Float, Double, Char array, Byte
array)
ii) Complex data type (Map, Tuple, Bag)
1.2) What is the difference between Pig Latin and Hive?
Pig Latin:
Pig Latin is a Procedural language
Nested relational data model
Schema is optional
HiveQL:
HiveQL is Declarative
HiveQL flat relational
Schema is required
1.3) Explain the difference between COUNT_STAR and COUNT functions
in Apache Pig?
COUNT function does not include the NULL value when counting the number of
elements in a bag, whereas COUNT_STAR (0 function includes NULL values while
counting.

2) Sqoop:
2.1) What is the usage of --split by parameter?
Using the --split-by parameter we specify the column name based on which
sqoop will divide the data to be imported into multiple chunks to be run in
parallel
2.2) What are the two file formats supported by sqoop for import?
Text and Sequence Files.
2.3) When the source data keeps getting updated frequently, what is
the approach to keep it in sync with the data in HDFS imported by
sqoop?
sqoop can have 2 approaches.
a To use the --incremental parameter with append option where value of some
columns are checked and only in case of modified values the row is imported as
a new row.
b To use the --incremental parameter with last modified option where a date
column in the source is checked for records which have been updated after the
last import.
2.4) How will you implement all-or-nothing load using sqoop?

Using the staging-table option we first load the data into a staging table and then
load it to the final target table only if the staging load is successful.
2.5) What is Sqoop merge in Sqoop?
The merge tool allows you to combine two datasets where entries in one dataset
should overwrite entries of an older dataset. For example, an incremental import
run in last-modified mode will generate multiple datasets in HDFS where
successively newer data appears in each dataset. The merge tool will flatten
two datasets into one, taking the newest available records for each primary key.

3) Oozie:
3.1) How to kill the running oozie job?
oozie job -kill [jobid]
3.2) How to execute the oozie actions parallel?
Using fork and join
3.3) What is difference between Oozie workflow, coordinator and
bundle?
Workflow: It is a sequence of actions. It is written in xml and the actions can be
map reduce, hive, pig etc.
Coordinator: It is a program that triggers actions (commonly workflow jobs) when
a set of conditions are met. Conditions can be a time frequency, other external
events etc.
Bundle: It is defined as a higher level oozie abstraction that batches a set of
coordinator jobs. We can specify the time for bundle job to start as well.
4) Hive

4.1) What are the different files formats hive will support?
Text, ORC, Parquet, JSON, AVRO etc.
4.2) Which is best file format to improve the query performance?
ORC
4.3) What is the importance of .hiverc file?
It is a file containing list of commands needs to run when the hive CLI starts. For
example, setting the strict mode to be true etc.
4.4) Is there a way to update the records in hive?
1. Overwriting the partitions
2. Using JOIN and Union

5) Tableau:
5.1) What is benefit of tableau extract file over the live connection?
Extract can be used anywhere without any connection and you can build your
own visualizations without connecting to Database
5.2) What is the difference between heat map and tree map?
A heat map is a great way to compare categories using colour and size. In this,
you can compare two different measures. Tree map is a very powerful
visualization, particularly for illustrating hierarchical(tree-structured) data and
part-to-whole relationships.

5.3) How many ways we use parameters in Tableau?


We can use parameters with filters, calculated fields, actions, measure-swap,
changing views and auto updates.
5.4) What is the use of new Custom SQL Query in tableau?
Custom SQL Query written after connecting to data for pulling the data in a
structured view, one simple example is you have 50 columns in a table, but we
need just 10 columns only. So instead of taking 50 columns you can write a sql
query. Performance will increase.
5.5) How to display top5 and last 5 sales in same view?
Using filters or calculated fields we can able to display the top 5 and last 5 sales
in same view.
5.6) How can we combine database and flat file data in tableau
desktop?
Connect data two times, one for database tables and one for flat file. The Data-
>Edit Relationships Give a join condition on common column form dB tables to
flat file.

6) NOSQL:

6.1) What is CAP Theorem?


Consistency (C), Availability (A), and Partition tolerance (P)

6.2) Mention what are the main components of Cassandra Data Model?
The main components of Cassandra Data Model Are Cluster, Key space, Column,
Column & Family

6.3) What is compaction in Hbase?


As more and more data is written to Hbase, many HFiles get created. Compaction
is the process of merging these HFiles to one file and after the merged file is
created successfully, discard the old file.

6.4) What is the role of Master server in Hbase?


The Master server assigns regions to region servers and handles load balancing
in the cluster

7) Machine Learning Basic Questions:

7.1) What is Overfitting in Machine learning?


In machine learning, when a statistical model describes random error or noise
instead of underlying relationship overfitting occurs. When a model is
excessively complex, overfitting is normally observed, because of having too
many parameters with respect to the number of training data types. The model
exhibits poor performance which has been over fit.

7.2) What are Bayesian Networks (BN)?


Bayesian Network is used to represent the graphical model for probability
relationship among a set of variables.

7.3) What is PCA, KPCA and ICA used for?


PCA (Principal Components Analysis), KPCA (Kernel based Principal Component
Analysis) and ICA (Independent Component Analysis) are important feature
extraction techniques used for dimensionality reduction.

7.4) What is dimension reduction in Machine Learning?


In Machine Learning and statistics, dimension reduction is the process of
reducing the number of random variables under considerations and can be
divided into feature selection and feature extraction

Other Random Questions Covering Data Science and Machine Learning


Algorithms NO Answers Provided

What type of problem does the model try to solve?

Is it prone to over-fitting? If so what can be done about this?

Does the model make any important assumptions about the data? When
might these be unrealistic? How do we examine the data to test whether these
assumptions are satisfied?

Does the model have convergence problems? Does it have a random


component or will the same training data always generate the same model? How
do we deal with random effects in training?

What types of data (numerical, categorical etc.) can the model handle?

Can the model handle missing data? What could we do if we find missing
fields in our data?

How interpretable is the model?

What alternative models might we use for the same type of problem that
this one attempts to solve, and how does it compare to those?

Can we update the model without retraining it from the beginning?

How fast is prediction compared to other models? How fast is training


compared to other models?

Does the model have any meta-parameters and thus require tuning? How
do we do this?

What is the EM algorithm? Give a couple of applications

What is deep learning and what are some of the main characteristics that
distinguish it from traditional machine learning

What is linear in a generalized linear model?


What is a probabilistic graphical model? What is the difference between
Markov networks and Bayesian networks?

Give an example of an application of non-negative matrix factorization

On what type of ensemble technique is a random forest based? What


particular limitation does it try to address?

What methods for dimensionality reduction do you know and how do they
compare with each other?

What are some good ways for performing feature selection that do not
involve exhaustive search?

How would you evaluate the quality of the clusters that are generated by a
run of K-means?

Do you have any research experience in machine learning or a related


field? Do you have any publications?

What tools and environments have you used to train and assess models?

Do you have experience with Spark ML or another platform for building


machine learning models using very large datasets?

You might also like