Professional Documents
Culture Documents
2) Sqoop:
2.1) What is the usage of --split by parameter?
Using the --split-by parameter we specify the column name based on which
sqoop will divide the data to be imported into multiple chunks to be run in
parallel
2.2) What are the two file formats supported by sqoop for import?
Text and Sequence Files.
2.3) When the source data keeps getting updated frequently, what is
the approach to keep it in sync with the data in HDFS imported by
sqoop?
sqoop can have 2 approaches.
a To use the --incremental parameter with append option where value of some
columns are checked and only in case of modified values the row is imported as
a new row.
b To use the --incremental parameter with last modified option where a date
column in the source is checked for records which have been updated after the
last import.
2.4) How will you implement all-or-nothing load using sqoop?
Using the staging-table option we first load the data into a staging table and then
load it to the final target table only if the staging load is successful.
2.5) What is Sqoop merge in Sqoop?
The merge tool allows you to combine two datasets where entries in one dataset
should overwrite entries of an older dataset. For example, an incremental import
run in last-modified mode will generate multiple datasets in HDFS where
successively newer data appears in each dataset. The merge tool will flatten
two datasets into one, taking the newest available records for each primary key.
3) Oozie:
3.1) How to kill the running oozie job?
oozie job -kill [jobid]
3.2) How to execute the oozie actions parallel?
Using fork and join
3.3) What is difference between Oozie workflow, coordinator and
bundle?
Workflow: It is a sequence of actions. It is written in xml and the actions can be
map reduce, hive, pig etc.
Coordinator: It is a program that triggers actions (commonly workflow jobs) when
a set of conditions are met. Conditions can be a time frequency, other external
events etc.
Bundle: It is defined as a higher level oozie abstraction that batches a set of
coordinator jobs. We can specify the time for bundle job to start as well.
4) Hive
4.1) What are the different files formats hive will support?
Text, ORC, Parquet, JSON, AVRO etc.
4.2) Which is best file format to improve the query performance?
ORC
4.3) What is the importance of .hiverc file?
It is a file containing list of commands needs to run when the hive CLI starts. For
example, setting the strict mode to be true etc.
4.4) Is there a way to update the records in hive?
1. Overwriting the partitions
2. Using JOIN and Union
5) Tableau:
5.1) What is benefit of tableau extract file over the live connection?
Extract can be used anywhere without any connection and you can build your
own visualizations without connecting to Database
5.2) What is the difference between heat map and tree map?
A heat map is a great way to compare categories using colour and size. In this,
you can compare two different measures. Tree map is a very powerful
visualization, particularly for illustrating hierarchical(tree-structured) data and
part-to-whole relationships.
6) NOSQL:
6.2) Mention what are the main components of Cassandra Data Model?
The main components of Cassandra Data Model Are Cluster, Key space, Column,
Column & Family
Does the model make any important assumptions about the data? When
might these be unrealistic? How do we examine the data to test whether these
assumptions are satisfied?
What types of data (numerical, categorical etc.) can the model handle?
Can the model handle missing data? What could we do if we find missing
fields in our data?
What alternative models might we use for the same type of problem that
this one attempts to solve, and how does it compare to those?
Does the model have any meta-parameters and thus require tuning? How
do we do this?
What is deep learning and what are some of the main characteristics that
distinguish it from traditional machine learning
What methods for dimensionality reduction do you know and how do they
compare with each other?
What are some good ways for performing feature selection that do not
involve exhaustive search?
How would you evaluate the quality of the clusters that are generated by a
run of K-means?
What tools and environments have you used to train and assess models?