HBlocks (Iterative Data Engineering For Hadoop)

HBlocks: A Hadoop Subsystem for Iterative Data Engineering
Eric Czech
Next Big Sound, Inc.
December 30, 2013
Abstract
The effective design of data driven systems is often
epitomized through iterative approaches. Frameworks for
Agile programming, A/B testing, continuous integration,
and source code control allow for incremental changes to
surface new insights and knowledge that can be applied
to existing systems in a way that minimizes risk and max-
imizes progress. These techniques are undoubtedly use-
ful but only support the development of software, not the
evolution of data itself. In this paper, we describe the ar-
chitecture of HBlocks, a logical abstraction for Apache
Hadoop used to create iterative data integration and stor-
age processes that generate and maintain large, volatile
datasets. This is accomplished using a novel data state
and version management framework that synchronizes the
evolution of the software underlying these processes with
the results they produce.
HBlocks is deployed within Next Big Sound as the
primary information management system and we will
also discuss our experience using it to enable the rapid,
anxiety-free construction of a comprehensive data aggre-
gation platform for the music industry (sourcing billions
of streams, purchases, views and more from providers like
Spotify, iTunes, YouTube, Facebook, Twitter, etc.).
1 Introduction
Distributed databases and data processing frameworks
like Hadoop have made the ability to aggregate and main-
tain large amounts of information far more ubiquitous
in recent years. They emphasize the fast accumulation
of new data and with any MapReduce platform, trans-
forming and summarizing newinformation is not difcult.
However, those summarizations often involve processes
with some asymptotic level of accuracy (e.g. IP/location
geocoding, name-based gender inference, entity match-
ing) or that are subject to sources of human error like soft-
ware bugs/malfunctions and misinterpretations. These er-
rors can manifest as large or small inaccuracies in any re-
sulting dataset and their overall impact can be uncertain.
When inaccuracies affect a large portion or all of any
one dataset, the corrective action is usually obvious
erase everything and try again. However, our experience
shows that this is not often the case and data quality issues
arise in much subtler ways where the action to be taken is
more ambiguous. In these situations, being able to min-
imize the scope of what needs to be recalculated while
effectively managing the availability of results is very im-
portant. HBlocks provides this capability by allowing for
an arbitrary number of versions of any one block of data
as well as different states that can be associated with each
of those versions. These states can then be used to mod-
ulate access to different versions of data blocks based on
any criteria. Some practical applications include:
Older, stable data blocks with small inaccuracies are
placed in a state that makes them available to end
users of a service while developers can see newer,
less stable version(s) mid-reconstruction
New, unprocessed blocks of (raw) data are associ-
ated with a ready or pending state implying that
they have not yet passed through an ETL pipeline
or other process responsible for extracting the neces-
sary summary forms from the raw form. This is used
to automate the processing of newdata on a recurring
basis.
Flawed or otherwise unwanted blocks of data are
immediately hidden by applying an obsolete or
deleted state. Query clients ignore records asso-
ciated with this state by default and a secondary pro-
cess physically removes this data at a later time af-
ter it becomes nearly certain that no reversion will
be necessary (i.e. states are used to implement a
trash).
For the remainder of this paper, we will discuss the me-
chanics of these features, how they work within the con-
text of the Hadoop ecosystem, and some details of our
experience with them. The key contributions from this
paper are:
To the best of our knowledge, this is the rst paper
to present a system capable of providing revision-
control for large volumes of data at a practical gran-
ularity within Apache Hadoop. Other similar so-
lutions do exist but they generally only function
within smaller (sub-terabyte) or non-distributed en-
vironments [16, 7, 17]
We present solutions, using only open-source tools,
to the challenges inherent in assembling a cohesive,
well standardized dataset composed of data from an
arbitrary number of contributing sources. These so-
lutions are centered around enabling an iterative, in-
cremental approach to data integration, processing,
and storage. This is in contrast to more common
brute-force or one-off approaches often taken with
Hadoop.
The solutions detailed in this paper have been im-
plemented and are deployed in a production environ-
ment. We will present the key design trade-offs that
have proven most relevant through this experience as
well as other lessons learned.
2 Design Principles
The basic function of HBlocks is to act as a metadata
system used to relate any piece of raw input data to any
other derived or altered forms in which it may exist. Many
data integration systems for Hadoop will start with a raw,
immutable input dataset and process that information via
MapReduce to produce some summary or view HBlocks
associates this raw input data with results produced in a
way that allows the properties of that association to be
manipulated to affect result presentation.
HBlocks stores this metadata in a relational database
which is intended to be very small compared to associated
data in HDFS/HBase [19, p. 41, 2]. This is accomplished
by grouping related input data les into blocks, where
altering the state of that input data block would also affect
any data produced using it.
Derived forms of input data blocks are always per-
sisted with an identier that indicates which block they
belong too so that the state of those blocks can later be
referenced when querying the derived data. For exam-
ple, raw HBlocks data stored on HDFS, processed via
MapReduce, and then inserted into HBase would con-
tain the block identier in the row key or qualier for
the record inserted. This identier would later be parsed
out when HBase is queried and referenced against the
HBlocks metadata to determine if that particular record
is associated with a block that should or should not be
visible for the user submitting the query. Figure 1 illus-
trates this relationship within the context of a conventional
Hadoop workow.
Figure 1: HBlocks usage in Hadoop data processing pipeline
2.1 Example Data Revisions
To demonstrate this design as an example, consider a
small dataset consisting of user visits to a particular web-
2
site. Assume that several les are generated at different
times containing records of these visits (by user) and that
these les are stored on HDFS. An aggregation process
then calculates the overall number of visits for each user
and writes that result to a distributed database like HBase.
HBlocks maintains the relationship between the data in
these two places as follows:
Figure 2: Example aggregation process with read path
In this example, contributions of any raw data block
to an aggregated result are controlled by manipulating
the state associated with that block. Data from the
DELETED block (block 3) is assumed to be awed in
some way; either the raw data le associated with it is
inaccurate or it was processed incorrectly by the ETL
pipeline. This state change could however be reversed,
and any such changes would be reected very quickly in
query results.
As a more practical use case, this example can be ex-
tended to consider how states, in conjunction with block
versions, can be used to correct processing errors or data
corruption. Data block versions are ommited from the
preceding example for the sake of simplicity but can be
applied as part of a corrective action for this hypothetical
situation. In Figure 3 a new version of block 3 is created
with a state that only provides access to privileged users
while the original version is still permanently hidden.
Once it is determined that the newly generated CAN-
DIDATE result is correct, the state of block 3 would be
updated to ACTIVE to make it available for all users.
Figure 3: Example usage of versions for revision control
The interpretation of different versions and states out-
lined in this example can be extended to support a variety
of behaviors (usage at Next Big Sound is much more com-
plex), but enabling seamless data revisions is their core
purpose. This feature greatly minimizes the risk inher-
ent to correcting or integrating large amounts of summary
data by aligning the stability of that data with the ex-
pectations of the user viewing it. HBlocks attempts to
treat data as a dynamic entity that can exist in any number
of states whereas conventional systems tend to treat data
one of two ways, either it exists or it does not. This is
analagous to concepts employed in software development
allowing for several viable versions of a codebase to exist
at any one time (e.g. branching [20]).
2.2 Applicability
HBlocks can be applied as part of many different types
of workows but it does introduce a non-trivial storage
3
and performance overhead. It is not appropriate for small
datasets or those for which quality is rarely a problem.
It is also not necessary for large, error-prone datasets
when other brute-force approaches can be used instead.
These simpler architectures generally involve continu-
ous, resource-intensive recomputations of entire datasets
where newly generated results are swapped out with old
results on completion [18, 15, p. 16]. HBlocks allows for
this same behavior, if need be, but also provides more tar-
geted, economical mechanisms to accomplish the same
thing.
3 Architecture
HBlocks consists of 4 primary components:
1. Relational data model (currently only implemented
for MySQL [14])
2. Java API
3. Command-line application for manipulating data
states and versions
4. Integration with other Hadoop services and tools
Each of these compenents is used to augment other
Hadoop subsystems in order to build an end-to-end data
management pipeline. We will only discuss this relation-
ship for HDFS, HBase, Hive [3], Pig [5], and Oozie [4]
but the application of HBlocks is not limited to these sub-
systems alone. Future work could involve integrations
with others such as Flume [1] or HCatalog [10].
3.1 Relational Data Model
The data model used by HBlocks denes two major
entities, hblocks and hblock versions. The former
is used to represent groups of raw, input les as named
blocks (referred to as hblocks). The latter represents the
different versions of each hblock as well as their current
state.
Figure 4: Database table schema representing relationship be-
tween hblocks and their corresponding versions
Figure 4 shows the structure of the entities used by
HBlocks. Each eld can have several purposes but their
primary functions are as follows:
Table 1: Field denitions for hblocks
Field Description
id
Auto-incrementing identier assigned to
each hblock
source
Data source associated with hblock used
for operations on entire datasets
name
Name of hblock used for human-readable
references and search features
Table 2: Field denitions for hblock versions
Field Description
hblock id Identier of hblock associated with version
version Auto-incrementing version identier
state Current state of this version
Most of the elds in this model have a simple interpre-
tation but there are others that bear further explanation.
4
First, the hblocks.name eld is used to enforce unique-
ness amongst hblocks for a particular hblocks.source. The
name assigned to any hblock is usually equivalent to the
name of a single le associated with it, but as many les
can be associated with one hblock, using a more gener-
alized name may be appropriate. The name is also com-
monly used to search for particular hblocks, via regular
expression, before executing some operation on them.
Second, the hblock versions.state eld can have many
values that correspond to a variety of behaviors. Typi-
cally states move through a standard progression as part
of an unattended workow, but manual changes are nec-
essary to delete versions or queue them for reprocessing.
Figure 5 illustrates the common transitions and the inter-
pretations of each state.
Figure 5: HBlock version lifecycle
3.1.1 Field Mutability
An important aspect of this model to note is that, ex-
cepting only the hblock versions.state eld, all elds are
generally treated as immutable. API operations (dis-
cussed in section 3.2) exist to completely remove hblocks
and their versions but these are not normally necessary.
Instead, an append-only approach is used where new
versions of any hblock would be created while the state
of old versions is modied to phase them out more grace-
fully.
3.1.2 Constraints and Limitations
In most situations, a very large number of hblocks
would not be practical. The data model outlined above
allows for the possibility of 2
64
hblocks but for realistic
applications, this count would not normally exceed tens
or hundreds of thousands. This is generally not an issue
though as the number and size of les associated with any
one hblock is limited only by the capacity and address-
able namespace allotted by HDFS. This implies that as
data size increases, the cumulative size of all les associ-
ated with each hblock would likely increase and keep the
overall count relatively stable.
Similarly, the number versions for each hblock would
usually be much lower than the maximum possible in this
model. As discussed later in sections 3.1.3 and 3.4.4,
the version identier is stored as part of HBase qual-
ifers/columns so row sizes would become prohibitively
large if the number of persisted versions per hblock is too
high. The maximum value of this ratio is certainly de-
pendent on the application, but we have found that using
no more than 3-5 versions per hblock provides sufcient
versatility with minimal impact on query performance. A
much larger number of versions can be used, but some
care must be taken to ensure that old versions are phys-
ically removed from HBase before new versions create
bloated, problematic rows.
3.1.3 Propagation
While the entities in this model are relatively simple,
some of the elds in these tables are persisted and used
by other Hadoop services to create more extensive re-
lationships. The hblocks.id eld is used as part of a
le naming convention within HDFS to preserve the re-
lationship between the two (section 3.4.1). Pig selects
these les for processing based on the state of the as-
sociated hblock versions (section 3.4.2) and persists re-
sults produced with this data in HBase with both the
hblocks.id and hblock versions.version elds as quali-
er/column sufxes (section 3.4.4). Query clients then
match rows returned from HBase to HBlocks state infor-
mation to generate results most appropriate for the execut-
ing user. The propagation of elds in this way is necessary
for HBlocks to control data ow through each part of the
Hadoop stack.
3.2 Java API
All systems that interact with HBlocks do so using the
CRUD [21] semantics provided by the java API. This API
5
acts as the interface to the data model described in sec-
tion 3.1 and is responsible for translating requests into
an atomic operation for a supporting relational database
backend. It is frequently used by integration libraries
for Hadoop subsystems as a means of unattended change
whereas manual changes are generally made using the
command-line interface (CLI) discussed in section 3.3 (a
java-based wrapper for the same API).
Table 3: Java API Outline
Behavior Description
Creation
New hblocks can be created with a
given name and data source
Search
Existing hblocks can be searched by
name, data source, or version state
Deletion
Existing hblocks can be deleted to
remove relational and HDFS data
Version Creation
New versions can be created for any
hblock
Version Update
Existing versions can be updated to
have any state
Table 3 outlines the basic behaviors provided by the
API. There are more behaviors, but these comprise the
bulk of the functionality required by higher level archi-
tectural compenents.
3.3 Command-Line Interface
The HBlocks command-line interface facilitates man-
ual operations made through the Java API (discussed pre-
viously in section 3.2). These operations are carried out
by systems operators or developers to control many as-
pects of the data lifecycle. Figure 6 illustrates this life cy-
cle with CLI command examples controlling movement
through it.
The CLI would not necessarily be the ideal interface
for each of these steps and would likely not be preferred
over the Java API when the quantity or complexity of op-
erations is too great. In practical applications, usage of
the CLI is limited to revisions/deletions and several of the
steps outlined in Figure 6 are substituted for more scal-
able, automated processes . These processes combine the
HBlocks functionality with other Hadoop subsystems as
part of an integration discussed in section 3.4.
1. Files uploaded to create new hblock with single
READY version
> hblocks upload -file mydata.csv
2. READY hblocks selected for processing; state be-
comes BUILDING
> hblocks list -states READY
3. Processed hblock versions made ACTIVE
> hblocks update versions -states
BUILDING -newstate ACTIVE
4. Optional: Invalid result versions made obsolete (or
available to only devs)
> hblocks update versions -regex
mydata.
*
-newstate OBSOLETE
5. Optional: New versions created as replacements
> hblocks rebuild -regex mydata.
*
6. Optional: Hblocks deleted to remove/hide all associ-
ated data
> hblocks delete -regex mydata.
*
Figure 6: Example CLI usage controlling data lifecycle (other
operations do exist and some details of those above are ommitted
for brevity).
3.4 Hadoop Integration
The integration of HBlocks with other Hadoop com-
ponents is intended to be as supercial and lightweight as
possible. In most cases, identiers fromthe HBlocks meta
database are simply passed around and stored in a way
that minimizes overhead. At Next Big Sound, this prop-
agation includes HDFS, Pig, Hive, and HBase but future
work could involve other systems as well.
6
3.4.1 HDFS
The HBlocks data model denes a directory structure
and le naming convention for HDFS. This organization
maintains the relationship between HBlocks and raw les
while separating them in a way that makes their use via
Pig and Hive as convenient as possible.
Figure 7: Generalized HDFS naming convention.
Figure 7 shows the general form for this scheme. Note
that many les can be associated with any one hblock id
in this model and that the original le name and extension
are always preserved. The name of the hblock itself will
often match that of a single le associated but is usually
abstracted appropriately when several les are likely to be
necessary.
Compression types for each le do not have to be con-
sistent within a data source (or hblock). Figure 8 shows
an example of mixed le types like this and often this is
ideal to apply different compression codecs based on le
size.
While compression types can be mixed, the format
of les should be consistent. Data for each source and
hblock is expected to be structurally identical and con-
form to some pre-determined schema. For example, the
Hive integration (discussed further in section 3.4.3) at-
taches external tables [19, p. 381] to the directory for each
data source and assumes that all contained data will match
the same form. Using mixed formats is not strictly en-
forced or impossible, but doing so would make later-stage
processing more complicated.
3.4.2 Pig
Pig is used to process data, as part of an Oozie
workow, by rst selecting HDFS les associated with
HBlocks that match a certain criteria. A custom
LoadFunc [8, p. 146] implementation enables initial
script clauses to specify raw data to load. Figure 9 shows
an example of this where a hypothetical dataset contain-
ing two elds, a user and a date, are loaded into Pig with
the necessary HBlocks identiers attached.
Figure 8: HDFS le examples matching generalized conven-
tion.
The custom loader implementation uses the identiers
in the le name (discussed in section 3.4.1) to search an
in-memory representation of HBlocks relational data (dis-
cussed in section 3.1) and produce the version and hblock
identier on a per-tuple basis. This loader will only select,
by default, les for hblocks with a most recent verion in
the READY state.
%DECLARE HBLOCK_FIELDS
hblock_id:long, hblock_version:long
%DECLARE SCHEMA_FIELDS
user:chararray, date:chararray
-- Files at hdfs:/hblocks/data/user_events
raw = LOAD source=user_events,state=READY
USING HBlockLoader
AS ($HBLOCK_FIELDS, $SCHEMA_FIELDS);
Figure 9: Pig data loading example
All transformations of raw data carried out by Pig are
done without loss of the corresponding HBlock identi-
ers and resulting tuples are persisted to HBase with those
identiers still intact. An example of this can be seen in
Figure 10 where the count per user and month is calcu-
lated before being stored.
7
transformed = FOREACH raw GENERATE
*
,
GetMonth(ToDate(date, yyyy-MM-dd))
AS month;
grouped = GROUP transformed BY (
hblock_id, hblock_version,
user, month
);
result = FOREACH grouped GENERATE
group, COUNT(transformed);
STORE result
INTO table=metrics
USING HBlockStorage;
Figure 10: Pig data processing example
3.4.3 Hive
Applications of Hive within the HBlocks model may
intersect or complement those of Pig. We have found,
however, that it is better suited as an exploratory tool sup-
porting ad-hoc questions surrounding raw datasets or as
a debugging tool for developing complex data pipelines
with Pig. It has no formal integration with the HBlocks
system but is still used in a supplementary fashion.
As discussed in section 3.4.1, les in HDFS are sepa-
rated into directories specic to data sources. These di-
rectories are registered within hive as external tables and
the format of all contained les is assumed to be con-
sistent (e.g. tabular). This allows for entire datasets to
be queried or for data to be limited to only subsets spe-
cic to hblocks using the Virtual Columns feature [6,
p. 142] (i.e. queries may include criteria for identiers in
le names).
3.4.4 HBase
Use of the HBlocks model within HBase consists of
two things: a persistence model and query semantics ap-
plied by a Java framework. The persistence model details
how HBase rows, qualiers, and timestamps are struc-
tured to include abstract data forms while maintaining
their relationship to HBlocks entities. The query seman-
tics given by the Java framework determine how queried
HBase data should be interpreted based on associated
HBlocks metadata. Both of these components attempt
to abstract the logic underlying the treatment of HBlocks
data away from business data via Java interface imple-
mentations. This separation allows for the types of data
stored with HBlocks to be fairly unrestricted.
The persistence model for HBase generally assumes
that many dimension values, or elds, will be stored per
record and that those elds have some precedence over
one another. These elds are composed into composite
keys much like those often advised for HBase data mod-
els (or seen in RBDMS systems) [9, p. 362] and imply a
common order with which each eld is accessed. For ex-
ample, two elds age and username could be used to
create a composite index where username values are typ-
ically searched for by a particular age value or range. In
this case the age is the eld with the highest precedence
since it will be used more often as a search criteria.
As a generalized model, assume that n elds are to be
stored and that for each eld, f
i
, i [1..n], a precedence
P(f
i
) exists where P(f
x
) > P(f
y
) x < y. Also,
assume that HBlocks identiers for ids and versions are
denoted as f
h id
and f
h ver
, respectively. Figure 11 shows
a potential application of these elds as HBase records,
concatenating them in decreasing order of precedence to
create each row component.
Key
f
1
f
2
...f
a
(1)
Qualier
f
a+1
f
a+2
...f
n
f
h id
f
h ver
(2)
Figure 11: HBase persistence model
The key structure in Figure 11 (1) would likely include
the elds or dimensions in the dataset most likely to be
used as a primary means of access. These values should
have a high cardinality and partial-key scans [9, p. 360]
using these primary values should result in a relatively
small number of rows. Partial scans are not strictly neces-
8
sary and other query mechanisms could be used, but both
the query method and persistence model must not allow
for results that are larger than volumes of physical mem-
ory available.
The qualier structure in Figure 11 (2) can support
more elds as an extension of the key, but functions pri-
marily as a representation of hblock identiers and ver-
sion numbers. These identiers are not favored in the ac-
cess path and cannot be searched for directly, as intended.
They are used only to merge and lter records pertaining
to some set of elds after they are retrieved (i.e. client-
side).
The HBase timestamp, value, and family struc-
tures are not necessarily included as part of the HBlocks
integration and can be used in any way that best suits the
dataset being supported. We present an example imple-
mentation using these row compenents, as well as keys
and qualiers, in section 4.1.2.
As discussed earlier in section 2, persisted HBlocks
identiers are parsed out of database records and matched
against hblock version and state data to produce exible
result sets. An HBlocks query framework performs this
matching by maintaining an in-memory replica of all re-
lational data and applying lter or merge operations as
quickly as possible to corresponding HBase records. The
lter semantics include removing HBase cells with hblock
versions that are not in an ACTIVE state or one oth-
erwise not available to the user. Merge operations then
group the remaining records by identical eld values, ex-
cluding the HBlocks identiers, and reduce values asso-
ciated with those eld groups to a single value through
some commutative, associative function (e.g. summing is
used for numeric count data).
4 Experience
At Next Big Sound, we provide analytical services for
the music industry by measuring engagment within online
streaming music/video services, digital content stores, ter-
restrial radio, and social media platforms. We receive data
feeds from over 50 sources with billions of sales transac-
tions, content streams/video views, page likes, etc. occur-
ring on platforms like iTunes, Amazon, Spotify, YouTube,
Facebook, and Twitter. These feeds take many forms, and
the types and formats of data contained varies even more
widely, but our objective with each is constant compute
the number of different event occurrences over time and
aggregate those counts to any arbitrary set of available
dimensions (e.g. Spotify stream counts broken down by
day, song, location, and age/gender).
The collection of this data, often via third-party ser-
vices, as well as its standardization and processing has
proven very problematic due to the number of sources
and highly variable reliability and quality of each. Raw
data we collect may have subtle errors that only affect
certain time ranges of data or events/transactions in that
data may be invalid only for certain entities (e.g. artists
or songs). Similarly, the proceessing of this data might
produce aggregations that are only fractionally correct or
worse, open to interpretation.
We have found that these errors occur frequently
enough to require an infrastructure capable of supporting
regular changes to existing data, and built HBlocks for
that purpose.
4.1 HBlocks Implementation
We receive data on a daily, or more frequent, basis from
most of our sources. This data is delivered as CSV, XML,
or JSON les, or it is collected from various APIs and
converted to that form. As these les come in we associate
them with existing hblocks or create new ones for them,
with versions in a READY state.
> hblocks list -source wikipedia
+---------------------------------------------+
| id | name | version:1 | version:2 |
+---------------------------------------------+
| 295 | data_20130101 | OBSOLETE | ACTIVE |
| 296 | data_20130102 | ACTIVE | |
| 297 | data_20130103 | BUILDING | |
| 298 | data_20130104 | READY | |
+---------------------------------------------+
Figure 12: HBlocks metadata for Wikipedia page visit log les
from Jan. 1, 2013 to Jan. 4, 2013
Figure 12 shows le groups registered in HBlocks as
collected from Wikipedia server log dumps [13]. This
gure also shows the different versions created for each
hblock and the current state of those versions. Note that
the hblock data 20130101 (id 295) has already under-
gone a single revision where its rst version was placed
in an OBSOLETE state after the second version was
9
nished processing (and made ACTIVE). Aggregations
computed for this hblock would still potentially exist for
both versions, but clients reading that data would only ac-
knowledge records for version 2.
4.1.1 HDFS
Wikipedia log les are separated by hour and for our
purposes, 24 one-hour les are grouped together for each
day to dene an hblock. Figure 13 shows these associated
HDFS les for hblock data 20130101 (id 295). These
les are named according to the convention described in
section 3.4.1.
> hblocks hdfs_files -source wikipedia -ids 295
+----------------------------------------+
| hblock_id | hdfs_file_name |
+----------------------------------------+
| 295 | block_295_views-hour-00.gz |
...
+----------------------------------------+
Figure 13: Wikipedia log les for hblock data 20130101
HDFS les associated with HBlocks are read into Pig
applications like those detailed in section 9, and upon
completion of processing, resulting aggregations are writ-
ten to HBase before updating the state of all hblock ver-
sions used to ACTIVE.
4.1.2 HBase
HBase records written from Pig are constructed in a
form similar to that used by OpenTSDB [11]. All infor-
mation stored is in timeseries form and placed into in-
dex HBase tables with a row structure described in Fig-
ure 14.
In this gure, entity values are identiers specic to in-
ternal objects (e.g. artists, albums, tracks). Metric values
are identiers for different event types (e.g. visits, sales,
page likes, streams) and location values are concatena-
tions of ISO-3166 geographic subdivision and country
codes [22].
The hblock id and hblock version values are stored as
variable length integers appended to the end of the quali-
er as discussed in section 3.4.4.
Figure 14: HBase row structure
Timestamps used to represent dates are stored as mi-
croseconds where each value is broken into three parts.
Timestamp
0
represents the quotient of the original value
and some large primary divisor. Timestamp
1
represents
the remainder of the original value divided by a second
smaller divisor, and timestamp
2.i
values represent offsets
needed to construct original timestamps. Together, any
combination of timestamp
0,1
and timestamp
2.i
encode a
single microsecond value. Timestamps are separated in
this way so that the divisors can be selected, based on the
expected distribution of timestamp values, to keep indi-
vidual rows from being too wide or too narrow [9, p. 359].
4.1.3 Query Clients
HBase clients at Next Big Sound are served from Fi-
nagle service wrappers [12] . A Thrift service accepts
queries that declare entity, metric, and location values
(other, unlisted dimensions are also possible), and an
HBlocks execution engine fetches all HBase records for
these values. These records are then transformed by rst
determining a set of HBlock version states available to
the executing user. These states are dened by implicit
privileges, or explicit preferences, for each user and are
matched against the state of the hblock version in the
qualier of each HBase cell. Cells that do not match are
discarded and values for the remaining cells are merged,
producing results that no longer contain any HBlocks
metadata.
4.2 Conclusion
At Next Big Sound, software requirements and infor-
mation demands shift rapidly. All system design is ap-
10
proached iteratively to facilitate the natural evolution of
our products, and while this has proven easy for software
development, extending the same approach to data man-
agement has not. The productionisation of HBlocks now
allows us to align changes in system design throughout
the entire stack without requiring excessive computing re-
sources. In conjunction with Hadoop, this has allowed us
to easily scale the number of data sources we support, the
volume of data we receive, and the variety of users on our
platform.
References
[1] Apache Flume. 2009. URL: http://flume.
apache.org/.
[2] Apache HBase. 2008. URL: http://hbase.
apache.org/.
[3] Apache Hive. 2009. URL: http : / / hive .
apache.org/.
[4] Apache Oozie. 2011. URL: http : / / oozie .
apache.org/.
[5] Apache Pig. 2007. URL: http : / / pig .
apache.org/.
[6] Jason Rutherglen Edward Capriolo Dean Wampler.
Programming Hive. OReilly Media, Inc, 2012.
ISBN: 9781449319335. URL: http://books.
google.com/books?id=NS8ABbm3MDEC.
[7] Exversion. VERSION CONTROL FOR DATA.
2013. URL: http : / / exversiondata .
wordpress.com/2013/08/27/version-
control-for-data/.
[8] Alan Gates. Programming Pig. OReilly Media,
Inc, 2011. ISBN: 9781449302641. URL: http :
//www.amazon.com/Programming-Pig-
Alan-Gates/dp/1449302645.
[9] Lars George. HBase: The Denitive Guide. OR-
eilly Media, Inc, 2011. ISBN: 9781449396107.
URL: http://books.google.com/books?
id=Ytbs4fLHDakC.
[10] HCatalog. 2012. URL: http : / / hive .
apache.org/docs/hcat_r0.5.0/.
[11] StumbleUpon Inc. OpenTSDB Schema. 2010.
[12] Twitter Inc. Finagle. 2011. URL: http : / /
twitter.github.io/finagle/.
[13] Domas Mituzas. Page view statistics for Wikime-
dia projects. 2007. URL: http : / / dumps .
wikimedia . org / other / pagecounts -
raw/.
[14] MySQL. 2009. URL: http: // www. mysql.
com/.
[15] James Warren Nathan Marz. Big Data - Principles
and best practices of scalable realtime data sys-
tems. Manning Publications Company, 2013. ISBN:
1617290343, 9781617290343. URL: http : / /
books . google . com / books ? id = HW -
kMQEACAAJ.
[16] Rufus Pollock. We Need Distributed Revi-
sion/Version Control for Data. 2010. URL: http:
//blog.okfn.org/2010/07/12/we-
need-distributed-revisionversion-
control-for-data/.
[17] [Victor Stanciu]. dbv.php. 2013. URL: http://
dbv.vizuina.com/.
[18] Roshan Sumbaly. Serving Large-scale Batch Com-
puted Data with Project Voldemort. 2012. URL:
http://engineering.linkedin.com/
voldemort / serving - large - scale -
batch - computed - data - project -
voldemort.
[19] Tom White. Hadoop: The Denitive Guide, Sec-
ond Edition. OReilly Media, Inc, 2011. ISBN:
9781449389734. URL: http : / / books .
google.com/books?id=yHS5mAEACAAJ.
[20] Wikipedia. Branching (revision control)
Wikipedia, The Free Encyclopedia. [Online;
accessed 5-December-2013]. 2013. URL: http:
/ / en . wikipedia . org / w / index .
php ? title = Branching _ (revision _
control)&oldid=554941549.
[21] Wikipedia. Create, read, update and delete
Wikipedia, The Free Encyclopedia. [Online; ac-
cessed 6-December-2013]. 2013. URL: http://
en . wikipedia . org / w / index . php ?
title=Create, _read, _update_and_
delete&oldid=581478195.
11
[22] Wikipedia. ISO 3166 Wikipedia, The Free Ency-
clopedia. [Online; accessed 12-December-2013].
2013. URL: http://en.wikipedia.org/
w/index.php?title=ISO_3166&oldid=
585445792.
12

HBlocks (Iterative Data Engineering For Hadoop)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HBlocks (Iterative Data Engineering For Hadoop)

Uploaded by

Copyright:

Available Formats

HBlocks: A Hadoop Subsystem for Iterative Data Engineering

You might also like