Register
Submit a solution
The challenge is finished.

Challenge Overview

We're starting a new challenge series for building a tool to migrate HBase tables. In previous challenges we have built a tool to create sample tables, generate sample data and a first version of the tool to migrate smaller amounts of data. In this challenge we will improve the tool so it can be used to migrate larger amounts of data.

The basic idea for improving the performance is to parallelize data migration. You can use either Apache Spark or MapReduce. The data migration job will be run on a Cloudera cluster (Cloudera quickstart docker/VM during review) and the core requirement in this challenge is extracting as much performance out of the used resources as possible. Performance will be 70% of the total score during review. We strongly suggest reading 
hbase performance tuning chapter of the Apache HBase book. Especially pay attention to Tuning read and write performance section.

In the forums you will find two tools: a tool to generate sample data in HBase and the Migration tool that will be used as the base code. 

The time series data (metric id, timestamp, value) that should be migrated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path
  • Rowkey: 'i' + 4-byte metric id integer.
  • Columns (ColumnFamily:ColumnQualifier):
    • m:x  (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
    • m:p  (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
    • m:d  (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
    • m:v  (1 byte) Stores the schema version (always set to 5)
P-Row Schema: Stores mapping from the path (device\channel\triat) to metric id
  • Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
  • Columns (ColumnFamily:ColumnQualifier):
    • m:x  (4 byte integer) Stores the metric id integer.
    • m:p  Same as above.
    • m:d  Same as above.
    • m:v  Same as above.
Custom Attributes
  • IN_MEMORY => 'true'
  • COMPRESSION => 'SNAPPY'  -  (or LZO)
Notes
  • Row Timestamp/Version: milliseconds since epoch
  • The numeric values are all Big-Endian encoded.
The following is the HBASE data table schema: Each row represents data from a metric id for a given UTC day. This is called a day bucket.
Rowkey: 7 bytes
  • Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
  • MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
  • DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.
Columns: A wide columns design is used to store values within the day bucket.
  • t:offset  The offset is a 4-byte integer representing the seconds offset during the current UTC day.
  • Row Timestamp/Version: milliseconds since epoch (not exposed in the query)
Custom Table Attributes:  MAX_VERSIONS=>'1'
Notes
  • The numeric values are all Big-Endian encoded.
Example (5 minutes interval trends, the interval can be flexible, and sometimes irregular)
  • Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
  • Values: float, float, float, …
  • Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.
Note that the meta table is very small so focus on speeding up migration of the data table.
There will be two modes for the tool: export and import (same as in the base migration tool). In export mode, the tool will dump the table data from HBASE table to a CSV file and an Apache Avro file, or a set of CSV and/or Avro files. Both CSV and Apache Avro file schemas should be well designed to be as generic and reusable as possible, and clearly documented. The size of the generated files need to be manageable. For example, it can be opened in Excel or be read by some program without freezing the user session. Each file size should not be too big. The reason to support both CSV and Avro file format is to satisfy both readability/portability and the performance reasons for different use cases.

In import mode, user can choose one of the two formats as the input. User can then import data from the files back to the same or new HBASE table. User can define a list of devices UUIDs and/or a list of channelIds, and a time range to the utility. The utility shall manage the very lengthy operation and be able to resume the task if some failure happens during hours of executions. These exceptions can happen and the utility shall handle them and be able to resume the work automatically or manually:
  • The HBASE is too busy, so that the write operation can fail.
  • The network can be interrupted.
  • The program can crash/shut down.
The utility can use external files to store the progress if needed. However, no external database shall be used, because we do not want to introduce another unnecessary dependency to run the utility.
Sample shell commands to create both tables are available in the data generation tool docs.
Spark/MapReduce jobs will produce proper logs, and no regressions should be introduced (comparing to the base data migration tool).

The tool will be benchmarked against the CDH quickstart HBASE table. You should document the amount of time used to migrate X number of devices, with Y number of channels, with Z number of values per day, and D days, with ranges X [1,1000], Y [1-100], Z[1-500], D[1-365].

Acceptance criteria: The tool must migrate proper amount of data for the specified input parameters


Final Submission Guidelines

Submit the complete source code for the tool
Submit a deployment guide.
Verification guide and benchmark results.
Submit a short demonstration video (unlisted Youtube link)

ELIGIBLE EVENTS:

2018 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30061151