Topcoder Challenge | Topcoder Community

Challenge Overview

We're starting a new challenge series for building a tool to migrate HBase tables. In a previous challenge we have built a tool to create sample tables and generate sample data. We will use the data generation tool here as input for benchmarks.

In this first challenge we will focus on migrating data from a specific table with known structure. The migration tool will be implemented in Java. The following challenges will improve the tool by parallelizing the workload and adding support for migrating any tables. Please keep this in mind and refactor any part of the code that makes assumptions about specific table structure.
The time series data (metric id, timestamp, value) that we will be migrating has a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path

Rowkey: 'i' + 4-byte metric id integer.
Columns (ColumnFamily:ColumnQualifier):
- m:x (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- m:p (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
- m:d (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
- m:v (1 byte) Stores the schema version (always set to 5)

P-Row Schema: Stores mapping from the path (device\channel\trait) to metric id

Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
Columns (ColumnFamily:ColumnQualifier):
- m:x (4 byte integer) Stores the metric id integer.
- m:p Same as above.
- m:d Same as above.
- m:v Same as above.

Custom Attributes

IN_MEMORY => 'true'
COMPRESSION => 'SNAPPY' - (or LZO)

Notes

Row Timestamp/Version: milliseconds since epoch
The numeric values are all Big-Endian encoded.

The following is the HBASE data table schema: Each row represents data from a metric id for a given UTC day. This is called a day bucket.
Rowkey: 7 bytes

Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.

Columns: A wide columns design is used to store values within the day bucket.

t:offset The offset is a 4-byte integer representing the seconds offset during the current UTC day.
Row Timestamp/Version: milliseconds since epoch (not exposed in the query)

Custom Table Attributes: MAX_VERSIONS=>'1'
Notes

The numeric values are all Big-Endian encoded.

Example (5 minutes interval trends, the interval can be flexible, and sometimes irregular)

Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
Values: float, float, float, …
Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.

Sample shell commands to create both tables are available in the data generation tool docs.

The goal is to create a Java based tool that will migrate the data from one HBase table to another one (standard Unix style command line interface). There will be two modes for the tool: export and import.
In export mode, the tool will dump the table data from HBASE table to a CSV file and an Apache Avro file, or a set of CSV and/or Avro files. Both CSV and Apache Avro file schemas should be well designed to be as generic and reusable as possible, and clearly documented. The size of the generated files need to be manageable. For example, it can be opened in Excel or be read by some program without freezing the user session. Each file size should not be too big. The reason to support both CSV and Avro file format is to satisfy both readability/portability and the performance reasons for different use cases.

In import mode, user can choose one of the two formats as the input. User can then import data from the files back to the same or new HBASE table. User can define a list of devices UUIDs and/or a list of channelIds, and a time range to the utility. The utility shall manage the very lengthy operation and be able to resume the task if some failure happens during hours of executions. These exceptions can happen and the utility shall handle them and be able to resume the work automatically or manually:

The HBASE is too busy, so that the write operation can fail.
The network can be interrupted.
The program can crash/shut down.

The utility can use external files to store the progress if needed. However, no external database shall be used, because we do not want to introduce another unnecessary dependency to run the utility.

Tool should run on CentOS AND Windows. The latest CDH (Cloudera QuickStarts VM/docker) should be used for setting up the hbase environment. A standard open source command line library and log4J shall be used in the utility.

The utility will be benchmarked against the CDH quickstart HBASE table. You should document the amount of time used to export and import X number of metric id, with Y number of days, with Z number of values per day, with ranges X [1,1000], Y [1-365], Z[1-500].

Acceptance criteria: The tool must export and import the data without losing any of the records.

Final Submission Guidelines

Submit the complete source code for the tool
Submit a deployment guide.
Verification guide and benchmark results.
Submit a short demonstration video (unlisted Youtube link)

Data migration series - Migration tool

Challenge Overview

Final Submission Guidelines

Learn

ELIGIBLE EVENTS:

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30060803