Topcoder Challenge | Topcoder Community

Challenge Overview

We're starting a new challenge series for building a tool to migrate HBase tables. In a previous challenge we have built a tool to create sample tables and generate sample data. In this challenge we will improve the tool so it can be used to generate larger amounts of data.

The basic idea for improving the performance is to parallelize data generation. You can use either Apache Spark or MapReduce. The data generation job will be run on a Cloudera cluster (Cloudera quickstart docker/VM during review) and the core requirement in this challenge is extracting as much performance out of the used resources as possible. Performance will be 70% of the total score during review. We strongly suggest reading hbase performance tuning chapter of the Apache HBase book. Especially pay attention to Tuning write performance section.

The time series data (metric id, timestamp, value) that should be generated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path

Rowkey: 'i' + 4-byte metric id integer.
Columns (ColumnFamily:ColumnQualifier):
- m:x (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- m:p (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
- m:d (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
- m:v (1 byte) Stores the schema version (always set to 5)

P-Row Schema: Stores mapping from the path (device\channel\triat) to metric id

Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
Columns (ColumnFamily:ColumnQualifier):
- m:x (4 byte integer) Stores the metric id integer.
- m:p Same as above.
- m:d Same as above.
- m:v Same as above.

Custom Attributes

IN_MEMORY => 'true'
COMPRESSION => 'SNAPPY' - (or LZO)

Notes

Row Timestamp/Version: milliseconds since epoch
The numeric values are all Big-Endian encoded.

The following is the HBASE data table schema: Each row represents data from a metric id for a given UTC day. This is called a day bucket.
Rowkey: 7 bytes

Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.

Columns: A wide columns design is used to store values within the day bucket.

t:offset The offset is a 4-byte integer representing the seconds offset during the current UTC day.
Row Timestamp/Version: milliseconds since epoch (not exposed in the query)

Custom Table Attributes: MAX_VERSIONS=>'1'
Notes

The numeric values are all Big-Endian encoded.

Example (5 minutes interval trends, the interval can be flexible, and sometimes irregular)

Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
Values: float, float, float, …
Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.

Sample shell commands to create both tables are available in the data generation tool docs.
Spark/MapReduce jobs will produce proper logs, and no regressions should be introduced (comparing to the base data generation tool).

The tool will be benchmarked against the CDH quickstart HBASE table. You should document the amount of time used to generate X number of devices, with Y number of channels, with Z number of values per day, and D days, with ranges X [1,1000], Y [1-100], Z[1-500], D[1-365].

Acceptance criteria: The tool must generate proper amount of data for the specified input fields

Final Submission Guidelines

Submit the complete source code for the tool
Submit a deployment guide.
Verification guide and benchmark results.
Submit a short demonstration video (unlisted Youtube link)

Data migration series - Scaling the data generation tool

Challenge Overview

Final Submission Guidelines

Learn

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30060804