Topcoder Challenge | Topcoder Community

Challenge Overview

We're starting a new challenge series for building a tool to migrate HBase tables. The tool itself will be built in the following challenges and this fast challenge will build a tool to create sample tables and generate sample data. We will use the test data generated here as input for benchmarks in following challenges.

Please note that this challenge has a short timeline (48hrs submission, 24hrs review, 12h appeals).

The time series data (metric id, timestamp, value) that should be generated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path

Rowkey: 'i' + 4-byte metric id integer.
Columns (ColumnFamily:ColumnQualifier):
- m:x (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- m:p (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
- m:d (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
- m:v (1 byte) Stores the schema version (always set to 5)

P-Row Schema: Stores mapping from the path (device\channel\triat) to metric id

Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
Columns (ColumnFamily:ColumnQualifier):
- m:x (4 byte integer) Stores the metric id integer.
- m:p Same as above.
- m:d Same as above.
- m:v Same as above.

Custom Attributes

IN_MEMORY => 'true'
COMPRESSION => 'SNAPPY' - (or LZO)

Notes

Row Timestamp/Version: milliseconds since epoch
The numeric values are all Big-Endian encoded.

The following is the HBASE data table schema: Each row represents data from a metric id for a given UTC day. This is called a day bucket.
Rowkey: 7 bytes

Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.

Columns: A wide columns design is used to store values within the day bucket.

t:offset The offset is a 4-byte integer representing the seconds offset during the current UTC day.
Row Timestamp/Version: milliseconds since epoch (not exposed in the query)

Custom Table Attributes: MAX_VERSIONS=>'1'
Notes

The numeric values are all Big-Endian encoded.

Example (5 minutes interval trends, the interval can be flexible, and sometimes irregular)

Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
Values: float, float, float, …
Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.

Sample shell commands to create both tables are given in the forums.

The goal is to create a Java or Scala based tool that will generate user defined amount of data (standard Unix style command line interface). User will specify the time range (month, year, etc), number of devices and number of channels per device. Tool should run on CentOS and Windows. The latest CDH (Cloudera QuickStarts VM/docker) should be used for setting up the hbase environment. A standard open source command line library and log4J shall be used in the utility.

Acceptance criteria: The tool must generate proper amount of data for the specified input fields

Final Submission Guidelines

Submit the complete source code for the tool
Submit a deployment guide.
Submit a short demonstration video (unlisted Youtube link)

Data migration series - Sample data generation tool

Challenge Overview

Final Submission Guidelines

Learn

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30060641