Topcoder Challenge | Topcoder Community

Challenge Overview

We're starting a new challenge series for building a tool to migrate HBase tables. In a previous challenge we have built a tool to create sample tables and generate sample data. In this challenge we are building a Python library for read-only access to the data.

The Python data access library will be used in the future Python based projects. The goal is to abstract away the actual HBASE schema, so that if we change the schema in the future, the program using this library does not need to change.

The time series data (metric id, timestamp, value) that should be generated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path

Rowkey: 'i' + 4-byte metric id integer.
Columns (ColumnFamily:ColumnQualifier):
- m:x (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- m:p (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
- m:d (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
- m:v (1 byte) Stores the schema version (always set to 5)

P-Row Schema: Stores mapping from the path (device\channel\triat) to metric id

Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
Columns (ColumnFamily:ColumnQualifier):
- m:x (4 byte integer) Stores the metric id integer.
- m:p Same as above.
- m:d Same as above.
- m:v Same as above.

Custom Attributes

IN_MEMORY => 'true'
COMPRESSION => 'SNAPPY' - (or LZO)

Notes

Row Timestamp/Version: milliseconds since epoch
The numeric values are all Big-Endian encoded.

The following is the HBASE data table schema: Each row represents data from a metric id for a given UTC day. This is called a day bucket.
Rowkey: 7 bytes

Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.

Columns: A wide columns design is used to store values within the day bucket.

t:offset The offset is a 4-byte integer representing the seconds offset during the current UTC day.
Row Timestamp/Version: milliseconds since epoch (not exposed in the query)

Custom Table Attributes: MAX_VERSIONS=>'1'
Notes

The numeric values are all Big-Endian encoded.

Example (5 minutes interval trends, the interval can be flexible, and sometimes irregular)

Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
Values: float, float, float, …
Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.

Sample shell commands to create both tables are available in the data generation tool docs.

Write a Python 2.7 based data access library using HappyBase to provide read only access to both meta and data table. It should support at least these operations:

Read meta data based on metric id.
Read data based on metric id and time range.
Read meta data based on device sensor path.
Read data based on the device sensor path and time range.
Read data based on device UUID, time range and filter by specific traits

Read performance is very important for the library, so we suggest reading hbase performance tuning chapter of the Apache HBase book. Especially pay attention to Tuning read performance section.
Unit tests are required.

Create a sample project that uses the library. Assume the DAL library is used as a pip package, don't use a relative path for including the module.

The tool will be benchmarked against the CDH quickstart HBASE table. Use the data generation tool to generate test data.
You should document the amount of time used to read data from both tables when they contain data for X number of devices, with Y number of channels, with Z number of values per day, and D days, with ranges X [1,1000], Y [1-100], Z[1-500], D[1-365]. The time range for data reads should vary between 1-100 days.

Acceptance criteria: The tool must provide the necessary methods to read the required data.

Final Submission Guidelines

Submit the complete source code for the library and the test project
Submit a deployment guide.
Verification guide and benchmark results.
Submit a short demonstration video (unlisted Youtube link)

Data migration series - Data access library

Challenge Overview

Final Submission Guidelines

Learn

ELIGIBLE EVENTS:

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30060805