Challenge Overview
We're starting a new challenge series for building a tool to migrate HBase tables. The tool itself will be built in the following challenges and this fast challenge will build a tool to create sample tables and generate sample data. We will use the test data generated here as input for benchmarks in following challenges.
Please note that this challenge has a short timeline (48hrs submission, 24hrs review, 12h appeals).
The time series data (metric id, timestamp, value) that should be generated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path
Rowkey: 7 bytes
Notes
The goal is to create a Java or Scala based tool that will generate user defined amount of data (standard Unix style command line interface). User will specify the time range (month, year, etc), number of devices and number of channels per device. Tool should run on CentOS and Windows. The latest CDH (Cloudera QuickStarts VM/docker) should be used for setting up the hbase environment. A standard open source command line library and log4J shall be used in the utility.
Acceptance criteria: The tool must generate proper amount of data for the specified input fields
Submit a deployment guide.
Submit a short demonstration video (unlisted Youtube link)
Please note that this challenge has a short timeline (48hrs submission, 24hrs review, 12h appeals).
The time series data (metric id, timestamp, value) that should be generated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path
- Rowkey: 'i' + 4-byte metric id integer.
- Columns (ColumnFamily:ColumnQualifier):
- m:x (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- m:p (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
- m:d (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
- m:v (1 byte) Stores the schema version (always set to 5)
- Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- Columns (ColumnFamily:ColumnQualifier):
- m:x (4 byte integer) Stores the metric id integer.
- m:p Same as above.
- m:d Same as above.
- m:v Same as above.
- IN_MEMORY => 'true'
- COMPRESSION => 'SNAPPY' - (or LZO)
- Row Timestamp/Version: milliseconds since epoch
- The numeric values are all Big-Endian encoded.
Rowkey: 7 bytes
- Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
- MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
- DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.
- t:offset The offset is a 4-byte integer representing the seconds offset during the current UTC day.
- Row Timestamp/Version: milliseconds since epoch (not exposed in the query)
Notes
- The numeric values are all Big-Endian encoded.
- Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
- Values: float, float, float, …
- Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.
The goal is to create a Java or Scala based tool that will generate user defined amount of data (standard Unix style command line interface). User will specify the time range (month, year, etc), number of devices and number of channels per device. Tool should run on CentOS and Windows. The latest CDH (Cloudera QuickStarts VM/docker) should be used for setting up the hbase environment. A standard open source command line library and log4J shall be used in the utility.
Acceptance criteria: The tool must generate proper amount of data for the specified input fields
Final Submission Guidelines
Submit the complete source code for the toolSubmit a deployment guide.
Submit a short demonstration video (unlisted Youtube link)