Challenge Overview
We're starting a new challenge series for building a tool to migrate HBase tables. In a previous challenge we have built a tool to create sample tables and generate sample data. In this challenge we will improve the tool so it can be used to generate larger amounts of data.
The basic idea for improving the performance is to parallelize data generation. You can use either Apache Spark or MapReduce. The data generation job will be run on a Cloudera cluster (Cloudera quickstart docker/VM during review) and the core requirement in this challenge is extracting as much performance out of the used resources as possible. Performance will be 70% of the total score during review. We strongly suggest reading hbase performance tuning chapter of the Apache HBase book. Especially pay attention to Tuning write performance section.
The time series data (metric id, timestamp, value) that should be generated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path
- Rowkey: 'i' + 4-byte metric id integer.
- Columns (ColumnFamily:ColumnQualifier):
- m:x (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- m:p (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
- m:d (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
- m:v (1 byte) Stores the schema version (always set to 5)
- Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- Columns (ColumnFamily:ColumnQualifier):
- m:x (4 byte integer) Stores the metric id integer.
- m:p Same as above.
- m:d Same as above.
- m:v Same as above.
- IN_MEMORY => 'true'
- COMPRESSION => 'SNAPPY' - (or LZO)
- Row Timestamp/Version: milliseconds since epoch
- The numeric values are all Big-Endian encoded.
Rowkey: 7 bytes
- Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
- MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
- DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.
- t:offset The offset is a 4-byte integer representing the seconds offset during the current UTC day.
- Row Timestamp/Version: milliseconds since epoch (not exposed in the query)
Notes
- The numeric values are all Big-Endian encoded.
- Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
- Values: float, float, float, …
- Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.
Sample shell commands to create both tables are available in the data generation tool docs.
Spark/MapReduce jobs will produce proper logs, and no regressions should be introduced (comparing to the base data generation tool).
The tool will be benchmarked against the CDH quickstart HBASE table. You should document the amount of time used to generate X number of devices, with Y number of channels, with Z number of values per day, and D days, with ranges X [1,1000], Y [1-100], Z[1-500], D[1-365].
Acceptance criteria: The tool must generate proper amount of data for the specified input fields
Final Submission Guidelines
Submit the complete source code for the toolSubmit a deployment guide.
Verification guide and benchmark results.
Submit a short demonstration video (unlisted Youtube link)