Challenge Overview
We're starting a new challenge series for building a tool to migrate HBase tables. In a previous challenge we have built a tool to create sample tables and generate sample data. We will use the data generation tool here as input for benchmarks.
In this first challenge we will focus on migrating data from a specific table with known structure. The migration tool will be implemented in Java. The following challenges will improve the tool by parallelizing the workload and adding support for migrating any tables. Please keep this in mind and refactor any part of the code that makes assumptions about specific table structure.
The time series data (metric id, timestamp, value) that we will be migrating has a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path
Rowkey: 7 bytes
Notes
The goal is to create a Java based tool that will migrate the data from one HBase table to another one (standard Unix style command line interface). There will be two modes for the tool: export and import.
In export mode, the tool will dump the table data from HBASE table to a CSV file and an Apache Avro file, or a set of CSV and/or Avro files. Both CSV and Apache Avro file schemas should be well designed to be as generic and reusable as possible, and clearly documented. The size of the generated files need to be manageable. For example, it can be opened in Excel or be read by some program without freezing the user session. Each file size should not be too big. The reason to support both CSV and Avro file format is to satisfy both readability/portability and the performance reasons for different use cases.
In import mode, user can choose one of the two formats as the input. User can then import data from the files back to the same or new HBASE table. User can define a list of devices UUIDs and/or a list of channelIds, and a time range to the utility. The utility shall manage the very lengthy operation and be able to resume the task if some failure happens during hours of executions. These exceptions can happen and the utility shall handle them and be able to resume the work automatically or manually:
Tool should run on CentOS AND Windows. The latest CDH (Cloudera QuickStarts VM/docker) should be used for setting up the hbase environment. A standard open source command line library and log4J shall be used in the utility.
The utility will be benchmarked against the CDH quickstart HBASE table. You should document the amount of time used to export and import X number of metric id, with Y number of days, with Z number of values per day, with ranges X [1,1000], Y [1-365], Z[1-500].
Acceptance criteria: The tool must export and import the data without losing any of the records.
Submit a deployment guide.
Verification guide and benchmark results.
Submit a short demonstration video (unlisted Youtube link)
In this first challenge we will focus on migrating data from a specific table with known structure. The migration tool will be implemented in Java. The following challenges will improve the tool by parallelizing the workload and adding support for migrating any tables. Please keep this in mind and refactor any part of the code that makes assumptions about specific table structure.
The time series data (metric id, timestamp, value) that we will be migrating has a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path
- Rowkey: 'i' + 4-byte metric id integer.
- Columns (ColumnFamily:ColumnQualifier):
- m:x (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- m:p (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
- m:d (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
- m:v (1 byte) Stores the schema version (always set to 5)
- Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- Columns (ColumnFamily:ColumnQualifier):
- m:x (4 byte integer) Stores the metric id integer.
- m:p Same as above.
- m:d Same as above.
- m:v Same as above.
- IN_MEMORY => 'true'
- COMPRESSION => 'SNAPPY' - (or LZO)
- Row Timestamp/Version: milliseconds since epoch
- The numeric values are all Big-Endian encoded.
Rowkey: 7 bytes
- Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
- MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
- DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.
- t:offset The offset is a 4-byte integer representing the seconds offset during the current UTC day.
- Row Timestamp/Version: milliseconds since epoch (not exposed in the query)
Notes
- The numeric values are all Big-Endian encoded.
- Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
- Values: float, float, float, …
- Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.
The goal is to create a Java based tool that will migrate the data from one HBase table to another one (standard Unix style command line interface). There will be two modes for the tool: export and import.
In export mode, the tool will dump the table data from HBASE table to a CSV file and an Apache Avro file, or a set of CSV and/or Avro files. Both CSV and Apache Avro file schemas should be well designed to be as generic and reusable as possible, and clearly documented. The size of the generated files need to be manageable. For example, it can be opened in Excel or be read by some program without freezing the user session. Each file size should not be too big. The reason to support both CSV and Avro file format is to satisfy both readability/portability and the performance reasons for different use cases.
In import mode, user can choose one of the two formats as the input. User can then import data from the files back to the same or new HBASE table. User can define a list of devices UUIDs and/or a list of channelIds, and a time range to the utility. The utility shall manage the very lengthy operation and be able to resume the task if some failure happens during hours of executions. These exceptions can happen and the utility shall handle them and be able to resume the work automatically or manually:
- The HBASE is too busy, so that the write operation can fail.
- The network can be interrupted.
- The program can crash/shut down.
Tool should run on CentOS AND Windows. The latest CDH (Cloudera QuickStarts VM/docker) should be used for setting up the hbase environment. A standard open source command line library and log4J shall be used in the utility.
The utility will be benchmarked against the CDH quickstart HBASE table. You should document the amount of time used to export and import X number of metric id, with Y number of days, with Z number of values per day, with ranges X [1,1000], Y [1-365], Z[1-500].
Acceptance criteria: The tool must export and import the data without losing any of the records.
Final Submission Guidelines
Submit the complete source code for the toolSubmit a deployment guide.
Verification guide and benchmark results.
Submit a short demonstration video (unlisted Youtube link)