TetrysTables: Speeding up machine learning model development with shared datasets
March 14, 2022 | 8 min. read
Autonomous vehicles understand the world by processing enormous amounts of data acquired via sensors. We use this data to train machine learning (ML) models to detect objects surrounding the vehicles, forecast their future behaviors, and plan the autonomous vehicles’ next movements.
The more data we use for ML model training, the more accurate these models will be. For example, in order to produce a perception model that uses lidar, camera, and radar inputs to detect vehicles, pedestrians, and other road actors in all kinds of situations, we train it on tens of thousands of miles of diverse training data that include a wide variety of objects observed under different conditions.
Now, what if some of that same dataset would be useful for training another ML model developed by a different team? To allow multiple teams of engineers to experiment and improve the performance of their models, the data they use must be well organized, versioned, easily searchable, and shareable across teams. But wrangling hundreds of terabytes of autonomous vehicle sensor data is far from simple.
Instead of requiring each ML model development team to implement their own dataset extraction code, which would mean maintaining and running multiple compute-heavy solutions, we’ve turned our “heaviest” datasets into a shared resource. Since no off-the-shelf tool could support this level of collaboration, we implemented a custom solution designed to meet the specific needs of our engineering teams.
By dramatically reducing the time that our model developers spend on dataset extraction, this combined, collaborative, self-driving-oriented enterprise data infrastructure has resulted in accelerated model development cycles that allow our engineering teams to scale quickly and make rapid progress in our mission to deliver the benefits of self-driving technology safely, quickly, and broadly.
Training data for self-driving ML models
Figure 1. Components of the Aurora Driver communicate with each other.
While operating, each component of our self-driving vehicle software communicates with the rest of the stack by passing messages. After every mission—i.e., any time Aurora technology is used or tested, whether on public roads, on private tracks, for manually-driven data collection, or to deliver a load for a customer—we offload all of that information to a datacenter as a “log” of messages. The log is used to reconstruct and analyze the behavior of the entire self-driving system.
However, it’s not efficient to train an ML model on raw logs. Each sensor modality emits information at a different frequency: for example, camera images may arrive at a lower frequency than lidar packets. Converting these messages into training samples on-the-fly, during training, is slow, computationally wasteful work.
Therefore, logs need to be converted into a set of uniform records before they can be used as training data, following standard ML development practices. Our implementation of ML dataset generation is shown in Figure 2.
When working on or proposing a new deep learning model for the self-driving system, a developer typically:
-
Extracts a new dataset from stored logs that reflects the current state of the code stack (or references an existing dataset if appropriate);
-
Establishes a baseline of model performance;
-
Modifies the model until it reaches desirable performance characteristics on the dataset.
While searching for a solution, we considered the question of whether the same dataset can be used by multiple teams to train different models (specifically models that use sensor data as their input). And if it can, what would such a dataset look like?
-
The dataset must store sensor data with minimal preprocessing, giving each model the freedom to perform a model-specific transformation, which is typically done on-the-fly as one of the first neural network layers.
-
The dataset should support tensor types natively in order to work well with deep-learning frameworks.
-
The dataset must have columnar organization. That way, a model that uses a subset of sensor data does not need to waste resources loading columns it has no use for. Model A may use camera and lidar data, for example, while model B only uses camera data.
-
The dataset should allow for random row access, so that users can easily switch between different sampling strategies (e.g., stratified sampling) without needing to re-extract a dataset.
-
The dataset should support fast row lookup based on a query, to simplify data inspection and sampling strategy implementation.
-
The dataset must be immutable in order to give model developers confidence that the data will not uncontrollably change between experiments.
-
The dataset needs to be able to store data on the order of petabytes with tens of millions of rows, hundreds of columns, and fields ranging from bytes to over 10 megabytes—the typical amount used for model training.
-
The dataset must be quickly and easily accessible with good data loading performance to guarantee that GPUs are not idle during model training.
-
The dataset must support reading of a sequence of frames to account for the input’s temporal features. For example, a model that takes several frames of a video excerpt as its input requires a dataset that is able to load the excerpt efficiently.
We evaluated numerous standard formats in search of a winning dataset structure, but, unfortunately, none of them could satisfy all of our needs.
So we decided to implement our own solution: TetrysTables.
TetrysTables (TT) library
We designed TetrysTables, a file format and software library, to support multipurpose datasets. It’s optimized for loading data from a subset of fields in a row or adjacent rows, which can appear as irregular shapes that look like blocks in a game of Tetris.
On-disk data is organized into two components, an index and a blob store. An index is stored as an Apache Parquet table. It is entirely loaded into memory as a pandas dataframe when a user opens a table. The index contains a subset of table fields (to keep RAM footprint manageable, we chose to only include fields with small memory footprints) that can be used for fast, in-memory queries. The index also contains “service” fields with references to table rows stored in the blob store.
The content of each partition is independent of other partitions in the table. Independent partitions enable at-scale distributed processing and writing since multiple machines in a compute cluster can process and write partitions in parallel.
Code examples
Reading a single row
When training, evaluating a model, or just exploring the data, a user needs to be able to read one or more fields from a row. Let’s assume a user is training a model that uses lidar and radar data, but does not need camera images. First, we load the table index, then we create a row loader object to access data from the blob store, and then we load all lidar and radar fields from the row defined by the row position (the row_pos argument of the get_row function). The library chooses which column-groups will be loaded from the blob store automatically based on its knowledge of the table’s physical layout and some statistical information about data sizes (the same field may be stored in one or more column-groups to reduce the number of storage read requests):
index = tt.read_index("s3://.../table1")
row_loader = tt.row_loader(index)
# Read a row
row = row_loader.get_row(row_pos, columns=["lidar.*", "radar.*"])
If a model makes use of temporal information in sensor data, it will use a set of consecutive frames as its input. In this case, we could use a row_loader object that supports reading data from multiple consecutive rows with a single input-output (IO) request. Here, we use the offset=range(-10, 0) argument to request that a “history” of ten rows is included in the result.
rows = row_loader.get_rows(row_pos, columns=["lidar.*", "radar.*"], offsets=range(-10, 0))
Joining two tables
Sometimes we might want to join two tables (e.g., joining an existing sensor table with a new labels table would allow us to upgrade labels while preserving the same sensor data). To do this, we mimic the pandas API with a tt.merge(...) function. The rest of the code can transparently use the joint index and perform any table operations (including additional tt.merge calls).
sensors_index = tt.read_index("s3://.../sensors")
labels_index = tt.read_index("s3://.../labels_v2")
merged_index = tt.merge(sensors_index, labels_index, on=["log_id", "timestamp"])
row_loader = tt.row_loader(merged_index)
row = row_loader.get_row(row_pos, columns=["lidar.*", "radar.*", "labels.*"])
Figure 7 illustrates the “secret sauce” behind tt.merge(...) implementation—an ability to merge index “service” fields in a way that makes the merged index reference data from multiple blob stores.
Selecting a row is easy and efficient as long as it is based on one of the user-columns in the index. When a table is created we add a number of fields to the index that we think will be useful in querying the table (e.g., if a model developer needs to train their model in scenes with a specific amount of pedestrians, we would add a pedestrian count). The following example shows how we load a frame that has more than three pedestrians.
sensors_index = tt.read_index("s3://.../sensors_data")
labels_index = tt.read_index("s3://.../labels_v2")
merged_index = tt.merge(sensors_index, labels_index, on=["log_id", "timestamp"]) merged_index =merged_index[merged_index.pedestrians_count > 3]
row_loader = tt.row_loader(merged_index)
row = row_loader.get_row(0, columns=[".*", "labels.*"])
Sharing storage across multiple tables
Working with immutable versioned data stores gives model developers confidence in established baselines and is essential for experiment reproducibility. However, storing multiple copies of the data (with each copy taking up hundreds of terabytes) is costly. When a table is written to a disk, we can optionally specify an existing table to use as a reference. If a particular chunk already exists in the referenced table, it won’t be stored again. By storing a hash value of all chunks in the index, we are able to efficiently lookup data in the referenced table. Data consumers (our model developers) still have an immutable table, but the TetrysTables library will read data from the referenced table.
Using TetrysTables and other tools, all of the software responsible for model input composition and dataset extraction across multiple teams has been consolidated and is now managed by one team of Aurora engineers who specialize in dataset extraction, effectively streamlining model development.
After multiple years of development and intensive usage, we’re confident that this methodology, novel file format, and library are delivering real value to our organization. TetrysTables is an important component in our collaborative, organization-wide data environment tailored for a scalable self-driving business. Today, teams across the company use TetrysTables tools not just for model training, but also for data science, label curation, and more.
The Machine Learning Platform (MLP) team, which specializes in maintaining the APIs, tables, and tools, continues to improve robustness and operational costs of ETLs, add new data features, and manage and track dataset versions. Areas of future refinement include improving performance with smaller row sizes to use TetrysTables in additional applications, reducing index footprint size, and/or altering our approach to memory loading to improve the user experience.
It will be exciting to see how TetrysTables evolves as we inevitably collect more and more data in pursuit of our mission to deliver self-driving technology to the world safely, quickly, and broadly. Look forward to new developments in our future updates.
Software Engineering, Perception