It's tough to make predictions, especially about the future.
—Yogi Berra
The cascaded approach
One of the greatest challenges in self-driving is the interaction with other actors on the road—whether drivers, cyclists, or pedestrians. AVs would already likely be ubiquitous if these technical challenges of interaction were not so significant.
The ability to forecast ¹ other actors well enables safe, interpretable, and responsive driving actions. Forecasting has traditionally been treated as a step to be cascaded after perception and before motion planning, providing key input to the decision-making sub-system.
Figure 1.
The traditional approach dates back to, and perhaps is best exemplified by, the pioneering Urban Challenge era self-driving vehicles and their progeny.
Figure 2. Forecasting image from Carnegie Mellon’s Urban Challenge entry, showing both on-road and unstructured parking lot forecasts. Image courtesy of Aurora CEO and former Carnegie Mellon University Urban Challenge lead, Chris Urmson.
More recent works have expanded on this cascaded approach and allowed backpropagation between modules, particularly to improve the joint perception and forecasting stack.
Here we see state-of-the-art cascaded forecasting systems with uncertainty estimates for multiple actors in a scene:
Figure 3. Left: Example output of a cascaded model demonstrating forecasts with probabilistic uncertainty of an actor in the scene to aid decision making.
These approaches demonstrate the importance of reasoning about multi-modality and uncertainty when generating forecasts for actors when there are multiple possible futures. However, the motion of other actors often depends crucially on the actions of the AV, and cascaded forecasting cannot reason about these interactions.
Aurora’s clean-sheet approach to self-driving has given us the space to identify a better way: integrating forecasting within our decision making architecture (i.e., motion planning) to enable reasoning that factors in the impact of the AV’s decisions on the motions of other actors.
Cascade forecasting failures
Let’s start by understanding what kinds of failures we’re likely to encounter in a cascaded system. Consider Figure 4, one of a wide variety of interactions where the AV’s decision affects another actor’s behavior.
Figure 4.
During planning, the AV forward simulates what will happen when it makes decisions and estimates how good or bad those outcomes will be. A cascaded autonomy system will conclude that if it makes the turn, the actor “Alice” in the figure above is likely to rear-end it, even when in reality this would only happen if Alice were being truly reckless!
Why does the AV think this? If the AV’s planner doesn’t reason that Alice will slow gently in response to the AV’s decision to pull out into traffic, it will incorrectly believe that a collision is inevitable and will thus decide that a legal and, in reality, very safe maneuver should be disallowed.
The result is that a naive implementation of a cascaded system will typically be ineffective in its decision making and unable to drive in normal traffic conditions. It will likely be less safe than a good driver because the actions it takes are surprising to other drivers on the road. Now, naturally, there are ways to patch up such forecasts post hoc, but it should make us begin to question, are forecasts actually useful when they don't consider the context of the AV's decisions? When we attempt to build a forecasting system that runs prior to, instead of within, the decision-making loop, what, exactly, is it learning to forecast?
Well, if it’s an algorithm based on data (it’s difficult to imagine building a modern AI system without a strong data-driven approach), then it must attempt to predict what other actors are likely to do in the context of the driver’s decisions when the data was collected. Now, of course, this means we have a chicken-or-the-egg feedback problem, and we must hope that the driver was an expert, or the predictions we have of other actors will be biased. E.g., if the AV is exposed to data collected with an unusually aggressive human driver behind the wheel, it’s likely that the AV would learn, incorrectly, that Alice, Bob, and other drivers will always move to accommodate it. Though in reality, at the time the data was collected, the other drivers were forced to behave unnaturally in response to aggressive actions.
Lost in the modes…
Now, what this means is that any forecasts we have about other actors must also implicitly reason about the correct actions of an expert driver. This takes the motion planning system almost completely out of the driver’s seat and relegates it to trying to back-out what the forecasting system has decided already.⁴
Consider for a moment a scenario like that depicted here, where a good driver can choose to move in front of the merging actor Alice or, instead, to yield to her. Let’s say that for similar scenarios, our expert vehicle operator demonstrations might establish that 80% of the time the AV should choose to take the lead position in the merge while 20% of the time it should instead choose to yield. Our predictions, then, if doing a good job, will be multi-modal as shown below.
Multi-modal forecasts are common and critical for self-driving—consider reasoning about a car that might make a turn or go straight at an intersection as in Figure 3. But these two particular modes illustrated in Figure 5 are the direct result of there being multiple good options for the AV. That fact is completely obscured by providing only the “marginal” forecasts of the actors themselves. Imagine now what a planner would have to do with an impoverished representation relying on marginal forecasts: the planner foresees some significant probability of collision if it chooses either to yield or to go first. Both options look “expensive”—and unsafe—to the reasoning system, even though in reality both are reasonable choices if executed correctly.
We can make a toy “matrix game” to capture what’s going on here. Let’s model a too-close interaction between the AV and Alice as “costing” 100⁵ and a proper interaction as the vehicles merge together as costing 0. The “real” situation leads to a good outcome when the actors coordinate while an apriori (cascaded) forecast makes all the reasonable decisions seem untenable. This leads to poor driving decisions. Good merging requires both decisive actions to signal intent and responsiveness to the other actors to fold together smoothly. But an AV with a cascaded forecasting system is forced to hedge indecisively against multiple outcomes that will not—and cannot—both happen.
This reliance on marginal forecasts continues to be problematic for a very wide range of probabilities. Truncating the probabilities when they get small leads to over-confident behavior where the AV won’t correctly consider outcomes that are unlikely but important. There are cases where the AV can do valuable forecasting prior to planning—for instance, on a short time scale of less than, perhaps, two seconds, and for reasoning about intentions of other actors that are independent of the AV’s decisions. But it is precisely where our software would benefit most from understanding the likely positions of actors in the future—in the complex dance of interactions between the AV and other actors during tricky merges, lane changes, and interactions with pedestrians and cyclists—that the cascade approach breaks down.
Interleaved Forecasting
So if a cascaded system is an engineering cul-de-sac, what works better? Interleaved forecasting. Here, the AV conditions forecasts on its future options using a form of causal reasoning⁶ to answer interventional questions: “If I were to make this decision, what are the possible—and probable—outcomes from other actors?”
The AV can evaluate those conditional outcomes and choose the one it deems to be overall best for its own safety and progress, and for acting as an ideal citizen of the road.
Now building such a system is complex and computationally demanding; it keeps our engineering and operations teams running hard. But it’s the right approach to deliver the benefits of self-driving safely, quickly, and broadly. Training our AV to reason about the network of interdependencies among all of the actors in a scene, including itself, empowers us to focus on the goal of making the correct decisions, rather than forecasting other agents’ marginal probabilities of future positions. We’re not interested in how accurately we can forecast another actor’s actions in the abstract, only in how it leads the Aurora Driver to make safe, expert-driver-like decisions.
In future articles, we’ll dive deeper into our approach to learned interleaved forecasting and interactive decision-making. Stay tuned.