Conference Recap: KDD 2019, Alaska

As the VP of Data at Ripple, I was lucky to attend KDD 2019 Alaska this summer with Jen Xia, a Data Scientist on my team.

As usual, KDD lived up to the hype, providing a good mix of academia and industry. You can easily go to a talk where you can't follow the math past slide 4 and afterwards attend a session about the future of autonomous vehicles. Humbling and interesting! I was also asked to sit on a few panels and give one of the keynotes at the blockchain breakout workshop.

Here are a few sessions that stood out to me:

Transforming transportation - DiDi
LinkedIn - building towards non-biased employment. Differential privacy enabled data mining
Intuit - Building better financial predictions for small business
Better driverless cars - Lyft
Image recognition for rhino poaching
Facebook disaster maps

As a Data Scientist at Ripple on Matt Curcio's team, I was also lucky to attend KDD 2019 Alaska this summer.

Here are a few recaps of the talks I found most interesting:

Friends Don’t Let Friends Deploy Black-Box Models: The Importance of Intelligibility in Machine Learning

This was my favorite talk at KDD. Rich Caruana from Microsoft spoke about how, many years ago, he had trained a neural net to predict the risk of pneumonia in patients and wanted to understand whether this model would be safe to use on patients. Digging in a bit further, he realized that one of the features in his model, whether a patient did or did not have asthma, was producing some counterintuitive results: patients with asthma were predicted to have a lower risk for pneumonia, when common sense should indicate the opposite to be true.

What the neural net had captured was that patients with asthma would be more likely to go to the doctor for lung-related medical problems, where pneumonia could be prevented or caught early. But allowing a neural net to make the prediction that people with asthma were at lower risk for pneumonia would reverse this (positive) effect!

Caruana introduced GA2M (generalized additive) models as a more interpretable alternative. Though these kinds of models tend to have less accuracy than neural nets, they are much more intelligible—which is invaluable for a model that has a real-world use case like this one. They also have the added benefit of being editable, meaning he could adjust the model so that it would predict that patients with asthma would have a higher risk for pneumonia.

The model is straightforward enough—a function of single features and some pairwise interactions. I appreciate the message that Caruana closed the presentation with, “The “correctness” of any model completely depends on the use case. The neural net he started with could have been “correct” for an actuarial use case, but the GA2M model was the “correct” one for assessing patient risk.”
Communicating Machine Learning Results About the Flint Water Crisis to City Residents at Scale

I also attended a number of discussions from the Social Impact Workshop. In this talk, professors from the University of Michigan and Georgia Tech presented their work helping the city of Flint, Michigan locate lead service lines and minimize recovery costs.

The team leveraged machine learning technology to ensemble tree models to predict the probability that a service line contained lead, and built an interactive map that shared those probabilities in the form of narrative text that could be understood by anyone. I appreciated their thoughtfulness around how to best communicate the risk of lead contamination to residents in a way that would inspire them to take action, rather than to panic.
Their Futures Matter Family Investment Model

Peter Mulquiney, Principal at consulting and analytics firm Taylor Fry, presented on the work he did with Their Futures Matter—an organization whose goal it is to improve life outcomes of vulnerable children. His task was to help them understand how to prioritize investments in different programs and interventions by predicting the life pathways and outcomes of different individuals.

To collect the data, he needed to link datasets across different agencies to create the historical pathways of different individuals, which was a huge challenge in itself as there were data silos and privacy concerns to deal with. In building the model, he chose to use logistic regression over a machine learning model; even though predictive performance was worse, explainability was much better.

The model estimated the future usage of human services by different individuals and revealed some really interesting statistics. For example, 7% of people in the dataset accounted for 50% of the usage of services, and young women that fell into this group were 10 times more likely to have children that would end up in foster care, meaning that not intervening early would result in problems propagating through generations. By identifying and focusing on the individuals that were in the long tail of these distributions, the organization could have outsized impact.