Project DataFuel
Enabling Data-driven Innovation with Synthetic Data
Our conversations with leading enterprise AI vendors across market verticals (e.g., security, telemetry, finance) tell us that at every step along the way, lack of access to realistic and diverse data from multiple deployments hampers innovation; e.g., products trained on data not representative of customer environment, there is no way to quantitatively assess products; machine learning workflows experiences data drift, and product feedback is not quantitative. The result today is poor products, lack of transparency, lots of effort in debugging/reproduction/resolution, and impossibility to share insights across customers.
As part of the DataFuel project, we have been leading research on demonstrating the feasibility of using synthetic data using Generative Adversarial Networks (GANs) to address these pain points for various tasks (e.g., telemetry, anomaly detection, model training). We have identified and addressed key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges.
- SIGCOMM
- AAAIRareGAN: Generating Samples for Rare ClassesIn AAAI 2022
- ICMLPareto GAN: Extending the Representational Power of GANs to Heavy-Tailed DistributionsIn Proc. ICML 2021
- AISTATS
- IMCUsing GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open QuestionsIn IMC ’20: ACM Internet Measurement Conference, Virtual Event, USA, October 27-29, 2020 2020
- arxivWhy Spectral Normalization Stabilizes GANs: Analysis and ImprovementsCoRR 2020