Project DataFuel

Enabling Data-driven Innovation with Synthetic Data

Our conversations with leading enterprise AI vendors across market verticals (e.g., security, telemetry, finance) tell us that at every step along the way, lack of access to realistic and diverse data from multiple deployments hampers innovation; e.g., products trained on data not representative of customer environment, there is no way to quantitatively assess products; machine learning workflows experiences data drift, and product feedback is not quantitative. The result today is poor products, lack of transparency, lots of effort in debugging/reproduction/resolution, and impossibility to share insights across customers.

As part of the DataFuel project, we have been leading research on demonstrating the feasibility of using synthetic data using Generative Adversarial Networks (GANs) to address these pain points for various tasks (e.g., telemetry, anomaly detection, model training). We have identified and addressed key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges.

Code on github

  1. SIGCOMM
    Practical GAN-based Synthetic IP Header Trace Generation using NetShare
    Yin, Yucheng, Lin, Zinan, Jin, Minhao, Fanti, Giulia, and Sekar, Vyas
    In SIGCOMM 2022
  2. AAAI
    RareGAN: Generating Samples for Rare Classes
    Lin, Zinan, Liang, Hao, Fanti, Giulia, and Sekar, Vyas
    In AAAI 2022
  3. ICML
    Pareto GAN: Extending the Representational Power of GANs to Heavy-Tailed Distributions
    Huster, Todd, Cohen, Jeremy, Lin, Zinan, Chan, Kevin, Kamhoua, Charles, Leslie, Nandi O., Chiang, Cho-Yu, and Sekar, Vyas
    In Proc. ICML 2021
  4. AISTATS
    On the Privacy Properties of GAN-generated Samples
    Lin, Zinan, Fanti, Giulia, and Sekar, Vyas
    In Proc. AISTATS 2021
  5. IMC
    Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
    Lin, Zinan, Jain, Alankar, Wang, Chen, Fanti, Giulia C., and Sekar, Vyas
    In IMC ’20: ACM Internet Measurement Conference, Virtual Event, USA, October 27-29, 2020 2020
  6. arxiv
    Why Spectral Normalization Stabilizes GANs: Analysis and Improvements
    Lin, Zinan, Sekar, Vyas, and Fanti, Giulia C.
    CoRR 2020