The center is fortunate to have a rich collection of problems thanks to strong relationships with practitioners and subject matter experts, and to have stable funding thanks to the vision and consideration of the department. But the third leg of that stool – data – is much trickier, and it is a critical aspect of what we do – the CINA portfolio is anchored by efforts to discover and model networks from data, and we can’t succeed without good data
Given ethical and privacy concerns, and the relevance of so much open source data, how do we provide data for researchers that is ethically collected, privacy-preserving, accurate, and useful? Answer: by understanding exactly what the researchers need (and don’t need). For example, some data sources are loaded with personal information, but the personal information is not critical to the research and those sources can be safely collected, cleansed, and then shared. In other cases, the collected and possibly cleansed data can’t be openly shared but can be shared on a limited basis.
Such solutions require forethought, documented and clear policies and procedures, and technology – all of which we know how to do. In other cases, the necessary data just isn’t available for technical, legal, or practical reasons. In these cases, we consider proxy data (an alternate data set that mimics the critical properties of the desired data set), or synthetic data (machine-generated data that mimics the critical properties of the desired data set).
Researchers explore, develop, and test tools and methods on the proxy or synthetic data, then the findings can be shared, tested, and applied in operational environments that do have access to the real data. In other words, we can have our cake and eat it too, but we may have to bake the cake ourselves.