Christoph Klemenjak, Andreas Reinhardt, Lucas Pereira, Mario Berges, Stephen Makonin, Wilfried Elmenreich
Proceedings of ACM BuildSys 2019, New York, NY, USA
Publication year: 2019


Real-world data sets are crucial to develop and test signal processing and machine learning algorithms to solve energy-related problems.
Their scope and data resolution is, however, often limited to the means required to fulfill the experimenters’ objectives and moreover governed by personal experience, budgetary and time constraints, and the availability of equipment.
As a result, numerous differences between data sets can be observed, e.g., regarding their sampling rates, the number of sensors deployed, their amplitude resolutions, storage formats, or the availability and extent of ground-truth annotations.
This heterogeneity poses a significant problem for researchers intending to comparatively use data sets because of the required data conversion, re-sampling, and adaptation steps.
In short, there is a lack of widely agreed best practices for designing, deploying, and operating electrical data collection systems.
We address this limitation by dissecting the collection methodologies used in existing data sets.
By offering recommendations for data collection, data storage, and data provision, we intend to foster the creation of data sets with increased usability and comparability, and thus a greater benefit to the community.