Non-Intrusive Load Monitoring (NILM) techniques are increasingly becoming a key instrument for identifying the power consumption of individual appliances based on a single metering point. Particularly Deep learning models are gaining interest in this regard. However, the challenges brought by NILM datasets and the non-availability of common experimental guidelines tend to compromise comparison, research transparency, and replicability. The limited adoption of efficient research instruments and a lack of best practices guidelines contribute in huge part to this problem, where no features, encouraging standardized formats for benchmarking and results sharing, are offered. To address these issues, we first present a brief overview of recent best practices for Deep Learning (DL) and highlight how deep NILM research can benefit from these practices. Furthermore, we suggest a novel open-source toolkit leveraging these practices: Deep-NILMTK. The proposed toolkit offers a common testing bed for NILM algorithms independently of the underlying deep learning framework with a modular NILM pipeline that can easily be customized. Furthermore, Deep-NILMTK introduces the concept of Experiment Templating to offer pre-designed experiments allowing to enhance research efficiency. Leveraging this concept and DL best practices, we present a case study of creating an online NILM benchmark repository.