Abstract
Advances in GPU have facilitated design and execution of complex and computation-intensive deep learning models. As the model complexity increases, the risk of encountering problems due to very large model size, individual tensor size, Not a Number (NaN) value, and memory leak increases as well. When untreated, these problems lead to substantial increase of execution time, generating unpredictable results, and memory leak exceptions. In this paper, we address these problems and particularly large tensor support, C++ kernel changes, and recompilation of the TensorFlow framework. In addition, issues related to NaN value debugging with existing debugging toolkits and solutions to alleviate memory leaks will be explored. Based on experience gained from our analysis, we propose solutions related to better tensor dimension sanity checks, alternative tensor loop procedures, different ways of applying kernels to tensors, a debug trace file filter method, and ways how memory leak exceptions can be resolved. While these problems and solutions may be applicable to running any complex and computation-intensive deep learning model, we described how we encountered them in a use case, in which we designed a deep learning model for activity and gesture recognition using radio data aiming to mitigate domain shift problem.
Original language | English |
---|---|
Title of host publication | 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM) |
Editors | Khalil Ibrahimi, Mohamed El Kamili, Abdellatif Kobbane, Ibraheem Shayea |
Publisher | Institute of Electrical and Electronics Engineers |
Number of pages | 7 |
ISBN (Electronic) | 979-8-3503-2967-4 |
ISBN (Print) | 979-8-3503-2968-1 |
DOIs | |
Publication status | Published - 22 Nov 2023 |
Event | International Conference on Wireless Networks and Mobile Communications - Istanbul, Turkey Duration: 26 Oct 2023 → 28 Oct 2023 Conference number: 10 https://www.wincom-conf.org/WINCOM_2023/ |
Conference
Conference | International Conference on Wireless Networks and Mobile Communications |
---|---|
Abbreviated title | WINCOM |
Country/Territory | Turkey |
City | Istanbul |
Period | 26/10/23 → 28/10/23 |
Internet address |
Keywords
- deep learning
- high performance computing
- resource complexity
- kernel function
- exception analysis