With the growing interest in Human-Machine interface (HMI), an increasing amount of effort is made to provide always-on low-cost touchless control solutions for Internet-of-Things (IoT) edge devices. In this study, we explore near-audio ultrasound for in-air ultrasonic gesture design and recognition. We propose to use beamforming followed by stages of feature extraction and a Temporal Convolution Network (TCN) for classification. The study is applied to a small form factor concentric hexagonal array of 7 microphones where a beamforming stage is leveraged for spatial feature extraction and ultrasonic gesture features fusion from different angles. In case of such limited number of microphones, we show that a customized Filter-and-Sum (FaS) beamformer is well suited for this application with a set of 5-tap filters. We optimize the beamformer by fitting a fixed beam in the ultrasonic frequency domain, to make the ultrasonic band of interest (18 kHz - 24 kHz) available for use. The beamformer generates parallel readings in time of Doppler shifts from a set of assigned angles as a hand gesture is performed near the array. A TCN of only 10k parameters is used to classify these parallel readings into predefined symbols to build a gesture alphabet. The TCN operates on two modes sharing the same TCN structure with the option to switch between them by loading different set of coefficients. Features from concatenated beamformed frequency points are learned with a per symbol classification accuracy in the range 92%-100% computed on a test set and visualized in the form of normalized confusion matrix. The proposed system gives users a degree of flexibility where gesture diversity can be obtained by grouping the trained symbols from the built alphabet in a post-training design stage. This paves the way for flexible, intuitive and easy to remember gestures.