We present a design methodology towards minimum-area maximum-performance designs in sub-/ near-threshold operation. Our methodology is based on a new metric called performance-per-area. Unlike conventional gate sizing, we use forward body biasing at synthesis time to render faster, smaller and more energy-efficient circuits. Our theory introduces body biasing into delay and energy models in the form of nonlinear derating functions that can easily be fitted to a technology node. The methodology is validated using an industrial microprocessor consisting of approximately 31 K gates and 3.7 K flip-flops in CMOS 90 nm. We obtain 4.2x better EDP, 3.8x higher speed and 9% smaller area than the non-body-biased counterpart.