Modern digital communication relies heavily on cryptographic protection to ensure data integrity and privacy. In order to deploy state-of-the art cryptographic primitives and protocols in real-world scenarios, one needs to highly optimize software for both speed and security. This requires careful choices of high-level cryptographic parameters, low-level optimization of software on the assembly level for a given microarchitecture and the subtle interactions between high-level and low-level optimizations. This thesis considers three examples of cryptographic primitives and describes software implementations of these primitives that set new speed records. The Advanced Encryption Standard (AES) is one of the most widely used symmetric cryptographic primitives. The traditional implementation approach for AES is based on table lookups. While software based on this approach still achieves best performance for a variety of 32-bit and 64-bit architectures, it is usually vulnerable to cache-timing attacks. Another implementation approach for AES is the bitslicing technique. Not only is software based on this approach inherently protected against cache-timing attacks, on some microarchitectures it even achieves better performance. Elliptic-curve cryptography is the current state of the art of asymmetric cryptography. For elliptic-curve Di??e-Hellman key exchange, Bernstein proposed the Curve25519 function. Several speed-record-setting implementations of this function have been developed for a variety of architectures. Optimizing Curve25519 software for the Synergistic Processor Units of the Cell Broadband Engine is a particularly interesting challenge because the small integer multipliers of this architecture do not seem to make it the best-suited platform for public-key cryptography. Another use of elliptic curves in cryptography is in the construction of cryptographic pairings. In order to make pairings fast and secure, very special elliptic curves|so-called pairing-friendly curves|are required. For cryptographic pairings on the 128-bit security level Barreto-Naehrig curves of size about 256 bits are the best choice. Optimizing pairing software is more complex than optimizing, e.g., elliptic-curve Di??e-Hellman key-exchange software. The reason is that pairings involve multiple computation steps and multiple mathematical structures. A choice of parameters considering only some of these steps or structures is likely to incur high performance penalties in the other steps or structures. Evaluating the security of cryptographic primitives requires cryptanalytic effort. In many ways optimizing cryptanalytic algorithms is similar to optimizing cryptographic primitives: Careful choices of high-level algorithmic parameters such as a Pollard rho iteration function need to be combined with low-level software optimization. A major di??erence when optimizing cryptanalytic algorithms is the high degree of parallelism that requires additional under-standing in optimizing parallel algorithms and in network protocols. This thesis considers two cryptanalytical applications with very di??erent performance bottlenecks and optimization requirements. Pollard's rho algorithm is the best known algorithm to solve the elliptic-curve discrete-logarithm problem for most prime-order elliptic curve groups. Large instances of this problem, such as Certicom's challenge ECC2K130 considered in this thesis, are usually solved using a parallel version of Pollard's rho algorithm, which uses a client-server approach. The e??ciency of this approach is mainly determined by the speed of the iteration function, which runs on all clients independently in parallel. A cryptanalytical algorithm with signi??cantly more complex parallelization requirements isWagner's tree algorithm. This algorithm involves a huge amount of data for cryptographically relevant inputs; each byte of data needs to be loaded and stored several times. In a parallel environment with distributed storage data cannot be kept local: each byte also needs to be sent various times over the network. The implementation of Wagner's tree algorithm to find a collision in the toy version FSB48 of the SHA-3 round-1 candidate FSB on a cluster of 8 computers with a total of 5.5 TB of distributed storage demonstrates techniques to apply this algorithm in storage-restricted environments.
|Qualification||Doctor of Philosophy|
|Award date||24 Jan 2011|
|Place of Publication||Eindhoven|
|Publication status||Published - 2011|