|
Description
|
The following changes in big_mont_mul() Montogomery multiplication routine) may lead to better performance.
Currently (snv 111):
for (i = 0; i < nlen; i++) {
digit = rr[i] * n0;
//digit = digit * n0;
c = BIG_MUL_ADD_VEC(rr + i, nn, nlen, digit);
j = i + nlen;
rr[j] += c;
while (rr[j] < c) {
rr[j++ + 1] += 1;
//j++;
c = 1;
}
}
Suggested:
BIG_CHUNK_TYPE c[BIGTMPSIZE];
for (i = 0; i < nlen; i++) {
//j = i + nlen;
temp = rr+i;
digit = *temp * n0;
c[i] = BIG_MUL_ADD_VEC(temp, nn, nlen, digit);
}
for (i = 0; i < nlen; i++) {
j = i + nlen;
rr[j] += c[i];
while (rr[j] < c[i]) {
rr[j++ + 1] += 1;
//j++;
c[i] = 1;
}
}
This change reduces the dependency between the computation of c (with big_mul_add_vec) and adding the carryover bits, thus improving pipelining.
With the suggested code, the -fast compiler option in SS12 was found to give better performance.
null
|