Blake Tree Hashing

One feature Skein has, but Blake is lacking is tree hashing. The kind of tree I’ll describe here has continuous leaves, so it is suitable for incremental updates/verification. For parallel hashing other constructions, such as interleaving, might be superior.

I’m using the Blake compression function without any changes, but I change the counter scheme. The counter scheme combines ideas from my earlier Alternative Padding suggestion and the UTF-8 encoding. I’ll only describe the Blake256 variant here, but generalization to Blake512 is straightforward.


The tree hash takes three parameters:

  • The LeafSize. A power of two, at least 4.
  • The FanOut. A power of two, at least 2
  • The maximum tree high MaxHeight, at least 2.

As with my earlier proposal, bit 63 marks the final block of a layer, and bit 62 indicates the presence of bit padding. The remaining 62 bits serve as a combined layer indication and position.

The number of leading 1 bits indicates the layer. No leading 1 bit corresponds to leaves etc. The remaining bits count the length of the message in bytes up to and including the current block.

For the leaf layer divide the message into LeafSize * HashSize pieces. For the higher layers divide the message into FanOut * HashSize pieces.
Hash each piece, producing one hash that’ll become part of the next layer. Don’t reset the counter when going to the next piece, only reset the chaining value to the IV.

If you reach MaxHeight, hash the whole lower layer in one go, instead of splitting it into pieces.

Repeat until the length of a layer is equal to the HashSize. That layer is the root hash.


TreeHash(Message, LeafSize, FanOut, MaxHeight)
    requires (LeafSize >= 4) and IsPowerOfTwo(LeafSize)
    requires (FanOut >= 2) and IsPowerOfTwo(FanOut)
    requires MaxHeight >= 2

    const HashSizeInBytes = 32
    const BlockSizeInBytes = HashSizeInBytes * 2
    const CounterSizeInBits = 64

    CurrentLayer = Message
    if Message.LengthInBits==0
        return BlakeCompress(IV, BlockSizeInBytes zero-bytes, 2**(CounterSizeInBits-1))
    LayerIndex = 0
        NextLayer = ""
        if LayerIndex == 0
            PieceSizeInBytes = LeafSize * HashSizeInBytes
            PieceSizeInBytes = FanOut * HashSizeInBytes
        if (PieceSizeInBytes<CurrentLayer.LenghInBytes) or (LayerIndex == MaxHeight - 1)
            PieceSizeInBytes = CurrentLayer.LengthInBytes
        for(pieceIndex=0; pieceIndex < Ceiling(CurrentLayer.LengthInBits/(PieceSizeInBytes*8)); pieceIndex++)
            State = IV
            Piece = CurrentLayer[pieceIndex*PieceSize, PieceSize]
            if CurrentLayer.LengthInBits % 8 != 0 //Add bit padding
            while Piece.LengthInBits % (BlockSizeInBytes * 8) != 0 //Pad to block size
            for(blockIndex=0; blockIndex<Piece.LengthInBytes/BlockSizeInBytes; blockIndex++)
                Counter = Min(
                    pieceIndex*PieceSize + (blockIndex + 1) * BlockSizeInBytes,
                if Counter == Ceiling(CurrentLayer.LengthInBits/8)//Last block of layer
                    Counter.SetBit(CounterSizeInBits - 1)
                    if CurrentLayer.LengthInBits % 8 != 0 //Bit padding was used
                        Counter.SetBit(CounterSizeInBits - 2)
                for(i=0; i< LayerIndex; i++)//Set the `LayerIndex` most significant bits in Counter
                    Counter.SetBit(CounterSizeInBits - (3 + i));
                State = BlakeCompress(State, Piece[blockIndex * HashSizeInBytes, HashSizeInBytes], Counter)
        CurrentLayer = NextLayer
    until CurrentLayer.LengthInBytes == HashSizeInBytes)
    return CurrentLayer

(The division sign / denotes exact division, not c style truncating division, = is assignment, and == equality test)


HashSizeInBytes = 32
FanOut = 2
LeafSize = 4
MaxHeight = 255
Message.LengthInBits = 2500

Each block corresponds to a compression function call, the number in it is the counter as a hexadecimal number.

| B000 ... 0040 |
        |                                                                               |
+---------------+                                                               +---------------+
| 2000 ... 0040 |                                                               | A000 ... 0060 |
+---------------+                                                               +---------------+
        |                                                                               |
        +-------------------+-------------------+-------------------+                   +
        |                   |                   |                   |                   |
+---------------+   +---------------+   +---------------+   +---------------+   +---------------+
| 0000 ... 0040 |   | 0000 ... 0080 |   | 0000 ... 00C0 |   | 0000 ... 0100 |   | C000 ... 0139 |
+---------------+   +---------------+   +---------------+   +---------------+   +---------------+

Properties/Open Questions

  • The Blake256 based variant can hash 2^64-8 bits or 2^61-1 bytes
  • I didn’t describe a configuration block yet. Without such a block, trees with different parameters might interact in undesirable ways.
  • Should we reduce the minimum LeafSize to 2 * HashSize? That’d reduce the maximum input size or adds additional complications.
  • Can it be extended to include interleaved hashing in an elegant way?

Alternative Blake Padding

BLAKE is one of the SHA-3 finalists, and IMO the one most likely to win. So I’ve been looking into it, to see how well it fits my usecases.

I like the design of its compression function, but I don’t like how it handles padding. Its padding adds at least 9 bytes to each message, which can cause an additional compression function call. Most annoyingly for me, message sizes that are powers of two always cause this additional call. This is particularly annoying in tree-hashing, a binary tree with Blake256 in my case, where inner nodes have a size of 64 bytes, and leaves have a slightly bigger power-of-two size, such as 1024. Now without padding the overhead of tree-hashing is one compression function call per leaf. With the original padding it’s three calls, tripling the overhead from around 6% to around 19%.

Definition of the alternative Padding

I’ll suggest an alternative padding, that’s more efficient for certain message lengths, and IMO simpler. I’m not a good cryptographer, so it’s very well possible that this padding is totally insecure. If so, please tell me ;)

Blake256 uses a 64 bit counter variable, that counts the bits of the message up to the current block. Now my suggestion is turning this into a 62 bit counter that counts the bytes of the message and using the other two bits as flags. Bit 63 marks the final block and bit 62 on the final block indicates that bit padding was used.

For each block except the final block, set the counter to the number of bytes up to this point, i.e. to (i+1)*64. For the final block there are three different cases:

  1. If the message length is zero, set the block data to all zero, and the counter to 2^63. This is similar to case 2), except that the padding consists of 512 bits, instead of 0 bits.
  2. If the message length is an integral number of bytes, set the counter to 2^63 | MessageLengthInBytes and pad the message to the next multiple of 512 bits. No padding required if it’s already a multiple of 512 bits.
  3. If the message length isn’t an integral number of bytes, append a single 1 bit, set the counter to 2^63 | 2^62 | MessageLengthInBytes and pad the message to the next multiple of 512 bits. The partial byte counts as a full byte in MessageLengthInBytes.

Then hash the final block.

Consequences of the alternative Padding

  1. The counter values are still unique
  2. The final block is clearly tagged, preventing length extension
  3. The only case where more than Ceiling(MessageLength/BlockSize) compression function calls are needed is the empty message.
  4. The padding is simpler for the common case, where the message consists of bytes. Simply pad with zeros, and set a bit in the counter. It isn’t much more complicated in the bitstream case.
  5. The maximal message length is slightly larger, 2^62-1 bytes = 2^65-8 bits instead of 2^64-1 bits. That difference is irrelevant in practice, but it obviously still fulfills the NIST length requirement.

It’s trivial to transfer this padding to the 128 bit counter Blake512 uses. Just use a 126 bit byte-counter, and use bits 126 and 127 as the flags.

CurveCP Review – Part 1 Crypto

CurveCP Review – Part 1 Crypto

CurveCP is a secure transport protocol, that can be used instead of the common TCP+TLS combination. This article assumes familiarity with CurveCP. You should at least read the overview and the packets description.

In this post I’ll look only at the cryptographic part of CurveCP, including the implementation hints.
But I’ll ignore engineering issues, such as the choice of UDP over TCP, congestion control and the per-message overhead. I’ll probably talk about these in a later post.

I’m not a professional cryptographer, just a programmer with an interest in crypto.
So it’s easily possible that my “improvement” suggestions actually make the protocol less secure.

I like the basic design of CurveCP a lot:

  • It’s much simpler than TLS
  • It always offers forward secrecy
  • It’s fast without needing to resort to ugly hacks like session resumption. Each connection costs the server four Curve25519 operations.
  • It doesn’t use certificates, but assumes that you obtained the public key through a different channel, such as DNS. This fits my usage, but might be a problem for others.

    Adding an additional step, where the client requests the server’s certificate is easy. This can happen in the clear before the actual CurveCP connection is established. This step would add one round-trip latency, but can be cached. It’s stateless for the server.

  • Client authentication is done using asymmetric crypto, where it keeps the client’s public key confidential. (TLS on the other hand sends the client certificate in plain, unless you use renegotiation.)
  • Connections are not bound to an IP. For example they’ll survive switching between W-LAN and mobile networks without connection interruptions.

Issue 1: Nonces

CurveCP relies heavily on nonces. It uses a “box” operation that combines the stream cipher XSalsa20 with the Poly1305 MAC. Poly1305 is a Wegman-Carter style MAC, i.e. it is based on universal hashing.

As the name implies, a nonce should only be used once for each key. One interesting question here is what happens if a nonce is reused by accident:

  • The stream cipher will leak the xor of the plaintexts. In many cases this allows the reconstruction of both plaintexts.
  • The MAC is broken as well, since universal hashing relies on unique nonces

CurveCP uses nonces in two contexts:

  • 64 bit counters with short term keys. If distinct short term keys are used, nonce reuse seems very unlikely. The damage is limited to the confidentiality and integrity of the sessions with reused keys. It’s possible that confidential data from such a compromized session can be used in further attacks on higher level protocols, or thanks to Issue 2 even on CurveCP itself.
  • 128 bit nonces with long term keys.

For the long term nonces one obvious approach is using random 128 bit values. I see no problems with that approach, provided the RNG works correctly. A good RNG is already required for the short-term keys, so relying on a RNG for nonces doesn’t seem to make the situation much worse.

The CurveCP website on the other hand, recommends using a persistent counter. IMO that’s a very fragile approach, and it causes considerable annoyances in deployment. For example:

  • When sharing a key between multiple computers, you need to implement nonce-separation. i.e. you need to make sure that the counters work on disjoint intervals
  • Restoring from backup, or cloning a virtual machine can cause counter-rollbacks.

Thus I believe random nonces are preferable. I’d also consider switching to a nonce-less MAC(such as HMAC) for authentication with a long-term key.

Edit: I couldn’t find any practical attack, even if nonce reuse happens in the context of the CurveCP handshake. In particular I found no way to impersonate the server in such a situation, since to do that you’d need to know S', the server’s short-term publickey. If the server reuses both short-term key and nonce, impersonation becomes possible.

Issue 2: Client Authentication

The client authenticates by sending a value V = Box[C'](C->S) to the server. I think that this is a bad choice.

Problem 1: If an attacker learns client short term secret c’ of a single session, he will be able to impersonate that user to that server at any point in the future. This shouldn’t happen normally, but I think higher robustness for this failure case would be nice.

The attacker just opens a connection, using the same C' and V. Since V does not depend on S', it’s still valid.

One scenario where this is a relevant attack is keeping the long term key in a secure hardware module, but handling the actual session from a lower trust computer.

This problem could be fixed by making the authenticated message dependent on the server’s short term key, which cannot be chosen by the attacker:
V = Box[C',S'](C->S). This adds a per-connection overhead of 32 bytes, which is negligible. It’s also very unlikely that this introduces any vulnerabilities not present in the original protocol.

Problem 2: If the server’s long term key gets compromized, it’s possible to impersonate arbitrary clients to it.

Anybody who knows s, can trivially calculate the key C->S, and impersonate arbitrary clients. Once again this should not happen, but makes a failure much bigger than it should be.

To fix both problems at the same time, we could use S' instead of S as the second key. i.e. V = Box[C',S](C->S') or V = Box[C'](C->S').

Personally I’d like to include all four public keys and the servername into this authentication, and I’d replace crypto_box with a simple MAC, such as HMAC with KeyExchange(C->S') as key. Switching to HMAC avoids nonce issues, and it also reduces the size of V to 16 bytes. Including additional information, such as the servername allows a secure hardware device to ask a question like “Do you want to authenticate to ServerX?”. If ServerX is your bank’s website when you only want to check your mails, you notice that something is wrong.

Interestingly, in some contexts, such as an OTR like protocol, using C->S' instead of C->S is undesirable, since it weakens deniability for the client. But I still think it’s preferable in most situations.

Issue 3: Replay Attacks

This is an implementation pitfall, not an inherent problem of the protocol.

A naive server will accept an Initiate-Packet, as long as the associated minute-key is still valid, allowing an attacker to replay a full connection. To avoid this, you need to put C' on a blacklist when initiating a connection. You can expire a blacklist entry when the associated minute key expires. It’s not enough to just prevent connections with a client short-term key that’s already in use (You need to do that too, since C' doubles as connection identifier).

Edit: Locking at the implementation in NaCl-20110221 (an alpha version), it seems to be vulnerable to this. A related code section has the comment /* XXX: cache cookie if it's recent */, which when implemented would probably fix the issue.


My conclusion is that CurveCP has one actual undesirable property in the protocol: The weak client authentication. The other issues can be avoided by a careful implementation.