Note: This page has been temporarily removed from the main index since it will remain in "rewrite pending" state for the foreseeable future because I have no time to re-think and rewrite it.
[2004-04-17] I have started adding clarification and doing some random corrections. Obviously, a number of things have changed over the last four months that I have not updated this. Now, MT is fairly mature network-wise, I have had many insights from practical situations and will soon proceed with this overdue rewrite, most likely including some useful extension proposals from other sources along the way. (But not before the first build of my next series is out.)
While the BitTorrent protocol is simple and gets the job done, it has one major drawback: at high download rates and relatively small piece sizes, it can become woefully inefficient and wasteful. The changes below are aimed at creating a better behaved and more efficient file distribution protocol based on the BT protocol and also provide developpers some means of implementing experimental extensions with no risk of breaking or clashing with the standard protocol.
2003-12-22: Now that MoonlightTorrent(.com) is working and that the stuff below has been discussed to some extent on IRC, I have concluded that most of the stuff below will not really be necessary so this page will be rewritten in the next week or in January...
- If the compressed stream is implemented, most of the redundancy will be compressed away.
- If compressed messages is implemented, most of the same effect as full stream compression can be achieved.
What I think the BT protocol needs the most:
- Compressed socket stream - compress everything to save as much bandwidth as possible. It also provides weak encryption that would make it harder for ISPs to detect BT streams since the compressed stream would not contain any readily identifiable patterns, this should help people stuck with ISPs that do protocol-based throttling.
- Compressed messages - allows client to selectively compress messages by encapsulating them in a compressed message.
- Pack messages - hold small messages up to some number of seconds and send them all at once to reduce the amount of wasted bandwidth due to TCP/IP packet overhead when sending frequent small messages, this alone can quadruple the outbound 'chatter' traffic efficiency and does not require any special support at the other end.
- Note: Many OSes actually delay transmissions by a few miliseconds and merge consecutive small packets that are sent within that time window so the actual overhead from multiple small packets should generally be less than the worst case scenario used here.
- Reduced message sizes - remove the redundant 'size' field since all messages other than Piece have known fixed lengths.
- Notes:
- Without compression, the reduced message size can reach chatter efficiency above 70%
- With compression, most of the redundancy is compressed away, making the reduced message sizes mostly irrelevant.
- Full-stream compression would make it possible to transfer a file using less bandwidth than the torrent's size itself, potentially leading to better-than-100% overall network bandwidth efficiency.
My main gripe with the BT protocol has to do with the following from the BT protocol specifications at http://bitconjurer.org/BitTorrent/protocol.html
The 'have' message's payload is a single number, the index which that downloader just completed and checked the hash of.
Why is this a problem? Well, for one thing, Have messages are by far the most frequently sent messages and when they are the only thing being sent in one TCP packet, that packet will contain:
- IP header, 20 bytes (more if options are used)
- TCP header, 20 bytes (more if options are used)
- BT message length, 4 bytes
- BT message type, 1 byte
- BT piece index, 4 bytes
This means each individual Have message carrying a 5 bytes payload generates up to 44 bytes of overhead, this is just over 10% in network traffic efficiency.
A single Have packet every now and then is fine... now, consider the case, where you have 100 peers while downloading at 360KB/s and the torrents have standard 256KB pieces...
- With an upstream capable of about 25KB/s max, TCP send timeout should be around 60ms
- 360KB/s / 256KB/piece = 1.4 piece/s = 710ms between Have
- 1.4 piece/s * 100 peers to send Have messages to = an average of 140 Have messages per second
- If there is nothing else to send to these 100 peers, each Have message will be in its own TCP packet, which means they will individually come with the full 40 bytes TCP/IP overhead (more if the OS uses some optional TCP/IP fields like QoS and TimeStamp)
- 140 packets/s * (40 bytes TCP/IP overhead per packet + 9 bytes actual payload) = 6.8KB/s in Have traffic alone
- 100 Clients sending 1 Have per second = 100 TCP ACK packets to send, each worth another 40 bytes a piece if nothing else needs to be sent. That is another ~4KB/s worth of potential TCP/IP overhead.
- So, it is very possible to waste 10KB/s or more in Have and ACKs alone by implementing the BT protocol 'by the specs'. (3Mbps downstream cable internet access is standard fare in my area... has been for the last six years. However, reliable upstream bandwidth beyond 256kbps is a different story and even 160kbps sometimes is too much to ask for.)
2003-12-21: If you think this theory looks bad, it actually gets worse in practice. I have started using Performance Monitor to keep an eye on NIC traffic while transferring files using BT. It turned out that actual overhead in a case similar to the above can reach 20KB/s, meaning my assumption that Haves accounted for most of the 'waste' may be marginally accurate since they account for only half the measured overhead.
Behavioral Changes:
The MoonlightTorrent Protocol - A More Efficient Torrent Protocol:
These are meant to avoid unnecessary messages, reduce message frequency and reduce the frequency at which the TCP/IP bandwidth 'tax' is paid without changing the protocol definition.
Some clients send KeepAlive messages within seconds of sending something else. Since receiving any valid message from a peer can be considered as an implicit KeepAlive, the extra KeepAlive is very much unnecessary. Thankfully, even with 500 connected clients sending KeepAlives every minute, this still accounts for less than 0.5KB/s including TCP/IP overhead. However, it does not make these any less unnecessary and wasteful. Since the clients are also supposed to keep advertising their [dis]interest, there is hardly any justification to using the KeepAlive message.
- Note:
- [2004-04-17] One exception to this is when waiting for a slo/no-poke to send overdue requested data. Since I have added KeepAlive for this condition (every 10 seconds after waiting for a chunk more than 30 seconds), I have not noticed any effect on no-senders so I conclude KeepAlives are mostly useless.
Most OSes' TCP implementation will buffer small packets for a few miliseconds and merge them until a full-size packet can be formed or the TCP window times out. By holding small messages for up to 10 seconds and sending them all at once, the TCP overhead would be incurred once every ~10 messages instead of every single one of them, increasing the network efficiency from 10% to about 40% and the overhead bandwidth would drop from 6.8KB/s to a much more reasonable 1.8KB/s. - Note:
- The potential savings are less while downloading from multiple torrents since the download bandwidth is spread across more files so each individual torrent's completion rate will be lower than the 1.4 pieces/second used above.
[2004-04-17] The above (and pretty much everything else on this page actually) was written in January while I was more concerned with making MT work without crashing every few minutes than trying to get stable upload/download speeds. Since then, I have noticed that many BT clients already do withold small messages then send them in bursts to use the TCP/IP's small packet coalescing.
As a variant of witholding messages and pack them in one TCP packet, Have messages could be trickled.
- Send at most X Have messages, at most every Y seconds
- Send Have messages only for pieces the peer does not appear to have, start with the rarer pieces
- Send the messages in bursts to benefit from TCP coalescing, thereby reducing the amount of TCP/IP overhead generated
- Send a Field when the backlogged 'Have' messages size exceeds the Field message's size
- A typical bitfield is ~800 bits so typically 100 bytes. This is roughly 11 Have messages without considering TCP/IP overhead.
- Ensures the Field method will never generate more traffic than the packed Have messages it replaces
- Sending at most one Field message every minute ensures even higher potential economy and thorough piece status update including pieces the other end already possessed thus were not sent as Haves
- Advantages:
- Eliminates the unnecessary overhead of telling peers about pieces they already have
- Reduces overhead by ensuring that multiple Have messages get packed in one TCP/IP packet
- Ensures that peers that appear to need available pieces get notified about these newly acquired pieces regularly
- Ensures that every peer is eventually fully briefed on the client's current pieces status
- Ensures the amount of data transfered for piece status (Have + BitField) will be less than Haves alone
- Drawbacks:
- Have messages cannot be used to estimate peer bandwidth
- Work-Around: Increase the averaging period to ~2 minutes and take received Field differences into account
- Extra delays are introduced in the notification process that may slow down piece spreading
- Work-Around: None should be necessary - once the first Have messages have been sent, the upstream will be maxed out soon enough trying to keep up with requests and further Have and Field messages will ensure that this situation is maintained
- Notes:
- The official client and its derivatives disconnect clients that send a Field in mid-session so convincing the devs that mid-session fields can be a good thing as stated above is necessary before this can be used.
- Note:
- [2004-04-17] Alternately, it would be possible to probe the other client when the first extra bitfield is sent and flag it as non-multifield-friendly if this results in a disconnect.
Setup:
- 256MB file
- 256KB pieces (128 bytes field size)
- 256KB/s download speed
- 100 peers
- 10 seconds maximum between trickled Haves, only one Have per trickle
Case:
- This means 10 completed pieces per 10 second period
- This means a backlog increase of 9 (or 10 if the peer already has all the recently finished pieces) each time
- After a minute, the Have backlog is 9 bytes per message * 6 periods * 9 backlogged Haves/period = 486 bytes per peer
- With individual Have messages to each peer in real-time, the average Have overhead would be 4.8KB/s with 10% efficiency.<b
- Rate = (HaveMsg + TCP/IP) * 100 clients
- (9 + 40) * 100 = 4900 bytes/s = 4.8KB/s
- 5 bytes useful payload / (40 bytes TCP/IP overhead + 4 bytes message length + 5 bytes payload) = 10.2%
- With the witheld Have messages, the average Have overhead would be a much more reasonable 1.3KB/s with 38% efficiency
- Rate = (6 * (10 * HaveMsg + TCP/IP)) * 100 clients / 60s/min
- (6 * (10 * 9 + 40)) * 100 / 60 = 1300bytes/s = 1.3KB/s
- 10 messages/packet * 5 bytes payload / (40 + 10 * (4 bytes length + 5 bytes payload)) = 38.5%
- With the mixed witheld Have/Field messages, the pieces overhead would be a very reasonble 0.7KB/s with 38% efficiency
- Rate = (5 * HaveMsg + 1 * FieldMsg + 6 * TCP/IP) * 100 clients / 60s/min
- (5 * 9 + 128+5 + 6 * 40) * 100 / 60 = 697 bytes/s
- 5 Haves/minute + 1 Field/Minute = (5 * 5 + 128+1) / (6 * 40 + 5 * 9 bytes/Have + (128+5) bytes/Field) = 37.5%
- When no completed piece is needed by the peers over a full minute, the Field method's overhead is 0.3KB/s with 75% efficiency
- Rate = (FieldMsg + TCP/IP) * 100 clients / 60s/min
- (128+5 + 40) * 100 / 60 = 288 bytes/s = 0.3KB/s
- 129 bytes payload / (40 bytes TCP/IP + 133 bytes FieldMsg) = 74.6%
- Note:
- If you were about to write to me "You're dreaming, this case will rarely if ever happen", think about how many active clients you usually have on your list VS how many clients can end up on it. Having 100 clients on-list with only 10 active clients means 90 clients are still sent Have packets with nothing expected to go/come to/from them. All these clients need to be kept up to date but they certainly do not need to be updated in real-time.
Why is there a "Message Length" field in the BT protocol? The only reason I can think of is to increase overhead... :)
- Choke/Unchoke/Interested/Not-Interested messages all have a 1 byte payload which is the message type itself
- Request and Cancel messages are always 13 bytes
- The only message type that could need a length field to maintain stream sync would be the Piece message - pieces are expected to have the exact same size as the originally requested size so the piece size could be looked up in the requests list using the Index/Offset pair.
Also, since the small messages usually should be grouped together before sending, compression could be of great help in improving network bandwidth efficiency since much of these small packets' data is highly redundant. This compression would either take the form of a compressed socket stream or a 'compressed messages' message.
When I first considered proposing compression in BT, I was thinking about the Piece messages' payload. Much later in MT's development (roughly a month later), I started looking into effective ways of reducing bandwidth and after doing the maths from the previous section, reducing message overhead on a global scale seemed like a good idea and the simplest way of doing this is to compress the socket stream itself and here is why:
- BT message headers contain the following data:
- DWORD Message Length = 0x0000????, that's at least 16 bits that can always be compressed away right here.
- CHAR TypeID < 0x20, that's another two bits that can always be removed
- Piece messages are usually expected to be the least compressible but are almost always compressible to some extent
- Have messages are by far the most common wasteful and highly compressible small messages when packed by the dozen: Out of the 72bits that make up a standard BT Have message, at least 60 practically never change and could be typically reduced by 80% or more.
- Length = 0x00000005, 32 bits that never ever change.
- TypeID = 0x04, 8 bits that never change.
- Piece = 0x00000???, 12bits is enough for a 1GB torrent with 256KB pieces
- Fields are also reasonably good candidates for compression if they are fairly large:
- Young torrents have mostly cleared fields and therefore highly compressible.
- Mature torrents have mostly set fields and therefore highly compressible.
- Ramping torrents have quasi-random fields which may be more or less compressible but that's ok:
- The compressed field will rarely be larger than the uncompressed one
- We're not sending fields unless they appear economical compared to sending packed Haves (without considering TCP/IP overhead)
- Nothing forces a MT-supporting coder to do it this way - one could both compress the field and Haves then send whichever is smaller, preferably the field in case of a tie to ensure that the peers have an accurate pieces status.
- Advantages:
- Message overhead is compressed away, a saving of 1KB/s in the case with witheld Have messages sent in bursts on top of the already substantial 3.5KB/s reduction from simply sending them in bursts.
- Maximized use of limited available upload capacity (256kbps, often lower) of most Cable/DSL residential internet access packages.
- More upstream bandwidth left to upload faster, connect to more peers, download/seed more torrents.
- Minimal CPU usage for relatively slow links - compressing a 30KB/s stream takes hardly any CPU time.
- Stream compression also acts as weak encryption, making it harder for ISPs to detect BT traffic since packets beyond the handshake would not contain readily identifiable patterns.
- Drawbacks:
- Higher RAM usage - need extra buffers for the compression/decompression.
- Higher CPU usage - but not that much for slow links. Very fast uploaders for whom this may be an issue could leave upstream compression off.
For those who think full-stream compression is too much, MT will also have a "Compressed Messages" message. This one is specifically targetted at packing multiple specific messages then compress them as one larger message. This could be used to compress everything but Piece messages for those who think compressing Piece messages is an indisputable waste of time. Message compression will be less effective since each compressed message will be independant... unless they are made to act like a sort of sub-stream, in which case the control message compression (as compressed messages) could be better than it would be with full-stream compression.
This is fairly straightforward:
- Durring handshake, a reserved bit is set to signal MT protocol capability and the following extra handshake steps occur after BT-style handshaking is completed:
- Extended handshaking occurs only on the first connection with a given PeerID
- All extended capabilities are initially disabled
- The initiator sends a list of supported MT message types using the "Capabilities" message.
- The receiver replies with the Capabilities subset from the received list it also supports
- Successfully negotiated extended capabilities (supported by both sides) are enumerated and enabled
- The received list from the receiver on the initiator's side
- The list sent to the initiator on the receiver's side
- Messages beging with the message type ID byte
- Message data (if any necessary) follows
- Changed messages:
- MT Piece messages have four fields: (Note: Request and Cancel messages are Data-less Piece headers)
- DWORD Piece Number <-- Must match a pending request
- DWORD Data Offset <-- " "
- DWORD Data Length <-- MT-specific, "
- CHAR[] Data
- New messages: (MT-specific messages start at 0x40)
- MT Get Extended Capabilities (MsgID: 0x40), should be used only during the first handshake with any given PeerID.
- DWORD MTProtoVersion = 0x00000000 <-- Nobody should play with this, only for official MT protocol extensions. Unofficial and experimental extensions should use the "Extended Capabilities Negociation/Enumeration" system which allows nearly FFA collision-free extensibility.
- DWORD CapListLength
- CHAR[] szCapTag_0
- CHAR[] szCapTag_1
- ...
- CHAR[] szCapTag_n
- MT Get Client Name String (MsgID:0x41), should usually be used only once per PeerID per Torrent.
- CHAR Name Length
- CHAR[] Name
- MT Compressed Message (MsgID:0x42), contains a block of compressed data. Decompress and re-insert on top of the input stream. This may be used to compress multiple messages together when socket stream compression is disabled, most likely Fields and Haves
- DWORD Compressed block length
- CHAR[] Compressed block data
- MT Compression Set (MsgID:0x44), change upstream compression status. The upload stream should be suspended until the remote end sends a matching Ack. Should not be sent again until the Ack is received.
- CHAR Enabled (non-zero if enabled)
- MT Compression Ack (MsgID:0x45), acknowledges a change in stream compression and resume the download stream.
- CHAR Enabled (non-zero if enabled)
- MT Bad Message Type (MsgID:0x48), sent whenever an unknown message type or a bad message is received before disconnecting.
- CHAR Bad message's type ID
- Extended Capabilities messages: (Extension enumeration should start at 0x80)
This protocol extension is specifically for unofficial and/or experimental protocol extensions. Sample extension:
- Chatter (Tag: "Chat"), sends messages to a peer
- INT16 Message Length
- CHAR[] Message
- When peers handshake, "Chat" is sent in the capabilities list and if both sides support/enabled the Chatter extension, it will be assigned a protocol extension enumeration number and the two clients will be able to use the Chatter extension. Peers sending messages with invalid message types should be disconnected after sending a "Bad Message Type" message.
Setup:
- Same as the previous example.
- After a minute, the Have backlog is 5 bytes per message * 6 periods * 9 backlogged Haves/period = 270 bytes per peer due to reduced message sizes.
Case:
- With individual Have messages to each peer in real-time, the average Have overhead would be 4.4KB/s with 11% efficiency.
- Rate = (HaveMsg + TCP/IP) * 100 clients
- (5 + 40) * 100 = 4500 bytes/s = 4.4KB/s
- 5 bytes useful payload / (40 bytes TCP/IP overhead + 5 bytes payload) = 11.1%
- With the witheld Have messages, the average Have overhead would be a much more reasonable 0.9KB/s with 55.6% efficiency
- Rate = (6 * (10 * HaveMsg + TCP/IP)) * 100 clients / 60s/min
- (6 * (10 * 5 + 40)) * 100 / 60 = 900bytes/s = 0.9KB/s
- 10 messages/packet * 5 bytes payload / (40 + 10 * 5 bytes payload) = 55.6%
- With the mixed witheld Have/Field messages, the pieces overhead would be a very reasonble 0.7KB/s with 39.1% efficiency
- Rate = (5 * HaveMsg + 1 * FieldMsg + 6 * TCP/IP) * 100 clients / 60s/min
- (5 * 5 + 128+1 + 6 * 40) * 100 / 60 = 656 bytes/s
- 5 Haves/minute + 1 Field/Minute = (5 * 5 + 128+1) / (6 * 40 + 5 * 5 bytes/Have + (128+1) bytes/Field) = 39.1%
- When no completed piece is needed by the peers over a full minute, the Field method's overhead is 0.3KB/s with 76% efficiency
- Rate = (FieldMsg + TCP/IP) * 100 clients / 60s/min
- (128+1 + 40) * 100 / 60 = 284 bytes/s = 0.3KB/s
- 129 bytes payload / (40 bytes TCP/IP + 129 bytes FieldMsg) = 76.3%
Of course, with compression as mentionned previously, most of the management messages could be compressed to near nothingness and network bandwidth efficiency could certainly be much higher.
Hits since December 5, 2003:
Generated on Tue Aug 24 23:57:31 2004 for MoonlightTorrent(.com) by
1.3.8