Analysis of the communication protocol between the hadoop client and the datanode

  • 2020-04-01 01:10:14
  • OfStack

This paper mainly analyzes the flow of hadoop client read and write blocks, the protocol of communication between client and datanode, and the format of data flow.

The hadoop client communicates with the namenode through the RPC protocol, but the client communicates with the datanode without using RPC, instead, it directly USES socket, and the protocol for reading and writing is also different. This paper analyzes the principle and communication protocol of the client communicating with the datanode in version 0.20.2 of hadoop (which is the same in version 0.19). It is also important to note that the communication protocol between the client and the datanode in the 0.23 and later versions has been changed to use the protobuf as the serialization mode.

The Write block

1. The client first requests file creation from namenode via namenode. Create, and then starts the dataStreamer thread

2. The client consists of three threads. The main thread is responsible for reading the local data into the memory, packaging it into the Package object and putting it in the queue.

3. The dataStreamer thread detects whether there is a package in the dataQueue. If there is, it first creates a BlockOutPutStream object (a block is created once, and a block may contain multiple packages). It is responsible for receiving the ack acknowledgement information from the datanode and for error handling.

4. DataStreamer takes the Package object from the dataQueue and sends it to the datanode. .

The following figure shows the flow of the write block.

(link: http://images.cnblogs.com/cnblogs_com/shenh062326/201211/201211201330081274.png)

The following figure shows the format of the message

(link: http://images.cnblogs.com/cnblogs_com/shenh062326/201211/201211201330148587.png)

The Read block

This is mainly implemented in the BlockReader class.

When newBlockReader is initialized,

1. Create a new SocketOutputStream(socket, timeout) by passing in the parameter sock, and then write the communication information, which is not the same as writing the header of the block.

/ / for the header.

Out. WriteShort (DataTransferProtocol. DATA_TRANSFER_VERSION);

Out. Write (DataTransferProtocol. OP_READ_BLOCK);

Out. WriteLong (blockId);

Out. WriteLong (genStamp);

Out. WriteLong (startOffset);

Out. WriteLong (len);

Text. WriteString (out, clientName);

Out. The flush ();

2. Create a new SocketInputStream(socket, timeout)

3. Judge the return message in.readshort ()! = DataTransferProtocol OP_STATUS_SUCCESS

4. According to the input stream to create a checksum: DataChecksum checksum. = DataChecksum newDataChecksum (in)

5. Read the position of the firstChunk: long firstChunkOffset = in.readlong ()

Note: 512 bytes for a chunk to calculate checksum(4 bytes)

6. Next, read the specific data in BlockReader's read method: result = readBuffer(buf, off, realLen)

7. Read chunk by chunk

Int packetLen = in readInt ();

Long offsetInBlock = in readLong ();

Long seqno = in readLong ();

Boolean lastPacketInBlock = in the readBoolean ();

Int dataLen = in readInt ();

IOUtils. ReadFully (in checksumBytes. Array (), 0,

ChecksumBytes. Limit ());

IOUtils. ReadFully (in, buf, offset, chunkLen);

8. Checksum verification after reading data; FSInputChecker. VerifySum (chunkPos)


Related articles: