An in depth understanding of ES1en. SplitFunc in golang

  • 2020-06-19 10:29:03
  • OfStack

preface

bufio module is one of the modules in golang standard library. It mainly implements a read and write cache for reading or writing data. This module is used in several standard libraries involving io, for example, buffio is used in http module to complete reading and writing of network data, zip module of compressed file USES bufio to operate reading and writing of file data, etc.

SplitFunc is an important and difficult thing to understand in golang's bufio package. This paper hopes to introduce the working principle of SplitFunc and how to realize one's own SplitFunc by combining simple examples.

One example

In the bufio package, some common tools such as Scanner are defined. You may need to read the user's input into the standard input. For example, we make a repeat machine to read each line of user input and print it out:


package main
import (
 "bufio"
 "fmt"
 "os"
)
func main() {
 scanner := bufio.NewScanner(os.Stdin)
 scanner.Split(bufio.ScanLines)
 for scanner.Scan() {
 fmt.Println(scanner.Text())
 }
}

os.Stdin implements the ES30en.Reader interface. We create an scanner from this reader, set the partition function to ES34en.ScanLines, and then for loops, printing out the text each time we read a line of data. Although the sparrow is small and dirty, this small program is simple, but it leads to today's object: ES37en. SplitFunc, its definition looks like this:


package "buffio"
type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)

The official documentation for golang looks like this:

[

SplitFunc is the signature of the split function used to tokenize the input. The arguments are an initial substring of the remaining unprocessed data and a flag, atEOF, that reports whether the Reader has no more data to give. The return values are the number of bytes to advance the input and the next token to return to the user, if any, plus an error, if any.

Scanning stops if the function returns an error, in which case some of the input may be discarded.

Otherwise, the Scanner advances the input. If the token is not nil, the Scanner returns it to the user. If the token is nil, the Scanner reads more data and continues scanning; if there is no more data--if atEOF was true--the Scanner returns. If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, a SplitFunc can return (0, nil, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.

The function is never called with an empty data slice unless atEOF is true. If atEOF is true, however, data may be non-empty and, as always, holds unprocessed text.

]

In English! So many parameters! So many return values! Good bother! I wonder if any of you will feel this way when you come across such a document... Because of this situation, I decided to write an article to introduce the specific working principle of SplitFunc 1, and explain it with specific examples in a popular way, hoping to be helpful to readers.
All right, cut the crap and get down to business!

Working mechanism of Scanner and SplitFunc


package "buffio"
type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)

Scanner is a cache, which means Scanner maintains a Slice bottom is used to store has read data from Reader SplitFunc Scanner will call our setting, the buffer content (data) and whether they have been finished input (atEOF) in the form of parameters passed to SplitFunc, while SplitFunc duty is in accordance with the above two parameters return 1 Scan need to advance a few bytes (advance), partition of data (token), and the error (err).

This is a two-way communication process. Scanner tells us the data scanned by SplitFunc and whether it reaches the end. According to this information, our SplitFunc returns the result of segmentation and the position to be advanced for the next scan to Scanner. Use an example to illustrate:


package main
import (
 "bufio"
 "fmt"
 "strings"
)
func main() {
 input := "abcdefghijkl"
 scanner := bufio.NewScanner(strings.NewReader(input))
 split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
  fmt.Printf("%t\t%d\t%s\n", atEOF, len(data), data)
  return 0, nil, nil
 }
 scanner.Split(split)
 buf := make([]byte, 2)
 scanner.Buffer(buf, bufio.MaxScanTokenSize)
 for scanner.Scan() {
  fmt.Printf("%s\n", scanner.Text())
 }
}

The output

[

false 2 ab
false 4 abcd
false 8 abcdefgh
false 12 abcdefghijkl
true 12 abcdefghijkl

]

Here, we set the initial size of the buffer to 2, and when it is not enough, it will expand to 2 times of the original size, the maximum size is ES107en.MaxScanTokenSize, so that 1 starts to scan 2 bytes, our buffer is full, the contents of reader have not been read to EOF, then split function is executed, output:

[

false 2 ab

]

Then the function returns 0, nil, nil the return value tells Scanner that the data is not enough, the next read position is 0, need to continue to read from reader, at this time because the buffer is full, so the capacity is expanded to 2 * 2 = 4, reader content has not been read to EOF, output

[

false 4 abcd

]

Repeat the above steps, 1 until finally all the contents have been read and EOF becomes true

[

true 12 abcdefghijkl

]

See the above process is not the original SplitFunc work to understand 1 point? If you look at the official document of golang again, do you think you understand it a little bit? Here is the implementation of ES143en.ScanLines, so you can explore for yourself how this function works

ScanLines in the standard library


func ScanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
 //  That means we've scanned the end 
 if atEOF && len(data) == 0 {
  return 0, nil, nil
 }
 //  find \n The location of the 
 if i := bytes.IndexByte(data, '\n'); i >= 0 {
  //  Move forward the position where the next read begins i + 1 position 
  return i + 1, dropCR(data[0:i]), nil
 }
 //  What we're dealing with here reader The contents are all read, but the contents are not empty, so the remaining data needs to be returned 
 if atEOF {
  return len(data), dropCR(data), nil
 }
 //  It means you can't split it now Reader Request more data 
 return 0, nil, nil
}

reference

In-depth introduction to bufio.Scanner in Golang

conclusion


Related articles: