How does Golang read file contents

  • 2020-10-23 20:59:26
  • OfStack

This article is intended to provide a quick introduction to the many options available in the Go standard library for reading files.

In Go (for that matter, most underlying languages and some dynamic languages, such as Node) return byte streams. The advantage of not automatically converting everything to a string is that one of them avoids expensive string assignments, which increase the pressure on GC.

To make this article simpler, I'll use string(arrayOfBytes) to convert the bytes array to a string. However, it should not be considered a general recommendation when releasing production code.

1. Read the entire file into memory

First, the standard library provides a variety of functions and utilities to read file data. We'll start with the basics provided in the os package. This means two prerequisites:

This file must be held in memory We need to know the size of the file in advance in order to instantiate a buffer large enough to hold it.

With a handle to the os.File object, we can query the size and instantiate a 1-byte list.


package main


import (
 "os"
 "fmt"
)
func main() {
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()

 fileinfo, err := file.Stat()
 if err != nil {
 fmt.Println(err)
 return
 }

 filesize := fileinfo.Size()
 buffer := make([]byte, filesize)

 bytesread, err := file.Read(buffer)
 if err != nil {
 fmt.Println(err)
 return
 }
 fmt.Println("bytes read: ", bytesread)
 fmt.Println("bytestream to string: ", string(buffer))
}

2. Read the file as a block

Although in most cases you can read the file once, sometimes you want to use a more memory-saving method. For example, read files in blocks of some size, process them, and repeat until finished. In the example below, the buffer size used is 100 bytes.


package main


import (
 "io"
 "os"
 "fmt"
)

const BufferSize = 100

func main() {
 
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()

 buffer := make([]byte, BufferSize)

 for {
 bytesread, err := file.Read(buffer)
 if err != nil {
  if err != io.EOF {
  fmt.Println(err)
  }
  break
 }
 fmt.Println("bytes read: ", bytesread)
 fmt.Println("bytestream to string: ", string(buffer[:bytesread]))
 }
}

The main differences compared to fully reading a file are as follows:

Read until you get the EOF tag, so we add a specific check for err == io.EOF We define the size of the buffer, so we can control the size of the "block" required. If the operating system correctly caches the files being read, it can improve performance when used correctly. If the file size is not an integer multiple of the buffer size, the last iteration will only add the remaining bytes to the buffer, so call buffer [: bytesread]. Under normal circumstances, bytesread would be the same size as the buffer.

For each iteration of the loop, the internal file pointer is updated. The next time you read, the data from the file pointer offset up to the buffer size is returned. The pointer is not a construct of the language, but of the operating system 1. On Linux, this pointer is an attribute of the file descriptor to be created. All read/Read calls (in Ruby/Go, respectively) are internally converted to system calls and sent to the kernel, where the kernel manages this pointer.

3. Read file blocks concurrently

What if we want to speed up the processing of the above blocks? One way is to use multiple go routines! The other thing we need to do in comparison to serial read blocks is we need to know the offset of each routine. Note that ReadAt behaves slightly differently than Read when the size of the target buffer is larger than the number of bytes remaining.

Also note that I did not limit the number of goroutine, which is defined only by the buffer size. In practice, there may be an upper limit.


package main

import (
 "fmt"
 "os"
 "sync"
)

const BufferSize = 100

type chunk struct {
 bufsize int
 offset int64
}

func main() {
 
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()

 fileinfo, err := file.Stat()
 if err != nil {
 fmt.Println(err)
 return
 }

 filesize := int(fileinfo.Size())
 // Number of go routines we need to spawn.
 concurrency := filesize / BufferSize
 // buffer sizes that each of the go routine below should use. ReadAt
 // returns an error if the buffer size is larger than the bytes returned
 // from the file.
 chunksizes := make([]chunk, concurrency)

 // All buffer sizes are the same in the normal case. Offsets depend on the
 // index. Second go routine should start at 100, for example, given our
 // buffer size of 100.
 for i := 0; i < concurrency; i++ {
 chunksizes[i].bufsize = BufferSize
 chunksizes[i].offset = int64(BufferSize * i)
 }

 // check for any left over bytes. Add the residual number of bytes as the
 // the last chunk size.
 if remainder := filesize % BufferSize; remainder != 0 {
 c := chunk{bufsize: remainder, offset: int64(concurrency * BufferSize)}
 concurrency++
 chunksizes = append(chunksizes, c)
 }

 var wg sync.WaitGroup
 wg.Add(concurrency)

 for i := 0; i < concurrency; i++ {
 go func(chunksizes []chunk, i int) {
  defer wg.Done()

  chunk := chunksizes[i]
  buffer := make([]byte, chunk.bufsize)
  bytesread, err := file.ReadAt(buffer, chunk.offset)

  if err != nil {
  fmt.Println(err)
  return
  }

  fmt.Println("bytes read, string(bytestream): ", bytesread)
  fmt.Println("bytestream to string: ", string(buffer))
 }(chunksizes, i)
 }

 wg.Wait()
}

This approach is much more extensive than any previous approach:

I am trying to create a specific number of Go routines, depending on the file size and buffer size (100 in this case). We need a way to ensure that we are "waiting" for all execution routines. In this example, I used wait group. At the end of each routine, the signal is emitted internally instead of the break for loop. Because we call wg.Done () late, we call it only when each routine returns.

Note: Always check the number of bytes returned and reallocate the output buffer.

Reading a file with Read() can go a long way, but sometimes you need more convenience. The IO function is often used in Ruby, such as each_line,each_char, each_codepoint, and so on. By using the Scanner type and associated functions in the bufio package, we can achieve a similar purpose.

The ES93en. Scanner type implements functions with "split" functionality and advances Pointers based on that functionality. For example, for each iteration, the built-in bufio.ScanLines split function moves the pointer forward until the next line break.

Within each step, this type also exposes methods for getting a byte array/string between the start and end positions.


package main

import (
 "fmt"
 "os"
 "bufio"
)

const BufferSize = 100

type chunk struct {
 bufsize int
 offset int64
}

func main() {
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()
 scanner := bufio.NewScanner(file)
 scanner.Split(bufio.ScanLines)

 // Returns a boolean based on whether there's a next instance of `\n`
 // character in the IO stream. This step also advances the internal pointer
 // to the next position (after '\n') if it did find that token.
 for {
 read := scanner.Scan()
 if !read {
  break
  
 }
 fmt.Println("read byte array: ", scanner.Bytes())
 fmt.Println("read string: ", scanner.Text())
 }
 
}

Therefore, to read the entire file line by line in this manner, you can use something like this:


package main

import (
 "bufio"
 "fmt"
 "os"
)

func main() {
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()

 scanner := bufio.NewScanner(file)
 scanner.Split(bufio.ScanLines)

 // This is our buffer now
 var lines []string

 for scanner.Scan() {
 lines = append(lines, scanner.Text())
 }

 fmt.Println("read lines:")
 for _, line := range lines {
 fmt.Println(line)
 }
}

4. Verbatim scanning

The bufio package contains basic predefined split functions:

ScanLines (default) ScanWords ScanRunes(useful for traversing es117EN-8 code points (not bytes)) ScanBytes

So, to read the file and create a list of words in the file, you can use something like this:


package main

import (
 "bufio"
 "fmt"
 "os"
)

func main() {
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()

 scanner := bufio.NewScanner(file)
 scanner.Split(bufio.ScanWords)

 var words []string

 for scanner.Scan() {
 words = append(words, scanner.Text())
 }

 fmt.Println("word list:")
 for _, word := range words {
 fmt.Println(word)
 }
}

The ScanBytes splitter provides the same output as the earlier Read() example. The main difference between the two is the dynamic allocation problem in the scanner every time you need to attach to a byte/string array. This can be avoided by techniques such as preinitializing the buffer to a specific length, and increasing the size only when the first limit is reached. Use the same example as above:


package main

import (
 "bufio"
 "fmt"
 "os"
)

func main() {
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()

 scanner := bufio.NewScanner(file)
 scanner.Split(bufio.ScanWords)

 // initial size of our wordlist
 bufferSize := 50
 words := make([]string, bufferSize)
 pos := 0

 for scanner.Scan() {
 if err := scanner.Err(); err != nil {
  // This error is a non-EOF error. End the iteration if we encounter
  // an error
  fmt.Println(err)
  break
 }

 words[pos] = scanner.Text()
 pos++

 if pos >= len(words) {
  // expand the buffer by 100 again
  newbuf := make([]string, bufferSize)
  words = append(words, newbuf...)
 }
 }

 fmt.Println("word list:")
 // we are iterating only until the value of "pos" because our buffer size
 // might be more than the number of words because we increase the length by
 // a constant value. Or the scanner loop might've terminated due to an
 // error prematurely. In this case the "pos" contains the index of the last
 // successful update.
 for _, word := range words[:pos] {
 fmt.Println(word)
 }
}

As a result, we end up with far fewer slices to "grow," but we might end up with 1 empty slot at the end depending on the size of the buffer and the number of words in the file, which is a tradeoff.

5. Break long strings into words

bufio.NewScanner USES the type that satisfies the ES138en.Reader interface as a parameter, which means that it will be used with any type that defines the Read method.
The string utility method 1 in the standard library that returns type reader is the ES144en.NewReader function. When reading a word from a string, we can combine the two:


package main

import (
 "bufio"
 "fmt"
 "strings"
)

func main() {
 longstring := "This is a very long string. Not."
 var words []string
 scanner := bufio.NewScanner(strings.NewReader(longstring))
 scanner.Split(bufio.ScanWords)

 for scanner.Scan() {
 words = append(words, scanner.Text())
 }

 fmt.Println("word list:")
 for _, word := range words {
 fmt.Println(word)
 }
}

6. Scan for comma-separated strings

Manually parsing the CSV file/string via the basic ES154en.Read () or Scanner type is complex. Because according to the splitting function ES157en.ScanWords, "word" is defined as a string of runes defined by the unicode space. Reading individual runes and tracking the size and location of buffers (such as work done in lexical analysis) is too much work and operation.

But it can be avoided. We can define a new splitter function that reads characters until the reader encounters a comma and then returns the block on a call to Text () or Bytes (). The function signature of bufio function is as follows:


type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)

For simplicity, I show an example of reading a string instead of a file. A simple reader of the CSV string using the above signature could be:


package main

import (
 "bufio"
 "bytes"
 "fmt"
 "strings"
)

func main() {
 csvstring := "name, age, occupation"

 // An anonymous function declaration to avoid repeating main()
 ScanCSV := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
 commaidx := bytes.IndexByte(data, ',')
 if commaidx > 0 {
  // we need to return the next position
  buffer := data[:commaidx]
  return commaidx + 1, bytes.TrimSpace(buffer), nil
 }

 // if we are at the end of the string, just return the entire buffer
 if atEOF {
  // but only do that when there is some data. If not, this might mean
  // that we've reached the end of our input CSV string
  if len(data) > 0 {
  return len(data), bytes.TrimSpace(data), nil
  }
 }

 // when 0, nil, nil is returned, this is a signal to the interface to read
 // more data in from the input reader. In this case, this input is our
 // string reader and this pretty much will never occur.
 return 0, nil, nil
 }

 scanner := bufio.NewScanner(strings.NewReader(csvstring))
 scanner.Split(ScanCSV)

 for scanner.Scan() {
 fmt.Println(scanner.Text())
 }
}

7.ioutil

We've seen multiple ways to read files. But what if you just want to read the file into the buffer?
ioutil is a package in the standard library that contains 1 function that makes it a single line.

Read the entire file


package main


import (
 "io"
 "os"
 "fmt"
)

const BufferSize = 100

func main() {
 
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()

 buffer := make([]byte, BufferSize)

 for {
 bytesread, err := file.Read(buffer)
 if err != nil {
  if err != io.EOF {
  fmt.Println(err)
  }
  break
 }
 fmt.Println("bytes read: ", bytesread)
 fmt.Println("bytestream to string: ", string(buffer[:bytesread]))
 }
}

0

This is closer to what we see in high-level scripting languages.

Reads the entire directory of the file

Needless to say, if you have large files, do not run this script


package main


import (
 "io"
 "os"
 "fmt"
)

const BufferSize = 100

func main() {
 
 file, err := os.Open("filetoread.txt")
 if err != nil {
 fmt.Println(err)
 return
 }
 defer file.Close()

 buffer := make([]byte, BufferSize)

 for {
 bytesread, err := file.Read(buffer)
 if err != nil {
  if err != io.EOF {
  fmt.Println(err)
  }
  break
 }
 fmt.Println("bytes read: ", bytesread)
 fmt.Println("bytestream to string: ", string(buffer[:bytesread]))
 }
}

1

reference

The go language reads the file overview


Related articles: