Define Schema with Python and generate Parquet file details

  • 2021-12-04 10:44:36
  • OfStack

Catalog 1. Simple field definitions 1. Define Schema and generate Parquet file 2. Verify Parquet data file 2. Contain nested field definition 1. Verify Parquet data file

Java And Python Implement Avro conversion to Parquet Format, chema Are defined in Avro. What we're going to try here is how to define Parquet Adj. Schema And then populates the data accordingly and generates Parquet Files.

1. Simple field definition

1. Define Schema and generate Parquet file


import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

#  Definition  Schema
schema = pa.schema([
    ('id', pa.int32()),
    ('email', pa.string())
])

#  Prepare data 
ids = pa.array([1, 2], type = pa.int32())
emails = pa.array(['first@example.com', 'second@example.com'], pa.string())

#  Generate  Parquet  Data 
batch = pa.RecordBatch.from_arrays(
    [ids, emails],
    schema = schema
)
table = pa.Table.from_batches([batch])

#  Write  Parquet  Documents  plain.parquet
pq.write_table(table, 'plain.parquet')
import pandas as pd

import pyarrow as pa

import pyarrow . parquet as pq

#  Definition  Schema

schema = pa . schema ( [

     ( 'id' , pa . int32 ( ) ) ,

     ( 'email' , pa . string ( ) )

] )

#  Prepare data 

ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) )

emails = pa . array ( [ 'first@example.com' , 'second@example.com' ] , pa . string ( ) )

#  Generate  Parquet  Data 

batch = pa . RecordBatch . from_arrays (

     [ ids , emails ] ,

     schema = schema

)

table = pa . Table . from_batches ( [ batch ] )

#  Write  Parquet  Documents  plain.parquet

pq . write_table ( table , 'plain.parquet' )

2. Verify the Parquet data file

We can use tools parquet-tools To view plain.parquet File data and Schema


 $ parquet-tools schema plain.parquet  message schema {      optional int32 id;      optional binary email (STRING);  }  $ parquet-tools cat --json plain.parquet  {"id":1,"email":"first@example.com"}  {"id":2,"email":"second@example.com"} 


No problem, with our expectation of 1. You can also use Python 0 Code to get the Schema And data


schema = pq.read_schema('plain.parquet')
print(schema)

df = pd.read_parquet('plain.parquet')
print(df.to_json())
schema = pq . read_schema ( 'plain.parquet' )

print ( schema )

df = pd . read_parquet ( 'plain.parquet' )

print ( df . to_json ( ) )

The output is:


schema = pq.read_schema('plain.parquet')
print(schema)

df = pd.read_parquet('plain.parquet')
print(df.to_json())
schema = pq . read_schema ( 'plain.parquet' )

print ( schema )

df = pd . read_parquet ( 'plain.parquet' )

print ( df . to_json ( ) )

2. Include nested field definitions

Below Schema Defines to add 1 nested object in the address Lower score email_address And post_address , Schema Definition and generation Parquet The code for the file is as follows


import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

#  Internal field 
address_fields = [
    ('email_address', pa.string()),
    ('post_address', pa.string()),
]

#  Definition  Parquet Schema , address  Nested  address_fields
schema = pa.schema(j)

#  Prepare data 
ids = pa.array([1, 2], type = pa.int32())
addresses = pa.array(
    [('first@example.com', 'city1'), ('second@example.com', 'city2')],
    pa.struct(address_fields)
)

#  Generate  Parquet  Data 
batch = pa.RecordBatch.from_arrays(
    [ids, addresses],
    schema = schema
)
table = pa.Table.from_batches([batch])

#  Write  Parquet  Data to file 
pq.write_table(table, 'nested.parquet')
import pandas as pd

import pyarrow as pa

import pyarrow . parquet as pq

#  Internal field 

address_fields = [

     ( 'email_address' , pa . string ( ) ) ,

     ( 'post_address' , pa . string ( ) ) ,

]

#  Definition  Parquet Schema , address  Nested  address_fields

schema = pa . schema ( j )

#  Prepare data 

ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) )

addresses = pa . array (

     [ ( 'first@example.com' , 'city1' ) , ( 'second@example.com' , 'city2' ) ] ,

     pa . struct ( address_fields )

)

#  Generate  Parquet  Data 

batch = pa . RecordBatch . from_arrays (

     [ ids , addresses ] ,

     schema = schema

)

table = pa . Table . from_batches ( [ batch ] )

#  Write  Parquet  Data to file 

pq . write_table ( table , 'nested.parquet' )

1. Verify the Parquet data file

Use the same parquet-tools Let's take a look at it nested.parquet Documents


 $ parquet-tools schema nested.parquet  message schema {      optional int32 id;      optional group address {          optional binary email_address (STRING);          optional binary post_address (STRING);      }  }  $ parquet-tools cat --json nested.parquet  {"id":1,"address":{"email_address":"first@example.com","post_address":"city1"}}  {"id":2,"address":{"email_address":"second@example.com","post_address":"city2"}} 


Use parquet-tools See Schama There is no struct The words, but embodies it address Nesting relationship with subordinate attributes.

Use Python 0 Code to read nested.parquet Documentary Schema And what does the data look like


schema = pq.read_schema("nested.parquet")
print(schema)

df = pd.read_parquet('nested.parquet')
print(df.to_json())
schema = pq . read_schema ( "nested.parquet" )

print ( schema )

df = pd . read_parquet ( 'nested.parquet' )

print ( df . to_json ( ) )

Output:


id: int32
  -- field metadata --
  PARQUET:field_id: '1'
address: struct<email_address: string, post_address: string>
  child 0, email_address: string
    -- field metadata --
    PARQUET:field_id: '3'
  child 1, post_address: string
    -- field metadata --
    PARQUET:field_id: '4'
  -- field metadata --
  PARQUET:field_id: '2'
{"id":{"0":1,"1":2},"address":{"0":{"email_address":"first@example.com","post_address":"city1"},"1":{"email_address":"second@example.com","post_address":"city2"}}}
id : int32

   -- field metadata --

   PARQUET : field_id : '1'

address : struct & lt ; email_address : string , post_address : string & gt ;

   child 0 , email_address : string

     -- field metadata --

     PARQUET : field_id : '3'

   child 1 , post_address : string

     -- field metadata --

     PARQUET : field_id : '4'

   -- field metadata --

   PARQUET : field_id : '2'

{ "id" : { "0" : 1 , "1" : 2 } , "address" : { "0" : { "email_address" : "first@example.com" , "post_address" : "city1" } , "1" : { "email_address" : "second@example.com" , "post_address" : "city2" } } }

Of course, the data is 1-type, and slightly different ones are displayed Schema In, address Identified as struct<email_address: string, post_address: string> , clearly indicating that it is 1 struct Type, instead of just showing the nesting level.


Related articles: