Define Schema with Python and generate Parquet file details
- 2021-12-04 10:44:36
- OfStack
Java
AndPython
Implement Avro conversion toParquet
Format,chema
Are defined in Avro. What we're going to try here is how to defineParquet
Adj.Schema
And then populates the data accordingly and generatesParquet
Files.
1. Simple field definition
1. Define Schema and generate Parquet file
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Definition Schema
schema = pa.schema([
('id', pa.int32()),
('email', pa.string())
])
# Prepare data
ids = pa.array([1, 2], type = pa.int32())
emails = pa.array(['first@example.com', 'second@example.com'], pa.string())
# Generate Parquet Data
batch = pa.RecordBatch.from_arrays(
[ids, emails],
schema = schema
)
table = pa.Table.from_batches([batch])
# Write Parquet Documents plain.parquet
pq.write_table(table, 'plain.parquet')
import pandas as pd
import pyarrow as pa
import pyarrow . parquet as pq
# Definition Schema
schema = pa . schema ( [
( 'id' , pa . int32 ( ) ) ,
( 'email' , pa . string ( ) )
] )
# Prepare data
ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) )
emails = pa . array ( [ 'first@example.com' , 'second@example.com' ] , pa . string ( ) )
# Generate Parquet Data
batch = pa . RecordBatch . from_arrays (
[ ids , emails ] ,
schema = schema
)
table = pa . Table . from_batches ( [ batch ] )
# Write Parquet Documents plain.parquet
pq . write_table ( table , 'plain.parquet' )
2. Verify the Parquet data file
We can use tools
parquet-tools
To view
plain.parquet
File data and
Schema
$ parquet-tools schema plain.parquet message schema { optional int32 id; optional binary email (STRING); } $ parquet-tools cat --json plain.parquet {"id":1,"email":"first@example.com"} {"id":2,"email":"second@example.com"}
No problem, with our expectation of 1. You can also use
Python
0
Code to get the
Schema
And data
schema = pq.read_schema('plain.parquet')
print(schema)
df = pd.read_parquet('plain.parquet')
print(df.to_json())
schema = pq . read_schema ( 'plain.parquet' )
print ( schema )
df = pd . read_parquet ( 'plain.parquet' )
print ( df . to_json ( ) )
The output is:
schema = pq.read_schema('plain.parquet')
print(schema)
df = pd.read_parquet('plain.parquet')
print(df.to_json())
schema = pq . read_schema ( 'plain.parquet' )
print ( schema )
df = pd . read_parquet ( 'plain.parquet' )
print ( df . to_json ( ) )
2. Include nested field definitions
Below
Schema
Defines to add 1 nested object in the
address
Lower score
email_address
And
post_address
,
Schema
Definition and generation
Parquet
The code for the file is as follows
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Internal field
address_fields = [
('email_address', pa.string()),
('post_address', pa.string()),
]
# Definition Parquet Schema , address Nested address_fields
schema = pa.schema(j)
# Prepare data
ids = pa.array([1, 2], type = pa.int32())
addresses = pa.array(
[('first@example.com', 'city1'), ('second@example.com', 'city2')],
pa.struct(address_fields)
)
# Generate Parquet Data
batch = pa.RecordBatch.from_arrays(
[ids, addresses],
schema = schema
)
table = pa.Table.from_batches([batch])
# Write Parquet Data to file
pq.write_table(table, 'nested.parquet')
import pandas as pd
import pyarrow as pa
import pyarrow . parquet as pq
# Internal field
address_fields = [
( 'email_address' , pa . string ( ) ) ,
( 'post_address' , pa . string ( ) ) ,
]
# Definition Parquet Schema , address Nested address_fields
schema = pa . schema ( j )
# Prepare data
ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) )
addresses = pa . array (
[ ( 'first@example.com' , 'city1' ) , ( 'second@example.com' , 'city2' ) ] ,
pa . struct ( address_fields )
)
# Generate Parquet Data
batch = pa . RecordBatch . from_arrays (
[ ids , addresses ] ,
schema = schema
)
table = pa . Table . from_batches ( [ batch ] )
# Write Parquet Data to file
pq . write_table ( table , 'nested.parquet' )
1. Verify the Parquet data file
Use the same
parquet-tools
Let's take a look at it
nested.parquet
Documents
$ parquet-tools schema nested.parquet message schema { optional int32 id; optional group address { optional binary email_address (STRING); optional binary post_address (STRING); } } $ parquet-tools cat --json nested.parquet {"id":1,"address":{"email_address":"first@example.com","post_address":"city1"}} {"id":2,"address":{"email_address":"second@example.com","post_address":"city2"}}
Use
parquet-tools
See
Schama
There is no
struct
The words, but embodies it
address
Nesting relationship with subordinate attributes.
Use
Python
0
Code to read
nested.parquet
Documentary
Schema
And what does the data look like
schema = pq.read_schema("nested.parquet")
print(schema)
df = pd.read_parquet('nested.parquet')
print(df.to_json())
schema = pq . read_schema ( "nested.parquet" )
print ( schema )
df = pd . read_parquet ( 'nested.parquet' )
print ( df . to_json ( ) )
Output:
id: int32
-- field metadata --
PARQUET:field_id: '1'
address: struct<email_address: string, post_address: string>
child 0, email_address: string
-- field metadata --
PARQUET:field_id: '3'
child 1, post_address: string
-- field metadata --
PARQUET:field_id: '4'
-- field metadata --
PARQUET:field_id: '2'
{"id":{"0":1,"1":2},"address":{"0":{"email_address":"first@example.com","post_address":"city1"},"1":{"email_address":"second@example.com","post_address":"city2"}}}
id : int32
-- field metadata --
PARQUET : field_id : '1'
address : struct & lt ; email_address : string , post_address : string & gt ;
child 0 , email_address : string
-- field metadata --
PARQUET : field_id : '3'
child 1 , post_address : string
-- field metadata --
PARQUET : field_id : '4'
-- field metadata --
PARQUET : field_id : '2'
{ "id" : { "0" : 1 , "1" : 2 } , "address" : { "0" : { "email_address" : "first@example.com" , "post_address" : "city1" } , "1" : { "email_address" : "second@example.com" , "post_address" : "city2" } } }
Of course, the data is 1-type, and slightly different ones are displayed
Schema
In,
address
Identified as
struct<email_address: string, post_address: string>
, clearly indicating that it is 1
struct
Type, instead of just showing the nesting level.