Grouping data in Groupby mode by time stamp using Pandas and Numpy
- 2021-07-24 11:24:42
- OfStack
First of all, I need to group the data in minutes, and then output the data in every minute as one line. Because the amount of data is different at different times, all the data shall be subject to the longest set of data, and the insufficient data shall be supplemented by their last data.
After that, I will introduce my data source in 1. The previously useless data columns have been removed. I only leave the data data column and the timestamp time column to be used. The timestamp is measured in seconds. You can see that 1 is 407,454 rows in total.
data time
0 6522.50 1.530668e+09
1 6522.66 1.530668e+09
2 6523.79 1.530668e+09
3 6523.79 1.530668e+09
4 6524.82 1.530668e+09
5 6524.35 1.530668e+09
6 6523.66 1.530668e+09
7 6522.64 1.530668e+09
8 6523.25 1.530668e+09
9 6523.88 1.530668e+09
10 6525.30 1.530668e+09
11 6525.70 1.530668e+09
... ... ...
407443 6310.69 1.531302e+09
407444 6310.55 1.531302e+09
407445 6310.42 1.531302e+09
407446 6310.40 1.531302e+09
407447 6314.03 1.531302e+09
407448 6314.04 1.531302e+09
407449 6312.84 1.531302e+09
407450 6312.57 1.531302e+09
407451 6312.56 1.531302e+09
407452 6314.04 1.531302e+09
407453 6314.04 1.531302e+09
[407454 rows x 2 columns]
Start data processing, define 1 function, input 1 DataFrame and name of time column.
def getdata_time(dataframe,name):
dataframe[name] = dataframe[name]/60 # Convert time to minutes
dataframe[name] = dataframe[name].astype('int64')
datalen = dataframe.groupby(name).count().max() # Get the maximum length of data
timeframe = dataframe.groupby(name).count().reset_index()# Convert the post-grouped time to get the time DataFrame
timeseries = timeframe['time']
array = [] # Establish 1 An empty array to store values
for time, group in dataframe.groupby(name):
tmparray = numpy.array(group['data']) # Will series Convert to an array and add to the totals group
array.append(tmparray)
notimedata = pandas.DataFrame(array)
notimedata = notimedata.fillna(method='ffill',axis = 1,limit=datalen[0]) # Complete missing values
notimedata[datalen[0]+1] = timeseries # Add time to the end 1 Column
return notimedata
The following will be analyzed line by line. First, it should be grouped according to each minute. Then, the timestamp of the second meter is divided by 60 to become minutes, and it is converted to int type for the convenience of observation (it is not clear whether changing the type will lead to the lack of data accuracy and affect the results. If those who know it see it, welcome to point out, thank you).
datalen is the maximum data length per minute we need to use, which is used as the basis for alignment. DataFrame. groupby. count () displays the number of data in each group, Instead of showing how many groups there are, If you want to get the index of every group after grouping, you need to use the reset_index method in the next row. The reason why reset_index is not directly used but called after count () method is that the result of groupby grouping is not an DataFrame, but after count () (not only count, but also the method of grouping data can be operated, as long as the result corresponds to index11 of each group), an DataFrame with index as one column and the other column is count result can be obtained. The following is the error report for reset_index operation directly:
AttributeError: Cannot access callable attribute 'reset_index' of 'DataFrameGroupBy' objects, try using the 'apply' method
The following is the result of reset_index method after count operation. It can be seen that 1 is divided into 10397 groups:
time data
0 25511135 33
1 25511136 18
2 25511137 25
3 25511138 42
4 25511139 36
5 25511140 7
6 25511141 61
7 25511142 45
8 25511143 46
9 25511144 19
10 25511145 21
... ... ...
10387 25521697 3
10388 25521698 9
10389 25521699 16
10390 25521700 13
10391 25521701 4
10392 25521702 34
10393 25521703 124
10394 25521704 302
10395 25521705 86
10396 25521706 52
[10397 rows x 2 columns]
The extracted timeseries will be used in the final data consolidation. Now start extracting each set of data, First, an empty array is established for storage, and then the information of each group is obtained by for loop. time is the grouped index, and group is the content of each group. The data is taken out from group ['data'] and added to the previously established empty array. After the loop operation, it is converted to DataFrame. Of course, this DataFrame contains a large number of missing values, because its column number is based on the longest data. As follows:
0 1 2 3 ... 1143 1144 1145 1146
0 6522.50 6522.66 6523.79 6523.79 ... NaN NaN NaN NaN
1 6523.95 6524.90 6525.00 6524.35 ... NaN NaN NaN NaN
2 6520.87 6520.00 6520.45 6520.46 ... NaN NaN NaN NaN
3 6516.34 6516.26 6516.21 6516.21 ... NaN NaN NaN NaN
4 6513.28 6514.00 6514.00 6514.00 ... NaN NaN NaN NaN
5 6511.98 6511.98 6511.99 6513.00 ... NaN NaN NaN NaN
6 6511.00 6511.00 6511.00 6511.00 ... NaN NaN NaN NaN
7 6511.70 6511.78 6511.99 6511.99 ... NaN NaN NaN NaN
8 6509.51 6510.00 6510.80 6510.80 ... NaN NaN NaN NaN
9 6511.36 6510.00 6510.00 6510.00 ... NaN NaN NaN NaN
10 6507.00 6507.00 6507.00 6507.00 ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
10386 6333.77 6331.31 6331.30 6333.19 ... NaN NaN NaN NaN
10387 6331.68 6331.30 6331.68 NaN ... NaN NaN NaN NaN
10388 6331.30 6331.30 6331.00 6331.00 ... NaN NaN NaN NaN
10389 6330.93 6330.92 6330.92 6330.93 ... NaN NaN NaN NaN
10390 6330.83 6330.83 6330.90 6330.80 ... NaN NaN NaN NaN
10391 6327.57 6326.00 6326.00 6325.74 ... NaN NaN NaN NaN
10392 6327.57 6329.70 6328.85 6328.85 ... NaN NaN NaN NaN
10393 6323.54 6323.15 6323.15 6322.77 ... NaN NaN NaN NaN
10394 6311.00 6310.83 6310.83 6310.50 ... NaN NaN NaN NaN
10395 6311.45 6311.32 6310.01 6310.01 ... NaN NaN NaN NaN
10396 6310.46 6310.46 6310.56 6311.61 ... NaN NaN NaN NaN
[10397 rows x 1147 columns]
It can be seen that the number of rows is the number of groups, and 1 total of 1147 columns is also the largest set of data lengths.
After that, we populate the missing value by calling fillna method. method = 'ffill' means based on the first data of the missing value, axis = 1 means in line units, and limit means the maximum fill length. Finally, we add the timeseries we got earlier to the last column, and we get the final result of the requirements.
0 1 2 ... 1145 1146 1148
0 6522.50 6522.66 6523.79 ... 6522.14 6522.14 25511135
1 6523.95 6524.90 6525.00 ... 6520.00 6520.00 25511136
2 6520.87 6520.00 6520.45 ... 6517.00 6517.00 25511137
3 6516.34 6516.26 6516.21 ... 6514.00 6514.00 25511138
4 6513.28 6514.00 6514.00 ... 6511.97 6511.97 25511139
5 6511.98 6511.98 6511.99 ... 6511.00 6511.00 25511140
6 6511.00 6511.00 6511.00 ... 6510.90 6510.90 25511141
7 6511.70 6511.78 6511.99 ... 6512.09 6512.09 25511142
8 6509.51 6510.00 6510.80 ... 6512.09 6512.09 25511143
9 6511.36 6510.00 6510.00 ... 6507.04 6507.04 25511144
10 6507.00 6507.00 6507.00 ... 6508.57 6508.57 25511145
11 6507.16 6507.74 6507.74 ... 6506.35 6506.35 25511146
... ... ... ... ... ... ... ...
10388 6331.30 6331.30 6331.00 ... 6331.00 6331.00 25521698
10389 6330.93 6330.92 6330.92 ... 6330.99 6330.99 25521699
10390 6330.83 6330.83 6330.90 ... 6327.58 6327.58 25521700
10391 6327.57 6326.00 6326.00 ... 6325.74 6325.74 25521701
10392 6327.57 6329.70 6328.85 ... 6325.00 6325.00 25521702
10393 6323.54 6323.15 6323.15 ... 6311.00 6311.00 25521703
10394 6311.00 6310.83 6310.83 ... 6315.00 6315.00 25521704
10395 6311.45 6311.32 6310.01 ... 6310.00 6310.00 25521705
10396 6310.46 6310.46 6310.56 ... 6314.04 6314.04 25521706
[10397 rows x 1148 columns]