Grouping data in Groupby mode by time stamp using Pandas and Numpy

  • 2021-07-24 11:24:42
  • OfStack

First of all, I need to group the data in minutes, and then output the data in every minute as one line. Because the amount of data is different at different times, all the data shall be subject to the longest set of data, and the insufficient data shall be supplemented by their last data.

After that, I will introduce my data source in 1. The previously useless data columns have been removed. I only leave the data data column and the timestamp time column to be used. The timestamp is measured in seconds. You can see that 1 is 407,454 rows in total.


     data     time
0    6522.50 1.530668e+09
1    6522.66 1.530668e+09
2    6523.79 1.530668e+09
3    6523.79 1.530668e+09
4    6524.82 1.530668e+09
5    6524.35 1.530668e+09
6    6523.66 1.530668e+09
7    6522.64 1.530668e+09
8    6523.25 1.530668e+09
9    6523.88 1.530668e+09
10   6525.30 1.530668e+09
11   6525.70 1.530668e+09
...     ...      ...
407443 6310.69 1.531302e+09
407444 6310.55 1.531302e+09
407445 6310.42 1.531302e+09
407446 6310.40 1.531302e+09
407447 6314.03 1.531302e+09
407448 6314.04 1.531302e+09
407449 6312.84 1.531302e+09
407450 6312.57 1.531302e+09
407451 6312.56 1.531302e+09
407452 6314.04 1.531302e+09
407453 6314.04 1.531302e+09
 
[407454 rows x 2 columns]

Start data processing, define 1 function, input 1 DataFrame and name of time column.


def getdata_time(dataframe,name):
 dataframe[name] = dataframe[name]/60  # Convert time to minutes 
 dataframe[name] = dataframe[name].astype('int64')
 
 datalen = dataframe.groupby(name).count().max()   # Get the maximum length of data 
 
 timeframe = dataframe.groupby(name).count().reset_index()# Convert the post-grouped time to get the time DataFrame
 timeseries = timeframe['time']    
 
 array = []   # Establish 1 An empty array to store values 
 for time, group in dataframe.groupby(name): 
 
 tmparray = numpy.array(group['data']) # Will series Convert to an array and add to the totals group 
 array.append(tmparray)
 
 notimedata = pandas.DataFrame(array)
 notimedata = notimedata.fillna(method='ffill',axis = 1,limit=datalen[0]) # Complete missing values 
 notimedata[datalen[0]+1] = timeseries  # Add time to the end 1 Column 
 
 return notimedata

The following will be analyzed line by line. First, it should be grouped according to each minute. Then, the timestamp of the second meter is divided by 60 to become minutes, and it is converted to int type for the convenience of observation (it is not clear whether changing the type will lead to the lack of data accuracy and affect the results. If those who know it see it, welcome to point out, thank you).

datalen is the maximum data length per minute we need to use, which is used as the basis for alignment. DataFrame. groupby. count () displays the number of data in each group, Instead of showing how many groups there are, If you want to get the index of every group after grouping, you need to use the reset_index method in the next row. The reason why reset_index is not directly used but called after count () method is that the result of groupby grouping is not an DataFrame, but after count () (not only count, but also the method of grouping data can be operated, as long as the result corresponds to index11 of each group), an DataFrame with index as one column and the other column is count result can be obtained. The following is the error report for reset_index operation directly:

AttributeError: Cannot access callable attribute 'reset_index' of 'DataFrameGroupBy' objects, try using the 'apply' method

The following is the result of reset_index method after count operation. It can be seen that 1 is divided into 10397 groups:


      time data
0   25511135  33
1   25511136  18
2   25511137  25
3   25511138  42
4   25511139  36
5   25511140   7
6   25511141  61
7   25511142  45
8   25511143  46
9   25511144  19
10   25511145  21
...     ...  ...
10387 25521697   3
10388 25521698   9
10389 25521699  16
10390 25521700  13
10391 25521701   4
10392 25521702  34
10393 25521703  124
10394 25521704  302
10395 25521705  86
10396 25521706  52
 
[10397 rows x 2 columns]

The extracted timeseries will be used in the final data consolidation. Now start extracting each set of data, First, an empty array is established for storage, and then the information of each group is obtained by for loop. time is the grouped index, and group is the content of each group. The data is taken out from group ['data'] and added to the previously established empty array. After the loop operation, it is converted to DataFrame. Of course, this DataFrame contains a large number of missing values, because its column number is based on the longest data. As follows:


     0    1    2    3   ...  1143 1144 1145 1146
0   6522.50 6522.66 6523.79 6523.79 ...  NaN  NaN  NaN  NaN
1   6523.95 6524.90 6525.00 6524.35 ...  NaN  NaN  NaN  NaN
2   6520.87 6520.00 6520.45 6520.46 ...  NaN  NaN  NaN  NaN
3   6516.34 6516.26 6516.21 6516.21 ...  NaN  NaN  NaN  NaN
4   6513.28 6514.00 6514.00 6514.00 ...  NaN  NaN  NaN  NaN
5   6511.98 6511.98 6511.99 6513.00 ...  NaN  NaN  NaN  NaN
6   6511.00 6511.00 6511.00 6511.00 ...  NaN  NaN  NaN  NaN
7   6511.70 6511.78 6511.99 6511.99 ...  NaN  NaN  NaN  NaN
8   6509.51 6510.00 6510.80 6510.80 ...  NaN  NaN  NaN  NaN
9   6511.36 6510.00 6510.00 6510.00 ...  NaN  NaN  NaN  NaN
10   6507.00 6507.00 6507.00 6507.00 ...  NaN  NaN  NaN  NaN
...    ...   ...   ...   ... ...  ...  ...  ...  ...
10386 6333.77 6331.31 6331.30 6333.19 ...  NaN  NaN  NaN  NaN
10387 6331.68 6331.30 6331.68   NaN ...  NaN  NaN  NaN  NaN
10388 6331.30 6331.30 6331.00 6331.00 ...  NaN  NaN  NaN  NaN
10389 6330.93 6330.92 6330.92 6330.93 ...  NaN  NaN  NaN  NaN
10390 6330.83 6330.83 6330.90 6330.80 ...  NaN  NaN  NaN  NaN
10391 6327.57 6326.00 6326.00 6325.74 ...  NaN  NaN  NaN  NaN
10392 6327.57 6329.70 6328.85 6328.85 ...  NaN  NaN  NaN  NaN
10393 6323.54 6323.15 6323.15 6322.77 ...  NaN  NaN  NaN  NaN
10394 6311.00 6310.83 6310.83 6310.50 ...  NaN  NaN  NaN  NaN
10395 6311.45 6311.32 6310.01 6310.01 ...  NaN  NaN  NaN  NaN
10396 6310.46 6310.46 6310.56 6311.61 ...  NaN  NaN  NaN  NaN
 
[10397 rows x 1147 columns]

It can be seen that the number of rows is the number of groups, and 1 total of 1147 columns is also the largest set of data lengths.

After that, we populate the missing value by calling fillna method. method = 'ffill' means based on the first data of the missing value, axis = 1 means in line units, and limit means the maximum fill length. Finally, we add the timeseries we got earlier to the last column, and we get the final result of the requirements.


     0    1    2    ...    1145   1146   1148
0   6522.50 6522.66 6523.79  ...   6522.14 6522.14 25511135
1   6523.95 6524.90 6525.00  ...   6520.00 6520.00 25511136
2   6520.87 6520.00 6520.45  ...   6517.00 6517.00 25511137
3   6516.34 6516.26 6516.21  ...   6514.00 6514.00 25511138
4   6513.28 6514.00 6514.00  ...   6511.97 6511.97 25511139
5   6511.98 6511.98 6511.99  ...   6511.00 6511.00 25511140
6   6511.00 6511.00 6511.00  ...   6510.90 6510.90 25511141
7   6511.70 6511.78 6511.99  ...   6512.09 6512.09 25511142
8   6509.51 6510.00 6510.80  ...   6512.09 6512.09 25511143
9   6511.36 6510.00 6510.00  ...   6507.04 6507.04 25511144
10   6507.00 6507.00 6507.00  ...   6508.57 6508.57 25511145
11   6507.16 6507.74 6507.74  ...   6506.35 6506.35 25511146
...    ...   ...   ...  ...     ...   ...    ...
10388 6331.30 6331.30 6331.00  ...   6331.00 6331.00 25521698
10389 6330.93 6330.92 6330.92  ...   6330.99 6330.99 25521699
10390 6330.83 6330.83 6330.90  ...   6327.58 6327.58 25521700
10391 6327.57 6326.00 6326.00  ...   6325.74 6325.74 25521701
10392 6327.57 6329.70 6328.85  ...   6325.00 6325.00 25521702
10393 6323.54 6323.15 6323.15  ...   6311.00 6311.00 25521703
10394 6311.00 6310.83 6310.83  ...   6315.00 6315.00 25521704
10395 6311.45 6311.32 6310.01  ...   6310.00 6310.00 25521705
10396 6310.46 6310.46 6310.56  ...   6314.04 6314.04 25521706
 
[10397 rows x 1148 columns]

Related articles: