How to use python data processing to solve data conflict and sample selection

  • 2021-11-24 02:06:58
  • OfStack

The content of the catalogue introduces the actual business data conflict 1 general data conflict types: 1 general data conflict causes: 1 general data processing method: sample selection 1 general data sampling method: data collinearity 1 general causes: 5 common methods to solve collinearity:

Content introduction

Summarize the methods of data conflicts and sample sources encountered in daily work, including actual business data conflicts, sample selection problems, data collinearity and other ideas, and update them for a long time.

Actual business data conflict

Multiple business data source conflicts refer to data from multiple systems, environments, platforms, and tools that have the same business logic but have different outcomes.

Different characteristics of conflict

1 Types of data conflicts: Data type: The format of data in the same field is different. For example, the field of registration date contains a string. Number According to structure conflict: There is conflict with the description structure of 1 data subject. Different record granularity: The order record granularity can be stored in one piece of data based on ID. Data range definition: The meaning of extracted data fields conflicts with each other. Data values are different: 1. Generally, it is a formatting problem.

1 Cause of data conflict:

Data conflicts between internal tools and third-party tools.

Why is there a discrepancy between the data obtained and the advertising data provided by agents or advertising media, and sometimes the discrepancy is particularly large?

Inevitably, there are discrepancies between the data obtained by Web analytics tools and the data provided by advertising media and agents.

Different definitions of indicators, different collection logic, different system filtering rules, different update times, different monitoring locations and so on will cause these problems.

1 general data processing methods:

At present, there is no unified standard, so it can be processed according to actual needs.

Form 1-only data: If you want to make an overall summary count, you need to eliminate conflicts in some way so that 1 data can be reported. Do not eliminate conflicts: Instead, use all conflicting data. If different data of different business processes are used in the overall process statistical analysis, different indicators will have better channel conversion effect. It is necessary to ensure that the differences in the results after treatment can be explained, and are objective and stable.

Selection of samples

Data sampling is still based on the existing data. The more complete the data, the better. However, the actual situation is not so ideal. We can only use statistical methods to sample.

1 general data sampling method:

Sampling methods are usually divided into non-probability sampling and probability sampling. Non-probabilistic sampling is not based on the principle of equal probability, but on human subjective experience and state. Probability sampling is based on mathematical probability theory, while sampling is based on randomness principle.

Simple random sampling: The sampling method is to directly extract n samples from the total number according to the principle of equal probability. This random sampling method is simple and easy to operate. However, this does not guarantee that the sample can perfectly represent the population. This method is suitable for evenly distributed scenes. Isometric sampling: Isometric sampling is to number each individual in the population first, then calculate the sampling interval, and then sample the individuals according to the fixed sampling interval. It is suitable for data with uniform distribution or obvious uniform distribution law without obvious trend or periodic law. Stratified sampling: Stratified sampling is to divide all individual samples into several categories according to certain characteristics, and then select individuals from each category by random sampling or equidistant sampling to form samples. This method is suitable for data with characteristics, such as attributes and classification logic tags. Cluster sampling: Cluster sampling is to divide all samples into several groups, and then randomly sample several groups to represent the population. This method is suitable for groups with relatively small characteristic differences, and has higher requirements for group division.

Several problems to pay attention to:

The data sampling must reflect the background of operation, and there is no problem of business randomness and business data feasibility. The most important data sampling must meet the needs of data analysis and modeling

Collinearity of data

The so-called collinearity (also called multicollinearity) problem refers to the high linear correlation between input independent variables. Collinearity problem will greatly reduce the stability and accuracy of regression model. For example, data with obvious collinearity: visits and page views; Page views and access time; Order quantity and sales, etc.

1 causes: The data sample is insufficient, which actually reflects the impact of lack of data on data modeling. Many variables have common or opposite evolution trends based on time. There is a definite relationship between multiple variables, but the occurrence nodes are not 1. On the whole, the trend among variables is 1. There is an approximate linear relationship among multiple variables. It is simply understood as a relationship of y = ax + b.

Check collinearity: Collinearity is usually determined by the characteristic data of tolerance, variance factor and eigenvalue.

Five common methods to solve collinearity: Increase sample size:

By adding samples to eliminate the accidental collinearity caused by insufficient data, the collinearity problem may not be solved, because it is very likely that this problem does exist among variables.

Ridge Regression (Ridge Regression):

Ridge regression analysis is a biased estimation regression method specially used for collinearity problems, which is essentially an improved least square estimation method.

Gradual regression:

One independent variable is introduced at a time for statistical test, and then other variables are introduced step by step, and the regression coefficients of all variables are tested at the same time.

Principal component regression (Principal Components Regression):

Regression analysis based on principal components can avoid collinearity without losing important data features.

Manually delete:

If you feel troublesome, you can delete it directly combined with manual experience.
It is impossible to solve the collinearity problem completely, because all things have definite relations.

In the related topics of solving collinearity problems, we only solve serious collinearity problems, not all collinearity problems.

The above is how to use python data processing to resolve data conflicts and sample selection details, more information about python data processing please pay attention to other related articles on this site!


Related articles: