The first step to determine the effectiveness of artificial intelligence and machine learning
Time:2021-05-27
Views:2135
After the automated pipeline is built, the algorithms in it have been able to complete most of the work, and there is almost no need to update the data collection process. But please note that the completion of the pipeline does not mean that it can operate forever once and for all. We need to continuously explore and analyze the underlying data over time, focusing on which drift modes are constantly eroding pipeline performance.
The good news is that the data team is fully capable of reducing this erosion risk, but the cost is the necessary time and effort. In order to maintain the execution efficiency of the automated pipeline, we must regularly conduct exploratory data analysis (EDA) to ensure that the entire system is always operating accurately.
What is exploratory data analysis?
EDA is the first step to the successful realization of AI and ML. Before analyzing the algorithm ontology, we first need to understand the content of the data. Data quality will ultimately determine the actual effect of downstream analysis pipelines. After the correct effect, EDA will help users identify unnecessary patterns and noise in the data, and at the same time guide companies to more accurately select appropriate algorithms.
In the EDA stage, we need to actively query the data to ensure that the behavioral patterns are in line with expectations. First, start with the following ten important issues that need a comprehensive analysis:
1.Do you have sufficient data points?
2. Does the measurement of data center and dispersion meet expectations?
3. How many data points are of good quality and can be used for actual analysis?
4. Are there missing values? Do these bad values constitute an important part of the data?
5. What is the empirical distribution of the data? Does the data conform to the normal distribution?
6. Are there special clusters or groupings in the value?
7. Are there any outliers? How to deal with these outliers?
8. Are there correlations between different dimensions?
9. Is it necessary to perform data conversion through reformatting and other means for downstream analysis and interpretation?
10. If the data is in high-dimensional form, can the dimensionality be reduced without losing too much information? Are some of the dimensions noise?
These problems will give rise to more problems. This is not a complete list of questions, but just the beginning of thinking. In the end, I hope that everyone can establish a better understanding of the existing data model, and then process the data correctly and choose the most suitable processing algorithm.