China Naming Network - Naming consultation - What are the steps of data analysis?

What are the steps of data analysis?

1. Problem definition

A typical scenario is that we need to analyze enterprise data, such as sales data, user data, operation data, product production data ... What useful information do you need from these data to guide the formulation of strategies? For example, what you need to do is market research or industry analysis, so you need to know what information you need to get from this industry.

First, you need to determine what the problem is. What conclusion do you want to draw?

For example, what is the changing trend of air quality in a certain area?

What is the user portrait of the glory of the king players? What kind of people spend a lot of money?

What are the key factors affecting the company's sales growth?

What are the core indicators that affect productivity and quality in the production process?

How to analyze user portraits and conduct precise marketing?

How to predict user behavior in a certain stage in the future according to historical data?

These problems may come from your existing experience and knowledge. For example, you already know that users buy different quantities at different times of the week, so you can get the accurate relationship between sales volume and time through analysis, so as to accurately stock up. For example, you know that the air quality in Beijing has been getting worse and worse in recent years. The possible factors include factory emissions, sandstorms, residents' emissions and weather factors. So when defining a problem, you need to think clearly and analyze which factors need to be considered.

Some questions are not clear, for example, what are the core indicators that affect the quality in the production process, or are they raw materials? Equipment level? Worker level? How's the weather? The complexity of a process? How many times does an operation have to be repeated? ..... these may not be obvious, or if you set foot in a new field and don't have very professional knowledge, then the questions you may need to define need to be broader and cover more possibilities.

The definition of the problem may require you to understand the core knowledge of the business and gain some experience that can help you analyze. To some extent, this is also what we often call data thinking. Data analysis can often help you find the correlation that we can't easily find, but an accurate definition of the problem can greatly improve the efficiency of data analysis.

How to define the problem better?

This requires you to find the feeling of data in long-term training. At the beginning, you will get extremely large data with a large number of fields, which may be very embarrassing. Where should you start?

But it would be much better if you had some experience. For example, if you want to study the physical factors that affect the runner's speed, then we may study the athlete's height, leg length, weight, and even heart rate, blood pressure and arm length, but we will not study the athlete's armpit hair length, which is based on our existing knowledge. For another example, if you want to analyze the influencing factors of housing prices in a place, you may have some common sense, such as urban population, geographical location, GDP, land price, price level, and further, there may be industrial structure, cultural conditions, climate conditions, etc., but generally you will not study the looks of girls and the proportion of beautiful women in cities.

So if you analyze more questions, you will be sensitive to the data, thus forming the habit of speaking with data analysis. At this time, you can even make preliminary judgments and predictions based on some data and your own experience (of course, it cannot replace the accurate prediction of complete samples). At this time, you basically have data thinking.

2. Data collection

With specific questions, you need to get relevant data. For example, if you want to explore the changing trend of air quality in Beijing, you may need to collect air quality data, weather data, even factory data, gas emission data, important schedule data and so on in Beijing in recent years. If you want to analyze the key factors that affect the company's sales, you need to call the company's historical sales data, user portrait data, advertising data and so on.

There are many ways to obtain data.

First, the company's sales and user data can be directly retrieved from the enterprise database, so you need SQL skills to complete database management such as data extraction. For example, you can extract all the sales data of 20 17 according to your needs, the data of the top 50 products sold this year, and the consumption data of users in Shanghai and Guangdong ... SQL can help you complete these tasks with simple commands.

The second is to obtain external public data sets. Some scientific research institutions, enterprises and governments will open some data, and you need to go to a specific website to download these data. These data sets are usually relatively complete and of relatively high quality. Of course, this method also has some defects. Usually, the data will be published later, but it is still of great value because of its objectivity and authority.

The third is to write a web crawler and collect data online. For example, you can get the recruitment information of a position on the recruitment website, the rental information of a city on the rental website, the list of movies with the highest douban rating, the likes of Zhihu and the list of comments on Netease Cloud Music through the crawler. Based on the data captured on the Internet, you can analyze a certain industry and a certain group of people, which is a very reliable way of market research and competitive product analysis.

Of course, the significance of comparing bugs is that you usually can't get all the data you need, which will have a certain impact on your analysis results, but it doesn't affect that you can extract more useful information from the limited available data.

3. Data preprocessing

In the real world, most of the data are incomplete and inconsistent dirty data, so it is impossible to directly analyze the data, or the analysis results are unsatisfactory. There are many methods of data preprocessing: data cleaning, data integration, data conversion, data reduction and so on. Only by processing the data of these impact analysis can we get more accurate analysis results.

For example, air quality data, many days of data are not monitored due to equipment reasons, some data are repeatedly recorded, and some data are invalid when equipment fails.

Then we need to use corresponding methods to deal with it, such as incomplete data, whether we directly remove this data or use adjacent values to complete it. These are all issues that need to be considered.

Of course, here we may also have data grouping, calculation of basic descriptive statistics, drawing of basic statistical charts, conversion of data values, normalization of data and so on. , which can help us grasp the distribution characteristics of data and is the basis for further in-depth analysis and modeling.

4. Data analysis and modeling

This part needs to understand the basic data analysis methods and data mining algorithms, and understand the applicable scenarios and problems of different methods. Abuse and misuse of statistical analysis methods should be avoided in the analysis. The abuse and misuse of statistical analysis methods are mainly caused by the problems that can be solved by another method, the premise of the application of this method, and the unclear requirements of this method for data.

In addition, it is also extremely important to select several statistical analysis methods for exploratory and repetitive analysis of data. Each statistical analysis method has its own characteristics and limitations. Therefore, it is generally necessary to choose several methods to confirm the analysis repeatedly, and it is unscientific to draw conclusions simply based on the results of one analysis method.

For example, if you find that under certain conditions, the sales volume is directly proportional to the price, then you can establish a linear regression model based on this. You find that the relationship between price and advertising is nonlinear, so you can first establish a logistic regression model to analyze it.

Generally speaking, the method of regression analysis can meet a large part of the analysis requirements. Of course, you can also learn some data mining algorithms and feature extraction methods to optimize your model and get better results.

5. Data visualization and data report writing

The most direct result of the analysis results is the description and display of statistical data.

For example, through the distribution of data, we found the five cities with the highest wages, the current popularity rankings of various languages, the changing trend of air quality in Beijing in recent years, and the regional distribution of condom consumption ... These are the results that we can show through simple data analysis and visualization.

Others need to explore internal relations, such as several key indicators that affect product quality. You need to analyze the correlation between different indicators and product quality in order to draw a correct conclusion. For example, if you need to predict the product sales in a certain period of time in the future, you need to model and analyze the historical data in order to have a more accurate prediction of the future situation.

The data analysis report is not only a direct presentation of the analysis results, but also a comprehensive understanding of the relevant situation. We often see some industry analysis reports analyzing various relationships from different angles. So, you need a story-telling logic. How to get convincing results from a macroscopic problem to all aspects of the problem requires constant training from practice.

Generally speaking, the general process of data analysis is these steps: problem definition, data collection, data preprocessing, data analysis modeling, data visualization and data report writing.