$30
Part 1 : Core
Business understanding:
-Background: “Overfishing” refers to the fact that human fishing activities result in a fish population that is not found in the ocean to breed and replenish populations. The marine life captured by modern fisheries has exceeded the amount that the ecosystem can balance, and as a result, the entire marine system is ecologically degraded. The vast ocean provides the most space for biological growth. The fishing and hunting life that humans have been since ancient times. It has been replaced by large-scale industrial fishery production to this day. Humans are increasingly demanding the oceans as they march into the ocean. When the human's claim exceeds the limit of the ocean's ability to load, the marine fishery resources begin to shrink and eventually lead to extinction. In recent years, with the rapid growth of the world's population, the world's fishery production has developed rapidly, and many fishing areas have produced excessive fishing.
-Business objective: My aim is to prove whether overfishing happened in New Zealand.
-Data mining goal : I want to use data to prove whether people's fishing activities have affected fish in the New Zealand. And has the government taken appropriate measures to face this issue?
Data understanding:
Source of the dataset:
My dataset is Environmental-economic accounts.I get the dataset from this link in data.govt.nz.(https://catalogue.data.govt.nz/dataset/environmental-economic-accounts-2019
-tables/resource/9c0fdf15-1d92-4163-a32c-0e2f98e75b76)
I chose Fish monetary stock account, 1996–2018–CSV in a series of datasets.
-Description of the dataset :
This data set shows the catch and fishing trends of some fish from 1996 to 2018. And the annual government limits on the fishing of various fish species, as well as the benefits from these fish.Environmental-economic accounts show how our environment contributes to our economy, the impacts of economic activity on our environment, and how we respond to environmental issues.
-When I open the dataset in the first time I find a lot of missing values and the dataset is too huge(6322 rows) for my analysis purposes. and complicated (too much species)which may mask the important features I want to figure out.So the first thing after loading the dataset is to filter my data.
Data Preparation:
And this dataset has a lot of flaws:
1. For missing value ,I choose to remove rows with missing value( data_value is the only attribute may have missing value which atrributeIndex is 6) because the number of data is big enough.(RemoveWithValues)
2. I chose the top ten fish in the amount of catch.
3. I remove asset_value(unit is Dollar) in variable because it is not unified with the other two data units(Catch&TACC tons), and actually I don't need this kind of data.
4. I remove All_species value in variable attribute because it have great impact of result
Modeling:
Before modelling, techniques I decide to use are Classification, Cluster and Regression.
Pipeline:
Evaluation:
Reslut of Pipeline:
Linear regression:
Since the algorithm filters out irrelevant data automatically, the data for the linear regression algorithm now retains only the annual catch (in tons) and the year and type of fish and data_value(how many fish be caught in tons ). Because the data value is continuous,so linear regression can be used. The purpose of our experiment is to predict future trends from known data. This satisfies the purpose of linear regression. For my dataset, it is based on the existing annual The amount of fishing to predict the future catch, and finally through the correlation coefficient is not difficult to see the results are very satisfactory, so the linear regression algorithm is very suitable. Linear regression is the ability to describe the relationship between data more accurately with a straight line. This way, when new data appears, it is possible to predict a simple value.
Logistic regression:
Logistic regression is different from linear regression. The essence of logistic regression algorithms is actually a classification algorithm, and it is not intended to predict a certain value. So the result is in the same form as the classification. Logistic regression is used for the classification of discrete variables, that is, the range of its output y is a discrete set, mainly used for class discrimination, and its output value y represents the class belonging to a certain class.
Logistic Regression is mainly used to classify problems. It is often used to predict probabilities. For example, knowing a person's age, weight, height, blood pressure and other information, predicting the probability of suffering from heart disease. The classic LR is used for the two-category problem (only 0, 1 and 2).
Logistic function:
For any x value, the corresponding y value is within the interval (0, 1).
The function formula is
IBK:
Because it is a classification algorithm, the result is correctly classified instances, but in fact, the purpose of the experiment does not have the need to classify the dataset.
SimplyKmean:
I think this experimental cluster is not applicable. The main reason is that the data have a label. We don't want to divide data into different groups.This deviates from our experimental purpose.
Difference between those algorithm:
Linear regression and Logistic regression:
First of all, the above two different regression algorithms are mentioned: linear regression and logistic regression. The ordinary linear regression is mainly used for the prediction of continuous variables. That is, the output y of the linear regression ranges from the whole real interval (y∈R).So it’s suitable for my data( data_value)
Logistic regression is used for the classification of discrete variables, that is, the range of its output y is a discrete set, mainly used for class discrimination, and its output value y represents the probability of belonging to a certain class.This is different from my experimental purpose.
SimpleKmean and IBK:
Clustering is very much important as it determines the intrinsic grouping among the unlabeled data present. And K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves clustering problem.K-means algorithm partition n observations into k clusters where each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.However, it is not suitable for my purpose.
And IBK is also a classification method ,so IBK is not applicable for the same reason as SimpleKmean above.
Conclusion:
All in all, linear regression have the best performance (coefficient is 0.9341) ,the correlation coefficient is pretty high. It shows attributes have strong relationship. Linear regression is the ability to describe the relationship between data more accurately with a straight line. This way, when new data appears, it is possible to predict a simple value, so it can be said that the number of fishing increases with the year. But now I don’t have enough evidence to show whether we did overfish or not ,the only thing what I can get is the number of fishing is positively related to the year.
At the same time, I made a diagram between TACC and Catch attributes in order to intuitively discover the relationship between them. And we can see only two catch values of fish exceed TACC.
So question in business understanding I can partially prove:
-Business objective: My aim is to prove whether overfishing happened in New Zealand.
-Data mining goal : I want to use data to prove whether people's fishing activities have affected fish in the New Zealand.
For these two question above, based on these diagram above, we did overfish on Snapper and Silver Warehou these two species but we can not ignore other majority of species were not overfished.So we can only say that overfishing happened but did not cause the collapse of fish because Snapper and Silver Warehou fluctuate but both of them stays within a range.
However we can not get definitive conclusion by those results, because my data is not good enough and there is only government regulations - Total allowable commercial catch. There is no data about limitation of biological (ecological balance) catch.So the next step I need to do is restart CRISP-DM and try to find more evidence and more useful dataset.
And when I do CRISP-DM again , I need to keep tracking following question:
and New question in business understanding:
Question 1: Is there any evidence of fish stocks collapsing in NZ waters?
In another aspect, how much human fishing has negatively affect on fish in NZ and whether it is cause fish stocks collapsing. Whether the fishing has caused a decrease of fish
?
Question 2: Whether the fishing exceeds the biological limit?
Because TACC is made by gorvernment and I really coursious about how they decide TACC value.
Question 3: Whether the definition of TACC is related to biological limit of fishing?
So after I made Question 2 I was thinking is there any possible TACC have relation with limit of fishing.
Part 2 : Completion
The question I chose is Question 3: Whether the definition of TACC is related to biological limit of fishing?
The reason why this is an interesting question is the dataset I used before doesn’t contain a reference indicator can be used to compare with my data. So this is a good question and essential question.As mentioned before, a single dataset is not enough to prove whether we overfishing in NZ because TACC is a man-made regulation, I want to know if there is one thing that has been affecting TACC. Therefore, I need an indicator to compare the relationship between my data and this indicator, so I need to introduce another two datasets , which are soft limit of fishing and hard limit of fishing. These two dataset contains different index of limit of fishing in different years. Explaination of soft limit and hard limit:
Picture from https://www.mpi.govt.nz/growing-and-harvesting/fisheries/fisheries-management/fish-stock-st atus/
The soft limit dataset I got from:
https://data.mfe.govt.nz/table/53467-performance-of-assessed-fish-stock-in-relation-to-the-s oft-limit-200915/data/
The hard limit dataset I got from:
https://data.mfe.govt.nz/table/53469-performance-of-assessed-fish-stock-in-relation-to-the-h ard-limit-200915/data/
Data preparation:
So what I want to do is make a dataset that only retains values of TACC , soft limit and hard limit, then see what relationship between them. (E.g. TACC grows with the growth of soft and hard.).First, I only keep values of landing from stock above hard&soft limit, Because being caught ashore is the real impact on fish. What’s more ,I find value of landing from stock above hard limit vey close to 100% ,which means we have not made a devastating thing.So I decide only use soft limit dataset.
1.My question is Whether the definition of TACC is related to biological limit of fishing? So I remove Catch value from my dataset and focus on TACC and my new data .
2. I found that their units are not uniform, the unit of landing_from_stocks_above_hard_limit is percentage but unit of data_value is tons.So I need to transfer one to another, then I use data_value of All_species in my original dataset I removed before multiply the percent then I get how many fish we catch above the soft limit.
3. Merge: Then I merge my original dataset with soft limit dataset.
4. Dimensionality reduction:
After I merge two data set we can see source, flag, magnitude ,units ,soft limit ,hard limit, have no relation with data_value. So we can remove these irrelevant attributes for making our result more reliable.
Modelling :
Evaluation:
Linear regression have a very good performance (coefficient is 0.9714) ,the correlation coefficient is pretty high. It shows attributes have strong relationship. Linear regression is the ability to describe the relationship between data more accurately with a straight line. This way, when new data appears, it is possible to predict a simple value, so it can be said that the value of TACC have relation with value_above(value of soft limit).
So now I can make sure when government made TACC is partially based on soft limit value.( Question I made before can be answered)
And I draw a line graph below to show the relation between them :
So we can see the relationship more apparently. The more value above soft limit line, the more fish we can catch , in another aspect, if there are some part below the line government will decrease TACC to protect fish.
Conclusion: Soft limit value →TACC →How much we can catch
It is a reasonable procedure to judge how many fish we can catch per year, and this is why overfishing didn’t happen in New Zealand.
Part 3 : Challenge
graph I got from pipeline: