Starting from:

$30

SML-Mini Project Solved

1        Problem and Data: actor classification
The technical problem is to tell which of two actors is male and which is female based on various properties of a film. Although we are predicting the gender of two actors this is a binary classification problem. If actor 1 is male then actor 2 is female and visa-versa. For context of this study, please first read this article: https://pudding.cool/2017/03/film-dialogue/. In this article, the authors were looking at the amount of speaking in films by male and female actors in order to detect gender bias. It turns out then, in children’s movies in particular, it is male characters who do most of the talking (figure 1).

You are looking at this data from another perspective, measuring whether male or female lead role is predictable from the amount of dialogue the actors have, the year the film was made, how much money it made and so on.

The training data set training.csv consists of an output variable

Lead Is either ’Female’ or ’Male’.

The lead is assumed to be the person who speaks most in the film (says the most words). The co-lead is assumed to have the gender male (if lead is female) and female (if lead is male).

The following input variables are provided

Year That the film was released.

Number of female actors With major speaking roles.

Number of male actors With major speaking roles.

Gross Profits made by film.

Total words Total number of words spoken in the film.

Number of words male Number of words spoken by all other male actors in the film (excluding lead and co-lead)

Number of words female Number of words spoken by all other female actors in the film (excluding lead and co-lead)

Number of words lead Number of words spoken by lead.

Difference in words lead and co-lead Difference in number of words by lead and the actor of opposite gender who speaks most.

Lead Age Age of lead actor.

Co-lead Age Age of co-lead actor.

Mean Age Male Mean age of all male characters.

Mean Age Female Mean age of all female characters.

You are expected to use all the knowledge that you have acquired in the course about classification algorithms, to come up with one algorithm that you think is suited for this problem and which you decide to put ‘in production’. This algorithm will then be tested against a test set made available after peer review.

 

Figure 1: Gender bias in speaking roles in Hollywood films

2        Training
2.1     Methods to explore

The course has (so far[1]) covered the five following ‘families’ of classification methods:

(i)        logistic regression

(ii)       discriminant analysis: LDA, QDA

(iii)     K-nearest neighbor

(iv)     Tree-based methods: classification trees, random forests, bagging (v) Boosting

In this project, you decide upon at least as many ‘families’ as you are group members, and decide in each ‘family’ at least one method to explore. To be clear, each group member should independently implement and write about one method. Who implemented which method should later be clearly written in the contribution statement. All group members should be able to stand for all sections of the report.

2.2     What to do with each method

For each method you decide to explore, you should do the following:

(a)    Implement the method. We suggest that you use Python, and you may write your own code or use packages (the material from the problem solving sessions can be useful).

(b)    Tune the method to perform well.

(c)    Evaluate its performance using, e.g., cross validation.

Exactly how to carry out this evaluation is up to you to decide.

Once you have completed the aforementioned tasks, you should with a good motivation (hint: cross validation) select which method you decide to use ‘in production’ on a test set that will be made available later. Work on this part of the project together and write the results together

3        Feature importance
Some input variables make better predictions than others. In this task investigate how important the following aspects:

•    Words spoken by males and females

•    Year of release

•    Money made by film

are in predicting the gender. To do this you should try fitting your model including and omitting these variables. Try fitting models including a variable or excluding it and also look at models that include just one variable. Which features are most important in getting a good prediction? Here you can use misclassification error, false positives, false negatives and ROC/AUC to see how well models with or without certain variable perform. For logistic regression, you can also use Akaike Information Criteria (https://machinelearningmastery.com/probabilistic-model-selection-measures/) to test your fit. Do this work together as a group.

Then answer the following questions based on your analysis:

•    Do men or women dominate speaking roles in Hollywood movies?

•    Has gender balance in speaking roles changed over time (i.e. years)?

•    Do films in which men do more speaking make a lot more money than films in which women speak more?

Write one paragraph in answer to each question.

Finally, discuss your results about gender and film together in an open and free discussion. No point of view is considered unreasonable in this discussion and you are free to say what you think, in the context of the data. After your group discussion write a joint two paragraph reflection on what you have collectively learnt from this analysis. Again, the conclusions should be based on the data analysis done here, but there is no single correct answer to this.


More products