$30
The main objective of this laboratory is to put into practice what you have learned on regression techniques. You will work on a tabular dataset. In particular, you will try to build a regression model that is able to identify the price of an Airbnb apartment, given different information related to the listing.
Important note. For what concerns this laboratory, you are encouraged to upload your results to the competition we launched on our platform, even if the submission will not count on your final exam mark. You have to use the same personal key you already used for Lab 5. If you do not have a key yet, please write to giuseppe.attanasio@polito.it. Refer to Section 3 to read more about the competition.
1 Preliminary steps
1.1 Datasets
In this laboratory, you will use a publicly available dataset. Public Domain Dedication datasets constitute an extremely valuable asset for the data science community. If you want to know more about how they are distributed, refer to the CC0 licence.
1.1.1 New York City Airbnb Open Data
This public dataset is part of Airbnb, and the original source can be found on Inside Airbnb.
Each row of the dataset corresponds to an Airbnb listing in New York City, for the year 2019. As for the previous competition, the dataset has been divided in a Development set and an Evaluation one. You will find more about them later in the document.
Each file has an initial header line, containing the names of attributes at your disposal:
• id: a unique identifier of the listing
• name
• host_id: a unique identifier of the host
• host_name
• neighborhood_group: neighborhood location in the city
• neighborhood: name of the neighborhood
• latitude: coordinate expressed as floating point number
• longitude: coordinate expressed as floating point number
• room_type
• price: price per night expressed in dollars
• minimum_nights: minimum nights requested by the host
• number_of_reviews
• last_review: date of the last review expressed as YYYY-MM-DD
• reviews_per_month: average number of reviews per month
• calculated_host_listings_count: amount of listing of the host
• availability_365: number of days when the listing is available for booking You can download the dataset at: https://github.com/dbdmg/data-science-lab/raw/master/datasets/NYC_Airbnb.zip
1.1.2 Dataset tree hierarchy
The data have been distributed uniformly in two separate collections. Each collection is in a different file. The dataset archive is organized as follows:
• development.csv (Development set): a collection of listings with the price column. This collection of data has to be used during the development of the regression model.
• evaluation.csv (Evaluation set): a collection of listings without the price column. This collection of data has to be used to produce the submission file.
• sample_submission.csv: a sample submission file.
So far, you should be used to work, while developing your models, with training, validation and test sets. In this case, the Development data must be used to tune your hyper-parameters while you should consider the Evaluation portion as the actual test set.
2 Exercises
In this laboratory, you have a single regression task to carry out.
2.1 NYC Airbnb listing price regression
In this exercise, you will try to predict the price of an Airbnb listing in NYC, published in 2019, using several contextual information. To do so, your primary goal will be modeling, through a regression-based pipeline, the relationship between information on the listing (e.g. its geographical location, the reviews it received, or many other metrics you might figure out) and the price itself.
Once your model is complete, you will predict, for a set of listings whose price is unknown, how much would it cost to you spending one night at them.
Finally, you will be able to upload your regression results and participate to the lab competition.
1. Load the dataset from the root folder.
2. Focus now on the data preparation step. You should have noticed that the attributes that describe each listing are heterogeneous, both on the source (e.g. geographical, related to host, related to Airbnb, etc.) and on the type (e.g. numerical, categorical, date, etc.). Before continuing, take you your time to answer these questions:
• which attribute (or set of attributes) you think could drive the price per night the most?
• can you detect any irregularity in any attribute distribution?
• if your regression model will fit on numerical data only, how could you handle categorical attributes?
Transform your initial dataset following the ideas you draw out.
3. Once you have your final dataset representation, choose one regression model of those you know. Then, perform the classic training-validation pipeline on the Development dataset to identify the best set of hyper-parameters for your model. As you can read in Section 3.3, we will evaluate your results on the R2 score (or Coefficient of determination). Hence, it is a reasonable option trying to optimize it on the Development set.
4. Assign a price value to each listing in the Evaluation set.
5. Define a function to generate a 2D scatterplot with the prices. The chart must be drawn as heatmap: use the latitude and longitude coordinates along the axes and the price value to assign a color to the point. Then, apply the function to the prices from the Development set and to the ones you predicted for the Evaluation set. From Section 1.1, you know that Development and Evaluation were generated with a uniform sampling on the initial listings. So, what should you expect on the map, if your regression were correct?