Airbnb NYC Data Analysis¶
Analyze NYC Airbnb listings using Python (Pandas) to uncover pricing trends.
Project Goal¶
The purpose of this project is to understand what factors affect Airbnb prices in NYC. We explore how neighborhood, room type, and availability influence pricing patterns.
Key Questions¶
- Which neighborhoods have the highest or lowest Airbnb prices?
- How does room type influence price across different neighborhoods?
Tools & Technologies¶
- Python (Pandas, Matplotlib/Seaborn): Data analysis and visualization
- Markdown / GitHub: Project documentation and portfolio showcase
Data Source¶
- Dataset: NYC Airbnb listings
- Source: Inside Airbnb
- File used:
listings.csv
Methodology¶
- Loaded the raw CSV into Pandas and cleaned the data (handled missing values, removed outliers).
- Aggregated and analyzed listings by neighborhood and room type using Pandas.
- Created visualizations (bar charts, heat maps) using Matplotlib/Seaborn.
- Compiled insights into a report for quick understanding.
import pandas as pd
df = pd.read_csv(r"C:\Users\Patrice Davis\Desktop\Projects\airbnb-nyc-analysis\data\AB_NYC_2019.csv")
df.head()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
| 1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
| 2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
| 3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
| 4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
df.info()
df.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48895 entries, 0 to 48894 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 48895 non-null int64 1 name 48879 non-null object 2 host_id 48895 non-null int64 3 host_name 48874 non-null object 4 neighbourhood_group 48895 non-null object 5 neighbourhood 48895 non-null object 6 latitude 48895 non-null float64 7 longitude 48895 non-null float64 8 room_type 48895 non-null object 9 price 48895 non-null int64 10 minimum_nights 48895 non-null int64 11 number_of_reviews 48895 non-null int64 12 last_review 38843 non-null object 13 reviews_per_month 38843 non-null float64 14 calculated_host_listings_count 48895 non-null int64 15 availability_365 48895 non-null int64 dtypes: float64(3), int64(7), object(6) memory usage: 6.0+ MB
| id | host_id | latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.889500e+04 | 4.889500e+04 | 48895.000000 | 48895.000000 | 48895.000000 | 48895.000000 | 48895.000000 | 38843.000000 | 48895.000000 | 48895.000000 |
| mean | 1.901714e+07 | 6.762001e+07 | 40.728949 | -73.952170 | 152.720687 | 7.029962 | 23.274466 | 1.373221 | 7.143982 | 112.781327 |
| std | 1.098311e+07 | 7.861097e+07 | 0.054530 | 0.046157 | 240.154170 | 20.510550 | 44.550582 | 1.680442 | 32.952519 | 131.622289 |
| min | 2.539000e+03 | 2.438000e+03 | 40.499790 | -74.244420 | 0.000000 | 1.000000 | 0.000000 | 0.010000 | 1.000000 | 0.000000 |
| 25% | 9.471945e+06 | 7.822033e+06 | 40.690100 | -73.983070 | 69.000000 | 1.000000 | 1.000000 | 0.190000 | 1.000000 | 0.000000 |
| 50% | 1.967728e+07 | 3.079382e+07 | 40.723070 | -73.955680 | 106.000000 | 3.000000 | 5.000000 | 0.720000 | 1.000000 | 45.000000 |
| 75% | 2.915218e+07 | 1.074344e+08 | 40.763115 | -73.936275 | 175.000000 | 5.000000 | 24.000000 | 2.020000 | 2.000000 | 227.000000 |
| max | 3.648724e+07 | 2.743213e+08 | 40.913060 | -73.712990 | 10000.000000 | 1250.000000 | 629.000000 | 58.500000 | 327.000000 | 365.000000 |
Exploring the Dataset¶
In cell 5, we used df.info() and df.describe() to get an overview of the Airbnb dataset's structure, quality, and numeric characteristics.
Using df.info()¶
df.info() provides a concise summary of the dataset. Here's what it reveals:
Checks for missing values: The output shows the number of non-null entries for each column compared to the total number of rows. This helps identify columns with missing data that may need to be cleaned or removed. In this dataset, the columns
last_review,reviews_per_month, andhost_namehad over 30% missing values, so we chose to exclude them since they were not essential to the overall analysis. (see cell 6-8)Displays data types: Each column’s data type (
int64,float64,object, etc.) is shown. Understanding whether a column is numeric or categorical is important because it determines which operations and visualizations are appropriate. Since most of our data consisted of floats and integers, box plots, histograms, and bar graphs were suitable for illustrating the findings.Estimates memory usage: The summary includes the dataset’s memory usage, which is useful when working with large files to ensure efficient performance and avoid potential memory issues.
Using df.describe()¶
df.describe() provides a statistical summary of the numeric columns. Here's what it tells us:
Count of non-null values: Confirms the number of entries present for each numeric column.
Measures of central tendency: Includes the mean and median (50th percentile), giving an idea of typical values in each column.
Measures of spread: Includes standard deviation (
std), minimum (min), maximum (max), and quartiles (25th and 75th percentiles), which help identify the variability in the data and detect potential outliers.Informs visualizations and analysis: Understanding the distribution of numeric data guides the choice of charts and statistical methods. For example, skewed distributions are better visualized with histograms or box plots, while evenly distributed data can be summarized with bar charts.
df.isnull().sum()
df = df.drop(columns=["last_review", "reviews_per_month"])
df = df.drop(columns=["host_name"])
From cells 6–8, I examined the dataset to identify columns with the most missing values and determined which features were essential for analyzing unit prices. Columns such as "last_review", "reviews_per_month", and "host_name" were not relevant to the project goal, so I removed them to create a cleaner, more focused dataset for analysis.
df['name'] = df['name'].fillna("Unknown Property")
In cell 9, I identified that keeping the property names was important. Since there were only 16 missing values, I chose to fill them with “Unknown Property” to maintain data integrity.
# Average price by borough
neigh_group_price = df.groupby("neighbourhood_group")["price"].mean().sort_values(ascending=False)
print(neigh_group_price)
# Bar chart
import matplotlib.pyplot as plt
neigh_group_price.plot(kind="bar", title="Average Airbnb Price by Neighborhood Group")
plt.ylabel("Average Price ($)")
plt.show()
neighbourhood_group Manhattan 196.875814 Brooklyn 124.383207 Staten Island 114.812332 Queens 99.517649 Bronx 87.496792 Name: price, dtype: float64
In cell 10, I used Matplotlib to create a bar chart visualizing the average Airbnb price by neighborhood group, helping to identify pricing trends across areas.
This bar chart shows that Manhattan has the highest average Airbnb price, close to $200, while the Bronx has the lowest, averaging under $100. Brooklyn and Staten Island fall in the middle range, followed by Queens.
Key Insight: Location is a major factor influencing Airbnb prices, with Manhattan commanding nearly double the average price of listings in the Bronx.
# Average price by room type
room_type_price = df.groupby("room_type")["price"].mean().sort_values(ascending=False)
print(room_type_price)
# Bar chart
room_type_price.plot(kind="bar", title="Average Airbnb Price by Room Type")
plt.ylabel("Average Price ($)")
plt.show()
room_type Entire home/apt 211.794246 Private room 89.780973 Shared room 70.127586 Name: price, dtype: float64
In cell 11, I used Matplotlib to create a bar chart showing the average Airbnb price by room type, highlighting pricing trends across different types of accommodations.
#--- Distribution of prices ---
plt.figure(figsize=(12,6))
df['price'].hist(bins=50)
plt.title("Distribution of Airbnb Prices")
plt.xlabel("Price ($)")
plt.ylabel("Number of Listings")
plt.show()
This chart shows the overall distribution of listing prices. The majority of Airbnb rentals fall well below $2,000, with a sharp concentration under $500. The y-axis highlights just how many listings exist in these lower price ranges: tens of thousands compared to only a handful at higher prices.
Key Insight: Most Airbnb rentals are budget-friendly, with only a small fraction priced near the upper end of the scale. This heavy skew toward lower prices emphasizes the importance of filtering out extreme outliers when analyzing trends.
# --- Boxplots to visualize price outliers ---
# Optional: limit price to $1000 for readability
df_clean = df[df['price'] < 1000]
import seaborn as sns
plt.figure(figsize=(10,6))
sns.boxplot(x="room_type", y="price", data=df)
plt.title("Price Distribution by Room Type (Raw Data)")
plt.show()
Cell 21: I used a box plot to identify extreme outliers in the dataset. Many listings were priced far above the typical range, which risked distorting the analysis. To address this, I applied a cutoff at $1,000, focusing on the majority of rentals where meaningful pricing patterns emerge. This filtering improves the reliability of insights while still capturing the vast majority of real-world listings.
# Boxplot by neighborhood group
plt.figure(figsize=(10,6))
sns.boxplot(x='neighbourhood_group', y='price', data=df_clean)
plt.title("Price Distribution by Neighborhood Group (under $1000)")
plt.show()
# Boxplot by room type
plt.figure(figsize=(10,6))
sns.boxplot(x='room_type', y='price', data=df_clean)
plt.title("Price Distribution by Room Type (under $1000)")
plt.show()
Price Distributions by Location and Room Type (under $1000)¶
The filtered box plots reveal two major factors shaping Airbnb pricing in New York City:
- Location: Manhattan and Brooklyn consistently show higher median prices and wider spreads than Queens, Staten Island, and the Bronx, underscoring the premium associated with popular areas.
- Room Type: Entire homes/apartments command significantly higher prices than private or shared rooms, with more variability due to the presence of luxury listings.
Key Insight: Both where an Airbnb is located and what type of room it offers are critical determinants of price. Premium neighborhoods and full-property rentals not only drive higher median prices but also contribute to greater price variability, reflecting a mix of standard listings and high-end outliers.
Final Summary¶
Through cleaning and visualizing the data, I identified two major drivers of Airbnb pricing in New York: location and room type.
Manhattan and Brooklyn consistently command higher prices with greater variability than other boroughs.
Entire homes/apartments are priced significantly above private or shared rooms.
Most listings are budget-friendly, concentrated under $500, with outliers skewing the raw distribution.
By filtering extreme outliers and focusing on the bulk of listings, the analysis provides a clearer view of typical pricing trends. These findings align with market expectations: popular locations and full-property rentals command premiums, while affordability is concentrated in private and shared rooms.