Airbnb NYC Data Analysis¶

Analyze NYC Airbnb listings using Python (Pandas) to uncover pricing trends.

Project Goal¶

The purpose of this project is to understand what factors affect Airbnb prices in NYC. We explore how neighborhood, room type, and availability influence pricing patterns.

Key Questions¶

  1. Which neighborhoods have the highest or lowest Airbnb prices?
  2. How does room type influence price across different neighborhoods?

Tools & Technologies¶

  • Python (Pandas, Matplotlib/Seaborn): Data analysis and visualization
  • Markdown / GitHub: Project documentation and portfolio showcase

Data Source¶

  • Dataset: NYC Airbnb listings
  • Source: Inside Airbnb
  • File used: listings.csv

Methodology¶

  1. Loaded the raw CSV into Pandas and cleaned the data (handled missing values, removed outliers).
  2. Aggregated and analyzed listings by neighborhood and room type using Pandas.
  3. Created visualizations (bar charts, heat maps) using Matplotlib/Seaborn.
  4. Compiled insights into a report for quick understanding.
In [3]:
import pandas as pd
In [4]:
df = pd.read_csv(r"C:\Users\Patrice Davis\Desktop\Projects\airbnb-nyc-analysis\data\AB_NYC_2019.csv")
df.head()
Out[4]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
In [5]:
df.info()
df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     38843 non-null  object 
 13  reviews_per_month               38843 non-null  float64
 14  calculated_host_listings_count  48895 non-null  int64  
 15  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
Out[5]:
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
count 4.889500e+04 4.889500e+04 48895.000000 48895.000000 48895.000000 48895.000000 48895.000000 38843.000000 48895.000000 48895.000000
mean 1.901714e+07 6.762001e+07 40.728949 -73.952170 152.720687 7.029962 23.274466 1.373221 7.143982 112.781327
std 1.098311e+07 7.861097e+07 0.054530 0.046157 240.154170 20.510550 44.550582 1.680442 32.952519 131.622289
min 2.539000e+03 2.438000e+03 40.499790 -74.244420 0.000000 1.000000 0.000000 0.010000 1.000000 0.000000
25% 9.471945e+06 7.822033e+06 40.690100 -73.983070 69.000000 1.000000 1.000000 0.190000 1.000000 0.000000
50% 1.967728e+07 3.079382e+07 40.723070 -73.955680 106.000000 3.000000 5.000000 0.720000 1.000000 45.000000
75% 2.915218e+07 1.074344e+08 40.763115 -73.936275 175.000000 5.000000 24.000000 2.020000 2.000000 227.000000
max 3.648724e+07 2.743213e+08 40.913060 -73.712990 10000.000000 1250.000000 629.000000 58.500000 327.000000 365.000000

Exploring the Dataset¶

In cell 5, we used df.info() and df.describe() to get an overview of the Airbnb dataset's structure, quality, and numeric characteristics.

Using df.info()¶

df.info() provides a concise summary of the dataset. Here's what it reveals:

  • Checks for missing values: The output shows the number of non-null entries for each column compared to the total number of rows. This helps identify columns with missing data that may need to be cleaned or removed. In this dataset, the columns last_review, reviews_per_month, and host_name had over 30% missing values, so we chose to exclude them since they were not essential to the overall analysis. (see cell 6-8)

  • Displays data types: Each column’s data type (int64, float64, object, etc.) is shown. Understanding whether a column is numeric or categorical is important because it determines which operations and visualizations are appropriate. Since most of our data consisted of floats and integers, box plots, histograms, and bar graphs were suitable for illustrating the findings.

  • Estimates memory usage: The summary includes the dataset’s memory usage, which is useful when working with large files to ensure efficient performance and avoid potential memory issues.

Using df.describe()¶

df.describe() provides a statistical summary of the numeric columns. Here's what it tells us:

  • Count of non-null values: Confirms the number of entries present for each numeric column.

  • Measures of central tendency: Includes the mean and median (50th percentile), giving an idea of typical values in each column.

  • Measures of spread: Includes standard deviation (std), minimum (min), maximum (max), and quartiles (25th and 75th percentiles), which help identify the variability in the data and detect potential outliers.

  • Informs visualizations and analysis: Understanding the distribution of numeric data guides the choice of charts and statistical methods. For example, skewed distributions are better visualized with histograms or box plots, while evenly distributed data can be summarized with bar charts.

In [ ]:
df.isnull().sum()
In [ ]:
df = df.drop(columns=["last_review", "reviews_per_month"])
In [8]:
df = df.drop(columns=["host_name"])

From cells 6–8, I examined the dataset to identify columns with the most missing values and determined which features were essential for analyzing unit prices. Columns such as "last_review", "reviews_per_month", and "host_name" were not relevant to the project goal, so I removed them to create a cleaner, more focused dataset for analysis.

In [9]:
df['name'] = df['name'].fillna("Unknown Property")

In cell 9, I identified that keeping the property names was important. Since there were only 16 missing values, I chose to fill them with “Unknown Property” to maintain data integrity.

In [10]:
# Average price by borough
neigh_group_price = df.groupby("neighbourhood_group")["price"].mean().sort_values(ascending=False)
print(neigh_group_price)

# Bar chart
import matplotlib.pyplot as plt

neigh_group_price.plot(kind="bar", title="Average Airbnb Price by Neighborhood Group")
plt.ylabel("Average Price ($)")
plt.show()
neighbourhood_group
Manhattan        196.875814
Brooklyn         124.383207
Staten Island    114.812332
Queens            99.517649
Bronx             87.496792
Name: price, dtype: float64
No description has been provided for this image

In cell 10, I used Matplotlib to create a bar chart visualizing the average Airbnb price by neighborhood group, helping to identify pricing trends across areas.

This bar chart shows that Manhattan has the highest average Airbnb price, close to $200, while the Bronx has the lowest, averaging under $100. Brooklyn and Staten Island fall in the middle range, followed by Queens.

Key Insight: Location is a major factor influencing Airbnb prices, with Manhattan commanding nearly double the average price of listings in the Bronx.

In [11]:
# Average price by room type
room_type_price = df.groupby("room_type")["price"].mean().sort_values(ascending=False)
print(room_type_price)

# Bar chart
room_type_price.plot(kind="bar", title="Average Airbnb Price by Room Type")
plt.ylabel("Average Price ($)")
plt.show()
room_type
Entire home/apt    211.794246
Private room        89.780973
Shared room         70.127586
Name: price, dtype: float64
No description has been provided for this image

In cell 11, I used Matplotlib to create a bar chart showing the average Airbnb price by room type, highlighting pricing trends across different types of accommodations.

In [12]:
#--- Distribution of prices ---
plt.figure(figsize=(12,6))
df['price'].hist(bins=50)
plt.title("Distribution of Airbnb Prices")
plt.xlabel("Price ($)")
plt.ylabel("Number of Listings")
plt.show()
No description has been provided for this image

This chart shows the overall distribution of listing prices. The majority of Airbnb rentals fall well below $2,000, with a sharp concentration under $500. The y-axis highlights just how many listings exist in these lower price ranges: tens of thousands compared to only a handful at higher prices.

Key Insight: Most Airbnb rentals are budget-friendly, with only a small fraction priced near the upper end of the scale. This heavy skew toward lower prices emphasizes the importance of filtering out extreme outliers when analyzing trends.

In [13]:
# --- Boxplots to visualize price outliers ---
# Optional: limit price to $1000 for readability
df_clean = df[df['price'] < 1000]
In [14]:
import seaborn as sns
In [21]:
plt.figure(figsize=(10,6))
sns.boxplot(x="room_type", y="price", data=df)
plt.title("Price Distribution by Room Type (Raw Data)")
plt.show()
No description has been provided for this image

Cell 21: I used a box plot to identify extreme outliers in the dataset. Many listings were priced far above the typical range, which risked distorting the analysis. To address this, I applied a cutoff at $1,000, focusing on the majority of rentals where meaningful pricing patterns emerge. This filtering improves the reliability of insights while still capturing the vast majority of real-world listings.

In [18]:
# Boxplot by neighborhood group

plt.figure(figsize=(10,6))
sns.boxplot(x='neighbourhood_group', y='price', data=df_clean)
plt.title("Price Distribution by Neighborhood Group (under $1000)")
plt.show()

# Boxplot by room type
plt.figure(figsize=(10,6))
sns.boxplot(x='room_type', y='price', data=df_clean)
plt.title("Price Distribution by Room Type (under $1000)")
plt.show()
No description has been provided for this image
No description has been provided for this image

Price Distributions by Location and Room Type (under $1000)¶

The filtered box plots reveal two major factors shaping Airbnb pricing in New York City:

  • Location: Manhattan and Brooklyn consistently show higher median prices and wider spreads than Queens, Staten Island, and the Bronx, underscoring the premium associated with popular areas.
  • Room Type: Entire homes/apartments command significantly higher prices than private or shared rooms, with more variability due to the presence of luxury listings.

Key Insight: Both where an Airbnb is located and what type of room it offers are critical determinants of price. Premium neighborhoods and full-property rentals not only drive higher median prices but also contribute to greater price variability, reflecting a mix of standard listings and high-end outliers.

Final Summary¶

Through cleaning and visualizing the data, I identified two major drivers of Airbnb pricing in New York: location and room type.

Manhattan and Brooklyn consistently command higher prices with greater variability than other boroughs.

Entire homes/apartments are priced significantly above private or shared rooms.

Most listings are budget-friendly, concentrated under $500, with outliers skewing the raw distribution.

By filtering extreme outliers and focusing on the bulk of listings, the analysis provides a clearer view of typical pricing trends. These findings align with market expectations: popular locations and full-property rentals command premiums, while affordability is concentrated in private and shared rooms.