Airbnb NYC Data Analysis¶

Analyze NYC Airbnb listings using Python (Pandas) to uncover pricing trends.

Project Goal¶

The purpose of this project is to understand what factors affect Airbnb prices in NYC. We explore how neighborhood, room type, and availability influence pricing patterns.

Key Questions¶

Which neighborhoods have the highest or lowest Airbnb prices?
How does room type influence price across different neighborhoods?

Tools & Technologies¶

Python (Pandas, Matplotlib/Seaborn): Data analysis and visualization
Markdown / GitHub: Project documentation and portfolio showcase

Data Source¶

Dataset: NYC Airbnb listings
Source: Inside Airbnb
File used: listings.csv

Methodology¶

Loaded the raw CSV into Pandas and cleaned the data (handled missing values, removed outliers).
Aggregated and analyzed listings by neighborhood and room type using Pandas.
Created visualizations (bar charts, heat maps) using Matplotlib/Seaborn.
Compiled insights into a report for quick understanding.

In [3]:

import pandas as pd

In [4]:

df = pd.read_csv(r"C:\Users\Patrice Davis\Desktop\Projects\airbnb-nyc-analysis\data\AB_NYC_2019.csv")
df.head()

Out[4]:

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0

In [5]:

df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     38843 non-null  object 
 13  reviews_per_month               38843 non-null  float64
 14  calculated_host_listings_count  48895 non-null  int64  
 15  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB

Out[5]:

	id	host_id	latitude	longitude	price	minimum_nights	number_of_reviews	reviews_per_month	calculated_host_listings_count	availability_365
count	4.889500e+04	4.889500e+04	48895.000000	48895.000000	48895.000000	48895.000000	48895.000000	38843.000000	48895.000000	48895.000000
mean	1.901714e+07	6.762001e+07	40.728949	-73.952170	152.720687	7.029962	23.274466	1.373221	7.143982	112.781327
std	1.098311e+07	7.861097e+07	0.054530	0.046157	240.154170	20.510550	44.550582	1.680442	32.952519	131.622289
min	2.539000e+03	2.438000e+03	40.499790	-74.244420	0.000000	1.000000	0.000000	0.010000	1.000000	0.000000
25%	9.471945e+06	7.822033e+06	40.690100	-73.983070	69.000000	1.000000	1.000000	0.190000	1.000000	0.000000
50%	1.967728e+07	3.079382e+07	40.723070	-73.955680	106.000000	3.000000	5.000000	0.720000	1.000000	45.000000
75%	2.915218e+07	1.074344e+08	40.763115	-73.936275	175.000000	5.000000	24.000000	2.020000	2.000000	227.000000
max	3.648724e+07	2.743213e+08	40.913060	-73.712990	10000.000000	1250.000000	629.000000	58.500000	327.000000	365.000000

Exploring the Dataset¶

In cell 5, we used df.info() and df.describe() to get an overview of the Airbnb dataset's structure, quality, and numeric characteristics.

Using `df.info()`¶

df.info() provides a concise summary of the dataset. Here's what it reveals:

Checks for missing values: The output shows the number of non-null entries for each column compared to the total number of rows. This helps identify columns with missing data that may need to be cleaned or removed. In this dataset, the columns last_review, reviews_per_month, and host_name had over 30% missing values, so we chose to exclude them since they were not essential to the overall analysis. (see cell 6-8)
Displays data types: Each column’s data type (int64, float64, object, etc.) is shown. Understanding whether a column is numeric or categorical is important because it determines which operations and visualizations are appropriate. Since most of our data consisted of floats and integers, box plots, histograms, and bar graphs were suitable for illustrating the findings.
Estimates memory usage: The summary includes the dataset’s memory usage, which is useful when working with large files to ensure efficient performance and avoid potential memory issues.

Using `df.describe()`¶

df.describe() provides a statistical summary of the numeric columns. Here's what it tells us:

Count of non-null values: Confirms the number of entries present for each numeric column.
Measures of central tendency: Includes the mean and median (50th percentile), giving an idea of typical values in each column.
Measures of spread: Includes standard deviation (std), minimum (min), maximum (max), and quartiles (25th and 75th percentiles), which help identify the variability in the data and detect potential outliers.
Informs visualizations and analysis: Understanding the distribution of numeric data guides the choice of charts and statistical methods. For example, skewed distributions are better visualized with histograms or box plots, while evenly distributed data can be summarized with bar charts.

In [ ]:

df.isnull().sum()

In [ ]:

df = df.drop(columns=["last_review", "reviews_per_month"])

In [8]:

df = df.drop(columns=["host_name"])

From cells 6–8, I examined the dataset to identify columns with the most missing values and determined which features were essential for analyzing unit prices. Columns such as "last_review", "reviews_per_month", and "host_name" were not relevant to the project goal, so I removed them to create a cleaner, more focused dataset for analysis.

In [9]:

df['name'] = df['name'].fillna("Unknown Property")

In cell 9, I identified that keeping the property names was important. Since there were only 16 missing values, I chose to fill them with “Unknown Property” to maintain data integrity.

In [10]:

# Average price by borough
neigh_group_price = df.groupby("neighbourhood_group")["price"].mean().sort_values(ascending=False)
print(neigh_group_price)

# Bar chart
import matplotlib.pyplot as plt

neigh_group_price.plot(kind="bar", title="Average Airbnb Price by Neighborhood Group")
plt.ylabel("Average Price ($)")
plt.show()

neighbourhood_group
Manhattan        196.875814
Brooklyn         124.383207
Staten Island    114.812332
Queens            99.517649
Bronx             87.496792
Name: price, dtype: float64

No description has been provided for this image

In cell 10, I used Matplotlib to create a bar chart visualizing the average Airbnb price by neighborhood group, helping to identify pricing trends across areas.

This bar chart shows that Manhattan has the highest average Airbnb price, close to $200, while the Bronx has the lowest, averaging under $100. Brooklyn and Staten Island fall in the middle range, followed by Queens.

Key Insight: Location is a major factor influencing Airbnb prices, with Manhattan commanding nearly double the average price of listings in the Bronx.

In [11]:

# Average price by room type
room_type_price = df.groupby("room_type")["price"].mean().sort_values(ascending=False)
print(room_type_price)

# Bar chart
room_type_price.plot(kind="bar", title="Average Airbnb Price by Room Type")
plt.ylabel("Average Price ($)")
plt.show()

room_type
Entire home/apt    211.794246
Private room        89.780973
Shared room         70.127586
Name: price, dtype: float64

In cell 11, I used Matplotlib to create a bar chart showing the average Airbnb price by room type, highlighting pricing trends across different types of accommodations.

In [12]:

#--- Distribution of prices ---
plt.figure(figsize=(12,6))
df['price'].hist(bins=50)
plt.title("Distribution of Airbnb Prices")
plt.xlabel("Price ($)")
plt.ylabel("Number of Listings")
plt.show()

This chart shows the overall distribution of listing prices. The majority of Airbnb rentals fall well below $2,000, with a sharp concentration under $500. The y-axis highlights just how many listings exist in these lower price ranges: tens of thousands compared to only a handful at higher prices.

Key Insight: Most Airbnb rentals are budget-friendly, with only a small fraction priced near the upper end of the scale. This heavy skew toward lower prices emphasizes the importance of filtering out extreme outliers when analyzing trends.

In [13]:

# --- Boxplots to visualize price outliers ---
# Optional: limit price to $1000 for readability
df_clean = df[df['price'] < 1000]

In [14]:

import seaborn as sns

In [21]:

plt.figure(figsize=(10,6))
sns.boxplot(x="room_type", y="price", data=df)
plt.title("Price Distribution by Room Type (Raw Data)")
plt.show()

Cell 21: I used a box plot to identify extreme outliers in the dataset. Many listings were priced far above the typical range, which risked distorting the analysis. To address this, I applied a cutoff at $1,000, focusing on the majority of rentals where meaningful pricing patterns emerge. This filtering improves the reliability of insights while still capturing the vast majority of real-world listings.

In [18]:

# Boxplot by neighborhood group

plt.figure(figsize=(10,6))
sns.boxplot(x='neighbourhood_group', y='price', data=df_clean)
plt.title("Price Distribution by Neighborhood Group (under $1000)")
plt.show()

# Boxplot by room type
plt.figure(figsize=(10,6))
sns.boxplot(x='room_type', y='price', data=df_clean)
plt.title("Price Distribution by Room Type (under $1000)")
plt.show()

Price Distributions by Location and Room Type (under $1000)¶

The filtered box plots reveal two major factors shaping Airbnb pricing in New York City:

Location: Manhattan and Brooklyn consistently show higher median prices and wider spreads than Queens, Staten Island, and the Bronx, underscoring the premium associated with popular areas.
Room Type: Entire homes/apartments command significantly higher prices than private or shared rooms, with more variability due to the presence of luxury listings.

Key Insight: Both where an Airbnb is located and what type of room it offers are critical determinants of price. Premium neighborhoods and full-property rentals not only drive higher median prices but also contribute to greater price variability, reflecting a mix of standard listings and high-end outliers.

Final Summary¶

Through cleaning and visualizing the data, I identified two major drivers of Airbnb pricing in New York: location and room type.

Manhattan and Brooklyn consistently command higher prices with greater variability than other boroughs.

Entire homes/apartments are priced significantly above private or shared rooms.

Most listings are budget-friendly, concentrated under $500, with outliers skewing the raw distribution.

By filtering extreme outliers and focusing on the bulk of listings, the analysis provides a clearer view of typical pricing trends. These findings align with market expectations: popular locations and full-property rentals command premiums, while affordability is concentrated in private and shared rooms.