Project Tutorial: Cleaning and Analyzing Used Car Listings from eBay Kleinanzeigen

Real-world data is messy. Anyone who has worked with scraped web data has seen this pattern: prices saved as text, impossible year values, columns with no variation, and numeric fields that repeat in suspiciously limited ways. Cleaning that data is where most of the real analytical work actually happens.

In this project, we'll work through a realistic data cleaning and analysis workflow using a dataset of 50,000 used car listings scraped from eBay Kleinanzeigen, the classifieds section of the German eBay site. We’ll step into the role of a data analyst helping a used car classifieds service understand how brand and mileage affect car prices.

What You'll Learn

By the end of this tutorial, you'll know how to:

Load data with encoding issues and understand why they occur
Clean column names from camelCase to snake_case
Identify and drop useless or redundant columns
Strip formatting characters from numeric strings and convert data types
Detect and remove price outliers using judgment-based thresholds
Filter invalid registration years
Group and aggregate data by brand to uncover pricing trends

Before You Start

You'll need Python 3.8+, Jupyter Notebook, pandas, and NumPy. You can install them with pip install pandas numpy if needed.

Familiarity with Python lists, dictionaries, loops, and basic pandas operations will help you follow along. If you want to review first, the Introduction to Pandas and NumPy for Data Analysis course covers the core concepts. Access the full project in the Dataquest app and the solution notebook on GitHub.

The Dataset

The data was originally scraped from eBay Kleinanzeigen and uploaded to Kaggle. We're working with a 50,000-row sample prepared by Dataquest that simulates a less-cleaned version of the original. Each row represents a single used car listing, with columns covering the car's name, price, brand, model, mileage, registration year, fuel type, and various metadata about the listing itself.

Our goal is to clean this data and analyze pricing trends by brand.

Step 1: Loading the Data

import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.info()
autos.head()

If you try loading the file without specifying an encoding, you'll get a UnicodeDecodeError. This happens because the dataset contains non-English characters (German text and special characters) that the default UTF-8 encoding doesn't know how to handle. When you see this error, encoding='Latin-1' is often the fix. Latin-1 covers a broader range of Western European characters and resolves the issue about 90% of the time.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
...
 4   price                50000 non-null  object
...
11   odometer             50000 non-null  object
...

A few things stand out from info(). Most columns are object type (strings in pandas). Notably, price and odometer are both objects when they should be numeric — glancing at the actual data values immediately reveals why.

price: $5,000    odometer: 150,000km

Both columns contain non-numeric characters: dollar signs and commas in price, and "km" plus commas in odometer. Those need to be stripped before we can do any math with these columns.

Learning Insight: The info() and head() methods together are your first line of defense in any new dataset. info() tells you data types and null counts. head() shows you the actual values so you can see why those types are what they are. Together they surface many of the cleaning tasks you'll need to do.

Step 2: Renaming Columns

The original column names use camelCase (dateCrawled, vehicleType, yearOfRegistration). Python convention is snake_case, so we'll rename them all at once.

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen']

This makes the columns easier to work with and aligns with pandas best practices. Note that we've also taken the opportunity to rename a few columns more descriptively: yearOfRegistration becomes registration_year, notRepairedDamage becomes unrepaired_damage, and nrOfPictures becomes num_photos.

Step 3: Dropping Useless Columns

The describe(include='all') method gives us summary statistics for every column, including string columns.

autos.describe(include='all')

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_photos	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-10 15:36:24	Ford_Fiesta	privat	Angebot	\$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

A few things immediately stand out. The seller column has only two unique values, and 49,999 out of 50,000 are the same one. offer_type is identical for nearly every row. These columns carry essentially no information.

The num_photos column is even more clear-cut:

autos["num_photos"].value_counts()

0    50000
Name: num_photos, dtype: int64

Every single row is zero. The column is useless. We drop all three.

autos = autos.drop(["num_photos", "seller", "offer_type"], axis=1)

Step 4: Cleaning the Price and Odometer Columns

Now for the core cleaning work. Both price and odometer are stored as strings with formatting characters that prevent numeric conversion.

autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(int)
                  )
autos["price"].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

This is method chaining: three operations applied one after another in a single readable block. We strip the dollar sign, strip the comma (used as a thousands separator in German notation, where commas can also denote decimal points), and convert to integer. The same approach works for odometer.

autos["odometer"] = (autos["odometer"]
                     .str.replace("km", "")
                     .str.replace(",", "")
                     .astype(int)
                     )
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

We also rename the column to odometer_km so it's clear the unit is kilometers, since we're removing "km" from the values themselves.

Step 5: Exploring and Cleaning Price

Before filtering, let's understand what we're working with.

print(autos["price"].unique().shape)
print(autos["price"].describe())
autos["price"].value_counts().sort_index(ascending=False).head(20)

(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

min      0.000000e+00
max      1.000000e+08

The minimum price is \$0 and the maximum is \$100,000,000. Both are suspicious. Looking at the actual high-end values:

99999999    1
27322222    1
12345678    3

These prices are well beyond the price range of a typical or high-end vehicle. Looking at the low end:

0     1421
1      156

There are 1,421 cars listed at \$0. While a small fraction of 50,000 rows, \$0 doesn't represent a real sale price for our analysis. We'll remove zeros and anything above \$350,000, which is where prices jump from plausible high-end values to a small set of outliers.

autos = autos[autos["price"].between(1, 351000)]
autos["price"].describe()

count     48565.000000
mean       5888.935591
min           1.000000
max      350000.000000

We retain 48,565 rows — a small reduction that removes noise without discarding meaningful data.

Learning Insight: Deciding where to draw the outlier threshold is one of the judgment calls in data cleaning. There's often no single correct answer. We chose \$350,000 because that's where prices in the data jump from gradual increases to extreme values. Document your reasoning wherever you make these calls, especially if you're sharing your analysis with a stakeholder.

Step 6: Exploring Odometer Values

autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
...
Name: odometer_km, dtype: int64

Interestingly, there are only 13 distinct values, despite this being a continuous numeric variable. This happens because eBay's listing interface likely had a dropdown selector rather than a free text field, so sellers chose from pre-set mileage bands. That's not an error — it just means we have less granularity than we might expect. The 13 distinct values are clean and interpretable.

Step 7: Cleaning Registration Year

autos["registration_year"].describe()

min       1000.000000
max       9999.000000

Year 1000 predates the automobile by about 900 years. Year 9999 is 8,000 years into the future. We need to filter to a realistic range.

(~autos["registration_year"].between(1900, 2016)).sum() / autos.shape[0]

0.038793

Only about 4% of rows fall outside our valid range of 1900 to 2016 (the dataset's collection year). That's small enough to remove without concern.

autos = autos[autos["registration_year"].between(1900, 2016)]
autos["registration_year"].value_counts(normalize=True).head(10)

2000    0.067608
2005    0.062895
1999    0.062060

The distribution looks exactly as expected for a used car market: most vehicles were registered in the past 20 years.

Step 8: Analyzing Price by Brand

. With the data cleaned, we can now compare the most common car brands in the listings and see how their average prices differ.

autos["brand"].value_counts(normalize=True)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
...

Volkswagen dominates with about 21% of listings. German manufacturers (Volkswagen, BMW, Opel, Mercedes-Benz, Audi) account for roughly half of all listings combined.

Many brands in the tail have less than 1% representation. For a meaningful price analysis, we'll focus on brands with more than 5% of total listings.

brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

Now we calculate mean price for each of these brands.

brand_mean_prices = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_price = brand_only["price"].mean()
    brand_mean_prices[brand] = int(mean_price)

brand_mean_prices

{'volkswagen': 5402,
 'bmw': 8332,
 'opel': 2975,
 'mercedes_benz': 8628,
 'audi': 9336,
 'ford': 3749}

A clear pattern emerges: Audi, BMW, and Mercedes-Benz cluster in the \$8,000–\$9,300 range. Ford and Opel sit in the \$3,000–\$3,700 range. Volkswagen lands in between at \$5,400.

Step 9: Comparing Mileage Across Brands

Does a higher average price correspond to higher mileage? Let's find out.

brand_mean_mileage = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_mileage = brand_only["odometer_km"].mean()
    brand_mean_mileage[brand] = int(mean_mileage)

mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)

brand_info = pd.DataFrame(mean_mileage, columns=['mean_mileage'])
brand_info["mean_price"] = mean_prices
brand_info

               mean_mileage  mean_price
bmw               132572        8332
mercedes_benz     130788        8628
opel              129310        2975
audi              129157        9336
volkswagen        128707        5402
ford              124266        3749

Mileage across brands falls within a fairly narrow range (roughly 124,000 to 133,000 km), while prices vary by more than 3x between the cheapest (Opel at ~\$3,000) and the most expensive (Audi at ~\$9,300). BMW and Mercedes-Benz have the highest mileage and also command high prices, suggesting buyers pay a premium for these brands regardless of wear.

Learning Insight: When comparing multiple metrics across groups, putting them side by side in a single DataFrame makes patterns much easier to spot than reading two separate outputs. Here, the combination reveals that mileage isn't the primary driver of the price gap between German luxury brands and budget options.

Summary of Findings

After cleaning the dataset from 50,000 rows down to approximately 46,900 valid listings, the key takeaways are:

The used car market on eBay Kleinanzeigen is dominated by German brands, with Volkswagen alone accounting for 21% of listings. Among common brands, there's a distinct price gap: Audi, BMW, and Mercedes-Benz average \$8,000–\$9,300 per vehicle, while Ford and Opel average \$3,000–\$3,700. Volkswagen sits in between, which may help explain its popularity as a middle-market option.

Average mileage doesn't vary much across brands (all within 10% of each other), which means mileage alone doesn't explain the price differences. Brand prestige and the typical market segment these manufacturers occupy are stronger drivers.

Next Steps

There's a lot more this dataset can support:

Analyze model-level pricing. We looked at brand, but the model column lets you go deeper. Does the price vary significantly between a BMW 3-series and a BMW 7-series? Are certain Volkswagen models clustered at higher or lower price points?

Segment by mileage bands. The odometer already gives us 13 clean buckets. Do average prices drop consistently as mileage increases, or is the relationship brand-dependent?

Explore unrepaired damage. The unrepaired_damage column tells us whether a car has damage that hasn't been fixed. Are damaged cars priced significantly lower? What fraction of each brand's listings have damage?

Clean and use the date columns. We skipped date_crawled, ad_created, and last_seen for this walkthrough. Analyzing how long listings stayed active before being removed could reveal which brands or price ranges sell fastest.

Resources

Share your extended analysis in the community and tag @Anna_strahl. There are several interesting directions to take this project, and seeing what patterns others find in the data is always worthwhile.