Project Tutorial: Cleaning and Analyzing Used Car Listings from eBay Kleinanzeigen
Our take

Real-world data is messy. Anyone who has worked with scraped web data has seen this pattern: prices saved as text, impossible year values, columns with no variation, and numeric fields that repeat in suspiciously limited ways. Cleaning that data is where most of the real analytical work actually happens.
In this project, we'll work through a realistic data cleaning and analysis workflow using a dataset of 50,000 used car listings scraped from eBay Kleinanzeigen, the classifieds section of the German eBay site. We’ll step into the role of a data analyst helping a used car classifieds service understand how brand and mileage affect car prices.
What You'll Learn
By the end of this tutorial, you'll know how to:
- Load data with encoding issues and understand why they occur
- Clean column names from camelCase to snake_case
- Identify and drop useless or redundant columns
- Strip formatting characters from numeric strings and convert data types
- Detect and remove price outliers using judgment-based thresholds
- Filter invalid registration years
- Group and aggregate data by brand to uncover pricing trends
Before You Start
You'll need Python 3.8+, Jupyter Notebook, pandas, and NumPy. You can install them with pip install pandas numpy if needed.
Familiarity with Python lists, dictionaries, loops, and basic pandas operations will help you follow along. If you want to review first, the Introduction to Pandas and NumPy for Data Analysis course covers the core concepts. Access the full project in the Dataquest app and the solution notebook on GitHub.
The Dataset
The data was originally scraped from eBay Kleinanzeigen and uploaded to Kaggle. We're working with a 50,000-row sample prepared by Dataquest that simulates a less-cleaned version of the original. Each row represents a single used car listing, with columns covering the car's name, price, brand, model, mileage, registration year, fuel type, and various metadata about the listing itself.
Our goal is to clean this data and analyze pricing trends by brand.
Step 1: Loading the Data
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.info()
autos.head()
If you try loading the file without specifying an encoding, you'll get a UnicodeDecodeError. This happens because the dataset contains non-English characters (German text and special characters) that the default UTF-8 encoding doesn't know how to handle. When you see this error, encoding='Latin-1' is often the fix. Latin-1 covers a broader range of Western European characters and resolves the issue about 90% of the time.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dateCrawled 50000 non-null object
1 name 50000 non-null object
...
4 price 50000 non-null object
...
11 odometer 50000 non-null object
...
A few things stand out from info(). Most columns are object type (strings in pandas). Notably, price and odometer are both objects when they should be numeric — glancing at the actual data values immediately reveals why.
price: $5,000 odometer: 150,000km
Both columns contain non-numeric characters: dollar signs and commas in price, and "km" plus commas in odometer. Those need to be stripped before we can do any math with these columns.
Learning Insight: The
info()andhead()methods together are your first line of defense in any new dataset.info()tells you data types and null counts.head()shows you the actual values so you can see why those types are what they are. Together they surface many of the cleaning tasks you'll need to do.
Step 2: Renaming Columns
The original column names use camelCase (dateCrawled, vehicleType, yearOfRegistration). Python convention is snake_case, so we'll rename them all at once.
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
'last_seen']
This makes the columns easier to work with and aligns with pandas best practices. Note that we've also taken the opportunity to rename a few columns more descriptively: yearOfRegistration becomes registration_year, notRepairedDamage becomes unrepaired_damage, and nrOfPictures becomes num_photos.
Step 3: Dropping Useless Columns
The describe(include='all') method gives us summary statistics for every column, including string columns.
autos.describe(include='all')
| date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
| unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
| top | 2016-03-10 15:36:24 | Ford_Fiesta | privat | Angebot | \$0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
| freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
A few things immediately stand out. The seller column has only two unique values, and 49,999 out of 50,000 are the same one. offer_type is identical for nearly every row. These columns carry essentially no information.
The num_photos column is even more clear-cut:
autos["num_photos"].value_counts()
0 50000
Name: num_photos, dtype: int64
Every single row is zero. The column is useless. We drop all three.
autos = autos.drop(["num_photos", "seller", "offer_type"], axis=1)
Step 4: Cleaning the Price and Odometer Columns
Now for the core cleaning work. Both price and odometer are stored as strings with formatting characters that prevent numeric conversion.
autos["price"] = (autos["price"]
.str.replace("$", "")
.str.replace(",", "")
.astype(int)
)
autos["price"].head()
0 5000
1 8500
2 8990
3 4350
4 1350
Name: price, dtype: int64
This is method chaining: three operations applied one after another in a single readable block. We strip the dollar sign, strip the comma (used as a thousands separator in German notation, where commas can also denote decimal points), and convert to integer. The same approach works for odometer.
autos["odometer"] = (autos["odometer"]
.str.replace("km", "")
.str.replace(",", "")
.astype(int)
)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
We also rename the column to odometer_km so it's clear the unit is kilometers, since we're removing "km" from the values themselves.
Step 5: Exploring and Cleaning Price
Before filtering, let's understand what we're working with.
print(autos["price"].unique().shape)
print(autos["price"].describe())
autos["price"].value_counts().sort_index(ascending=False).head(20)
(2357,)
count 5.000000e+04
mean 9.840044e+03
std 4.811044e+05
min 0.000000e+00
25% 1.100000e+03
50% 2.950000e+03
75% 7.200000e+03
max 1.000000e+08
Name: price, dtype: float64
min 0.000000e+00
max 1.000000e+08
The minimum price is \$0 and the maximum is \$100,000,000. Both are suspicious. Looking at the actual high-end values:
99999999 1
27322222 1
12345678 3
These prices are well beyond the price range of a typical or high-end vehicle. Looking at the low end:
0 1421
1 156
There are 1,421 cars listed at \$0. While a small fraction of 50,000 rows, \$0 doesn't represent a real sale price for our analysis. We'll remove zeros and anything above \$350,000, which is where prices jump from plausible high-end values to a small set of outliers.
autos = autos[autos["price"].between(1, 351000)]
autos["price"].describe()
count 48565.000000
mean 5888.935591
min 1.000000
max 350000.000000
We retain 48,565 rows — a small reduction that removes noise without discarding meaningful data.
Learning Insight: Deciding where to draw the outlier threshold is one of the judgment calls in data cleaning. There's often no single correct answer. We chose \$350,000 because that's where prices in the data jump from gradual increases to extreme values. Document your reasoning wherever you make these calls, especially if you're sharing your analysis with a stakeholder.
Step 6: Exploring Odometer Values
autos["odometer_km"].value_counts()
150000 32424
125000 5170
100000 2169
...
Name: odometer_km, dtype: int64
Interestingly, there are only 13 distinct values, despite this being a continuous numeric variable. This happens because eBay's listing interface likely had a dropdown selector rather than a free text field, so sellers chose from pre-set mileage bands. That's not an error — it just means we have less granularity than we might expect. The 13 distinct values are clean and interpretable.
Step 7: Cleaning Registration Year
autos["registration_year"].describe()
min 1000.000000
max 9999.000000
Year 1000 predates the automobile by about 900 years. Year 9999 is 8,000 years into the future. We need to filter to a realistic range.
(~autos["registration_year"].between(1900, 2016)).sum() / autos.shape[0]
0.038793
Only about 4% of rows fall outside our valid range of 1900 to 2016 (the dataset's collection year). That's small enough to remove without concern.
autos = autos[autos["registration_year"].between(1900, 2016)]
autos["registration_year"].value_counts(normalize=True).head(10)
2000 0.067608
2005 0.062895
1999 0.062060
The distribution looks exactly as expected for a used car market: most vehicles were registered in the past 20 years.
Step 8: Analyzing Price by Brand
. With the data cleaned, we can now compare the most common car brands in the listings and see how their average prices differ.
autos["brand"].value_counts(normalize=True)
volkswagen 0.211264
bmw 0.110045
opel 0.107581
mercedes_benz 0.096463
audi 0.086566
ford 0.069900
...
Volkswagen dominates with about 21% of listings. German manufacturers (Volkswagen, BMW, Opel, Mercedes-Benz, Audi) account for roughly half of all listings combined.
Many brands in the tail have less than 1% representation. For a meaningful price analysis, we'll focus on brands with more than 5% of total listings.
brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
Now we calculate mean price for each of these brands.
brand_mean_prices = {}
for brand in common_brands:
brand_only = autos[autos["brand"] == brand]
mean_price = brand_only["price"].mean()
brand_mean_prices[brand] = int(mean_price)
brand_mean_prices
{'volkswagen': 5402,
'bmw': 8332,
'opel': 2975,
'mercedes_benz': 8628,
'audi': 9336,
'ford': 3749}
A clear pattern emerges: Audi, BMW, and Mercedes-Benz cluster in the \$8,000–\$9,300 range. Ford and Opel sit in the \$3,000–\$3,700 range. Volkswagen lands in between at \$5,400.
Step 9: Comparing Mileage Across Brands
Does a higher average price correspond to higher mileage? Let's find out.
brand_mean_mileage = {}
for brand in common_brands:
brand_only = autos[autos["brand"] == brand]
mean_mileage = brand_only["odometer_km"].mean()
brand_mean_mileage[brand] = int(mean_mileage)
mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)
brand_info = pd.DataFrame(mean_mileage, columns=['mean_mileage'])
brand_info["mean_price"] = mean_prices
brand_info
mean_mileage mean_price
bmw 132572 8332
mercedes_benz 130788 8628
opel 129310 2975
audi 129157 9336
volkswagen 128707 5402
ford 124266 3749
Mileage across brands falls within a fairly narrow range (roughly 124,000 to 133,000 km), while prices vary by more than 3x between the cheapest (Opel at ~\$3,000) and the most expensive (Audi at ~\$9,300). BMW and Mercedes-Benz have the highest mileage and also command high prices, suggesting buyers pay a premium for these brands regardless of wear.
Learning Insight: When comparing multiple metrics across groups, putting them side by side in a single DataFrame makes patterns much easier to spot than reading two separate outputs. Here, the combination reveals that mileage isn't the primary driver of the price gap between German luxury brands and budget options.
Summary of Findings
After cleaning the dataset from 50,000 rows down to approximately 46,900 valid listings, the key takeaways are:
The used car market on eBay Kleinanzeigen is dominated by German brands, with Volkswagen alone accounting for 21% of listings. Among common brands, there's a distinct price gap: Audi, BMW, and Mercedes-Benz average \$8,000–\$9,300 per vehicle, while Ford and Opel average \$3,000–\$3,700. Volkswagen sits in between, which may help explain its popularity as a middle-market option.
Average mileage doesn't vary much across brands (all within 10% of each other), which means mileage alone doesn't explain the price differences. Brand prestige and the typical market segment these manufacturers occupy are stronger drivers.
Next Steps
There's a lot more this dataset can support:
Analyze model-level pricing. We looked at brand, but the model column lets you go deeper. Does the price vary significantly between a BMW 3-series and a BMW 7-series? Are certain Volkswagen models clustered at higher or lower price points?
Segment by mileage bands. The odometer already gives us 13 clean buckets. Do average prices drop consistently as mileage increases, or is the relationship brand-dependent?
Explore unrepaired damage. The unrepaired_damage column tells us whether a car has damage that hasn't been fixed. Are damaged cars priced significantly lower? What fraction of each brand's listings have damage?
Clean and use the date columns. We skipped date_crawled, ad_created, and last_seen for this walkthrough. Analyzing how long listings stayed active before being removed could reveal which brands or price ranges sell fastest.
Resources
- Project in the Dataquest app
- Solution notebook on GitHub
- Introduction to Pandas and NumPy for Data Analysis
- Dataquest Community Forum
Share your extended analysis in the community and tag @Anna_strahl. There are several interesting directions to take this project, and seeing what patterns others find in the data is always worthwhile.
Read on the original site
Open the publisher's page for the full experience