leo.blog();

Data science

Data Science is extracting useful and actionable information out of structured and unstructured data.

Exploratory Data Analysis (EDA)

When you get a dataset, it’s a set of rows and columns. If it’s a supervised learning task, there are labels as well. But before you go straight to modeling, you should make yourself familiar with the data first.

Oftentimes, 1 hour spent looking at the data will be more useful than 1 hour spent tweaking model stuff. After all, garbage-in garbage-out, so you should try to put in something as clean as possible.

ydata-profiling

This is both a Python library and a command-line tool. The Python library can analyse Pandas dataframes, and the command-line tool can analyse CSV files.

The tool works a bit slowly, and the generated reports make your browser use a lot of RAM. But the analysis is very good and helpful.

You can run it on a CSV like this

uv run --python cpython-3.12.10-linux-x86_64-gnu --with ydata-profiling --with setuptools -- ydata_profiling data.csv report.html

This will read data.csv and output a report.html.

I needed to run it with Python 3.12 for some reason, it didn’t work with Python 3.14. Probably this will be fixed in the future.

EDA link dump

Leave a Comment