Page 1 of 1

How data cleaning works

Posted: Tue Feb 18, 2025 5:20 am
by rifat28dddd
spelling errors - a word is written incorrectly, for example "Sanktpeterburg" instead of "Sankt Petersburg";
polysemy - the same meaning in different features is called differently - for example, "nurse" and "nurse sister";
anomalous values ​​- the information in the attribute cannot be real - for example, the age of a living person is indicated as 271 years, and the date as March 34;
word reversal - words in a meaning have a different order in different places - for example, "building material" and "building material";
nesting of values ​​- one feature contains several values ​​- say, the city "Perm, Penza".
The data that reflects the readings of some devices also contains noise - interference, for example, rustling on the audio track or stripes on the video. And if the information was collected from different sources, a problem of different types of data may arise: in one place the date is written as April 7, and in another - as 07.04.

If errors remain in the sample, the model may perceive chile telegram data them incorrectly and produce incorrect answers later. For example, it may actually consider "Saint Petersburg" to be a separate city, unrelated to Saint Petersburg. Or it may remember that March has 34 days.

Become a data analyst and get a sought-after specialty
Read more
Become a data analyst and get a sought-after specialty
Data for analytics or model training is huge samples. Removing "garbage" from hundreds of thousands of values ​​manually is difficult, and sometimes impossible, so most often the process is automated.

Let's talk about what "cleaning data" is from a technical point of view. There are three main approaches to cleaning.