How data cleaning works
Posted: Tue Feb 18, 2025 5:20 am
spelling errors - a word is written incorrectly, for example "Sanktpeterburg" instead of "Sankt Petersburg";
polysemy - the same meaning in different features is called differently - for example, "nurse" and "nurse sister";
anomalous values - the information in the attribute cannot be real - for example, the age of a living person is indicated as 271 years, and the date as March 34;
word reversal - words in a meaning have a different order in different places - for example, "building material" and "building material";
nesting of values - one feature contains several values - say, the city "Perm, Penza".
The data that reflects the readings of some devices also contains noise - interference, for example, rustling on the audio track or stripes on the video. And if the information was collected from different sources, a problem of different types of data may arise: in one place the date is written as April 7, and in another - as 07.04.
If errors remain in the sample, the model may perceive chile telegram data them incorrectly and produce incorrect answers later. For example, it may actually consider "Saint Petersburg" to be a separate city, unrelated to Saint Petersburg. Or it may remember that March has 34 days.
Become a data analyst and get a sought-after specialty
Read more
Become a data analyst and get a sought-after specialty
Data for analytics or model training is huge samples. Removing "garbage" from hundreds of thousands of values manually is difficult, and sometimes impossible, so most often the process is automated.
Let's talk about what "cleaning data" is from a technical point of view. There are three main approaches to cleaning.
polysemy - the same meaning in different features is called differently - for example, "nurse" and "nurse sister";
anomalous values - the information in the attribute cannot be real - for example, the age of a living person is indicated as 271 years, and the date as March 34;
word reversal - words in a meaning have a different order in different places - for example, "building material" and "building material";
nesting of values - one feature contains several values - say, the city "Perm, Penza".
The data that reflects the readings of some devices also contains noise - interference, for example, rustling on the audio track or stripes on the video. And if the information was collected from different sources, a problem of different types of data may arise: in one place the date is written as April 7, and in another - as 07.04.
If errors remain in the sample, the model may perceive chile telegram data them incorrectly and produce incorrect answers later. For example, it may actually consider "Saint Petersburg" to be a separate city, unrelated to Saint Petersburg. Or it may remember that March has 34 days.
Become a data analyst and get a sought-after specialty
Read more
Become a data analyst and get a sought-after specialty
Data for analytics or model training is huge samples. Removing "garbage" from hundreds of thousands of values manually is difficult, and sometimes impossible, so most often the process is automated.
Let's talk about what "cleaning data" is from a technical point of view. There are three main approaches to cleaning.