Data Preprocessing(Machine Learning)

Vatsal Sharma
3 min readJun 15, 2021

--

Data Preprocessing is considered one of the most important step in making a Making Learning model function properly.

We can easily get tons of Data in form of various Datasets, but to make that data fit for deriving various insights from it, requires a lot of observation, modification, manipulation and numerous other steps.

What is it?

When we freshly download a Dataset for our project or some other work, the Data it contains is random(most of the time) i.e. not arranged or not filled in the way we need it to be.

Sometimes, it might have

  • NULL Values
  • Unnecessary Features
  • Datatypes not in a proper format.

etc…

So, to treat all these shortcomings, we go through a process which is popularly known as “Data Preprocessing’’.

Applications

Data Preprocessing in some or the other way is used in almost each and every Machine Learning problem. It has a very wide application spectrum.

How do we do Preprocessing?

There are numerous ways to preprocess the data, depending upon our need, we proceed further.

Example 1- If we have NULL values in our Dataset.

  • We can simply drop our NULL values if they aren’t much in number & if dropping them won’t affect our dataset.
  • We can also treat NULL values by replacing them with Mean, Median or Mode of that column. It depends on our need.

Example 2- If we do not have date & time in correct format.

We can use pd.to_datetime() for this.

It advised to declare a function to avoid typing the same code for other columns again and again.

Example 3- If we want to replace useless string values from a column

We can use replace() for doing this, after the replacement with proper numeric values, our dataset will be more useful.

Example 4- If we have a useless column in our dataset.

Here, we can simply use drop() with required parameters.

Example 5- When we want to convert String type(Text) column to Numeric type, this is done to properly implement ML algorithms on our Dataset.

Now, we can concatenate this mini dataset to our actual dataset & drop it’s ‘Source’ column. This is also referred to as One Hot Encoding.

These were some examples of Data Preprocessing in Machine Learning, we can preprocess our data in many other ways too, as per our need.

Happy Learning Folks!

You can also visit my previous blogs by clicking on their name below-

Follow me on LinkedIn- https://www.linkedin.com/in/imvat18/

--

--

Vatsal Sharma

Building Yarnit 🚀 | SDE-I @Yarnit | Ex- Data Science Intern @Aiotize | JIIT'22 | Aficionado 🏏 https://www.linkedin.com/in/imvat18/