In this article, we’ll take a somewhat knowledgeable route of collecting data to know the steps to perform data preprocessing which creates a base for any NLP model. We’ll use web scraping to gather data from websites and store them into a CSV file. So, it’ll be a mixture of web scraping and data preprocessing.

7 steps of knowledge Preprocessing

  1. Gathering the info 
  2. Importing the info and libraries
  3. Dividing the dataset into dependent & independent variables
  4. Checking for missing values.
  5. Checking for categorical values.
  6. Splitting the dataset into training and test sets.
  7. Feature Scaling

1.) Gathering the info 

We are getting to collect data through web scraping. We could have directly downloaded the CSV file then continued to our second step too.

Procedure

Generally, web scraping is split into two parts:

  • Fetching data by making an HTTP request
  • Extracting important data by parsing the HTML DOM

Libraries & Tools

  • Beautiful Soup may be a Python library for pulling data out of HTML and XML files.
  • Requests allow you to send HTTP requests very easily.
  • Pandas provide fast, flexible, and expressive data structures.

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup, requests & pandas. For creating a folder and installing libraries type below given commands. I’m assuming that you simply have already installed Python 3. x.

mkdir scraper
pip install beautifulsoup4
pip install requests
pip install pandas

Now, create a file inside that folder by any name you wish. I’m using dataprocess.py. Then just import Beautiful Soup, requests, and pandas like below.

from bs4 import BeautifulSoup
import requests
import pandas as pd

Preparing the Food

We are getting to scrape the table from this website then we are getting to store the info during a CSV file using Pandas.

r = requests.get(‘https://milindjagre.co/2018/03/10/post-3-ml-data-preprocessing-part-1/’).text
soup=BeautifulSoup(r,'html.parser')
u=list()
l={}

We are getting to scrape this table using BeautifulSoup

table = soup.find(“table”,{“class”:”js-csv-data csv-data js-file-line-container”}) 
tr = table.find_all(“tr”,{“class”:”js-file-line”})

We’ll run a for loop to succeed in each and each “td” tag.

for i in range(0,len(tr)):
 td=tr[i].find_all(“td”)
 for x in range(0,len(td)):
  try:
   l[“Country”]=td[1].text
  except:
   l[“Country”]=None
  try:
   l[“Age”]=td[2].text
  except:
   l[“Age”]=None
  try:
   l[“Salary”]=td[3].text
  except:
   l[“Salary”]=None
  try:
   l[“Purchased”]=td[4].text
  except:
   l[“Purchased”]=None
  u.append(l)
  l={}

From here we’ve to use pandas to make a data frame of the above list to save lots of the info into a CSV file.

df = pd.io.json.json_normalize(u)
df.to_csv(‘data.csv’, index=False, encoding=’utf-8')

This will save the info to a CSV file. And so, here we finish our data gathering process.

2.) Import the info and libraries

Libraries are the tool that you simply can use to try to do a selected job. It makes programming very simple. you only need to provide input and therefore the library will respond with the result you’re expecting.

Three essential libraries we are getting to use in data preprocessing

  • Numpy is the fundamental package for array computing with Python.
  • Matplotlib.plot may be a plotting package for python
  • Pandas may be a powerful arrangement for data analysis and statistics. It’s also very helpful in importing data.

Now, importing data is extremely simple. we’ll use pandas to import data. We are getting to import the info .csv file which we created while gathering data.

datasets = pd.read_csv(‘Data.csv’)

Dependent & Independent Variables

Now, we’d like to differentiate the matrix of features and therefore the variable vector. So, we are getting to create a matrix of features. We’ll make a matrix of three independent variables.

X = datasets.iloc[: , :-1].values

Now, by ‘:’ (on the left of ‘,’)it means we took all the lines into consideration and on the proper comma it means all the columns except the last one which is Purchased or not.

Now, we’ll make a matrix of dependent variables.

y = datasets.iloc[: , 3].values

3 on the proper of a comma means the last column of the table.

Missing values

As you’ll see there are two missing data within the table. One within the Age column and therefore the other within the Salary column.

To solve this problem we will just remove that complete row/dataset. But that would be very dangerous if that row contains very crucial information. 

Another idea is to require a mean of columns. we’ll use Sklearn for doing this task for us. we’ll use its class SimpleImputer to try to do the work.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy=’mean’)
imputer = imputer.fit(X[:, 1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

Categorical variables

As you’ll see we’ve two categorical variables country & purchased. The country has three and Purchased has two categories.

Machine Learning models are supported mathematical equations. So, it’ll create problems if we keep text in equations. Therefore, we’ve to encode those variables.

from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

In the first line, as usual, we are creating an object of the category LabelEncoder.

In the second line, we’ve used the fit_transform method to suit label encoder and return encoded variables.

But, something is wrong within the matrix. within the first matrix, France is denoted by 0, Spain is denoted by 2, and Germany is denoted by 1. this is often a situation where our machine learning model will think that France is greater than Germany and Spain is greater than Germany but actually, you can’t compare these three countries. to unravel this problem we are getting to use dummy variables.

Dummy Variable

We’ll split the country column into three columns since we’ve 3 categories in it. We are getting to use ColumnTransformer and OneHotEncoder to separate the entire column of the country. OneHotEncoder will create a separate column for every category.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer([(“Country”, OneHotEncoder(), [0])], remainder = ‘passthrough’)
X = ct.fit_transform(X)

Independent Variable

Now, we’ll encode variable Y.

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

Be relieved as we won’t need to use onehotencoder, we just got to use a label encoder because since this is often the variable the machine learning model will know that it’s a category and there’s no order between the 2.

6.) Split the dataset into training and test set

The question is why can we get to do this? Well, take a step back and specialize in the word machine learning itself. There is often a few machines that go to find out something. In our case, there’s an algorithm that’s getting to learn something from your data to form predictions or complete machine learning goals. We don’t want our algorithm to find out something by memory otherwise our ML model will predict similar results even with different datasets.

On the training set, we built a machine learning model and a test assail which we test the performance of the ML model. you ought to also keep one thing in mind that the test model shouldn’t vary from the performance of the training set.

We are getting to use train_test_split to separate our dataset. As you’ll see that the names of the libraries are quite intuitive.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

test_size floats between 0 and 1 and represents the proportion of the info set to incorporate within the test split and randon_state controls the shuffling applied to the data before applying the split.

X_train is that the training set of variable X

X_test is that the test set of variable X

Y_train is that the training set of variable Y

Y_test is that the test set of variable Y

7.) Feature Scaling

Age and salary values aren’t during the same range this may cause some issues in our ML model. That issue might be because most of the ML models are supported euclidean distance(Go back to high school). Since the salary column features a much wider range of values from 0 to 100k the euclidean distance is going to be dominated by the salary compare to the Age column.

We are getting to use StandardScaler to normalize the values in order that we will have a solid ML model. StandardScaler will Standardize every value. There are mainly two ways to try to do it Standardisation and Normalisation.

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Conclusion

In this article, we understood how we will scrape data using Python & BeautifulSoup then perform data preprocessing using several important libraries in Machine learning.

Was this post helpful?