Cleaning data with pandas

Cleaning with a vacuum
Cleaning with a vacuum
Photo by The Creative Exchange on Unsplash.

Data is rarely clean enough for analysis. A huge part of data analysis involves cleaning the data because real-life data is “dirty.” Below are some useful techniques that can be used to get rid of the dirt in our data.

Eliminating Extra Characters

While analyzing data, it is important to pay close attention to whether certain variables contain extra characters.

An extra character could change the nature of our dataset. For example, It could convert a numeric character to a string.

While 1,000 is a numeric value, $1000 would be interpreted by most programming languages and software as a string, making it impossible…


1. Anscombe Quartet

Anscombe Quartet was developed by Francis Anscombe, a Statistician, in 1973. The Anscombe’s quartet comprises four datasets that have nearly identical statistical properties, yet appear very different when graphed.

Anscombe’s Quartet Dataset

When we find the mean of all the X’s in the above dataset, they are approximately 9 and the mean of all the Y’s are approximately 7.5. Also the Standard Deviation of all the X’s are approximately 3.16 and the Standard Deviation of all the Y’s are 1.94 …


Source : dev.to

SQL (Structured Query Language) is a language used for extracting and organizing data stored in a relational database.

There are 5 Commands in SQL and they are:

  • DDL(Data Definition Language).
  • DML(Data Manipulation Language).
  • DQL(Data Query Language).
  • DCL(Data Control Language).
  • TCL(Transaction Control Language).


Concept:

Linear Regression is a supervised Machine Learning algorithm used to make predictions on a continuous quantity whose behavior is linear in nature. For instance, Mrs Johnson wants to predict the price of a shoes, certain factors can affect the price of shoes, it could be the material used to make it, where it is made, the color and the fashion. Linear Regression can help with this prediction.

If a data is linear, then a line of best fit can be drawn for that data. Linear Regression follows from the equation of line from our basic algebra given by: y=mx+c…

Margaret Awojide

Statistician | Mathematician | Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store