Data is rarely clean enough for analysis. A huge part of data analysis involves cleaning the data because real-life data is “dirty.” Below are some useful techniques that can be used to get rid of the dirt in our data.
While analyzing data, it is important to pay close attention to whether certain variables contain extra characters.
An extra character could change the nature of our dataset. For example, It could convert a numeric character to a string.
1,000 is a numeric value,
$1000 would be interpreted by most programming languages and software as a string, making it impossible…
Anscombe Quartet was developed by Francis Anscombe, a Statistician, in 1973. The Anscombe’s quartet comprises four datasets that have nearly identical statistical properties, yet appear very different when graphed.
When we find the mean of all the X’s in the above dataset, they are approximately 9 and the mean of all the Y’s are approximately 7.5. Also the Standard Deviation of all the X’s are approximately 3.16 and the Standard Deviation of all the Y’s are 1.94 …
SQL (Structured Query Language) is a language used for extracting and organizing data stored in a relational database.
There are 5 Commands in SQL and they are:
Linear Regression is a supervised Machine Learning algorithm used to make predictions on a continuous quantity whose behavior is linear in nature. For instance, Mrs Johnson wants to predict the price of a shoes, certain factors can affect the price of shoes, it could be the material used to make it, where it is made, the color and the fashion. Linear Regression can help with this prediction.
If a data is linear, then a line of best fit can be drawn for that data. Linear Regression follows from the equation of line from our basic algebra given by: y=mx+c…
Statistician | Mathematician | Data Scientist