Chapter 1:

A li’ll background:

  • Data jargons - “Data Science”, “Data Analytics”, “Machine Learning”, “Python” , “R”...
  • Where to Start?

A little about me:

  • I am Rakesh!
  • Coder, Manager and work in the field of analytics
  • I like solving analytical problems, hiking, XBOX, Photography, teaching etc.
  • My website: www.rakeshgopal.com
  • Some feedback from my previous courses.

What is the BI Stack

  • Databases and Querying concepts

(SQL Engineer, SQL/DB Developer, BI Developer)

  • Transferring and Manipulating data from different sources -

(ETL Developer / BI Developer)

  • Visualizing the data (i.e extracting information out of data)

(Report/ Data Analyst/ BI Developer)

  • Predicting and Pattern Analyzing the data

(Data Scientist / ML Scientist / ML engineer)

About this course:

  • Keep it simple!
  • Provide enough information for people to start creating reports.
  • Incorporate feedback from previous courses.
  • I will be uploading this document
  • Has reading materials.
  • Ask questions!

Reading and resources:



Chapter 2 - What is Machine Learning?

https://en.wikipedia.org/wiki/Machine_learning

  • The inputs need to be relevant to your algorithm
  • You need to choose the correct algorithm for optimal results.
  • 2 Types of algorithms:
    • Supervised Learning: Inputs and outputs are supplied to the algorithm to study patterns.

Example: Email Spam Classification.

You can feed the following input and output to the supervised learning algorithm and let it study:






Email_Message

From

To

SpellingErrors

No_of_links

Spam

Some text

[email protected]

[email protected]

25

4

Yes

Some text

[email protected]

[email protected]

0

0

No

Some text

[email protected]

[email protected]

2

4

Yes





  • Unsupervised Learning: Only inputs are supplied to the algorithm. There are no outputs. The algorithm simply studies the inputs and puts them as clusters or tells us how the inputs are associated with one another.

Further Readings:




Chapter 3

Getting Started with Python and Scikit-Learn

Goal: To know about tools needed for this course and how to set them up.

  • We learnt about machine learning, supervised and unsupervised learning.
  • Tool and Environment setup.
  • We will use Python
  • What is Python - “It is a programming language”
  • What is Scikit Learn - Scikit-learn is a package or a library for python which helps perform machine learning tasks and input data manipulation.
  • Environment setup:
    • Install Anaconda distribution of Python
    • Install Scikit Learn
    • And we are ready to write some code.

Further Reading:

https://en.wikipedia.org/wiki/Python_(programming_language)

https://www.python.org/

  • What is scikit learn?

http://scikit-learn.org/stable/

  • Scikit-Learn Tutorials & Cheat Sheets:

http://scikit-learn.org/stable/tutorial/basic/tutorial.html

http://know.anaconda.com/rs/387-XNW-688/images/2017-08_Anaconda_Starter_Guide_CheatSheet_Web.pdf?content=button2&mkt_tok=eyJpIjoiTW1Sa05UVmtaVEZqTnpjMiIsInQiOiJoZGx6ODRGY2hlNFkyZE9YbEdIUURFYWJMN2FHZm83ajBnMTlDdFlUOHloNmo0OGxGTTJPXC9NRms1enJRaUpvQ2w4Mkw2VE9SaXFtcE9hTE11TWpzZEFvanZWSnFtU3V0NlVhbGlmOUcwbFRyc1hJTWh5VXVrU21rTHN0Vjl1dDAifQ%3D%3D

To learn Python:

Chapter 4 - Types of Machine learning algorithms

Goal: How machine learning works

  • Let’s take a very simple dataset:

Your problem statement = “Use Machine Learning to predict Profits per sale

  • Your inputs are also called features.



Sales ($) (X) - Features/inputs

Profit ($) (y) - outputs/predictions

30

3

40

4

50

5

70

7

Now, if you supply this to a machine learning algorithm, it will output:

y = 0.1*X => This output is called as a machine learning model.

Think about it

Sales ($) (x)

Profit ($) (y)

30

3 (y = 0.1*x → 0.1*30 = 3)

40

4 (y = 0.1*x → 0.1*40 = 4)

50

5 (y = 0.1*x → 0.1*50 = 5)

70

7 (y = 0.1*x → 0.1*70 = 7)

Example, if sales was 100, can you predict the profit? Sure you can.

y= 0.1*x → 0.1*100 = 10, hence the predicted profit = 10

So what are the most common algorithms?

  1. Supervised (you supply the input and output to the algorithm, and let the algorithm study the categories/patterns, so that if a similar input comes to it in future, it can classify/categorize or predict the output, based on what it studied before)
  • Linear Regression
  • Logistic Regression
  • Classification
  • Decision Trees
  • KNN
  • Naive Bayes

2. Unsupervised (The dataset only has inputs, no output. The algorithm will study the inputs and form associations or clusters or groups among the inputs)

  • Clustering Algorithms:
    • Centroid-based algorithms
    • Connectivity-based algorithms
    • Density-based algorithms
    • Probabilistic

Further Readings:

  1. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
  2. https://en.wikipedia.org/wiki/Linear_regression
  3. https://en.wikipedia.org/wiki/Logistic_regression
  4. https://en.wikipedia.org/wiki/Statistical_classification
  5. https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
  6. https://en.wikipedia.org/wiki/Naive_Bayes_classifier
  7. https://en.wikipedia.org/wiki/Decision_tree

Chapter 5: playing around with Anaconda and Jupyter

  • Anaconda Navigator
  • Project paths and documentation
  • Jupyter web based notebook
  • The shortcut H for other shortcuts.
  • PRINT(‘Hello World’)
  • Markup languages.

Additional Reading:

https://jupyter.readthedocs.io/en/latest/

https://docs.python.org/3/tutorial/modules.html

http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

Chapter 6: Playing with some code.

Iris Data-set: https://en.wikipedia.org/wiki/Iris_flower_data_set

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Input terminologies:


Features (columns)

Target (values or predictions)


Sepal length

Sepal width

Petal length

Petal width

Species

Observations

(rows)

5.1

3.5

1.4

0.2

I. setosa

4.9

3.0

1.4

0.2

I. setosa

4.7

3.2

1.3

0.2

I. setosa

4.6

3.1

1.5

0.2

I. setosa

Thus features will have:

  • Your input data (sepal_length, Sepal_Width, Petal_length and Petal_width) => which you supply to the algorithm and is also called as Feature Names.
  • Your Output data (Species) => Which you need the algorithm to output or predict (also called as Target Names. In the above example - I.Sentosa, I.Versicolor are the target names

You need to follow just 2 simple steps to create an machine learning model in Python:

  1. Pass your input (data) and your output (targets) as different objects (numpy array). Numpy is an inbuilt data packaging mechanism as an array for faster processing.
  2. Make sure your inputs and output are numerical

Iris.data (let's call this X [uppercase])

Iris.target (let’s call this y (lowercase))

Sepal length

Sepal width

Petal length

Petal width

Species

5.1

3.5

1.4

0.2

I. setosa

4.9

3.0

1.4

0.2

I. setosa

4.7

3.2

1.3

0.2

I. setosa

4.6

3.1

1.5

0.2

I. setosa

Next video =>

  1. We will use an algorithm to study these patterns (input that we provided) - KNN



Additional Reading and References:

  1. https://en.wikipedia.org/wiki/Iris_flower_data_set
  2. http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris
  3. Take a guess on why we used a upper case X and lower case y. Submit your answers in the forums.
  4. Deep dive on numpy - https://www.youtube.com/watch?v=gtejJ3RCddE

Chapter 7 : “Fitting a Machine Learning model - KNN algorithm”

Rewind:

  1. We saw how to read data in Python.
  2. Fitting a model / or passing input to an algorithm, comprises of 2 main steps:
    1. Pass your input (data) and your output (targets) as different objects (numpy array). Numpy is an inbuilt data packaging mechanism as an array for faster processing.
    2. Make sure your inputs and output are numerical

Let’s go ahead and fit a model for this problem. By looking at the outputs, you will understand that this is a classification problem

  • Bunch is a special datatype in Scikit to store the datasets.
  • Bunch has many attributes:
    • Data (we will store all the features)
    • Target (we will store the target prediction)
    • Feature_names
    • Target_names

Iris.data (let's call this X [uppercase])

Iris.target (let’s call this y (lowercase))

Sepal length

(feature_name)

Sepal width

(feature_name)

Petal length

(feature_name)

Petal width

(feature_name)

Species

5.1

3.5

1.4

0.2

I. setosa (target_name)

4.9

3.0

1.4

0.2

I. setosa

(target_name)

4.7

3.2

1.3

0.2

I. setosa

(target_name)

4.6

3.1

1.5

0.2

I. setosa

(target_name)



Let’s make sure we satisfy our rule 1 & 2:

  1. Fitting a model / or passing input to an algorithm, comprises of 2 main steps:
    1. Pass your input (data) and your output (targets) as different objects (numpy array). Numpy is an inbuilt data packaging mechanism as an array for faster processing.
    2. Make sure your inputs and output are numerical



Now we have the data. Let’s go ahead and store this in X and y .

X because its a matrix, y because its a vector

Iris.data (let's call this X [uppercase])

Iris.target (let’s call this y (lowercase))

Sepal length

(feature_name)

Sepal width

(feature_name)

Petal length

(feature_name)

Petal width

(feature_name)

Species

5.1

3.5

1.4

0.2

I. setosa (target_name)

4.9

3.0

1.4

0.2

I. setosa

(target_name)

4.7

3.2

1.3

0.2

I. setosa

(target_name)

4.6

3.1

1.5

0.2

I. setosa

(target_name)

X = iris.data

y= iris.target

Further readings:

  1. How do you determine the shape of the ndarray for your data and target. Submit your answers in the forums. What is the shape of X and y?
  2. Read in Detail about the K-NN algortihm : https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Chapter 8 : “Fitting a Machine Learning model - 2 (KNN algorithm)”

Rewind:

  • We saw how to define X (uppercase) and y (lowercase) in the previous video.

This lecture:

  • In this lecture, we will see how to fit a model.
  • 3 steps:
    • Import the class
    • Instantiate
    • Fit the model

  • Let’s look at K-NearestNeighbor (KNN) Classifier with some code.

Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html







Iris.data (let's call this X [uppercase])

PREDICTION

Sepal length, Sepal width, Petal Length Petal Width

(feature_name)

KNN (n=1)





[2,4,3,1]

[0] - setosa







Iris.data (let's call this X [uppercase])

PREDICTION

Sepal length, Sepal width, Petal Length Petal Width

(feature_name)

KNN (n=1)





[2,4,31]

[0] - setosa





[4,6,5,3]

[2] - virginica




















Let’s try with another K value - Principle remains the same:

Iris.data (let's call this X [uppercase])

PREDICTION

Sepal length, Sepal width, Petal Length Petal Width

(feature_name)

KNN (n=1)

KNN (n=5)




[2,4,31]

[0] - setosa

[0] - setosa




[4,6,5,3]

[2] - virginica

{1] - versicolor




Iris.data (let's call this X [uppercase])

PREDICTION

Sepal length, Sepal width, Petal Length Petal Width

(feature_name)

KNN (n=1)

KNN (n=5)

KNN(n=8)



[2,4,31]

??

??

??



[4,6,5,3]

??

??

??



Further Reading:

  1. http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
  1. Quiz: can you predict the species using the KNN algorithm for k =1,2,3,4,.....30? Try to use a Python looping construct to achieve the result. Learn more on looping: https://www.learnpython.org/en/Loops
  1. Read and implement for the same inputs, another algorithm, say, Logistic Regression algorithm and note the results as below.

Iris.data (let's call this X [uppercase])

PREDICTION

Sepal length, Sepal width, Petal Length Petal Width

(feature_name)

KNN (n=1)

KNN (n=5)

KNN(n=8)

LogisticRegression

[2,4,31]

[0] - setosa

[0] - setosa

[0] - setosa

??

[4,6,5,3]

[2] - virginica

[1] - setosa

[2] - virginica

??



Post answers for #2 and #3 in the forums or comments.






Chapter 9 : Logistic Regression

Rewind:

We saw how to calculate X, y and pass it to an algorithm called K-Nearest Neighbor algorithm, with K = 1,5,8 etc.

This lecture:

We will do the same thing with another algorithm i.e. take our X, y and pass it to an algorithm called Logistic Regression.








The overall logic remains the same.



http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Iris.data (let's call this X [uppercase])

PREDICTION

Sepal length, Sepal width, Petal Length Petal Width

(feature_name)

KNN (n=1)

KNN (n=5)

KNN(n=8)

LogisticRegression

[2,4,3,1]

[0] - setosa

[0] - setosa

[0] - setosa

[2] - virginica

[4,6,5,3]

[2] - virginica

[1] - setosa

[2] - virginica

[2] - virginica



Iris.data (let's call this X [uppercase])

Iris.target (let’s call this y (lowercase))

Sepal length

(feature_name)

Sepal width

(feature_name)

Petal length

(feature_name)

Petal width

(feature_name)

Species

5.1

3.5

1.4

0.2

I. setosa (target_name)

4.9

3.0

1.4

0.2

I. setosa

(target_name)

4.7

3.2

1.3

0.2

I. setosa

(target_name)

4.6

3.1

1.5

0.2

I. setosa

(target_name)

6.3

3.3

4.7

1.6

Versicolor



Iris.data (let's call this X [uppercase])

PREDICTION

Sepal length, Sepal width, Petal Length Petal Width

(feature_name)

KNN (n=1)

KNN (n=5)

KNN(n=8)

LogisticRegression

[5.1,3.5,1.4,0.2]

??

??

??

??

[6.3,3.3,4.7,1.6]

??

??

??

??



prediction = knn.predict([[5.1,3.5,1.4,0.2],[6.3,3.3,4.7,1.6]])

print(prediction)

prediction5 = knn5.predict([[5.1,3.5,1.4,0.2],[6.3,3.3,4.7,1.6]])

print(prediction5)

prediction_lr = logisticreg.predict(([[5.1,3.5,1.4,0.2],[6.3,3.3,4.7,1.6]]))

print(prediction_lr)

Further Readings:

https://en.wikipedia.org/wiki/Supervised_learning

https://www.youtube.com/watch?v=qSTHZvN8hzs

Chapter 10: What is the Train and Test Approach?

Rewind:

  1. You saw K=1,5,8 and Logistic regression. How do you know which algorithm is better?
  2. We checked randomly

This lecture:

  1. We will check the results more systematically
  2. We will call this ‘splitting the data into train and test’

Iris.data (let's call this X [uppercase])

Iris.target

y_train

Iris.target

y-test


Sepal length

(feature name)

Sepal width

(feature name)

Petal length

(feature name)

Petal width

(feature name)

Species


X_train

5.1

3.5

1.4

0.2

I. setosa


4.9

3.0

1.4

0.2

I. setosa


4.7

3.2

1.3

0.2

I. setosa


X_test

4.6

3.1

1.5

0.2


I. setosa

4.7

3.2

1.3

0.2


I. setosa




http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

Algorithm

Accuracy Score

Accuracy Score (

test_size = 0.4, random_state=4)

Logistic Regression

97.777%

95%

KNN (n=1)

95.55%

95%

KNN (n=5)

97.77%

96.66%

KNN (n=8)

??




Now lets change to test_size = 0.4, random_state=4 and repeat.




Further Reading / Quiz:

  1. KNN (n = 1 to 30) -> can you find, which ‘n’ provides the highest accuracy?
  2. Can you plot a simple graph, where K values (1-30) form your x-axis and accuracy score 0-1 form your y-axis? Hint : read - https://matplotlib.org/users/pyplot_tutorial.html
  3. Read: (K fold cross validation) http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
  4. What are some disadvantages of Test - Train method?

Hint: use different values for random_state and check the accuracy variance.

Chapter 11 - Finalizing your optimum algorithm

Rewind:

  1. We calculated X, y
  2. We saw how to split X, y into X_test, X_train, y_test, y_train
  3. We tried supplying the inputs to KNN (n=1,5,8) and logistic regression and calculated the accuracy scores.
  4. We saw that accuracy scores changes, when your training sets and sizes changes.
  5. How to we choose the optimal algorithm?
  6. K-fold cross validation.



  • What is K-fold cross validation?

It simply a way to shuffle your training and testing sets and calculate accuracy scores.

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

http://scikit-learn.org/stable/modules/cross_validation.html

  • Let’s look at some code

You have chosen the best algorithm now. Make sure you retrain the whole model again with all the data.

Further Reading:

  1. K-Fold Cross Validation - https://en.wikipedia.org/wiki/Cross-validation_(statistics)
  2. https://matplotlib.org/users/pyplot_tutorial.html

Chapter 12 - Data science and Analytics - Next Steps

Rewind:

  1. We saw what is data science
  2. We explored the concept of Supervised and Unsupervised learning
  3. We looked at Anaconda distribution of Python, and worked with Jupyter notebook.
  4. We saw how to choose inputs, supply to an algorithm and get some predictions.
  5. We looked at general framework of training a model:

6. We learnt how to create X, y & X_train, X_test, y_train, y_test.

7. We learnt how to choose parameters for an algorithm

What’s next?

  • We have simply scratched the surface.
  • Consider this course as Module # 1 (Introduction to Data Science using Python).
  • Future courses will be split into modules, with incremental complexity.
    • Future Modules:
      • Pandas and data Manipulation
      • Statistics (linear regression, Logistic regression, Decision Trees, Random forests etc.)
      • Application of these statistics using Python. Solve some problems from www.kaggle.com
      • Focussed courses - Example: Text Analytics.

How does this fit into Business Intelligence and Analytics (all links on my Udemy page or www.rakeshgopal.com )

  1. All about querying: SQL basic and Advanced courses.
  2. All about data movement: SQL Integration services
  3. All about visualizing your data and telling a wonderful story: SQL reporting services and Tableau
  4. All about data science and analysis: Multiple modules of Data science courses.

Keep yourself informed on new courses:

  1. Visit www.rakeshgopal.com and sign-up for newsletter.
  2. Visit my Udemy page
  3. Connect with me via social media (all links in my Udemy Profile)

Drop me a note:

If you like this course, please drop me a note with your valuable comments and reviews.

All the best!