Chapter # 1 - Introduction

A li’ll background:

  • This is module # 2 of the Data Science track.
  • Module # 1
    • Absolute basics of data science and Scikit learn
    • We walked through simple concepts:
      • What is data science and machine learning.
      • How to build a career in machine learning
      • What is scikit learn, Anaconda and Jupyter.
      • How to setup your environment.
      • Basics of pandas and input manipulation.
      • Supervised and unsupervised learning.
      • KNN algorithm and the usage.
      • Fitting and model and fine tuning algorithms.
      • And many more.
  • Module # 2
    • Is all about input data cleaning and analysis!!!
    • We will use Pandas.

A little about me!

  • I am Rakesh!
  • Coder, Manager and work in the field of analytics
  • I like solving analytical problems, hiking, XBOX, Photography, teaching etc.
  • My website: www.rakeshgopal.com
  • Some feedback from my previous courses.

What is the BI Stack

  • Databases and Querying concepts

(SQL Engineer, SQL/DB Developer, BI Developer)

  • Transferring and Manipulating data from different sources -

(ETL Developer / BI Developer)

  • Visualizing the data (i.e extracting information out of data)

(Report/ Data Analyst/ BI Developer)

  • Predicting and Pattern Analyzing the data

(Data Scientist / ML Scientist / ML engineer)

About this course:

  • Keep it simple!
  • Provide enough information for people to start creating reports.
  • Incorporate feedback from previous courses.
  • I will be uploading this document
  • Has reading materials.
  • Ask questions!

Chapter # 2:

Goal: To know more about Pandas and Installation instructions.

  • pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.”
  • https://pandas.pydata.org/
  • Comes installed with Anaconda distribution of Python.
  • Very interesting and provides very easy and speedy techniques for data manipulation using Python.

Chapter # 3:

Goal: Tools and Techniques to read data from files.

  • You will always need to first read data in order to perform analysis
  • Pandas provides powerful tools.
  • Various data formats that you may encounter:
    • CSV
    • Tab Separated
    • Pipe Delimited
    • Json
    • And many more.
  • Let’s look at a few examples.



Resources:

The data files used in this lecture:

  1. https://datatables.net/extensions/buttons/examples/flash/tsv.html
  2. https://data-gov.tw.rpi.edu/wiki/CSV_files_use_delimiters_other_than_commas

(note: if the file links change, please feel free to use any pipe-delimited or tab separated file. )

Read_table documentation: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.read_table.html

Quiz / exercise:

  1. Search for some sample data that is in Json format. Try read that into a dataframe. Post your sample file link and code in the Q&A section

Hint: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

2. How will you drop multiple columns? Post your code in the Q&A section.

3. Can you try loading some data from SQL Server? If so, please post the code and a short commentary.



Chapter # 4:

Goal: To understand some basic syntax to gather metadata before you actually start your data analysis.

  • Select a few columns
  • Select a few rows
  • Describe
  • Dtypes
  • Shape
  • Type

Quiz/ exercise:

Can you think of anymore interesting data study you might do, before you start your analysis? If so, please post the syntax and ideas in the Q&A section.

Chapter 5

Goal : Column Manipulation, Sorting and Filtering.

To learn about

  • Concatenating columns & adding new columns
  • Rename existing columns
  • Sorting values - Just like order by in SQL
  • Filtering - Just like where clause in SQL



Quiz/ Exercise.

  1. Can you filter for rows where Name is either Airi Satou or Brende Wagner or Bruno Nash? Post your code in the Q&A section
  2. Can you think about more Data Column renaming techniques? If so, please post your code and a short commentary in the Q&A section.
  3. If you need to replace 500 column names in one shot, how would you do that? Post your code. (hint use columns.str.replace)

Chapter # 6

Goal: To learn about data

  • Mean
  • String manipulation techniques:
    • Upper and lower
    • String replace
  • Groupby clauses (like you use in SQL Aggregations)
    • Min
    • Max
    • Agg
    • etc.

Quiz / exercises:

Chapter 7

Goal: Exploring the loc and dropna

  • Loc (to choose number of rows and columns)
  • Dropna (to drop rows with missing values)

Quiz/Exercise: We looked at dropping rows with missing values. Can you explore what is ‘fillna’? Post a sample code in the Q&A section

Chapter 8 -

Goal: To learn how to work with plots, merge data-frames & work explore data pivoting and pivot tables

  • Creating plots
  • Merging data-frames
    • Inner joins
    • Left and right joins
    • Outer joins.
  • Data pivoting and pivot tables

Quiz/Reading:

  • If you are not aware of the concept of joins, please watch a lecture on join - Link & Link
  • Data set link: https://datamarket.com/data/list/?q=provider%3Atsdl
  • Can you display the “Grand total” in your pivot table? If so, please paste the code in the Q&A section
  • Try out some plots like histograms and bar charts. Paste your code in the Q&A section.

Chapter 9 - Tips and tricks in Pandas.

Goal: To learn about some useful Tips and tricks.

  • Shift
  • Write to a CSV file.



Quiz/Exercise:

  1. Explore the read_excel option. Paste some sample code in the Q&A section.
  2. I am attaching 2 links that might be helpful:
    1. https://rakeshgopal.teachable.com/blog/1266369/interview-tips-and-preparation-guidelines
  1. https://rakeshgopal.teachable.com/blog/1065862/where-do-i-start-to-learn-sql-server-databases