Chapter # 1 - Introduction
A li’ll background:
- This is module # 2 of the Data Science track.
- Module # 1
- Absolute basics of data science and Scikit learn
- We walked through simple concepts:
- What is data science and machine learning.
- How to build a career in machine learning
- What is scikit learn, Anaconda and Jupyter.
- How to setup your environment.
- Basics of pandas and input manipulation.
- Supervised and unsupervised learning.
- KNN algorithm and the usage.
- Fitting and model and fine tuning algorithms.
- And many more.
- Module # 2
- Is all about input data cleaning and analysis!!!
- We will use Pandas.
A little about me!
- I am Rakesh!
- Coder, Manager and work in the field of analytics
- I like solving analytical problems, hiking, XBOX, Photography, teaching etc.
- My website: www.rakeshgopal.com
- Some feedback from my previous courses.
What is the BI Stack
- Databases and Querying concepts
(SQL Engineer, SQL/DB Developer, BI Developer)
- Transferring and Manipulating data from different sources -
(ETL Developer / BI Developer)
- Visualizing the data (i.e extracting information out of data)
(Report/ Data Analyst/ BI Developer)
- Predicting and Pattern Analyzing the data
(Data Scientist / ML Scientist / ML engineer)
About this course:
- Keep it simple!
- Provide enough information for people to start creating reports.
- Incorporate feedback from previous courses.
- I will be uploading this document
- Has reading materials.
- Ask questions!
Chapter # 2:
Goal: To know more about Pandas and Installation instructions.
- “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.”
- Comes installed with Anaconda distribution of Python.
- Very interesting and provides very easy and speedy techniques for data manipulation using Python.
Chapter # 3:
Goal: Tools and Techniques to read data from files.
- You will always need to first read data in order to perform analysis
- Pandas provides powerful tools.
- Various data formats that you may encounter:
- Tab Separated
- Pipe Delimited
- And many more.
- Let’s look at a few examples.
The data files used in this lecture:
(note: if the file links change, please feel free to use any pipe-delimited or tab separated file. )
Read_table documentation: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.read_table.html
Quiz / exercise:
- Search for some sample data that is in Json format. Try read that into a dataframe. Post your sample file link and code in the Q&A section
2. How will you drop multiple columns? Post your code in the Q&A section.
3. Can you try loading some data from SQL Server? If so, please post the code and a short commentary.
Chapter # 4:
Goal: To understand some basic syntax to gather metadata before you actually start your data analysis.
- Select a few columns
- Select a few rows
Can you think of anymore interesting data study you might do, before you start your analysis? If so, please post the syntax and ideas in the Q&A section.
Goal : Column Manipulation, Sorting and Filtering.
To learn about
- Concatenating columns & adding new columns
- Rename existing columns
- Sorting values - Just like order by in SQL
- Filtering - Just like where clause in SQL
- Can you filter for rows where Name is either Airi Satou or Brende Wagner or Bruno Nash? Post your code in the Q&A section
- Can you think about more Data Column renaming techniques? If so, please post your code and a short commentary in the Q&A section.
- If you need to replace 500 column names in one shot, how would you do that? Post your code. (hint use columns.str.replace)
Chapter # 6
Goal: To learn about data
- String manipulation techniques:
- Upper and lower
- String replace
- Groupby clauses (like you use in SQL Aggregations)
Quiz / exercises:
- Take any data of your choice, which has atleast 2-3 numerical columns. Can you calculate the mean rowwise and columnwise? (hint read about axis = 0 and axis = 1)
- Complete this entire tutorial: https://pandas.pydata.org/pandas-docs/stable/text.html
Goal: Exploring the loc and dropna
- Loc (to choose number of rows and columns)
- Dropna (to drop rows with missing values)
Quiz/Exercise: We looked at dropping rows with missing values. Can you explore what is ‘fillna’? Post a sample code in the Q&A section
Chapter 8 -
Goal: To learn how to work with plots, merge data-frames & work explore data pivoting and pivot tables
- Creating plots
- Merging data-frames
- Inner joins
- Left and right joins
- Outer joins.
- Data pivoting and pivot tables
- If you are not aware of the concept of joins, please watch a lecture on join - Link & Link
- Data set link: https://datamarket.com/data/list/?q=provider%3Atsdl
- Can you display the “Grand total” in your pivot table? If so, please paste the code in the Q&A section
- Try out some plots like histograms and bar charts. Paste your code in the Q&A section.
Chapter 9 - Tips and tricks in Pandas.
Goal: To learn about some useful Tips and tricks.
- Write to a CSV file.
- Explore the read_excel option. Paste some sample code in the Q&A section.
- I am attaching 2 links that might be helpful: