Data Analysis and Machine Learning with Python
Python for Data Analysis
For many people, the Python language is easy to fall in love with. Since its first appearance in 1991, Python has become one of the most popular dynamic, programming languages, along with Perl, Ruby, and others.
Python and Ruby have become especially popular in recent years for building websites using their numerous web frameworks, like Rails (Ruby) and Django (Python). Such languages are often called scripting languages as they can be used to write quick-and-dirty small programs, or scripts.
I don’t like the term “scripting language” as it carries a connotation that they cannot be used for building mission-critical software. Among interpreted languages Python is distinguished by its large and active scientific computing community.
Adoption of Python for scientific computing in both industry applications and academic research has increased significantly since the early 2000s. For data analysis and interactive, exploratory computing and data visualization.
Python will inevitably draw comparisons with the many other domain-specific open source and commercial programming languages and tools in wide use, such as R, MATLAB, SAS, Stata, and others. In recent years, Python’s improved library support (primarily pandas) has made it a strong alternative for data manipulation tasks.
Combined with Python’s strength in general purpose programming, it is an excellent choice as a single language for building data-centric applications.
Python for Machine Learning
Machine learning (ML) teaches machines how to carry out tasks by themselves, it is that simple. The complexity comes with the details, and that is most likely the reason you are reading this book. Maybe you have too much data and too little insight, and you hoped that using machine learning algorithms will help you solve this challenge.
So you started to dig into random algorithms. But after some time you were puzzled: which of the myriad of algorithms should you actually choose? Or maybe you are broadly interested in machine learning and have been reading a few blogs and articles about it for some time.
The goal of machine learning is to teach machines to carry out tasks by providing them with a couple of examples (how to do or not do a task). Let us assume that each morning when you turn on your computer, you perform the same task of moving e-mails around so that only those e-mails belonging to a particular topic end up in the same folder.
After some time, you feel bored and think of automating this chore. One way would be to start analyzing your brain and writing down all the rules your brain processes while you are shuffling your e-mails. However, this will be quite cumbersome and always imperfect.
While you will miss some rules, you will over-specify others. A better and more future-proof way would be to automate this process by choosing a set of e-mail meta information and body/folder name pairs and let an algorithm come up with the best rule set.
The pairs would be your training data, and the resulting rule set (also called model) could then be applied to future e-mails that we have not yet seen. This is machine learning in its simplest form. Of course, machine learning (often also referred to as data mining or predictive analysis) is not a brand new field in itself.
Quite the contrary, its success over recent years can be attributed to the pragmatic way of using rock-solid techniques and insights from other successful fields; for example, statistics. There, the purpose is for us humans to get insights into the data by learning more about the underlying patterns and relationships.
As you read more and more about successful applications of machine learning (you have checked out kaggle.com already, haven't you?), you will see that applied statistics is a common field among machine learning experts. As you will see later, the process of coming up with a decent ML approach is never a waterfall-like process.
Instead, you will see yourself going back and forth in your analysis, trying out different versions of your input data on diverse sets of ML algorithms. It is this explorative nature that lends itself perfectly to Python. Being an interpreted high-level programming language, it may seem that Python was designed specifically for the process of trying out different things.
What is more, it does this very fast. Sure enough, it is slower than C or similar statically-typed programming languages; nevertheless, with a myriad of easy-to-use libraries that are often written in C, you don't have to sacrifice speed for agility.
Table of Content
- The tasks to do in this course
- Install Development Environment
2. Interactive Computing with IPython
- IPython Basics
- The commands in IPython
- Interacting with the OS
- Debug with pdb
- Advanced IPython Features
3. Arrays and Vectorized Computation
- Introduction to NumPy
- Multidimensional Array Object
- Fast Element-wise Array Functions
- Data Processing Using Arrays
- File Input and Output with Arrays
- Linear Algebra
- Random Number Generation
4. Data Analysis with pandas
- Introduction to pandas Data Structures
- Essential Functionality
- Summarizing and Computing Descriptive Statistics
- Handling Missing Data
- Hierarchical Indexing
- Advanced pandas
5. Data Loading, Storage, and File Formats
- Reading and Writing Data in Text Format
- Binary Data Formats
- Interacting with HTML and Web APIs
- Interacting with Databases
6. Data Wrangling
- Combining and Merging Data Sets
- Reshaping and Pivoting
- Data Transformation
- String Manipulation
7. Plotting and Visualization
- Matplotlib APIs
- Plotting Functions in pandas
- Example Visualizing Earthquake Crisis Data
- Visualization Tool Ecosystem
8. Data Aggregation and Group Operations
- GroupBy Mechanics
- Data Aggregation
- Group-wise Operations and Transformations
- Pivot Tables and Cross-Tabulation
9. Time Series
- Date and Time Data Types
- Time Series Basics
- Date Ranges Frequencies and Shifting
- Time Zone Handling
- Periods and Period Arithmetic
- Resampling and Frequency Conversion
- Time Series Plotting
- Moving Window Functions
- Performance and Memory Usage
10. Financial and Economic Data
- Time Series and Cross-Section Alignment
- Operations with Time Series of Different Frequencies
- Time of Day and Data Selection
- Splicing Together Data Sources
- Return Indexes and Cumulative Returns
- Group Transforms and Analysis
11. Advanced NumPy
- ndarray Object Internals
- Advanced Array Manipulation
- Structured and Record Arrays
- NumPy Matrix Class
- Advanced Array I/O
12. Big Data with Python
- Introducing big data
- Hadoop for big data
- Apache Hadoop
- Example in Hadoop
- Hadoop for finance
- Introducing NoSQL
- MongoDB and PyMongo
13. Getting Started with Python Machine Learning
- Machine learning and Python
- A simple example machine learning
- Linear regression algorithm
- Training a linear regression model
- Recursive polynomial algorithm
- Training a recursive polynomial model
- Support Vector Machine Regression
- The Decision Tree Algorithm
- Random forest algorithm
14. Classification in Machine Learning
- Logistic Regression
- K-Nearest Neighbor Classifier
- Support Vector Machine
- Kernel Support Vector Machine
- Naive Bayes Classifier
- Tree Based Algorithms
- Random Forest Classifier
- K-means clustering
- Hierarchical Clustering in Python
15. Artificial Neural Networks
- Introduction to ANN
- Mathematical basis of ANN
- Perceptron neural network
- The Backpropagation Algorithm
- Building an ANN
- Training an ANN
16. TensorFlow Framework
- Introduction to TensorFlow
- TensorFlow APIs
- Building an ANN with TensorFlow
- Training an ANN with TensorFlow
17. Practical Projects
- Handwriting Recognition with Python
- Image Recognition with Python
- Natural Language Processing with Python