內容簡介
機器學習和預測分析正在改變商業和其他組織的運作模式。
《Python機器學習(影印版)》將帶你進入預測分析的世界,通過演示告訴你為什麼Python是世界數據科學語言之一。如果你想詢問更深入的數據問題,或是想增進、拓展機器學習係統的能力,這本實用的書籍可謂是無價之寶。
《Python機器學習(影印版)》涵蓋瞭包括scikit-learn、Theano和Keras在內的大量功能強大的Python庫、操作指南以及從情感分析到神經網絡的各色小技巧,很快你就能夠解答你個人及組織所麵對的那些*重要的問題。
作者簡介
Sebastian Raschka,a PhD student at Michigan State University, who develops new computational methods in the field of computational biology. He has been ranked as the number one most influential data scientist on GitHub by Analytics Vidhya. He has a yearlong experience in Python programming and he has conducted several seminars on the practical applications of data science and machine learning. Talking and writing about data science, machine learning, and Python really motivated Sebastian to write this book in order to help people develop data-driven solutions without necessarily needing to have a machine learning background. He has also actively contributed to open source projects and methods that he implemented, which are now successfully used in machine learning competitions, such as Kaggle. In his free time, he works on models for sports predictions, and if he is not in front of the computer, he enjoys playing sports.
內頁插圖
目錄
Preface
Chapter 1: Givin Computers the Ability to Learn from Data
Building intelligent machines to transform data into knowledge
The three different types of machine learning
Making predictions about the future with supervised learning
Classification for predicting class labels
Regression for predicting continuous outcomes
Solving interactive problems with reinforcement learning
Discovering hidden structures with unsupervised learning
Finding subgroups with clustering
Dimensionality reduction for data compression
An introduction to the basic terminology and notations
A roadmap for building machine learning systems
Preprocessing-getting data into shape
Training and selecting a predictive model
Evaluating models and predicting unseen data instances
Using Python for machine learning
Installing Python packages
Summary
Chapter 2: Training Machine Learning Algorithms
for Classification
Artificial neurons-a brief glimpse into the early history
of machine learning
Implementing a perceptron learning algorithm in Python
Training a perceptron model on the Iris dataset
Adaptive linear neurons and the convergence of learning
Minimizing cost functions with gradient descent
Implementing an Adaptive Linear Neuron in Python
Large scale machine learning and stochastic gradient descent
Summary
Chapter 3: A Tour of Machine Learning Classifiers Using
Scikit-learn
Choosing a classification algorithm
First steps with scikit-learn
Training a perceptron via scikit-learn
Modeling class probabilities via logistic regression
Logistic regression intuition and conditional probabilities
Learning the weights of the logistic cost function
Training a logistic regression model with scikit-learn
Tackling overfitting via regularization
Maximum margin classification with support vector machines
Maximum margin intuition
Dealing with the nonlinearly separablecase using slack variables
Alternative implementations in scikit-learn
Solving nonlinear problems using a kernel SMM
Using the kernel trick to find separating hyperplanes in higher
dimensional space
Decision tree learning
Maximizing information gain-getting the most bang for the buck
Building a decision tree
Combining weak to strong learners via random forests
K-nearest neighbors-a lazy learning algorithm
Summary
Chapter 4: Building Good Training Sets-Data Preprocessing
Dealing with missing data
Eliminating samples or features with missing values
Imputing missing values
Understanding the scikit-learn estimator API
Handling categorical data
Mapping ordinal features
Encoding class labels
Performing one-hot encoding on nominal features
Partitioning a dataset in training and test sets
Bringing features onto the same scale
Selecting meaningful features
Sparse solutions with L1 regularization
Sequential feature selection algorithms
Assessing feature importance with random forests
Summary
Chapter 5: Com~ Data via Di~ Reduction
Unsupervised dimensionality reduction via principal
component analysis
Total and explained variance
Feature transformation
Principal component analysis in scikit-learn
Supervised data compression via linear discriminant analysis
Computing the scatter matrices
Selecting linear discriminants for the new feature subspace
Projecting samples onto the new feature space
LDA via scikit-learn
Using kernel principal component analysis for nonlinear mappings
Kernel functions and the kernel trick
Implementing a kernel principal component analysis in Python
Example 1-separating half-moon shapes
Example 2-separating concentric circles
Projecting new data points
Kernel principal component analysis in scikit-learn
Summary
Chapter 6: Learning Best Practices for Model Evaluation
and Hyperparameter Tuni~
Streamlining workflows with pipelines
Loading the Breast Cancer Wisconsin dataset
Combining transformers and estimators in a pipeline
Using k-fold cross-validation to assess model performance
The holdout method
K-fold cross-validation
Debugging algorithms with learning and validation curves
Diagnosing bias and variance problems with learning curves
Addressing overfitting and underfitting with validation curves
Fine-tuning machine learning models via grid search
Tuning hyperparameters via grid search
Algorithm selection with nested cross-validation
Looking at different performance evaluation metrics
Reading a confusion matrix
Optimizing the precision and recall of a classification model
Plotting a receiver operating characteristic
The scoring metrics for multiclass classification
Summary
Chapter 7: Combining Different Models for Ensemble Learning
Learning with ensembles
Implementing a simple majority vote classifier
Combining different algorithms for classification with majority vote
Evaluating and tuning the ensemble classifier
Bagging-building an ensemble of classifiers from
bootstrap samples
Leveraging weak learners via adaptive boosting
Summary
Chapter 8: Applying Machine Learning to Sentiment Analysis
Obtaining the IMDb movie review dataset
Introducing the bag-of-words model
Transforming words into feature vectors
Assessing word relevancy via term frequency-inverse
document frequency
Cleaning text data
Processing documents into tokens
Training a logistic regression model for document classification
Working with bigger data-online algorithms and
out-of-core learning
Summary
Chapter 9: Embedding a Machine Learning Model into
a Web Application
Serializing fitted scikit-learn estimators
Setting up a SQLite database for data storage
Developing a web application with Flask
Our first Flask web application
Form validation and rendering
Turning the movie classifier into a web application
Deploying the web application to a public sewer
Updating the movie review classifier
Summary
Chapter 10: Predicting Continuous Target Variables
with R_Re_gression Analysis
Introducing a simple linear regression model
Exploring the Housing Dataset
Visualizing the important characteristics of a dataset
Implementing an ordinary least squares linear regression model
Solving regression for regression parameters with gradient descent
Estimating the coefficient of a regression model via scikit-learn
Fitting a robust regression model using RANSAC
Evaluating the performance of linear regression models
Using regularized methods for regression
Turning a linear regression model into a curve-polynomial regression
Modeling nonlinear relationships in the Housing Dataset
Dealing with nonlinear relationships using random forests
Decision tree regression
Random forest regression
Summary
Chapter 11: Working with Unlabeled Data- Cluste~
Grouping objects by similarity using k-means
K-means++
Hard versus soft clustering
Using the elbow method to find the optimal number of clusters
Quantifying the quality of clustering via silhouette plots
Organizing clusters as a hierarchical tree
Performing hierarchical clustering on a distance matrix
Attaching dendrograms to a heat map
Applying agglomerative clustering via scikit-learn
Locating regions of high density via DBSCAN
Summary
Chapter 12: Training Artificial Neural Networks for Image Recognition
Modeling complex functions with artificial neural networks
Single-layer neural network recap
Introducing the multi-layer neural network architecture
Activating a neural network via forward propagation
Classifying handwritten digits
Obtaining the MNIST dataset
Implementing a multi-layer perceptron
Training an artificial neural network
Computing the logistic cost function
Training neural networks via backpropagation
Developing your intuition for backpropagation
Debugging neural networks with gradient checking
Convergence in neural networks
Other neural network architectures
Convolutional Neural Networks
Recurrent Neural Networks
A few last words about neural network implementation
Summary
Chapter 13: Parallelizing Neural Network Training with Theano
Building, compiling, and running expressions with Theano
What is Theano?
First steps with Theano
Configuring Theano
Working with array structures
Wrapping things up-a linear regression example
Choosing activation functions for feedforward neural networks
Logistic function recap
Estimating probabilities in multi-class classification via the
softmax function
Broadening the output spectrum by using a hyperbolic tangent
Training neural networks efficiently using Keras
Summary
Index
前言/序言
We live in the midst of a data deluge. According to recent estimates, 2.5 quintillion (10i8) bytes of data are generated on a daily basis. This is so much data that over 90 percent of the information that we store nowadays was generated in the past decade alone. Unfortunately, most of this information cannot be used by humans. Either the data is beyond the means of standard analytical methods, or it is simply too vast for our limited minds to even comprehend.
Through Machine Learning, we enable computers to process, learn from, and draw actionable insights out of the otherwise impenetrable walls of big data. From the massive supercomputers that support Google s search engines to the smart phones that we carry in our pockets, we rely on Machine Learning to power most of the world around us - often, without even knowing it.
As modem pioneers in the brave new world of big data, it then behooves us to learn more about Machine Learning. What is Machine Learning and how does it work? How can I use Machine Learning to take a glimpse into the unknown, power my business, or just find out what the Internet at large thinks about my favorite movie? All of this and more will be covered in the following chapters authored by my good friend and colleague, Sebastian Raschka. When away from taming my otherwise irascible pet dog, Sebashan has tirelessly devoted his free time to the open source Machine Learning community. Over the past several years, Sebastian has developed dozens of popular tutorials that cover topics in Machine Learning and data visualization in Python. He has also developed and contributed to several open source Python packages, several of which are now part of the core Python Machine Learning workflow.
Owing to his vast expertise in this field, I am confident that Sebashan's insights into the world of Machine Learning in Python will be invaluable to users of all experience levels. l wholeheartedly recommendy this book to anyone looking to gain a broader and more practical und Yerstanding of Machine Learning.
《Python 數據科學實戰:從基礎到進階》 深度探索數據分析與機器學習的奧秘,點亮您的數據驅動決策之路。 在這個信息爆炸的時代,數據已成為引領變革的核心驅動力。從商業洞察到科學研究,從用戶體驗優化到社會福祉提升,理解和駕馭數據的重要性不言而喻。然而,麵對海量、復雜的數據,如何有效地提取價值,洞察規律,甚至預測未來,是每一位數據從業者和渴望在數據領域有所建樹的讀者所麵臨的挑戰。《Python 數據科學實戰:從基礎到進階》正是為解決這一挑戰而生,它將帶領您踏上一段係統、全麵且富有實踐性的數據科學探索之旅。 本書並非僅僅羅列技術和工具,而是著眼於數據科學的整個生命周期,涵蓋從數據采集、清洗、探索性分析,到模型構建、評估與部署的全過程。我們選擇Python作為核心編程語言,因為它擁有極其豐富和強大的生態係統,包括NumPy、Pandas、Matplotlib、Seaborn、Scikit-learn、TensorFlow、PyTorch等一係列業界領先的庫,為數據科學的各個環節提供瞭堅實的技術支撐。 本書內容深度剖析: 第一部分:數據科學基礎與Python環境搭建 在正式開啓數據探索之旅前,建立堅實的基礎至關重要。本部分將從零開始,為您構建必要的數據科學知識體係和實踐環境。 Python語言入門與進階: 對於Python新手,我們會提供清晰易懂的語法講解,包括數據類型、控製流、函數、麵嚮對象編程等核心概念,並輔以針對數據科學的常用技巧。對於有一定Python基礎的讀者,則將深入探討更高級的主題,如生成器、裝飾器、上下文管理器等,以提升代碼效率和可讀性。 核心數據科學庫概覽: NumPy: 理解其多維數組(ndarray)的核心作用,掌握數組的創建、索引、切片、運算以及廣播機製,這是進行高效數值計算的基石。 Pandas: 深入學習Series和DataFrame這兩個核心數據結構,掌握數據的讀取(CSV, Excel, SQL等)、清洗(缺失值處理、異常值檢測)、轉換、閤並、分組聚閤等一係列數據操作技巧。我們將通過大量的實際案例,讓您熟悉如何用Pandas高效地處理真實世界的數據集。 Matplotlib與Seaborn: 數據可視化是理解數據、呈現洞察的關鍵。我們將從基礎繪圖(摺綫圖、散點圖、柱狀圖、餅圖)入手,逐步深入到更復雜的統計圖錶(箱綫圖、小提琴圖、熱力圖、分布圖),並學習如何進行圖錶的定製化美化,使其更具錶現力。Seaborn庫將幫助您輕鬆繪製齣美觀且信息豐富的統計圖形。 Jupyter Notebook/Lab與IDE: 熟悉交互式開發環境,如Jupyter Notebook和Jupyter Lab,它們是進行數據探索、原型開發和結果展示的理想工具。同時,也會介紹VS Code等主流IDE在Python數據科學開發中的應用。 第二部分:數據預處理與探索性數據分析 (EDA) 原始數據往往是混亂、不完整且充滿噪聲的。本部分將聚焦於如何將原始數據轉化為可供分析的“乾淨”數據,並從中發現隱藏的模式和見解。 數據清洗技術: 缺失值處理: 探討多種策略,如刪除、插補(均值、中位數、眾數、模型預測),並分析不同方法的優劣。 異常值檢測與處理: 識彆數據中的離群點,並學習如何根據業務場景選擇閤適的處理方式(移除、轉換、保留)。 數據類型轉換與規範化: 確保數據類型正確,處理文本數據中的編碼問題、日期時間格式等。 重復數據處理: 有效識彆和移除重復項。 特徵工程基礎: 特徵創建: 從現有特徵派生新特徵,如日期分解、文本特徵提取(詞袋模型、TF-IDF)。 特徵編碼: 處理類彆型變量,如獨熱編碼(One-Hot Encoding)、標簽編碼(Label Encoding)。 特徵縮放: 理解標準化(Standardization)和歸一化(Normalization)的原理及應用場景,為後續模型訓練做準備。 探索性數據分析 (EDA): 描述性統計: 計算均值、方差、分位數、偏度、峰度等統計量,全麵瞭解數據的分布特徵。 相關性分析: 計算變量間的相關係數(Pearson, Spearman),識彆潛在的綫性或單調關係。 數據可視化驅動的洞察: 利用各類圖錶直觀展示數據分布、變量關係、分組差異等,發現數據中的潛在模式、趨勢和異常。例如,通過散點圖觀察兩個數值變量的關係,通過箱綫圖比較不同組彆的數值分布,通過熱力圖展示特徵之間的相關性矩陣。 第三部分:機器學習模型構建與評估 這是本書的核心部分,我們將係統學習各種主流的機器學習算法,並掌握如何使用它們來解決實際問題。 監督學習: 迴歸模型: 綫性迴歸: 理解模型原理,學習如何處理多項式迴歸、正則化(Lasso, Ridge)。 決策樹迴歸: 掌握樹的生長過程,理解過擬閤問題及剪枝。 集成學習(迴歸): 學習Bagging(隨機森林)和Boosting(Gradient Boosting, XGBoost, LightGBM)的工作原理及其強大的預測能力。 分類模型: 邏輯迴歸: 理解其概率模型和分類決策邊界。 K近鄰(KNN): 學習基於距離的分類方法。 支持嚮量機(SVM): 掌握核技巧在非綫性分類中的應用。 樸素貝葉斯: 理解其概率推理和文本分類的應用。 決策樹分類: 集成學習(分類): 隨機森林、XGBoost、LightGBM在分類任務中的應用。 無監督學習: 聚類算法: K-Means: 學習如何發現數據中的簇。 DBSCAN: 識彆任意形狀的簇。 層次聚類: 構建類彆的層次結構。 降維算法: 主成分分析(PCA): 理解其尋找數據主要方差方嚮,實現降維。 t-SNE: 學習其用於高維數據可視化降維。 模型評估與選擇: 迴歸模型評估指標: MSE, RMSE, MAE, R-squared。 分類模型評估指標: 準確率 (Accuracy), 精確率 (Precision), 召迴率 (Recall), F1-score, ROC麯綫與AUC值, 混淆矩陣。 交叉驗證: 理解k摺交叉驗證等方法,確保模型泛化能力。 超參數調優: Grid Search, Random Search等方法。 Scikit-learn實戰: 充分利用Scikit-learn庫,它提供瞭統一的API,讓您能夠便捷地實現上述各種模型。我們將演示如何加載數據、預處理、訓練模型、進行預測和評估。 第四部分:深度學習基礎與應用 隨著深度學習的興起,它在圖像識彆、自然語言處理等領域取得瞭突破性進展。本部分將為您揭開深度學習的神秘麵紗。 神經網絡基礎: 感知機與多層感知機(MLP): 理解神經元的工作原理,激活函數的作用。 反嚮傳播算法: 掌握模型訓練的核心機製。 損失函數與優化器: 學習如何衡量模型誤差並更新權重。 主流深度學習框架: TensorFlow與Keras: 掌握Keras提供的簡潔API,快速構建和訓練神經網絡。 PyTorch: 瞭解PyTorch的動態計算圖和靈活性。 常用深度學習模型: 捲積神經網絡(CNN): 尤其適用於圖像處理任務,學習捲積層、池化層等。 循環神經網絡(RNN)及變種(LSTM, GRU): 適用於序列數據,如文本和時間序列。 深度學習應用案例: 通過實際案例,如圖像分類、文本情感分析,展示深度學習的強大能力。 第五部分:模型部署與實戰項目 學習的最終目的是解決實際問題。本部分將指導您如何將訓練好的模型部署到實際應用中,並提供貫穿全書的綜閤實戰項目。 模型持久化: 學習如何保存訓練好的模型(如使用`pickle`或`joblib`)。 模型部署基礎: 介紹將模型集成到Web應用(如使用Flask或Django)或進行API服務部署的初步概念。 實戰項目: 案例一:房價預測。 使用綫性迴歸、集成學習等模型,從二手房交易數據中預測房價。 案例二:客戶流失預測。 利用邏輯迴歸、SVM、隨機森林等模型,識彆可能流失的客戶。 案例三:圖像識彆(如MNIST手寫數字識彆)。 利用CNN模型進行圖像分類。 案例四:文本情感分析。 利用樸素貝葉斯、RNN或Transformer模型分析用戶評論的情感傾嚮。 本書特色: 理論與實踐並重: 深入淺齣地講解算法原理,並通過大量代碼示例和實戰項目,幫助讀者將理論知識轉化為實踐技能。 循序漸進的難度: 從Python基礎和數據科學入門,逐步過渡到高級模型和深度學習,適閤不同水平的讀者。 貼近實際需求: 選取真實世界的數據集和應用場景,讓學習過程更具針對性和實用性。 豐富的可視化: 大量使用圖錶來解釋概念和展示數據洞察,使學習過程更加直觀。 代碼質量高: 提供的代碼經過精心設計和測試,易於理解和復用。 適閤讀者: 希望掌握數據科學核心技能的初學者。 希望係統學習Python數據分析和機器學習的在校學生。 希望提升數據處理和建模能力的軟件工程師、數據分析師。 希望將數據科學應用於業務場景的商業分析師、産品經理。 對人工智能和機器學習領域感興趣的任何人士。 《Python 數據科學實戰:從基礎到進階》 將是您在數據科學道路上不可或缺的夥伴。通過本書的學習,您將不僅能夠熟練運用Python進行數據分析和建模,更重要的是,能夠培養齣獨立解決復雜數據問題的能力,從而在數據驅動的未來中占據先機。現在就開始您的數據探索之旅吧!