内容简介
大数据分析是检视庞大的复杂数据集的过程,这些数据集通常超出了你所拥有的计算能力。R语言作为数据科学的领军编程语言,包含了诸多功能强大的函数,足以解决大数据处理相关的所有问题。
《大数据分析:R语言实现(影印版 英文版)》首先简要叙述了大数据领域及其当前的行业标准.然后介绍了R语言的发展、结构、现实应用和不足之处,接着引入了用于数据管理和转换的主要R函数的修订版。读者会了解至U基于云的大数据解决方案(例如Amazon EC2实例和Amazon RDS,Microsoft Azure及其HDInsight集群)以及R与关系/非关系数据库(如MongoDB和HBase)之间如何建立连接。除此之外,进一步涵盖了大数据工具,如ApacheHadoop、HDFS和MapReduce,还有其他一些R兼容工具,如Apache Spark及其机器学习库Spark MLlib、H2O。
作者简介
Simon Walkowiak,a cognitive neuroscientist and a managing director of Mind Project Ltd - a Big Data and Predictive Analytics consultancy based in London, United Kingdom. As a former data curator at the UK Data Service (UKDS, University of Essex) - European largest socio-economic data repository, Simon has an extensive experience in processing and managing large-scale datasets such as censuses, sensor and smart meter data, telecommunication data and well-known governmental and social surveys such as the British Social Attitudes survey, Labour Force surveys, Understanding Society, National Travel survey, and many other socio-economic datasets collected and deposited by Eurostat, World Bank, Office for National Statistics, Department of Transport, NatyCen and International Energy Agency, to mention just a few. Simon has delivered numerous data science and R training courses at public institutions and international comparniues. He has also taught a course in Big Data Methods in R at major UK universities and at the prestigious Big Data and Analyhcs Summer School organized by the Institute of Analytics and Data Saence (IADS),
内页插图
目录
Preface
Chapter 1: The Era of Big Data
Big Data - The monster re-defined
Big Data toolbox - dealing with the giant
Hadoop - the elephant in the room
Databases
Hadoop Spark-ed up
R- The unsung Big Data hero
Summary
Chapter 2: Introduction to R Programming Language and Statistical Environment
Learning R
Revisiting R basics
Getting R and RStudio ready
Setting the URLs to R repositories
R data structures
Vectors
Scalars
Matrices
Arrays
Data frames
Lists
Exporting R data objects
Applied data science with R
Importing data from different formats
Exploratory Data Analysis
Data aggregations and contingency tables
Hypothesis testing and statistical inference
Tests of differences
Independent t-test example (with power and effect size estimates)
ANOVA example
Tests of relationships
An example of Pearson's r correlations
Multiple regression example
Data visualization packages
Summary
Chapter 3: Unleashing the Power of R from Within
Traditional limitations of R
Out-of-memory data
Processing speed
To the memory limits and beyond
Data transformations and aggregations with the ff and ffbase packages
Generalized linear models with the ff and ffbase packages
Logistic regression example with ffbase and biglm
Expanding memory with the bigmemory package
Parallel R
From bigmemory to faster computations
An apply() example with the big.matrix object
A for() loop example with the ffdf object
Using apply() and for() loop examples on a data.frame
A parallel package example
A foreach package example
The future of parallel processing in R
Utilizing Graphics Processing Units with R
Multi-threading with Microsoft R Open distribution
Parallel machine learning with H20 and R
Boosting R performance with the data.table package and other tools
Fast data import and manipulation with the data.table package
Data import with data.table
Lightning-fast subsets and aggregations on data.table
Chaining, more complex aggregations, and pivot tables with data.table
Writing better R code
Summary
Chapter 4: Hadoop and MapReduce Framework for R
Hadoop architecture
Hadoop Distributed File System
MapReduce framework
A simple MapReduce word count example
Other Hadoop native tools
Learning Hadoop
A single-node Hadoop in Cloud
Deploying Hortonworks Sandbox on Azure
A word count example in Hadoop using Java
A word count example in Hadoop using the R language
RStudio Server on a Linux RedHat/CentOS virtual machine
Installing and configuring RHadoop packages
HDFS management and MapReduce in R - a word count example
HDInsight - a multi-node Hadoop cluster on Azure
Creating your first HDInsight cluster
Creating a new Resource Group
Deploying a Virtual Network
Creating a Network Security Group
Setting up and configuring an HDInsight cluster
Starting the cluster and exploring Ambari
Connecting to the HDInsight cluster and installing RStudio Server
Adding a new inbound security rule for port 8787
Editing the Virtual Network's public IP address for the head node
Smart energy meter readings analysis example - using R on HDInsight cluster
Summary
Chapter 5: R with Relational Database Management Systems (RDBMSs)
Relational Database Management Systems (RDBMSs)
A short overview of used RDBMSs
Structured Query Language (SQL)
SQLite with R
Preparing and importing data into a local SQLite database
Connecting to SQLite from RStudio
MariaDB with R on a Amazon EC2 instance
Preparing the EC2 instance and RStudio Server for use
Preparing MariaDB and data for use
Working with MariaDB from RStudio
PostgreSQL with R on Amazon RDS
Launching an Amazon RDS database instance
Preparing and uploading data to Amazon RDS
Remotely querying PostgreSQL on Amazon RDS from RStudio
Summary
Chapter 6: R with Non-Relational (NoSQL) Databases
Introduction to NoSQL databases
Review of leading non-relational databases
MongoDB with R
Introduction to MongoDB
MongoDB data models
Installing MongoDB with R on Amazon EC2
Processing Big Data using MongoDB with R
Importing data into MongoDB and basic MongoDB commands
MongoDB with R using the rmongodb package
MongoDB with R using the RMongo package
MongoDB with R using the mongolite package
HBase with R
Azure HDInsight with HBase and RStudio Server
Importing the data to HDFS and HBase
Reading and querying HBase using the rhbase package
Summary
Chapter 7: Faster than Hadoop - Spark with R
Spark for Big Data analytics
Spark with R on a multi-node HDInsight cluster
Launching HDInsight with Spark and R/RStudio
Reading the data into HDFS and Hive
Getting the data into HDFS
Importing data from HDFS to Hive
Bay Area Bike Share analysis using SparkR
Summary
Chapter 8: Machine Learning Methods for Big Data in R
What is machine learning?
Supervised and unsupervised machine learning methods
Classification and clustering algorithms
Machine learning methods with R
Big Data machine learning tools
GLM example with Spark and R on the HDInsight cluster
Preparing the Spark cluster and reading the data from HDFS
Logistic regression in Spark with R
Naive Bayes with H20 on Hadoop with R
Running an H2O instance on Hadoop with R
Reading and exploring the data in H2O
Naive Bayes on H2O with R
Neural Networks with H2O on Hadoop with R
How do Neural Networks work?
Running Deep Learning models on H20
Summary
Chapter 9: The Future of R - Big, Fast, and Smart Data
The current state of Big Data analytics with R
Out-of-memory data on a single machine
Faster data processing with R
Hadoop with R
Spark with R
R with databases
Machine learning with R
The future of R
Big Data
Fast data
Smart data
Where to go next
Summary
Index
探索数据深处:解锁大数据分析的奥秘 在信息爆炸的时代,数据已成为驱动创新、决策优化乃至社会变革的核心力量。从海量的传感器读数到复杂的社交网络互动,再到精密的科学实验结果,庞大而杂乱的数据集蕴含着我们渴望发现的规律、洞察和价值。然而,如何从这些“大数据”的洪流中提炼出有意义的信息,并将其转化为可操作的知识,却是一项充满挑战的任务。本书旨在为读者提供一套系统且实用的方法论,帮助您掌握驾驭大数据、挖掘其内在价值的关键技能。 本书并非一本枯燥的理论手册,而是一次激动人心的实践探索之旅。 我们将带您深入理解大数据分析的核心理念,并聚焦于当下最流行、功能最强大的开源数据科学语言——R。R语言以其丰富的统计分析库、强大的可视化能力以及活跃的社区支持,已成为大数据分析领域的首选工具之一。本书将围绕R语言,循序渐进地引导您构建从数据采集、预处理到建模、评估的完整分析流程。 数据,是您旅程的起点。 在本书的早期章节,我们将首先关注数据的来源与形态。您将学习如何高效地从各种数据库、文件格式(如CSV、JSON、XML)以及网络API中获取原始数据。更重要的是,您将掌握对这些数据进行初步探索和理解的方法。这包括但不限于:理解数据的结构、识别缺失值和异常值、进行描述性统计分析以概览数据分布特征、以及利用多样的可视化技术(如直方图、散点图、箱线图)来直观地展现数据间的关系。我们相信,深入理解您的数据是成功分析的基础。 数据清洗与转换,是通往真相的必经之路。 原始数据往往是“脏”的,充斥着错误、不一致和不完整的信息。本书将投入大量篇幅,详细讲解数据清洗和预处理的各项技术。您将学习如何有效地处理缺失值(例如,通过插补、删除或模型预测),如何检测和纠正异常值,如何进行数据类型转换,如何合并、连接和重塑数据集,以及如何对分类变量进行编码。此外,我们还会介绍特征工程的概念,包括如何从现有数据中创建新的、更有意义的特征,以提升模型性能。您将学会利用R语言强大的数据处理包,如dplyr和tidyr,将繁琐的数据操作转化为简洁优雅的代码。 模型构建,是大数据分析的核心环节。 一旦数据被清洗和准备好,我们就可以开始构建模型来探索数据中的模式并做出预测。本书将涵盖多种大数据分析中常用的建模技术,从经典的统计模型到更现代的机器学习算法。 监督学习: 您将学习如何构建回归模型(如线性回归、岭回归、Lasso回归)来预测连续数值,以及如何构建分类模型(如逻辑回归、决策树、随机森林、支持向量机、K近邻)来预测离散类别。我们将深入探讨每种模型的原理、假设、优缺点以及在R语言中的实现。 无监督学习: 对于那些没有明确目标变量的数据,无监督学习提供了强大的工具。您将学习聚类分析(如K-Means、层次聚类)来发现数据中的自然分组,以及降维技术(如主成分分析PCA、t-SNE)来简化高维数据,揭示潜在结构。 时间序列分析: 对于具有时间依赖性的数据,如股票价格、销售额或传感器读数,时间序列分析至关重要。本书将介绍ARIMA模型、指数平滑法等经典时间序列模型,以及如何利用R语言进行时间序列预测和异常检测。 模型评估与优化,是确保分析可靠性的关键。 构建模型只是第一步,如何评估模型的性能并对其进行优化同样重要。本书将详细介绍各种模型评估指标,如准确率、精确率、召回率、F1分数、ROC曲线、AUC值、均方误差(MSE)、均方根误差(RMSE)等。您将学习如何利用交叉验证等技术来获得更可靠的模型评估结果,避免过拟合和欠拟合。此外,我们还将探讨模型调参、特征选择等模型优化策略。 可视化,是大数据分析的灵魂。 即使是最复杂的模型和最深刻的洞察,如果无法清晰地传达给他人,其价值也会大打折扣。可视化是将数据转化为易于理解的故事的关键。本书将重点介绍R语言中强大的可视化工具,如ggplot2。您将学习如何创建各种类型的图表,包括散点图、折线图、柱状图、饼图、热力图、地理空间图等,并学会如何通过调整图表的颜色、形状、大小和标签来有效地传达信息,突出关键发现。我们相信,通过高质量的可视化,您可以让数据“说话”,从而更容易地与他人分享您的发现并驱动决策。 实际应用与案例研究,是检验真理的唯一标准。 理论知识需要通过实践来巩固。本书将结合一系列实际应用场景,通过具体的案例研究来展示如何运用R语言解决真实世界的大数据分析问题。这些案例可能涵盖: 市场营销分析: 分析客户购买行为,进行客户细分,预测客户流失。 金融风险管理: 构建信用评分模型,进行欺诈检测,分析股票市场趋势。 医疗健康: 分析疾病发病率,预测病人风险,优化治疗方案。 社交媒体分析: 分析用户情感,发现热门话题,预测趋势。 物联网数据分析: 实时监控设备状态,预测故障,优化资源利用。 通过这些案例,您将有机会亲眼看到如何将本书中学到的知识和技能应用于实际工作中,并从中获得宝贵的实践经验。 本书的读者对象广泛, 无论是希望进入大数据分析领域的初学者,还是希望提升R语言数据分析能力的在职专业人士,亦或是对利用数据解决复杂问题充满兴趣的学生,都将从中受益。我们假设读者具备基础的编程概念,但对R语言不一定有深入了解。我们会从基础讲起,逐步引导您掌握R语言在数据分析中的各项应用。 掌握大数据分析的能力,就是掌握了理解和塑造未来的关键。 本书将为您提供一把解锁数据宝藏的钥匙,引导您穿越数据的迷雾,发现隐藏的模式,洞察事物的本质,并最终做出更明智、更具影响力的决策。加入我们,开启您的R语言大数据分析之旅,让数据成为您最强大的盟友!