Big Data Analysis with R

When the database grows in size, it is important to find the solution that best suits your needs. In this course we start with the use of databases to analyze a much larger data than the one that’s accessible with the simple R and we will move on increasing the data size until we understand how the distributed architectures work and how to use Apache Hadoop and Apache Spark with R and Sparklyr.

Topics include

Introduction to data lakes
The fundamentals of database theory
Main technologies for data lakes
Manipulate data on the disk
Lazy operations
Data Import and Manipulation
Distributed Architectures
Principles
Cloudera HDFS
Hive and Apache Spark
Sparklyr
Standalone testing mode and distributed mode
Data Import and Distributed Manipulation
Relational data and Joins
In-memory Caching
In-memory Distributed computation
Machine Learning Pipelines
Introduction to the main Machine Learning Algorithms

What you will be able to do

Select the architecture that best suits your needs, selecting which type of database or distributed architecture.
Import, manipulate data in that architecture
Understand the basics and the limitations of the architecture in terms of data and of kind of analysis
Use the main Machine Learnings algorithms on a distributed architecture

Duration

2 days.

Pre requisites

Basics of R programming and Tidyverse
Nice to have: Dplyr or SQL basics

Audience

This course provides the foundations for analyzing and using a big dataset and is also recommended for anyone who needs to understand how the size of the data affects the analysis process or results and, consequently, understand what benefits the usage of these technologies provides to the business.