High Performance Spark Best Practices for Scaling and Optimizing Apache Spark

Langue : Anglais

Auteurs : Karau Holden, Warren Rachel

Couverture de l’ouvrage High Performance Spark

Résumé
Sommaire

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.

With this book, you’ll explore:
. How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
. The choice between data joins in Core Spark and Spark SQL
. Techniques for getting the most out of standard RDD transformations
. How to work around performance issues in Spark’s key/value pair paradigm
. Writing high-performance Spark code without Scala or the JVM
. How to test for functionality and performance when applying suggested improvements
. Using Spark MLlib and Spark ML machine learning libraries
. Spark’s Streaming components and external community packages

Chapter 1 - Introduction to High Performance Spark
. What Is Spark and Why Performance Matters
. What You Can Expect to Get from This Book
. Spark Versions
. Why Scala?
. Conclusion

Chapter 2 - How Spark Works
. How Spark Fits into the Big Data Ecosystem
. Spark Model of Parallel Computing: RDDs
. Spark Job Scheduling
. The Anatomy of a Spark Job
. Conclusion

Chapter 3 - DataFrames, Datasets, and Spark SQL
. Getting Started with the SparkSession (or HiveContext or SQLContext)
. Spark SQL Dependencies
. Basics of Schemas
. DataFrame API
. Data Representation in DataFrames and Datasets
. Data Loading and Saving Functions
. Datasets
. Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
. Query Optimizer
. Debugging Spark SQL Queries
. JDBC/ODBC Server
. Conclusion

Chapter 4 - Joins (SQL and Core)
. Core Spark Joins
. Spark SQL Joins
. Conclusion
. Narrow Versus Wide Transformations
. What Type of RDD Does Your Transformation Return?
. Minimizing Object Creation
. Iterator-to-Iterator Transformations with mapPartitions
. Set Operations
. Reducing Setup Overhead
. Reusing RDDs
. Conclusion

Chapter 6 - Working with Key/Value Data
. The Goldilocks Example
. Actions on Key/Value Pairs
. What’s So Dangerous About the groupByKey Function
. Choosing an Aggregation Operation
. Multiple RDD Operations
. Partitioners and Key/Value Data
. Dictionary of OrderedRDDOperations
. Secondary Sort and repartitionAndSortWithinPartitions
. Straggler Detection and Unbalanced Data
. Conclusion

Chapter 7 - Going Beyond Scala
. Beyond Scala within the JVM
. Beyond Scala, and Beyond the JVM
. Calling Other Languages from Spark
. The Future
. Conclusion
. Unit Testing
. Getting Test Data
. Property Checking with ScalaCheck
. Integration Testing
. Verifying Performance
. Job Validation
. Conclusion

Chapter 9 - Spark MLlib and ML Choosing Between Spark MLlib and Spark ML
. Working with MLlib
. Working with Spark ML
. General Serving Considerations
. Conclusion

Chapter 10 - Spark Components and Packages
. Stream Processing with Spark
. GraphX
. Using Community Packages and Libraries
. Conclusion

Appendix - Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
. Spark Tuning and Cluster Sizing
. Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
. Serialization Options

Broché

Date de parution : 06-2017

Ouvrage de 358 p.

18.1x23.3 cm

Disponible chez l'éditeur (délai d'approvisionnement : 12 jours).

Prix indicatif 44,97 €

Ajouter au panier

High Performance Spark Best Practices for Scaling and Optimizing Apache Spark

Auteurs : Karau Holden, Warren Rachel

Résumé

Sommaire

Thèmes de High Performance Spark :

Ces ouvrages sont susceptibles de vous intéresser