Using Flume : Flexible, Scalable, and Reliable Data Streaming

upload/duxiu_main2/【星空藏书馆】/【星空藏书馆】等多个文件/Kindle电子书库（012）/综合书籍（007）/综合1（011）/书1/yanhu831/Verycd Share/O'Reilly/2015/2015-9/OReilly.Using.Flume.2014.9.pdf

Using Flume : Flexible, Scalable, and Reliable Data Streaming 🔍

Hari Shreedharan O'Reilly Media, Incorporated ; John Wiley & Sons, Limited [distributor, O'Reilly Media, Beijing, 2014

English [en] · PDF · 3.9MB · 2014 · 📘 Book (non-fiction) · 🚀/lgli/lgrs/nexusstc/upload/zlib · Save

description

How can you get your data from frontend servers to Hadoop in near real time? With this complete reference guide, you’ll learn Flume’s rich set of features for collecting, aggregating, and writing large amounts of streaming data to the Hadoop Distributed File System (HDFS), Apache HBase, SolrCloud, Elastic Search, and other systems.Using Flume shows operations engineers how to configure, deploy, and monitor a Flume cluster, and teaches developers how to write Flume plugins and custom components for their specific use-cases. You’ll learn about Flume’s design and implementation, as well as various features that make it highly scalable, flexible, and reliable. Code examples and exercises are available on GitHub.Learn how Flume provides a steady rate of flow by acting as a buffer between data producers and consumersDive into key Flume components, including sources that accept data and sinks that write and deliver itWrite custom plugins to customize the way Flume receives, modifies, formats, and writes dataExplore APIs for sending data to Flume agents from your own applicationsPlan and deploy Flume in a scalable and flexible way—and monitor your cluster once it’s running

Alternative filename

nexusstc/Using Flume Flexible, Scalable, and Reliable Data Streaming/9f366d6e0cfab4fe804d7e201b48a6cb.pdf

Alternative filename

lgli/Hari Shreedharan;Using Flume Flexible, Scalable, and Reliable Data Streaming;;;O'Reilly Media;2014;;;English.pdf

Alternative filename

lgrsnf/Hari Shreedharan;Using Flume Flexible, Scalable, and Reliable Data Streaming;;;O'Reilly Media;2014;;;English.pdf

Alternative filename

zlib/no-category/Hari Shreedharan/Using Flume Flexible, Scalable, and Reliable Data Streaming_3337956.pdf

Alternative title

Using Flume : Stream Data into Hdfs and Hbase

Alternative title

Using Flume, First Edition

Alternative author

Shreedharan, Hari

Alternative publisher

Oreilly & Associates Inc

Alternative edition

United States, United States of America

Alternative edition

First edition, Sebastopol, CA, 2015

Alternative edition

Cambridge, Chichester, 2014

Alternative edition

1st ed, Beijing, 2015

Alternative edition

1, FR, 2014

metadata comments

lg2096071

metadata comments

producers:
calibre 2.26.0 [http://calibre-ebook.com]

metadata comments

{"publisher":"O'Reilly Media"}

metadata comments

类型: 图书

metadata comments

出版日期: 2014.10

Alternative description

Looking to use Apache Flume to stream data to Hadoop? This complete
reference guide shows operations engineers how to configure, deploy, and monitor a Flume
cluster, and teaches developers how to write Flume plugins and custom components to
their specific use-cases.
Foreword 7
Preface 10
Conventions Used in This Book 11
Using Code Examples 12
Safari® Books Online 14
How to Contact Us 15
Acknowledgments 16
1. Apache Hadoop and Apache HBase: An Introduction 18
HDFS 19
HDFS Data Formats 21
Processing Data on HDFS 22
Apache HBase 23
Summary 25
References 26
2. Streaming Data Using Apache Flume 28
The Need for Flume 29
Is Flume a Good Fit? 31
Inside a Flume Agent 32
Configuring Flume Agents 35
Getting Flume Agents to Talk to Each Other 40
Complex Flows 40
Replicating Data to Various Destinations 44
Dynamic Routing 45
Flume’s No Data Loss Guarantee, Channels, and Transactions 46
Transactions in Flume Channels 48
Agent Failure and Data Loss 51
The Importance of Batching 54
What About Duplicates? 56
Running a Flume Agent 57
Summary 60
References 60
3. Sources 62
Lifecycle of a Source 63
Sink-to-Source Communication 65
Avro Source 66
Thrift Source 71
Failure Handling in RPC Sources 72
HTTP Source 73
Writing Handlers for the HTTP Source* 75
Spooling Directory Source 79
Reading Custom Formats Using Deserializers* 82
Spooling Directory Source Performance 87
Syslog Sources 87
Exec Source 91
JMS Source 94
Converting JMS Messages into Flume Events* 97
Writing Your Own Sources* 98
Event-Driven and Pollable Sources 99
Developing pollable sources 100
Building event-driven sources 102
Summary 105
References 106
4. Channels 108
Transaction Workflow 109
Channels Bundled with Flume 112
Memory Channel 113
File Channel 116
Design and implementation of the File Channel* 121
Summary 122
References 123
5. Sinks 125
Lifecycle of a Sink 126
Optimizing the Performance of Sinks 128
Writing to HDFS: The HDFS Sink 129
Understanding Buckets 131
Configuring the HDFS Sink 134
Controlling the Data Format Using Serializers* 142
HBase Sinks 147
Translating Flume Events to HBase Puts and Increments Using Serializers* 150
RPC Sinks 154
Avro Sink 155
Thrift Sink 158
Morphline Solr Sink 159
Elastic Search Sink 162
Customizing the Data Format* 164
Other Sinks: Null Sink, Rolling File Sink, Logger Sink 167
Writing Your Own Sink* 169
Summary 173
References 173
6. Interceptors, Channel Selectors, Sink Groups, and Sink Processors 175
Interceptors 176
Timestamp Interceptor 177
Host Interceptor 178
Static Interceptor 179
Regex Filtering Interceptor 180
Morphline Interceptor 181
UUID Interceptor 182
Writing Interceptors* 184
Channel Selectors 188
Replicating Channel Selector 189
Multiplexing Channel Selector 190
Custom Channel Selectors* 193
Sink Groups and Sink Processors 195
Load-Balancing Sink Processor 197
Writing sink selectors* 200
Failover Sink Processor 200
Summary 204
References 204
7. Getting Data into Flume* 206
Building Flume Events 207
Flume Client SDK 209
Building Flume RPC Clients 210
RPC Client Interface 211
Configuration Parameters Common to All RPC Clients 212
Default RPC Client 217
Load-Balancing RPC Client 221
Writing your own host selector* 224
Failover RPC Client 224
Thrift RPC Client 225
Embedded Agent 226
Configuring an Embedded Agent 230
log4j Appenders 234
Load-Balancing log4j Appender 236
Summary 237
References 238
8. Planning, Deploying, and Monitoring Flume 240
Planning a Flume Deployment 241
Time to Repair 242
How Much Capacity Do I Need in My Flume Channels? 244
How Many Tiers? 245
How do you know if Flume is not scaling or if the destination storage system or index is slow? 248
Sending Data over Cross–Data Center Links 248
Sharding Tiers 250
Deploying Flume 252
Deploying Custom Code 253
Monitoring Flume 254
Reporting Metrics from Custom Components 257
Summary 258
References 259
Index 261
Foreword 7
Preface 10
Conventions Used in This Book 11
Using Code Examples 12
Safari庐 Books Online 14
How to Contact Us 15
Acknowledgments 16
1. Apache Hadoop and Apache HBase: An Introduction 18
HDFS 19
HDFS Data Formats 21
Processing Data on HDFS 22
Apache HBase 23
Summary 25
References 26
2. Streaming Data Using Apache Flume 28
The Need for Flume 29
Is Flume a Good Fit? 31
Inside a Flume Agent 32
Configuring Flume Agents 35
Getting Flume Agents to Talk to Each Other 40
Complex Flows 40
Replicating Data to Various Destinations 44
Dynamic Routing 45
Flume鈥檚 No Data Loss Guarantee, Channels, and Transactions 46
Transactions in Flume Channels 48
Agent Failure and Data Loss 51
The Importance of Batching 54
What About Duplicates? 56
Running a Flume Agent 57
Summary 60
References 60
3. Sources 62
Lifecycle of a Source 63
Sink-to-Source Communication 65
Avro Source 66
Thrift Source 71
Failure Handling in RPC Sources 72
HTTP Source 73
Writing Handlers for the HTTP Source* 75
Spooling Directory Source 79
Reading Custom Formats Using Deserializers* 82
Spooling Directory Source Performance 87
Syslog Sources 87
Exec Source 91
JMS Source 94
Converting JMS Messages into Flume Events* 97
Writing Your Own Sources* 98
Event-Driven and Pollable Sources 99
Developing pollable sources 100
Building event-driven sources 102
Summary 105
References 106
4. Channels 108
Transaction Workflow 109
Channels Bundled with Flume 112
Memory Channel 113
File Channel 116
Design and implementation of the File Channel* 121
Summary 122
References 123
5. Sinks 125
Lifecycle of a Sink 126
Optimizing the Performance of Sinks 128
Writing to HDFS: The HDFS Sink 129
Understanding Buckets 131
Configuring the HDFS Sink 134
Controlling the Data Format Using Serializers* 142
HBase Sinks 147
Translating Flume Events to HBase Puts and Increments Using Serializers* 150
RPC Sinks 154
Avro Sink 155
Thrift Sink 158
Morphline Solr Sink 159
Elastic Search Sink 162
Customizing the Data Format* 164
Other Sinks: Null Sink, Rolling File Sink, Logger Sink 167
Writing Your Own Sink* 169
Summary 173
References 173
6. Interceptors, Channel Selectors, Sink Groups, and Sink Processors 175
Interceptors 176
Timestamp Interceptor 177
Host Interceptor 178
Static Interceptor 179
Regex Filtering Interceptor 180
Morphline Interceptor 181
UUID Interceptor 182
Writing Interceptors* 184
Channel Selectors 188
Replicating Channel Selector 189
Multiplexing Channel Selector 190
Custom Channel Selectors* 193
Sink Groups and Sink Processors 195
Load-Balancing Sink Processor 197
Writing sink selectors* 200
Failover Sink Processor 200
Summary 204
References 204
7. Getting Data into Flume* 206
Building Flume Events 207
Flume Client SDK 209
Building Flume RPC Clients 210
RPC Client Interface 211
Configuration Parameters Common to All RPC Clients 212
Default RPC Client 217
Load-Balancing RPC Client 221
Writing your own host selector* 224
Failover RPC Client 224
Thrift RPC Client 225
Embedded Agent 226
Configuring an Embedded Agent 230
log4j Appenders 234
Load-Balancing log4j Appender 236
Summary 237
References 238
8. Planning, Deploying, and Monitoring Flume 240
Planning a Flume Deployment 241
Time to Repair 242
How Much Capacity Do I Need in My Flume Channels? 244
How Many Tiers? 245
How do you know if Flume is not scaling or if the destination storage system or index is slow? 248
Sending Data over Cross鈥揇ata Center Links 248
Sharding Tiers 250
Deploying Flume 252
Deploying Custom Code 253
Monitoring Flume 254
Reporting Metrics from Custom Components 257
Summary 258
References 259
Index 261 (as-gbk-encoding)

Alternative description

Annotation How can you get your data from frontend servers to Hadoop in near real time? With this complete reference guide, youll learn Flumes rich set of features for collecting, aggregating, and writing large amounts of streaming data to the Hadoop Distributed File System (HDFS), Apache HBase, SolrCloud, Elastic Search, and other systems. Using Flume shows operations engineers how to configure, deploy, and monitor a Flume cluster, and teaches developers how to write Flume plugins and custom components for their specific use-cases. Youll learn about Flumes design and implementation, as well as various features that make it highly scalable, flexible, and reliable. Code examples and exercises are available on GitHub. Learn how Flume provides a steady rate of flow by acting as a buffer between data producers and consumersDive into key Flume components, including sources that accept data and sinks that write and deliver itWrite custom plugins to customize the way Flume receives, modifies, formats, and writes dataExplore APIs for sending data to Flume agents from your own applicationsPlan and deploy Flume in a scalable and flexible wayand monitor your cluster once its running

date open sourced

2017-08-26

🚀 Fast downloads

Become a member to support the long-term preservation of books, papers, and more. To show our gratitude for your support, you get fast downloads. ❤️

🐢 Slow downloads

From trusted partners. More information in the FAQ. (might require browser verification — unlimited downloads!)

Slow Partner Server #1 (slightly faster but with waitlist)
Slow Partner Server #2 (slightly faster but with waitlist)
Slow Partner Server #3 (slightly faster but with waitlist)
Slow Partner Server #4 (slightly faster but with waitlist)
Slow Partner Server #5 (no waitlist, but can be very slow)
Slow Partner Server #6 (no waitlist, but can be very slow)
Slow Partner Server #7 (no waitlist, but can be very slow)
Slow Partner Server #8 (no waitlist, but can be very slow)
After downloading: Open in our viewer

All download options have the same file, and should be safe to use. That said, always be cautious when downloading files from the internet, especially from sites external to Anna’s Archive. For example, be sure to keep your devices updated.

show external downloads

For large files, we recommend using a download manager to prevent interruptions.
Recommended download managers: Motrix
You will need an ebook or PDF reader to open the file, depending on the file format.
Recommended ebook readers: Anna’s Archive online viewer, ReadEra, and Calibre
Use online tools to convert between formats.
Recommended conversion tools: CloudConvert and PrintFriendly
You can send both PDF and EPUB files to your Kindle or Kobo eReader.
Recommended tools: Amazon‘s “Send to Kindle” and djazz‘s “Send to Kobo/Kindle”
Support authors and libraries
✍️ If you like this and can afford it, consider buying the original, or supporting the authors directly.
📚 If this is available at your local library, consider borrowing it for free there.

📂 File quality

Help out the community by reporting the quality of this file! 🙌

🚀 Fast downloads

🐢 Slow downloads

External downloads

📂 File quality