Chapter 13. Integrating Big Data Pipelines with Spring Batch and Spring Integration

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (11.07 MB, 314 trang )

contents of a local directory get filled up with rolled-over logfiles, usually following a

naming convention such as myapp-timestamp.log. It is also common that logfiles are

being continuously created on remote machines, such as a web farm, and need to be

transferred to a separate machine and loaded into HDFS. We can implement these use

cases by using Spring Integration in combination with Spring for Apache Hadoop.

In this section, we will provide a brief introduction to Spring Integration and then

implement an application for each of the use cases just described. In addition, we will

show how Spring Integration can be used to process and load into HDFS data that

comes from an event stream. Lastly, we will show the features available in Spring Integration that enable rich runtime management of these applications through JMX

(Java management extensions) and over HTTP.

An Introduction to Spring Integration

Spring Integration is an open source Apache 2.0 licensed project, started in 2007, that

supports writing applications based on established enterprise integration patterns.

These patterns provide the key building blocks to develop integration applications that

tie new and existing system together. The patterns are based upon a messaging model

in which messages are exchanged within an application as well as between external

systems. Adopting a messaging model brings many benefits, such as the logical decoupling between components as well as physical decoupling; the consumer of messages

does not need to be directly aware of the producer. This decoupling makes it easier to

build integration applications, as they can be developed by assembling individual

building blocks together. The messaging model also makes it easier to test the application, since individual blocks can be tested first in isolation from other components.

This allows bugs to be found earlier in the development process rather than later during

distributed system testing, where tracking down the root cause of a failure can be very

difficult. The key building blocks of a Spring Integration application and how they

relate to each other is shown in Figure 13-1.

Figure 13-1. Building block of a Spring Integration application

Endpoints are producers or consumers of messages that are connected through channels. Messages are a simple data structure that contains key/value pairs in a header and

an arbitrary object type for the payload. Endpoints can be adapters that communicate

with external systems such as email, FTP, TCP, JMS, RabbitMQ, or syslog, but can

also be operations that act on a message as it moves from one channel to another.

220 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration

Common messaging operations that are supported in Spring Integration are routing to

one or more channels based on the headers of message, transforming the payload from

a string to a rich data type, and filtering messages so that only those that pass the filter

criteria are passed along to a downstream channel. Figure 13-2 is an example taken

from a joint Spring/C24 project in the financial services industry that shows the type

of data processing pipelines that can be created with Spring Integration.

Figure 13-2. A Spring Integration processing pipeline

This diagram shows financial trade messages being received on the left via three RabbitMQ adapters that correspond to three external sources of trade data. The messages

are then parsed, validated, and transformed into a canonical data format. Note that

this format is not required to be XML and is often a POJO. The message header is then

enriched, and the trade is stored into a relational database and also passed into a filter.

The filter selects only high-value trades that are subsequently placed into a GemFirebased data grid where real-time processing can occur. We can define this processing

pipeline declaratively using XML or Scala, but while most of the application can be

declaratively configured, any components that you may need to write are POJOs that

can be easily unit-tested.

In addition to endpoints, channels, and messages, another key component of Spring

Integration is its management functionality. You can easily expose all components in

a data pipeline via JMX, where you can perform operations such as stopping and starting adapters. The control bus component allows you to send in small fragments of

code—for example, using Groovy—that can take complex actions to modify the state

Collecting and Loading Data into HDFS | 221

of the system, such as changing filter criteria or starting and stopping adapters. The

control bus is then connected to a middleware adapter so it can receive code to execute;

HTTP and message-oriented middleware adapters are common choices.

We will not be able to dive into the inner workings of Spring Integration in great depth,

nor cover every feature of the adapters that are used, but you should end up with a

good feel for how you can use Spring Integration in conjunction with Spring for Apache

Hadoop to create very rich data pipeline solutions. The example applications developed

here contain some custom code for working with HDFS that is planned to be incorporated into the Spring Integration project. For additional information on Spring Integration, consult the project website, which contains links to extensive reference documentation, sample applications, and links to several books on Spring Integration.

Copying Logfiles

Copying logfiles into Hadoop as they are continuously generated is a common task.

We will create two applications that continuously load generated logfiles into HDFS.

One application will use an inbound file adapter to poll a directory for files, and the

other will poll an FTP site. The outbound adapter writes to HDFS, and its implementation uses the FsShell class provided by Spring for Apache Hadoop, which was described in “Scripting HDFS on the JVM” on page 187. The diagram for this data pipeline

is shown in Figure 13-3.

Figure 13-3. A Spring Integration data pipeline that polls a directory for files and copies them into

HDFS

The file inbound adapter is configured with the directory to poll for files as well as the

filename pattern that determines what files will be detected by the adapter. These values

are externalized into a properties file so they can easily be changed across different

runtime environments. The adapter uses a poller to check the directory since the filesystem is not an event-driven source. There are several ways you can configure the

poller, but the most common are to use a fixed delay, a fixed rate, or a cron expression.

In this example, we do not make use of any additional operations in the pipeline that

would sit between the two adapters, but we could easily add that functionality if required. The configuration file to configure this data pipeline is shown in Example 13-1.

222 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration

Example 13-1. Defining a data pipeline that polls for files in a directory and loads them into HDFS

fs.default.name=${hd.fs}

channel="filesIn"

directory="${polling.directory}"

filename-pattern="${polling.fileNamePattern}">

channel="filesIn"

ref="fsShellWritingMessagingHandler" >

class="com.oreilly.springdata.hadoop.filepolling.FsShellWritingMessageHandler">

The relevant configuration parameters for the pipeline are externalized in the poll

ing.properties file, as shown in Example 13-2.

Example 13-2. The externalized properties for polling a directory and loading them into HDFS

polling.directory=/opt/application/logs

polling.fixedDelay=5000

polling.fileNamePattern=*.txt

polling.destinationHdfsDirectory=/data/application/logs

This configuration will poll the directory /opt/application/logs every five seconds and

look for files that match the pattern *.txt. By default, duplicate files are prevented when

we specify a filename-pattern; the state is kept in memory. A future enhancement of

the file adapter is to persistently store this application state. The FsShellWritingMessa

geHandler class is responsible for copying the file into HDFS using FsShell’s copyFrom

Local method. If you want to remove the files from the polling directory after the transfer, then you set the property deleteSourceFiles on FsShellWritingMessageHandler to

true. You can also lock files to prevent them from being picked up concurrently if more

than one process is reading from the same directory. See the Spring Integration reference

guide for more information.

To build and run this application, use the commands shown in Example 13-3.

Collecting and Loading Data into HDFS | 223

Example 13-3. Command to build and run the file polling example

$ cd hadoop/file-polling

$ mvn clean package appassembler:assemble

$ sh ./target/appassembler/bin/filepolling

The relevant parts of the output are shown in Example 13-4.

Example 13-4. Output from running the file polling example

03:48:44.187 [main] INFO

c.o.s.hadoop.filepolling.FilePolling - File Polling Application Running

03:48:44.191 [task-scheduler-1] DEBUG o.s.i.file.FileReadingMessageSource - \

Added to queue: [/opt/application/logs/file_1.txt]

03:48:44.215 [task-scheduler-1] INFO o.s.i.file.FileReadingMessageSource - \

Created message: [[Payload=/opt/application/logs/file_1.txt]

03:48:44.215 [task-scheduler-1] DEBUG o.s.i.e.SourcePollingChannelAdapter - \

Poll resulted in Message: [Payload=/opt/application/logs/file_1.txt]

03:48:44.215 [task-scheduler-1] DEBUG o.s.i.channel.DirectChannel - \

preSend on channel 'filesIn', message: [Payload=/opt/application/logs/file_1.txt]

03:48:44.310 [task-scheduler-1] INFO c.o.s.h.f.FsShellWritingMessageHandler - \

sourceFile = /opt/application/logs/file_1.txt

03:48:44.310 [task-scheduler-1] INFO c.o.s.h.f.FsShellWritingMessageHandler - \

resultFile = /data/application/logs/file_1.txt

03:48:44.462 [task-scheduler-1] DEBUG o.s.i.channel.DirectChannel - \

postSend (sent=true) on channel 'filesIn', \

message: [Payload=/opt/application/logs/file_1.txt]

03:48:49.465 [task-scheduler-2] DEBUG o.s.i.e.SourcePollingChannelAdapter - \

Poll resulted in Message: null

03:48:49.465 [task-scheduler-2] DEBUG o.s.i.e.SourcePollingChannelAdapter - \

Received no Message during the poll, returning 'false'

03:48:54.466 [task-scheduler-1] DEBUG o.s.i.e.SourcePollingChannelAdapter - \

Poll resulted in Message: null

03:48:54.467 [task-scheduler-1] DEBUG o.s.i.e.SourcePollingChannelAdapter - \

Received no Message during the poll, returning 'false'

In this log, we can see that the first time around the poller detects the one file that was

in the directory and then afterward considers it processed, so the file inbound adapter

does not process it a second time. There are additional options in FsShellWritingMes

sageHandler to enable the generation of an additional directory path that contains an

embedded date or a UUID (universally unique identifier). To enable the output to have

an additional dated directory path using the default path format (year/month/day/hour/

minute/second), set the property generateDestinationDirectory to true. Setting gener

ateDestinationDirectory to true would result in the file written into HDFS, as shown

in Example 13-5.

Example 13-5. Partial output from

generateDestinationDirectory set to true

running

the

file

polling

example

03:48:44.187 [main] INFO c.o.s.hadoop.filepolling.FilePolling - \

File Polling Application Running

...

04:02:32.843 [task-scheduler-1] INFO c.o.s.h.f.FsShellWritingMessageHandler - \

224 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration

with

sourceFile = /opt/application/logs/file_1.txt

04:02:32.843 [task-scheduler-1] INFO c.o.s.h.f.FsShellWritingMessageHandler - \

resultFile = /data/application/logs/2012/08/09/04/02/32/file_1.txt

Another way to move files into HDFS is to collect them via FTP from remote machines,

as illustrated in Figure 13-4.

Figure 13-4. A Spring Integration data pipeline that polls an FTP site for files and copies them into

HDFS

The configuration in Example 13-6 is similar to the one for file polling, only the configuration of the inbound adapter is changed.

Example 13-6. Defining a data pipeline that polls for files on an FTP site and loads them into HDFS

fs.default.name=${hd.fs}

class="org.springframework.integration.ftp.session.DefaultFtpSessionFactory">

channel="filesIn"

cache-sessions="false"

session-factory="ftpClientFactory"

filename-pattern="*.txt"

auto-create-local-directory="true"

delete-remote-files="false"

remote-directory="${ftp.remoteDirectory}"

local-directory="${ftp.localDirectory}">

channel="filesIn" ref="fsShellWritingMessagingHandler"/>

class="com.oreilly.springdata.hadoop.ftp.FsShellWritingMessageHandler">

Collecting and Loading Data into HDFS | 225

You can build and run this application using the commands shown in Example 13-7.

Example 13-7. Command to build and run the file polling example

$ cd hadoop/ftp

$ mvn clean package appassembler:assemble

$ sh ./target/appassembler/bin/ftp

The configuration assumes there is a testuser account on the FTP host machine. Once

you place a file in the outgoing FTP directory, you will see the data pipeline in action,

copying the file to a local directory and then copying it into HDFS.

Event Streams

Streams are another common source of data that you might want to store into HDFS

and optionally perform real-time analysis as it flows into the system. To meet this need,

Spring Integration provides several inbound adapters that we can use to process streams

of data. Once inside a Spring Integration, the data can be passed through a processing

chain and stored into HDFS. The pipeline can also take parts of the stream and write

data to other databases, both relational and NoSQL, in addition to forwarding the

stream to other systems using one of the many outbound adapters. Figure 13-2 showed

one example of this type of data pipeline. Next, we will use the TCP (Transmission

Control Protocol) and UDP (User Datagram Protocol) inbound adapters to consume

data produced by syslog and then write the data into HDFS.

The configuration that sets up a TCP-syslog-to-HDFS processing chain is shown in

Example 13-8.

Example 13-8. Defining a data pipeline that receives syslog data over TCP and loads it into HDFS

fs.default.name=${hd.fs}

type="server"

port="${syslog.tcp.port}"

deserializer="lfDeserializer"/>

class="com.oreilly.springdata.integration.ip.syslog.ByteArrayLfSerializer"/>

226 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration

channel="syslogChannel"

connection-factory="syslogListener"/>

class="com.oreilly.springdata.integration.ip.syslog.SyslogToMapTransformer"/>

class="com.oreilly.springdata.hadoop.streaming.HdfsWritingMessageHandler">

class="com.oreilly.springdata.hadoop.streaming.HdfsTextFileWriterFactory">

value="${syslog.hdfs.rolloverThresholdInBytes}"/>

The relevant configuration parameters for the pipeline are externalized in the stream

ing.properties file, as shown in Example 13-9.

Example 13-9. The externalized properties for streaming data from syslog into HDFS

syslog.tcp.port=1514

syslog.udp.port=1513

syslog.hdfs.basePath=/data/

syslog.hdfs.baseFilename=syslog

syslog.hdfs.fileSuffix=log

syslog.hdfs.rolloverThresholdInBytes=500

The diagram for this data pipeline is shown in Figure 13-5.

This configuration will create a connection factory that listens for an incoming TCP

connection on port 1514. The serializer segments the incoming byte stream based on

the newline character in order to break up the incoming syslog stream into events. Note

that this lower-level serializer configuration will be encapsulated in a syslog XML

namespace in the future so as to simplify the configuration. The inbound channel

adapter takes the syslog message off the TCP data stream and parses it into a byte array,

which is set as the payload of the incoming message.

Collecting and Loading Data into HDFS | 227

Figure 13-5. A Spring Integration data pipeline that streams data from syslog into HDFS

Spring Integration’s chain component groups together a sequence of endpoints without

our having to explicitly declare the channels that connect them. The first element in

the chain parses the byte[] array and converts it to a java.util.Map containing the key/

value pairs of the syslog message. At this stage, you could perform additional operations

on the data, such as filtering, enrichment, real-time analysis, or routing to other databases. In this example, we have simply transformed the payload (now a Map) to a

String using the built-in object-to-string transformer. This string is then passed into

the HdfsWritingMessageHandler that writes the data into HDFS. HdfsWritingMessage

Handler lets you configure the HDFS directory to write the files, the file naming policy,

and the file size rollover policy. In this example, the rollover threshold was set artificially

low (500 bytes versus the 10 MB default) to highlight the rollover capabilities in a simple

test usage case.

To build and run this application, use the commands shown in Example 13-10.

Example 13-10. Commands to build and run the Syslog streaming example

$ cd hadoop/streaming

$ mvn clean package appassembler:assemble

$ sh ./target/appassembler/bin/streaming

To send a test message, use the logger utility demonstrated in Example 13-11.

Example 13-11. Sending a message to syslog

$ logger -p local3.info -t TESTING "Test Syslog Message"

Since we set HdfsWritingMessageHandler’s rolloverThresholdInBytes property so low,

after sending a few of these messages or just waiting for messages to come in from the

operating system, you will see inside HDFS the files shown in Example 13-12.

Example 13-12. Syslog data in HDFS

$ hadoop dfs

-rw-r--r-- 3

-rw-r--r-- 3

-rw-r--r-- 3

-rw-r--r-- 3

…

-ls /data

mpollack supergroup

mpollack supergroup

mpollack supergroup

mpollack supergroup

711

202

240

119

2012-08-09

2012-08-09

2012-08-09

2012-08-09

13:19

13:22

13:22

15:04

/data/syslog-0.log

/data/syslog-1.log

/data/syslog-2.log

/data/syslog-3.log

228 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration

$ hadoop dfs -cat /data/syslog-2.log

{HOST=ubuntu, MESSAGE=Test Syslog Message, SEVERITY=6, FACILITY=19, \

TIMESTAMP=Thu Aug 09 13:22:44 EDT 2012, TAG=TESTING}

{HOST=ubuntu, MESSAGE=Test Syslog Message, SEVERITY=6, FACILITY=19, \

TIMESTAMP=Thu Aug 09 13:22:55 EDT 2012, TAG=TESTING}

To use UDP instead of TCP, remove the TCP-related definitions and add the commands

shown in Example 13-13.

Example 13-13. Configuration to use UDP to consume syslog data

channel="syslogChannel" port="${syslog.udp.port}"/>

Event Forwarding

When you need to process a large amount of data from several different machines, it

can be useful to forward the data from where it is produced to another server (as opposed to processing the data locally). The TCP inbound and outbound adapters can

be paired together in an application so that they forward data from one server to another. The channel that connects the two adapters can be backed by several persistent

message stores. Message stores are represented in Spring Integration by the interface

MessageStore, and implementations are available for JDBC, Redis, MongoDB, and

GemFire. Pairing inbound and outbound adapters together in an application affects

the message processing flow such that the message is persisted in the message store of

the producer application before the message is sent to the consumer application. The

message is removed from the producer’s message store once the acknowledgment from

the consumer is received. The consumer sends its acknowledgment once it has successfully put the received message in its own message-store-backed channel. This

configuration enables an additional level of "store and forward" guarantee via TCP

normally found in messaging middleware such as JMS or RabbitMQ.

Example 13-14 is a simple demonstration of forwarding TCP traffic and using Spring’s

support to easily bootstrap an embedded HSQL database to serve as the message store.

Example 13-14. Store and forwarding of data across processes using TCP adapters

channel="dataChannel" port="${syslog.tcp.in.port}"/>

channel="dataChannel" port="${syslog.tcp.out.port}"/>

Collecting and Loading Data into HDFS | 229

Management

Spring Integration provides two key features that let you manage data pipelines at runtime: the exporting of channels and endpoints to JMX and a control bus. Much like

JMX, the control bus lets you invoke operations and view metric information related

to each component, but it is more general-purpose because it allows you to run small

programs inside the running application to change its state and behavior.

Exporting channels and endpoints to JMX is as simple as adding the lines of XML

configuration shown in Example 13-15.

Example 13-15. Exporting channels and endpoints to JMX

Running the TCP streaming example in the previous section and then starting JConsole

shows the JMX metrics and operations that are available (Figure 13-6). Some examples

are to start and stop the TCP adapter, and to get the min, max, and mean duration of

processing in a MessageHandler.

Figure 13-6. Screenshots of the JConsole JMX application showing the operations and properties

available on the TcpAdapter, channels, and HdfsWritingMessageHandler

A control bus can execute Groovy scripts or Spring Expression Language (SpEL) expressions, allowing you to manipulate the state of components inside the application

230 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration

Xem Thêm

Chapter 13. Integrating Big Data Pipelines with Spring Batch and Spring Integration

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về