Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (11.07 MB, 314 trang )
contents of a local directory get filled up with rolled-over logfiles, usually following a
naming convention such as myapp-timestamp.log. It is also common that logfiles are
being continuously created on remote machines, such as a web farm, and need to be
transferred to a separate machine and loaded into HDFS. We can implement these use
cases by using Spring Integration in combination with Spring for Apache Hadoop.
In this section, we will provide a brief introduction to Spring Integration and then
implement an application for each of the use cases just described. In addition, we will
show how Spring Integration can be used to process and load into HDFS data that
comes from an event stream. Lastly, we will show the features available in Spring Integration that enable rich runtime management of these applications through JMX
(Java management extensions) and over HTTP.
An Introduction to Spring Integration
Spring Integration is an open source Apache 2.0 licensed project, started in 2007, that
supports writing applications based on established enterprise integration patterns.
These patterns provide the key building blocks to develop integration applications that
tie new and existing system together. The patterns are based upon a messaging model
in which messages are exchanged within an application as well as between external
systems. Adopting a messaging model brings many benefits, such as the logical decoupling between components as well as physical decoupling; the consumer of messages
does not need to be directly aware of the producer. This decoupling makes it easier to
build integration applications, as they can be developed by assembling individual
building blocks together. The messaging model also makes it easier to test the application, since individual blocks can be tested first in isolation from other components.
This allows bugs to be found earlier in the development process rather than later during
distributed system testing, where tracking down the root cause of a failure can be very
difficult. The key building blocks of a Spring Integration application and how they
relate to each other is shown in Figure 13-1.
Figure 13-1. Building block of a Spring Integration application
Endpoints are producers or consumers of messages that are connected through channels. Messages are a simple data structure that contains key/value pairs in a header and
an arbitrary object type for the payload. Endpoints can be adapters that communicate
with external systems such as email, FTP, TCP, JMS, RabbitMQ, or syslog, but can
also be operations that act on a message as it moves from one channel to another.
220 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration
Common messaging operations that are supported in Spring Integration are routing to
one or more channels based on the headers of message, transforming the payload from
a string to a rich data type, and filtering messages so that only those that pass the filter
criteria are passed along to a downstream channel. Figure 13-2 is an example taken
from a joint Spring/C24 project in the financial services industry that shows the type
of data processing pipelines that can be created with Spring Integration.
Figure 13-2. A Spring Integration processing pipeline
This diagram shows financial trade messages being received on the left via three RabbitMQ adapters that correspond to three external sources of trade data. The messages
are then parsed, validated, and transformed into a canonical data format. Note that
this format is not required to be XML and is often a POJO. The message header is then
enriched, and the trade is stored into a relational database and also passed into a filter.
The filter selects only high-value trades that are subsequently placed into a GemFirebased data grid where real-time processing can occur. We can define this processing
pipeline declaratively using XML or Scala, but while most of the application can be
declaratively configured, any components that you may need to write are POJOs that
can be easily unit-tested.
In addition to endpoints, channels, and messages, another key component of Spring
Integration is its management functionality. You can easily expose all components in
a data pipeline via JMX, where you can perform operations such as stopping and starting adapters. The control bus component allows you to send in small fragments of
code—for example, using Groovy—that can take complex actions to modify the state
Collecting and Loading Data into HDFS | 221
of the system, such as changing filter criteria or starting and stopping adapters. The
control bus is then connected to a middleware adapter so it can receive code to execute;
HTTP and message-oriented middleware adapters are common choices.
We will not be able to dive into the inner workings of Spring Integration in great depth,
nor cover every feature of the adapters that are used, but you should end up with a
good feel for how you can use Spring Integration in conjunction with Spring for Apache
Hadoop to create very rich data pipeline solutions. The example applications developed
here contain some custom code for working with HDFS that is planned to be incorporated into the Spring Integration project. For additional information on Spring Integration, consult the project website, which contains links to extensive reference documentation, sample applications, and links to several books on Spring Integration.
Copying Logfiles
Copying logfiles into Hadoop as they are continuously generated is a common task.
We will create two applications that continuously load generated logfiles into HDFS.
One application will use an inbound file adapter to poll a directory for files, and the
other will poll an FTP site. The outbound adapter writes to HDFS, and its implementation uses the FsShell class provided by Spring for Apache Hadoop, which was described in “Scripting HDFS on the JVM” on page 187. The diagram for this data pipeline
is shown in Figure 13-3.
Figure 13-3. A Spring Integration data pipeline that polls a directory for files and copies them into
HDFS
The file inbound adapter is configured with the directory to poll for files as well as the
filename pattern that determines what files will be detected by the adapter. These values
are externalized into a properties file so they can easily be changed across different
runtime environments. The adapter uses a poller to check the directory since the filesystem is not an event-driven source. There are several ways you can configure the
poller, but the most common are to use a fixed delay, a fixed rate, or a cron expression.
In this example, we do not make use of any additional operations in the pipeline that
would sit between the two adapters, but we could easily add that functionality if required. The configuration file to configure this data pipeline is shown in Example 13-1.
222 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration
Example 13-1. Defining a data pipeline that polls for files in a directory and loads them into HDFS
channel="filesIn"
directory="${polling.directory}"
filename-pattern="${polling.fileNamePattern}">
channel="filesIn"
ref="fsShellWritingMessagingHandler" >
class="com.oreilly.springdata.hadoop.filepolling.FsShellWritingMessageHandler">
The relevant configuration parameters for the pipeline are externalized in the poll
ing.properties file, as shown in Example 13-2.
Example 13-2. The externalized properties for polling a directory and loading them into HDFS
polling.directory=/opt/application/logs
polling.fixedDelay=5000
polling.fileNamePattern=*.txt
polling.destinationHdfsDirectory=/data/application/logs
This configuration will poll the directory /opt/application/logs every five seconds and
look for files that match the pattern *.txt. By default, duplicate files are prevented when
we specify a filename-pattern; the state is kept in memory. A future enhancement of
the file adapter is to persistently store this application state. The FsShellWritingMessa
geHandler class is responsible for copying the file into HDFS using FsShell’s copyFrom
Local method. If you want to remove the files from the polling directory after the transfer, then you set the property deleteSourceFiles on FsShellWritingMessageHandler to
true. You can also lock files to prevent them from being picked up concurrently if more
than one process is reading from the same directory. See the Spring Integration reference
guide for more information.
To build and run this application, use the commands shown in Example 13-3.
Collecting and Loading Data into HDFS | 223
Example 13-3. Command to build and run the file polling example
$ cd hadoop/file-polling
$ mvn clean package appassembler:assemble
$ sh ./target/appassembler/bin/filepolling
The relevant parts of the output are shown in Example 13-4.
Example 13-4. Output from running the file polling example
03:48:44.187 [main] INFO
c.o.s.hadoop.filepolling.FilePolling - File Polling Application Running
03:48:44.191 [task-scheduler-1] DEBUG o.s.i.file.FileReadingMessageSource - \
Added to queue: [/opt/application/logs/file_1.txt]
03:48:44.215 [task-scheduler-1] INFO o.s.i.file.FileReadingMessageSource - \
Created message: [[Payload=/opt/application/logs/file_1.txt]
03:48:44.215 [task-scheduler-1] DEBUG o.s.i.e.SourcePollingChannelAdapter - \
Poll resulted in Message: [Payload=/opt/application/logs/file_1.txt]
03:48:44.215 [task-scheduler-1] DEBUG o.s.i.channel.DirectChannel - \
preSend on channel 'filesIn', message: [Payload=/opt/application/logs/file_1.txt]
03:48:44.310 [task-scheduler-1] INFO c.o.s.h.f.FsShellWritingMessageHandler - \
sourceFile = /opt/application/logs/file_1.txt
03:48:44.310 [task-scheduler-1] INFO c.o.s.h.f.FsShellWritingMessageHandler - \
resultFile = /data/application/logs/file_1.txt
03:48:44.462 [task-scheduler-1] DEBUG o.s.i.channel.DirectChannel - \
postSend (sent=true) on channel 'filesIn', \
message: [Payload=/opt/application/logs/file_1.txt]
03:48:49.465 [task-scheduler-2] DEBUG o.s.i.e.SourcePollingChannelAdapter - \
Poll resulted in Message: null
03:48:49.465 [task-scheduler-2] DEBUG o.s.i.e.SourcePollingChannelAdapter - \
Received no Message during the poll, returning 'false'
03:48:54.466 [task-scheduler-1] DEBUG o.s.i.e.SourcePollingChannelAdapter - \
Poll resulted in Message: null
03:48:54.467 [task-scheduler-1] DEBUG o.s.i.e.SourcePollingChannelAdapter - \
Received no Message during the poll, returning 'false'
In this log, we can see that the first time around the poller detects the one file that was
in the directory and then afterward considers it processed, so the file inbound adapter
does not process it a second time. There are additional options in FsShellWritingMes
sageHandler to enable the generation of an additional directory path that contains an
embedded date or a UUID (universally unique identifier). To enable the output to have
an additional dated directory path using the default path format (year/month/day/hour/
minute/second), set the property generateDestinationDirectory to true. Setting gener
ateDestinationDirectory to true would result in the file written into HDFS, as shown
in Example 13-5.
Example 13-5. Partial output from
generateDestinationDirectory set to true
running
the
file
polling
example
03:48:44.187 [main] INFO c.o.s.hadoop.filepolling.FilePolling - \
File Polling Application Running
...
04:02:32.843 [task-scheduler-1] INFO c.o.s.h.f.FsShellWritingMessageHandler - \
224 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration
with
sourceFile = /opt/application/logs/file_1.txt
04:02:32.843 [task-scheduler-1] INFO c.o.s.h.f.FsShellWritingMessageHandler - \
resultFile = /data/application/logs/2012/08/09/04/02/32/file_1.txt
Another way to move files into HDFS is to collect them via FTP from remote machines,
as illustrated in Figure 13-4.
Figure 13-4. A Spring Integration data pipeline that polls an FTP site for files and copies them into
HDFS
The configuration in Example 13-6 is similar to the one for file polling, only the configuration of the inbound adapter is changed.
Example 13-6. Defining a data pipeline that polls for files on an FTP site and loads them into HDFS
class="org.springframework.integration.ftp.session.DefaultFtpSessionFactory">
channel="filesIn"
cache-sessions="false"
session-factory="ftpClientFactory"
filename-pattern="*.txt"
auto-create-local-directory="true"
delete-remote-files="false"
remote-directory="${ftp.remoteDirectory}"
local-directory="${ftp.localDirectory}">
channel="filesIn" ref="fsShellWritingMessagingHandler"/>
class="com.oreilly.springdata.hadoop.ftp.FsShellWritingMessageHandler">
Collecting and Loading Data into HDFS | 225
You can build and run this application using the commands shown in Example 13-7.
Example 13-7. Command to build and run the file polling example
$ cd hadoop/ftp
$ mvn clean package appassembler:assemble
$ sh ./target/appassembler/bin/ftp
The configuration assumes there is a testuser account on the FTP host machine. Once
you place a file in the outgoing FTP directory, you will see the data pipeline in action,
copying the file to a local directory and then copying it into HDFS.
Event Streams
Streams are another common source of data that you might want to store into HDFS
and optionally perform real-time analysis as it flows into the system. To meet this need,
Spring Integration provides several inbound adapters that we can use to process streams
of data. Once inside a Spring Integration, the data can be passed through a processing
chain and stored into HDFS. The pipeline can also take parts of the stream and write
data to other databases, both relational and NoSQL, in addition to forwarding the
stream to other systems using one of the many outbound adapters. Figure 13-2 showed
one example of this type of data pipeline. Next, we will use the TCP (Transmission
Control Protocol) and UDP (User Datagram Protocol) inbound adapters to consume
data produced by syslog and then write the data into HDFS.
The configuration that sets up a TCP-syslog-to-HDFS processing chain is shown in
Example 13-8.
Example 13-8. Defining a data pipeline that receives syslog data over TCP and loads it into HDFS
fs.default.name=${hd.fs}
type="server"
port="${syslog.tcp.port}"
deserializer="lfDeserializer"/>
class="com.oreilly.springdata.integration.ip.syslog.ByteArrayLfSerializer"/>
226 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration
channel="syslogChannel"
connection-factory="syslogListener"/>
class="com.oreilly.springdata.integration.ip.syslog.SyslogToMapTransformer"/>
class="com.oreilly.springdata.hadoop.streaming.HdfsWritingMessageHandler">
class="com.oreilly.springdata.hadoop.streaming.HdfsTextFileWriterFactory">
value="${syslog.hdfs.rolloverThresholdInBytes}"/>
The relevant configuration parameters for the pipeline are externalized in the stream
ing.properties file, as shown in Example 13-9.
Example 13-9. The externalized properties for streaming data from syslog into HDFS
syslog.tcp.port=1514
syslog.udp.port=1513
syslog.hdfs.basePath=/data/
syslog.hdfs.baseFilename=syslog
syslog.hdfs.fileSuffix=log
syslog.hdfs.rolloverThresholdInBytes=500
The diagram for this data pipeline is shown in Figure 13-5.
This configuration will create a connection factory that listens for an incoming TCP
connection on port 1514. The serializer segments the incoming byte stream based on
the newline character in order to break up the incoming syslog stream into events. Note
that this lower-level serializer configuration will be encapsulated in a syslog XML
namespace in the future so as to simplify the configuration. The inbound channel
adapter takes the syslog message off the TCP data stream and parses it into a byte array,
which is set as the payload of the incoming message.
Collecting and Loading Data into HDFS | 227
Figure 13-5. A Spring Integration data pipeline that streams data from syslog into HDFS
Spring Integration’s chain component groups together a sequence of endpoints without
our having to explicitly declare the channels that connect them. The first element in
the chain parses the byte[] array and converts it to a java.util.Map containing the key/
value pairs of the syslog message. At this stage, you could perform additional operations
on the data, such as filtering, enrichment, real-time analysis, or routing to other databases. In this example, we have simply transformed the payload (now a Map) to a
String using the built-in object-to-string transformer. This string is then passed into
the HdfsWritingMessageHandler that writes the data into HDFS. HdfsWritingMessage
Handler lets you configure the HDFS directory to write the files, the file naming policy,
and the file size rollover policy. In this example, the rollover threshold was set artificially
low (500 bytes versus the 10 MB default) to highlight the rollover capabilities in a simple
test usage case.
To build and run this application, use the commands shown in Example 13-10.
Example 13-10. Commands to build and run the Syslog streaming example
$ cd hadoop/streaming
$ mvn clean package appassembler:assemble
$ sh ./target/appassembler/bin/streaming
To send a test message, use the logger utility demonstrated in Example 13-11.
Example 13-11. Sending a message to syslog
$ logger -p local3.info -t TESTING "Test Syslog Message"
Since we set HdfsWritingMessageHandler’s rolloverThresholdInBytes property so low,
after sending a few of these messages or just waiting for messages to come in from the
operating system, you will see inside HDFS the files shown in Example 13-12.
Example 13-12. Syslog data in HDFS
$ hadoop dfs
-rw-r--r-- 3
-rw-r--r-- 3
-rw-r--r-- 3
-rw-r--r-- 3
…
-ls /data
mpollack supergroup
mpollack supergroup
mpollack supergroup
mpollack supergroup
711
202
240
119
2012-08-09
2012-08-09
2012-08-09
2012-08-09
13:19
13:22
13:22
15:04
/data/syslog-0.log
/data/syslog-1.log
/data/syslog-2.log
/data/syslog-3.log
228 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration
$ hadoop dfs -cat /data/syslog-2.log
{HOST=ubuntu, MESSAGE=Test Syslog Message, SEVERITY=6, FACILITY=19, \
TIMESTAMP=Thu Aug 09 13:22:44 EDT 2012, TAG=TESTING}
{HOST=ubuntu, MESSAGE=Test Syslog Message, SEVERITY=6, FACILITY=19, \
TIMESTAMP=Thu Aug 09 13:22:55 EDT 2012, TAG=TESTING}
To use UDP instead of TCP, remove the TCP-related definitions and add the commands
shown in Example 13-13.
Example 13-13. Configuration to use UDP to consume syslog data
channel="syslogChannel" port="${syslog.udp.port}"/>
Event Forwarding
When you need to process a large amount of data from several different machines, it
can be useful to forward the data from where it is produced to another server (as opposed to processing the data locally). The TCP inbound and outbound adapters can
be paired together in an application so that they forward data from one server to another. The channel that connects the two adapters can be backed by several persistent
message stores. Message stores are represented in Spring Integration by the interface
MessageStore, and implementations are available for JDBC, Redis, MongoDB, and
GemFire. Pairing inbound and outbound adapters together in an application affects
the message processing flow such that the message is persisted in the message store of
the producer application before the message is sent to the consumer application. The
message is removed from the producer’s message store once the acknowledgment from
the consumer is received. The consumer sends its acknowledgment once it has successfully put the received message in its own message-store-backed channel. This
configuration enables an additional level of "store and forward" guarantee via TCP
normally found in messaging middleware such as JMS or RabbitMQ.
Example 13-14 is a simple demonstration of forwarding TCP traffic and using Spring’s
support to easily bootstrap an embedded HSQL database to serve as the message store.
Example 13-14. Store and forwarding of data across processes using TCP adapters
channel="dataChannel" port="${syslog.tcp.in.port}"/>
channel="dataChannel" port="${syslog.tcp.out.port}"/>
Collecting and Loading Data into HDFS | 229
Management
Spring Integration provides two key features that let you manage data pipelines at runtime: the exporting of channels and endpoints to JMX and a control bus. Much like
JMX, the control bus lets you invoke operations and view metric information related
to each component, but it is more general-purpose because it allows you to run small
programs inside the running application to change its state and behavior.
Exporting channels and endpoints to JMX is as simple as adding the lines of XML
configuration shown in Example 13-15.
Example 13-15. Exporting channels and endpoints to JMX
Running the TCP streaming example in the previous section and then starting JConsole
shows the JMX metrics and operations that are available (Figure 13-6). Some examples
are to start and stop the TCP adapter, and to get the min, max, and mean duration of
processing in a MessageHandler.
Figure 13-6. Screenshots of the JConsole JMX application showing the operations and properties
available on the TcpAdapter, channels, and HdfsWritingMessageHandler
A control bus can execute Groovy scripts or Spring Expression Language (SpEL) expressions, allowing you to manipulate the state of components inside the application
230 | Chapter 13: Creating Big Data Pipelines with Spring Batch and Spring Integration