Discriminators For Use In Flow-Based Classification Essay

1. Introduction

Over the past two decades airborne mapping light detection and ranging (lidar), also known as airborne laser scanning (ALS), has become one of the prime remote sensing technologies for sampling the Earth’s surface and land cover in three dimensions (3D), especially in areas covered by vegetation canopies [1]. In addition to the range (spatial) information derived from time-of-flight (ToF) measurements, pulsed airborne lidar sensors deliver an arbitrarily scaled measure of the strength of the optical backscattered signal that is proportional to the radiance incident on the detector, typically referred to as intensity. Intensity is correlated with a target’s reflectance at the given laser wavelength, making it useful in the interpretation of the lidar spatial information [2] or as a standalone data source for identifying the characteristics of the likely backscattering surface for each return [3]. The intensity also depends on other target characteristics such as roughness and the lidar cross-section and on sensor-target geometry parameters including range and incidence angle. Currently the general practice is to digitally scale the “intensity” signal to normalize it to a given range. However, the quality of intensity information can be improved through geometric and radiometric corrections for other factors such as incidence angle, and atmospheric attenuation [4,5,6]. Taken a step further, radiometric calibration methods transform intensities to physical quantities such as target reflectance [7], allowing interpretation with respect to known material spectra.

In the past, intensity information has been used for various applications, including land cover classification [3,5]; enhancement of lidar ground return classification [8,9]; fusion with multispectral and hyperspectral data to enable a better characterization of terrain or land cover [10]; derivation of forest parameters and tree species mapping [11,12,13]; and production of greyscale stereo-pair imagery to generate breaklines traditionally used in photogrammetry through a technique named lidargrammetry [14]. However, the usefulness of lidar intensity information has been fundamentally limited because it provides a measure of backscatter at a single, narrow laser wavelength band. This is usually a near-infrared (NIR) wavelength (1064 or 1550 nm) for topographic lidar systems and a green-blue wavelength [15] for bathymetric lidar systems. Currently, the second harmonic of neodymium-doped yttrium aluminum garnet (Nd:YAG) lasers, 532 nm, is a common choice for bathymetric lidars [16]. This single wavelength spectral limitation has been previously recognized and several experiments have attempted to mitigate it by combining data obtained from individual sensors that operate at different laser wavelengths [17,18] or by observations from prototype multispectral lidar systems [19,20,21]. These multispectral lidar experiments are testament to the potential of this newly developing remote sensing technique, and as with any other technology, the potential is coupled with challenges. Some of these challenges are related to physical principles (atmospheric transparency, background solar radiation, etc.) and hardware limitations (available laser wavelengths, eye safety) [22], while other limitations are related to software and algorithms (e.g., radiometric calibration of the raw lidar intensities) [17].

This paper presents a general overview of the design and performance of the first operational multispectral airborne laser scanner that collects range, intensity and optionally full waveform return data at three different laser wavelengths (1550, 1064, 532 nm) through a single scanning mechanism. This airborne multispectral lidar scanner is the Teledyne Optech Titan MW (multi-wavelength) which was developed based on the specifications and technical requirements of the National Science Foundation (NSF) National Center for Airborne Laser Mapping (NCALM). The system was delivered to the University of Houston (UH) in October of 2014. Since then, the sensor has undergone significant testing, improvement and fine tuning in a wide range of environments [23].

This paper is intended to highlight the flexibility of the Titan to perform 3D and active multispectral mapping for different applications (bathymetry, urban mapping, ground cover classification, forestry, archeology, etc.) without delving too deeply into one specific application. Individual papers that provide more details for specific applications are currently in preparation. The basic research question addressed in this paper relates to the performance of the Titan system, which, given its operational flexibility, has to be assessed using a variety of metrics that vary according to the application. The performance metrics discussed within this work are: (a) the accuracy of experimental ground cover classification based on un-calibrated multispectral lidar return intensity and structural metrics in an urban environment (Houston, TX, USA); (b) maximum water penetration, and (c) accuracy of the measured water depths under ideal bathymetric mapping conditions (Destin, FL, USA, and San Salvador Island, Bahamas); (d) canopy penetration; (e) range resolution in tropical rain forests (Guatemala, Belize and Mexico); and (f) precision and accuracy of topographically derived elevations.

It is important to clarify that since the delivery of NCALM’s original Titan system, the manufacturer has produced other Titan sensors; however, not all units have the same engineering specifications or system design. The design of the sensor is flexible, allowing the exact configuration and performance of the unit to be adapted to the specific needs of a customer. The discussion presented below is specific to the design and performance of the NCALM unit and may not be applicable or reproducible for other Titan units, even though they carry the same make and model designation.

This paper is structured as follows: Section 2 presents a high-level description of the Titan system design and operational characteristics; Section 3 presents the results and discussion of performance tests related to (a) ground cover classification based on multispectral intensity; (b) bathymetric capabilities; (c) canopy penetration and characterization; (d) redundancy and diversity design schemes; and (e) vertical positional precision and accuracy; finally, conclusions are presented in Section 4.

2. Titan Instrument Description

The first Titan MW lidar sensor (Serial number 14SEN/CON340) was developed to meet operational specifications and requirements established by NCALM. The specifications called for a multipurpose integrated multichannel/multispectral airborne mapping lidar unit, with an integrated high resolution digital camera that could seamlessly map terrain and shallow water bathymetry environments from flying heights between 300 and 2000 m above ground level (AGL). NCALM’s operational experience with airborne lidar units operating at 1064 and 532 nm wavelengths, and the requirement to perform simultaneous terrestrial and bathymetry mapping, determined two of the three laser wavelengths. There were several laser wavelengths considered for the third channel, including 950 and 1550 nm; however, the 1550 nm option was selected because of the ease in complying with eye safety regulations, the proven reliability of the laser sources, and because it results in nearly equally spaced (500 nm) spectral sampling wavelengths when combined with the 1064 and 532 nm wavelengths (Figure 1).

NCALM’s Titan has two fiber laser sources. The first source “Laser A” has a primary output at 1064 nm. Part of the 1064 nm output is directly used as channel two for the lidar and another part of the output is passed through a frequency-doubling crystal to obtain 532 nm wavelength pulses for channel three of the lidar unit. The second laser source, “Laser B”, has its output at the 1550 nm wavelength and is used as the source for lidar channel one. Both lasers are synchronized and can produce pulse rates between 50 and 300 kHz, programmable at 25 kHz intervals. The laser output of the Titan corresponds to Class IV as per United States Food and Drug Administration, 21 Code of Federal Regulations 1040.10 and 1040.11; International Electrotechnical Commission 60825-1. The characteristics of the individual laser sources as well as other characteristics of each lidar channel required for assessing system behavior and performance are presented in Table 1.

Other equipment providers offer single-pass multispectral lidar systems by mounting, either in tandem or side by side, individual sensor heads operating at different wavelengths [22]. However, for the Titan system, we required the operation of the different wavelengths’ lasers through a single scanning mechanism to provide near-identical swaths from the three wavelength channels. The channels are arranged such that the 1064 nm channel points at the nadir, and the 1550 and 532 nm channels are pointed 3.5° and 7° forward of the nadir, respectively. The primary reason for this configuration was to minimize returns from the water surface and maximize the probability of water penetration for the 532 nm pulses. A secondary reason was to maximize correlation with legacy lidar datasets collected with a 1064 channel pointing to the nadir. The Titan scanner has a ±30° field of scan and a maximum scanner product (half scan angle × scan frequency) of 800 degrees-Hertz. The beam divergence values are close to 0.3 milliradians for the 1064 and 1550 nm channels and one milliradian for the 532 nm channel (see Table 1). Boresight parameters for each channel are currently determined independently, and combined with the sensor model in the manufacturer’s proprietary software to obtain a geometrically correct and consistent point cloud. Individual point cloud files are generated for each flight line and channel.

The laser return signal is analyzed in real time through an analogue constant fraction discriminator (CFD) [24,25] which detects and records discrete laser returns (range and intensities) at all pulse repetition frequencies. While the system can detect a large number of returns per pulse, it only records up to four returns for each outgoing laser pulse (first, second, third and last). In addition to the analogue CFD, the outgoing and return waveforms of all the channels can optionally be digitized at a 12 bit amplitude quantization resolution and at a rate of 1 gigasample/s. Currently, this digitization can only be done for each outgoing pulse and return waveforms at a maximum PRF of 100 kHz; for higher PRFs, full waveform digitization is only performed for a decimated subset of the emitted pulses.

The Titan is capable of ranging beyond the single pulse-in-the-air limit (range ambiguity), meaning that it is able to obtain accurate ranges at high PRFs when there are several laser pulses from each channel in the air simultaneously before a return from the first emitted pulse is obtained [26,27]. The Titan is capable of measuring in a fixed multi-pulse mode, which means that the sensor needs to be aware of how many pulses are planned to simultaneously be in the air in order to compute accurate ranges. This is done in the planning phase of the data collection. If, for some reason, the actual number of pulses-in-the-air (PIA) is different than the planned value, the system will produce erroneous range values. This fixed multi-pulse capability has certain combinations of ranges and PRFs where the sensor is not able to resolve the range ambiguities and which the manufacturer calls “blind zones”. The PRFs and range regions where the Titan can work without suffering range ambiguity for a given PRF are displayed in Figure 2 as the white regions between the colored bands and labeled according the number of pulses-in-the-air (PIA) at a given time.

Technically, the range ambiguity only occurs at a specific range or multiples of this range value. However, due to the wide scan angle and the forward-looking channels on the Titan (channel 1 and channel 3), the specific range at which the ambiguity occurs turns into a band of range values as it applies to the entire sensor. These blind zones, or, more accurately, ambiguity zones, are depicted as the solid color bands in Figure 2. Figure 2 also illustrates that as the PRF increases, the different PIA regions of operation get smaller. Another way of visualizing these regions of operation is to relate them to how much range variation the sensor can experience within a single flight line as the result of terrain variation or elevation of manmade structures. The higher the PRF, the less terrain relief can be tolerated by the sensor without entering into the range ambiguity regions depicted in the figure.

The laser shot densities obtainable with the Titan are mainly a function of instrument parameters (laser PRF, scan angle, scan frequency), and flying parameters (ground speed and flying height above terrain). However, eye safety regulations and range ambiguity also limit the maximum measurement density obtainable for a specific flying height. Figure 3 illustrates the laser shot density operational envelope of the Titan as a function of the flying height for a single channel and single pass (i.e., no swath overlap). The figure is for reference only and is based on the specific assumptions described below; density values outside the envelope can be obtained under specific circumstances but are not considered to be normal mapping operations.

In Figure 3, the lower limit of the envelope is obtained by assuming the lowest PRF of 50 kHz, a ground speed of 150 knots and a scanner operating at 25 Hz at the maximum field of view (±30°). At lower flight heights the maximum density is limited by eye safety considerations. Each laser PRF has a nominal ocular hazardous distance (NOHD) that increases as the PRF increases. While it is technically possible to operate at higher PRFs at low altitudes, this does not comply with applicable eye safety regulations and thus it is necessary to use the maximum PRF that ensures exceeding the NOHD at ground level (see Figure 3). The upper limit of the envelope is obtained by assuming the highest possible PRF for a given flying height (taking into account the range ambiguity regions), a ground speed of 150 knots and a scanner operating at 70 Hz and a very narrow field of view (±10°). The peaks and valleys in the upper envelope limit are caused by the range ambiguity regions (Figure 2

1. Introduction

Network traffic classification [1] refers to the classification of the bidirectional TCP/UDP flows generated by network communications according to the application types such as WWW, P2P, FTP, and ATTACK. As an important preprocessing of traffic classification, the feature selection aims to select a feature subset that has the lowest dimension and also retains classification accuracy so as to optimize the specific evaluation criteria. Due to the significant effect on the performance network traffic selection in the current big data environment, the feature selection has become a key research area in such fields as machine learning, data mining and pattern recognition [2].

Current feature selection methods can be categorized into two groups, according to whether the subset evaluation criteria depend on the subsequent classification algorithm: filter and wrapper. The filter methods take the feature selection process and the subsequent machine learning method into consideration separately, and make a selection on the basis of the characteristics of the dataset itself. They have a fast operation, but lead to poor performance in selection. The wrapper methods take the subsequent classification algorithm as the evaluation index of the feature subset; they achieve higher classification accuracy, but result in low efficiency and a large amount of computation. Thus, how to integrate the two methods to bridge the gap between the accuracy and the executing efficiency is a challenge in the research of feature selection [3].

With the network bandwidth and the number of users growing continuously, the size of network data has been increasing rapidly; thus, improving the efficiency of the feature selection method has become an urgent demand. Some researchers parallelized and improved the traditional feature selection methods using the Hadoop-MapReduce framework, which enabled it to deal with super large-scale data. However, computation complexity of such solutions cannot be ignored. Recently, more and more research shows that the MapReduce computational model is suitable for processing jobs with large amounts of data but uncomplicated computation. In terms of complex computation, like the wrapper methods with many iterations, the MapReduce framework has inferior execution efficiency.

In order to solve the aforementioned problems, in this paper, we put forward a feature selection method based on Spark (FSMS). In our approach, a filter method called Fisher score is used to process the complete feature set, eliminating features with a low distinguishing degree and worse discrimination performance. We also adopt the concept of the wrapper method, in which the sequential forward search is used as the searching strategy, the classification accuracy is used as the evaluation criterion, and the feature combination with the strongest classification ability is continuously selected using the Spark computing framework. Finally, the optimal feature subset for traffic classification is acquired.

2. Related Work

The evolution of Internet applications has made traditional methods for classifying network traffic progressively less effective [4]. Port-based methods can easily misclassify traffic flows, mostly because of new applications randomly selecting port numbers. Current research efforts have formed a new trend of using machine learning methods to classify network traffic based on flow statistical features [5]. The combination algorithms have been proved to allow a further improvement in terms of classification accuracy with multi-classification [6]. Thus, trying to use the combination algorithms to select the most significant and effective features is worthwhile.

The filter feature selection methods measure features by analyzing their internal characteristics. The classic methods include information gain [7], Fisher score [8], Relief-F [9], etc. The wrapper methods adopt different strategies on the feature subset, but one evaluation criterion on the candidate subset. Rodrigues et al. [10] took the bat algorithm (BA) as the guidance and the optimum path forest classifier (OPF) as the evaluation function to select the optimal feature subset. Chuang et al. [11] proposed a better classification result by combining binary particle swarm optimization (BPSO) with a genetic algorithm (GA) and using k-Nearest Neighbors (k-NN) as the classifier to select the feature. Both filter and wrapper methods have their own advantages and disadvantages, which are complementary to each other. Peng et al. [12] integrated the two approaches into a sequential searching approach to improve the classification performance. The hybrid feature selection method generally filters out some of the unrelated or noisy features using the filter methods first, and subsequently selects the optimal feature subset using the wrapper methods.

With the data scale increasing rapidly, how to carry out the feature selection for big data becomes a critical issue. Appropriate and novel designs for highly parallel low-cost architectures promise significant scalability improvements [13]. The feature selection and traffic classification systems must be redesigned to run on multicore hardware, targeting low-cost but highly parallel architectures [14]. At first, the Hadoop-MapReduce framework had become a research focus. Sun et al. [15] introduced the joint mutual information method into the feature selection. The high efficiency and expansibility of the method was tested by experiments. Based on the cloud model, Li et al. [16] proposed an improved method of SVM-BPSO feature selection, in which the wrapper evaluation strategy was adopted, and the local optimal solution was achieved with a faster convergence rate, and a better result was thus obtained. Long [17] parallelized and improved the feature selection method using the Hadoop-MapReduce framework to enable it to deal with large-scale data. However, Srirama et al. [18,19] pointed out that the MapReduce model was suitable for processing jobs with large amounts of data but uncomplicated computing. In order to deal with the complex iterative computations, the MapReduce model had lower execution efficiency. Spark [20] is a new-generation parallel-computing framework based on memory, which performs better than MapReduce in the iterative computation. Currently, there are limited research outcomes on Spark in the field of feature selection of network traffic. This paper puts forward a feature selection method for network traffic based on Spark and investigates its performance and effect.

3. The Parallel Computing Framework

3.1. Hadoop-MapReduce

Hadoop is a distributed open-source software framework for the clustered environment and large-scale dataset processing. It provides MapReduce as the core programming interface and Hadoop Distributed File System (HDFS), which can process as many as one million nodes and data with the magnitude of Zettabyte (ZB). Among them, MapReduce is a computing framework used for large-scale data processing, by which the parallel computing is abstracted in two stages, namely Map and Reduce, to deal with data by mapping and protocol. Having many advantages such as simple programming, ease of extension and good fault-tolerance, MapReduce became the most popular parallel processing strategy at present.

However, much research proves that the MapReduce computing framework is suitable for processing the job with large amounts of data but uncomplicated core computing. Faced with many instances of complex iterative computations, the following problems arise: (1) It only provides two expressions for Map and Reduce, which are hard to fully describe the complex algorithm; (2) although the algorithm is the same for each iteration, the implementation of the MapReduce operation must be renewed at each iteration, which will incur unnecessary system overhead; (3) even though the input data is slightly varied at each iteration, MapReduce will read the complete data again from HDFS, which will take up enormous time, CPU resources, and network bandwidth.

3.2. Spark

Spark is a parallel computing framework for big data based on memory computation. Its framework adopts the master-slave model in distributed computing. A master is a node containing the master process and slaves are nodes containing the worker process. Acting as the controller of whole clusters, the master node is responsible for the normal operation of the whole clusters. The worker node, which is equivalent to a computing node, receives commands from the master node and generates a status report. The executor is responsible for the task execution. As the user’s client, the client is responsible for submission and application. The driver is responsible for controlling the execution of an application.

The advantages of Spark in iterative computations are mainly described as follows: Firstly, a Spark operation based on a resilient distributed dataset (RDD) not only realizes the operator map and reduces function in MapReduce, but also provides ten times as many operators, which can fully describe the iterative algorithm. Secondly, several rounds of iteration can be integrated into one Spark operation. At the beginning of the operation, the raw data will be read into the memory, from which the corresponding data is transferred directly at each iteration to perform multiple calls to the data after reading at one time. Thirdly, the intermediate result of computation is also stored in memory, but is not written to disk in order to avoid an I/O operation on disk with quite low efficiency, which improves the operation efficiency significantly.

4. FSMS

4.1. The Filter Methods for Feature Selection Based on Fisher Score

The filter-based feature selection method takes the feature selection process and the subsequent machine learning method into consideration separately and selects the feature subset based on the characteristics of data itself. The Fisher score, as a high-efficient filter-based supervised feature selection method, according to the feature selection criteria of the minimum intra-cluster distance and the maximum inter-cluster distance, evaluates and sorts the features by the internal properties of a single feature. The relevant definitions are defined as follows: where is the number of features, is the class, is the total number of classes, is the number of samples in class , is the mean value of the i-th feature in class , is the mean value of the i-th feature in all samples, is the variance of the i-th feature in class . The bigger is, the better the discernibility of the feature is.

Traditionally, there are at most hundreds of characteristics of network traffic data. The calculation is too large when the iterative discrimination is directly used for whole features. With the Fisher score, the features with low discernibility and poor identification performance are eliminated from the complete feature set, and the initial feature subset is then obtained. Thus, the amount of computation of subsequent wrapper methods is significantly reduced. Due to its simple and fast calculations, it can be applied to the dimension reduction of large-scale and high dimensional data.

4.2. The Wrapper Feature Selection Methods Base on the SFS Strategy

The wrapper feature selection method consists of two parts: the selection strategy and the evaluation strategy, which are considered as the feature selection process and the follow up machine learning method respectively. In this paper, the sequential forward search (SFS) of heuristic searching is used as the selection strategy, and the classification accuracy is used as the evaluation index.

The selection process is described as follows:

The initial feature subset can be defined as = {}, the optimal subset of features is null, and the classification model is .
  • The classification accuracy of each feature in is obtained separately by the classification model .

  • The feature achieving the highest accuracy is added into and eliminated from .

  • Each feature in is combined with all the features in separately, and then obtains their own classification accuracy by the classification model .

  • The corresponding feature of the highest accuracy in is added into and eliminated from .

  • Steps 3 and 4 are repeated until the stopping criterion is reached.

Note the threshold for the stopping criterion adopted in this paper. If the obtained accuracy begins to decrease compared with the previous round, and the decrease amplitude is larger than , then the feature selection process stops. In this case, the selected features are the optimal subset of features available for later classification operations.

4.3. The Realization of FSMS

One thought on “Discriminators For Use In Flow-Based Classification Essay

Leave a Reply

Your email address will not be published. Required fields are marked *