Elasticsearch Server, 2nd Edition. This book begins by introducing the most commonly used Elasticsearch server functionalities, from creating your own index . Throughout the book, you'll follow a problem-based approach to learn why, when , and several tutorials to help beginners using the server. Zach is a developer. Leverage Elasticsearch to create a robust, fast, and flexible search solution with ease.
|Language:||English, Spanish, Arabic|
|Genre:||Fiction & Literature|
|Distribution:||Free* [*Registration needed]|
reviewer Elasticsearch Server, Second Edition, and the video that Packt offers eBook versions of every book published, with PDF and ePub. Failed to load latest commit information. An Elasticsearch Crash aracer.mobi · Add 1st batch of books, a year ago. Beginning Elastic aracer.mobi · Add 1st batch of. The library is compatible with all Elasticsearch versions since x but . level. aracer.mobi can be used to log requests to the server in.
And last but not least — Beats are lightweight agents that are installed on edge hosts to collect different types of data for forwarding into the stack. Together, these different components are most commonly used for monitoring, troubleshooting and securing IT environments though there are many more use cases for the ELK Stack such as business intelligence and web analytics.
Beats and Logstash take care of data collection and processing, Elasticsearch indexes and stores the data, and Kibana provides a user interface for querying the data and visualizing it. The ELK Stack is popular because it fulfills a need in the log management and analytics space. Monitoring modern applications and the IT infrastructure they are deployed on requires a log management and analytics solution that enables engineers to overcome the challenge of monitoring what are highly distributed, dynamic and noisy environments.
The ELK Stack helps by providing users with a powerful platform that collects and processes data from multiple data sources, stores that data in one centralized data store that can scale as data grows, and that provides a set of tools to analyze the data.
Of course, the ELK Stack is open source. With IT organizations favoring open source products , this alone could explain the popularity of the stack. Using open source means organizations can avoid vendor lock-in and onboard new talent much more easily. Everyone knows how to use Kibana, right? Open source also means a vibrant community constantly driving new features and innovation and helping out in case of need. Sure, Splunk has long been a market leader in the space.
But its numerous functionalities are increasingly not worth the expensive price — especially for smaller companies such as SasS products and tech startups. ELK might not have all of the features of Splunk, but it does not need those analytical bells and whistles. ELK is a simple but robust log management and analytics platform that costs a fraction of the price. Performance issues can damage a brand and in some cases translate into a direct revenue loss.
For the same reason, organizations cannot afford to be compromised as well, and not complying with regulatory standards can result in hefty fines and damage a business just as much as a performance issue. To ensure apps are available, performant and secure at all times, engineers rely on the different types of data generated by their applications and the infrastructure supporting them. This data, whether event logs or metrics, or both, enables monitoring of these systems and the identification and resolution of issues should they occur.
Logs have always existed and so have the different tools available for analyzing them. What has changed, though, is the underlying architecture of the environments generating these logs. Architecture has evolved into microservices, containers and orchestration infrastructure deployed on the cloud, across clouds or in hybrid environments.
Not only that, the sheer volume of data generated by these environments is constantly growing and constitutes a challenge in itself. Long gone are the days when an engineer could simply SSH into a machine and grep a log file. This cannot be done in environments consisting of hundreds of containers generating TBs of log data a day.
This is where centralized log management and analytics solutions such as the ELK Stack come into the picture, allowing engineers, whether DevOps, IT Operations or SREs, to gain the visibility they need and ensure apps are available and performant at all times. Modern log management and analysis solutions include the following key capabilities: Aggregation — the ability to collect and ship logs from multiple data sources.
Processing — the ability to transform log messages into meaningful data for easier analysis. Storage — the ability to store data for extended time periods to allow for monitoring, trend analysis, and security use cases. Analysis — the ability to dissect the data by querying it and creating visualizations and dashboards on top of it.
The various components in the ELK Stack were designed to interact and play nicely with each other without too much extra configuration.
One of the examples of character mapper is HTML tags removal process. Indexing and querying We may wonder how that all affects indexing and querying when using Lucene and all the software that is built on top of it. During indexing, Lucene will use analyzer of your choice to process contents of your document; of course different analyzer can be used for different fields, so the title field of your document can be analyzed differently compared to the description field.
During query time, if you use one of the provided query parsers, your query will be analyzed. However, you can also choose the other path and not analyze your queries. This is crucial to remember, because some of the ElasticSearch queries are being analyzed and some are not. For example, the prefix query is not analyzed and the match query is analyzed.
What you should remember about indexing and querying analysis is that the index should be matched by the query term. If they don't match, Lucene won't return the desired documents. For example, if you are using stemming and lowercasing during indexing, you need to be sure that the term in the query are also lowercased and stemmed, or your queries will return no results at all.
Lucene query language Some of the query types provided by ElasticSearch support Apache Lucene query parser syntax. Because of that, let's go deeper into Lucene query language and describe it. A term, in Lucene, can be a single word or a phrase group of words surrounded by double quote characters.
If the query is set to be analyzed, the defined analyzer will be used on each of the terms that form the query. A query can also contain Boolean operators that connect terms to each other forming clauses. The list of Boolean operators is as follows: It means that the given two terms left and right operand need to match in order for the clause to be matched. For example, we would run a query, such as apache AND lucene, to match documents with both apache and lucene terms in a document.
It means that any of the given terms may match in order for the clause to be matched. For example, we would run a query, such as apache OR lucene, to match documents with apache or lucene or both terms in a document. It means that in order for the document to be considered a match, the term appearing after the NOT operator must not match. For example, we would run a query lucene NOT elasticsearch to match documents that contain lucene term, but not elasticsearch term in the document.
In addition to that, we may use the following operators: It means that the given term needs to be matched in order for the document to be considered as a match. It means that the given term can't be matched in order for the document to be considered a match.
When not specifying any of the previous operators, the default OR operator will be used. In addition to all these, there is one more thing; you can use parenthesis to group clauses together.
For example, with something like this: Chapter 1 Querying fields Of course, just like in ElasticSearch, in Lucene all your data is stored in fields that build the document.
In order to run a query against a field, you need to provide the field name, add the colon character, and provide the clause that should be run against that field. For example, if you would like to match documents with the term elasticsearch in the title field, you would run a query like this: For example, if you would like your query to match all the documents having the elasticsearch term and the mastering book phrase in the title field, you could run a query like this: The most common modifiers, which you are surely familiar with, are wildcards.
There are two wildcards supported by Lucene the? The first one will match any character and the second one will match multiple characters. Please note by default these wildcard characters can't be used as the first character in a term because of the performance reasons.
When used with a single word term, it means that we want to search for terms that are similar to the one we've modified so, called fuzzy search. For example, let's take the following query: Introduction to ElasticSearch It would match the document with the title field containing mastering elasticsearch, but not mastering book elasticsearch.
However, if we ran a query, such as title: The boost lower than one would result in decreasing the importance, boost higher than one will result in increasing the importance, and the default boost value is 1. Please refer to the Default Apache Lucene scoring explained section in Chapter 2, Power User Query DSL, for further reference what boosting is and how it is taken into consideration during document scoring.
In addition to all these, we can use square and curly brackets to allow range searching. For example, if we would like to run a range search on a numeric field we could run the following query: In case of string based fields, we also can run a range query, for example: If you would like your range bound or bounds to be exclusive, use curly brackets instead of the square ones.
For example, in order to find documents with the price field between For example, to search for abc"efg term you need to do something like this: Chapter 1 Introducing ElasticSearch If you hold this book in your hands, you are probably familiar with ElasticSearch, at least the core concepts and basic usage. However, in order to fully understand how this search engine works, let's discuss it briefly.
As you probably know ElasticSearch is production-ready software for building search-oriented applications. It was originally started by Shay Banon and published in February After that it has rapidly gained popularity just within a few years, and became an important alternative to other open source and commercial solutions.
It is one of the most downloaded open source projects, hitting more than , downloads a month. Basic concepts Let's go through the basic concepts of ElasticSearch and its features. Index ElasticSearch stores its data in one or more indices. Using analogies from the SQL world, index is something similar to a database. It is used to store the documents and read them from it. As we already mentioned, under the hood, ElasticSearch uses Apache Lucene library to write and read the data from the index.
What one should remember about is that a single ElasticSearch index may be built of more than a single Apache Lucene index, by using shards and replicas.
Document Document is the main entity in the ElasticSearch world and also in Lucene world. At the end, all use cases of using ElasticSearch can be brought to a point where it is all about searching for documents. Document consists of fields and each field has a name and one or many values in this case, field is called multi-valued. Each document may have a different set of fields; there is no schema or imposed structure. It should look familiar these are the same rules as for Lucene documents.
In fact, ElasticSearch documents are stored as Lucene documents. Introduction to ElasticSearch Mapping As you already read in the Introducing Apache Lucene section, all documents are analyzed before being stored.
We can configure how the input text is divided into tokens, which tokens should be filtered out, or what additional processing, such as removing HTML tags, is needed. In addition, various features are offered by ElasticSearch, such as sorting needs information about fields contents.
This is where mapping comes to play: Besides the fact that ElasticSearch can automatically discover field type by looking at its value, sometimes in fact usually always we will want to configure the mappings ourselves to avoid unpleasant surprises.
Type Each document in ElasticSearch has its type defined. This allows us to store various document types in one index and have different mappings for different document types. Node The single instance of the ElasticSearch server is called a node. A single node ElasticSearch deployment can be sufficient for many simple use cases, but when you have to think about fault tolerance or you have lots of data that cannot fit in a single server, you should think about multi-node ElasticSearch cluster.
Cluster Cluster is a set of ElasticSearch nodes that work together to handle the load bigger than single instance can handle both in terms of handling queries and documents. This is also the solution which allows us to have uninterrupted work of application even if several machines nodes are not available due to outage or administration tasks, such as upgrade.
The ElasticSearch provides clustering almost seamlessly. In our opinion, this is one of the major advantages over competition; setting up a cluster in ElasticSearch world is really easy. Chapter 1 Shard As we said previously, clustering allows us to store information volumes that exceed abilities of a single server.
To achieve this requirement, ElasticSearch spread data to several physical Lucene indices. Those Lucene indices are called shards and the process of this spreading is called sharding.
ElasticSearch can do this automatically and all parts of the index shards are visible to the user as one-big index. Note that besides this automation, it is crucial to tune this mechanism for particular use case because the number of shard index is built or is configured during index creation and cannot be changed later, at least currently.
Replica Sharing allows us to push more data into ElasticSearch that is possible for a single node to handle. Replicas can help where load increases and a single node is not able to handle all the requests.
The idea is simple: Note that we get safety for free. If the server with the shard is gone, ElasticSearch can use replica and no data is lost. Replicas can be added and removed at any time, so you can adjust their numbers when needed. Gateway During its work, ElasticSearch collects various information about cluster state, indices settings, and so on. This data is persisted in the gateway. Key concepts behind ElasticSearch architecture ElasticSearch was built with few concepts in mind.
The development team wanted to make it easy to use and scalable, and these core features are visible in every corner of ElasticSearch. From the architectural perspective, the main features are: This includes built-in discovery for example, field types and auto configuration. Nodes assume that there are or will be a part of the cluster, and during setup nodes try to automatically join the cluster.
Nodes automatically connect to other machines in the cluster for data interchange and mutual monitoring. This covers automatic replication of shards.
This allows users to adjust to existing data model. As we noted in type description, ElasticSearch supports multiple data types in a single index and adjustment to business model includes handling relation between documents although, this functionality is rather limited. Because of distributed nature of ElasticSearch, there is no possibility to avoid delays and temporary differences between data located on the different nodes.
ElasticSearch tries to reduce these issues and provide additional mechanisms as versioning. The boostrap process When the ElasticSearch node starts, it uses multicast or unicast, if configured to find the other nodes in the same cluster the key here is the cluster name defined in the configuration and connect to them.
You can see the process illustrated in the following figure: Chapter 1 In the cluster, one of the nodes is elected as the master node.
This node is responsible for managing the cluster state and process of assigning shards to nodes in reaction of changes in cluster topology. Note that a master node in ElasticSearch has no importance from the user perspective, which is different from other systems available such as the databases.
In practice you do not need to know which node is a master node; all operations can be sent to any node, and internally ElasticSearch will do all the magic. If necessary, any node can send subqueries parallel to other nodes and merge responses to return the full response to the user. All of this is done without accessing master node nodes operate in peer-to-peer architecture.
The master node reads the cluster state and if necessary, goes into recovery process. During this state, it checks which shards are available and decides which shards will be the primary shards.
After this the whole cluster enters into yellow state. This means that a cluster is able to run queries but full throughput and all possibilities are not achieved yet it basically means that all primary shard are allocated, but replicas are not. The next thing to do is find duplicated shards and treat them as replicas. When a shard has too few replicas, the master node decides where to put missing shards and additional replica are created based on a primary shard.
If everything went well, the cluster enters into a green state which means that all primary shard and replicas are allocated. Failure detection During normal cluster work, the master node monitors all the available nodes and checks if they are working.
If any of them are not available for configured amount of time, the node is treated as broken and process of handling failure starts. This may mean rebalancing of the cluster—shards, which were present on the broken node are gone and for each such shard other nodes have to take responsibility.
In other words, for every lost primary shard, a new primary shard should be elected from the remaining replicas of this shard. The whole process of placing new shards and replicas can and usually should be configured to match our needs. More information about it can be found in Chapter 4, Index Distribution Architecture.
Introduction to ElasticSearch Just to illustrate how it works, let's take an example of three nodes cluster, there will be a single master node and two data nodes. The master node will send the ping requests to other nodes and wait for the response. If the response won't come actually how many ping requests may fail depends on the configuration , such a node will be removed from the cluster. It is worth mentioning that Java API is also used internally by the ElasticSearch itself to do all the node to node communication.
Note that we treat this as a little reminder this book assumes that you have used these elements already. If not, we strongly suggest reading about this, for example, our ElasticSearch Server book covers all this information. Chapter 1 Indexing data ElasticSearch has four ways of indexing data. The easiest way is using the index API, which allows you to send one document to a particular index. For example, by using the curl tool see http: The difference between methods is the connection type.
This is faster but not so reliable. The last method uses plugins, called rivers. The river runs on the ElasticSearch node and is able to fetch data from the external systems.
One thing to remember is that the indexing only takes place on the primary shard, not on the replica. If the indexing request will be sent to a node, which doesn't have the correct shard or contains replica, it will be forwarded to the primary shard.
In general, the process can be divided into two phases, the scatter phase and the gather phase. The scatter phase is about querying all the relevant shards of your index. The gather phase is about gathering the results from the relevant shards, combining them, sorting, processing, and returning to the client.
Shard 2 [ 22 ]. Chapter 1 You can control the scatter and gather phases by specifying the search type to one of the six values currently exposed by ElasticSearch. Index configuration We already talked about automatic index configuration and ability to guess document field types and structure. Of course, ElasticSearch gives us the possibility to alter this behavior. We may, for example, configure our own document structure with the use of mappings, set the number of shards and replicas index will be built of, configure the analysis process, and so on.
Administration and monitoring The administration and monitoring part of API allows us to change the cluster settings, for example, to tune the discovery mechanism or change index placement strategy. You can find various information about cluster state or statistics regarding each node and index.
The API for the cluster monitoring is very comprehensive and example usage will be discussed in Chapter 5, ElasticSearch Administration. Summary In this chapter we've looked at the general architecture of Apache Lucene, how it works, how the analysis process is done, and how to use Apache Lucene query language.
In addition to that we've discussed the basic concepts of ElasticSearch, its architecture, and internal communication. In the next chapter you'll learn about the default scoring formula Apache Lucene uses, what the query rewrite process is, and how it works. In addition to that we'll discuss some of the ElasticSearch functionality, such as query rescore, multi near real-time get, and bulk search operations.
We'll also see how to use the update API to partially update our documents, how to sort our data, and how to use filtering to improve performance of our queries. Finally, we'll see how we can leverage the use of filters and scopes in the faceting mechanism.
In addition to that we've seen what Lucene query language is and how to use it. We also discussed ElasticSearch, its architecture, and core concepts. We will first go through how Lucene scoring formula works before turning to advanced queries. Power User Query DSL Default Apache Lucene scoring explained One important thing when talking about query relevance is how the score of the document is calculated for a query. What is the score? The score is a parameter that describes how well the document matched the query.
In this section, we'll look at the default Apache Lucene scoring mechanism: Knowing how this works is valuable when designing complicated queries and choosing which queries parts should be more relevant than others. When a document is matched When a document is returned by Lucene it means that it matched the query we sent. In this case, the document is given a score. The higher the score value, the more relevant the document is, at least at the Apache Lucene level and from the scoring formula point of view.
Naturally, the score calculated for the same document on two different queries will be different and comparing scores between queries usually doesn't make much sense. One should remember that not only should we avoid comparing the scores of individual documents returned by different queries, but we should also avoid comparing the maximum score calculated for different queries.
This is because the score depends on multiple factors, not only the boosts and query structure, but also on how many terms were matched, in which fields, and the type of matching that was used on query normalization, and so on. In extreme cases, a similar query may result in totally different scores for a document, only because we've used a custom score query or the number of matched terms increased dramatically.
For now, let's get back to the scoring. In order to calculate the score property for a document, multiple factors are taken into account: It is the boost value given for a document during indexing.
It is the boost value given for a field during querying. It is the coordination factor that is based on the number of terms the document has. It is responsible for giving more value to the documents that contain more search terms compared to other documents. It is a term based factor telling the scoring formula how rare the given term is. The lower the inverse document frequency is, the rarer the term is. The scoring formula uses this factor to boost documents that contain rare terms.
It is a field based factor for normalization based on the number of terms a given field contains calculated during indexing and stored in the index. The longer the field, the lesser boost this factor will give, which means that Apache Lucene scoring formula will favor documents with fields containing lower terms. It is a term based factor describing how many times given term occurs in a document. The higher the term frequency the higher the score of the document will be.
It is a query based normalization factor that is calculated as sum of a squared weight of each of the query terms. Query norm is used to allow score comparison between queries, which we said is not always easy and possible. Keep in mind, that in order to adjust your query relevance, you don't need to understand that, but it is very important to at least know how it works. The previous presented formula is a representation of Boolean model of Information Retrieval combined with Vector Space Model of Information Retrieval.
Let's not discuss it and let's just jump into the practical formula, which is implemented by Apache Lucene and is actually used. The information about Boolean model and Vector Space Model of Information Retrieval are far beyond the scope of this book. If you would like to read more about it, start with http: As you may be able to see, the score factor for the document is a function of query q and document d.
There are two factors that are not dependent directly on query terms, the coord and queryNorm.