Technology that supports Japan's largest job search engine "Standby"
https://speakerdeck.com/marevol/ri-ben-zui-da-ji-falseqiu-ren-jian-suo-enzin-sutanbai-wozhi-eruji-shu
About the standby search function
--Standby: Search for more than 8 million different job listings at once
--The search server is an Elasticsearch cluster (+ plugin)
- On a standby basis, it is easy to manage nodes in a cluster and expand with plugins.
- Data node (index)
- Search API ⇒ Coordinating node (for search)
- Spark / Update batch ⇒ Coordination node (for update)
- Since the update frequency is high, the search and update are separated so that the search does not collapse together.
--Search type
- Keyword search: Full-text search for titles, descriptions, working hours, etc.
- Work location search: Location information search after converting the work location to latitude and longitude
--In Lucene, what to do with Analyzer is the point of search (inverted index creation)
- CharFilter: Convert character by character (Example: ①⇒1)
- Tokenizer: Divide into words
- TokenFilter: Convert word by word (eg, cut long vowels, cut particles)
--Where to use Analyzer
- Used for searching and indexing, but not necessarily the same Analyzer
- Series: Introduction to OSS full-text search server Fess
Standby index and query
--Proper scoring and avoiding omissions
- Morphological analysis + Yomigana * Index is created for each of the three bi-grams
- Copy_to to create an index with other analyzers (multi-field is fine, but not to add extra fields)
--Query
- or search
- phrase (maintains word order) multi_match (throws to queries for multiple fields), job_title, job_content, bigram_content, reading_content (weighted, eg) job_title: 0.8)
- minimum_should_match: The word order is different, it hits even if it takes a little (even if you search in a sentence, you can pick it up as it is and avoid 0 hits), weight it to 0.01 and give priority to phrase
--Elasticsearch extension: You can create an entry point that accepts new paths
- It depends on the version of Elasticsearch, so it will not work as it is in the next version ...
Search issues and responses
--Index backup
- Use the mechanism to take snapshots, use the restored environment for analysis and learning (currently, there is no aggregation that requires real-time performance)
- Obtained every hour within 1 day, daily unit within 2 weeks, and thinned out to 1 per month for more (stored in S3)
--Original Similarity
- I want to lower the score even though I have taken SEO measures (such as those that match keywords frequently appear too often)
- Made by inheriting BM25 Similarity
--Distribution of dictionary files
- Can be mounted on NFS
- I wanted to distribute files with API like REST, so I introduced ConfigSync plugin
--Search template
- I want to manage business logic and search queries separately
- I want to write a little logic, so I introduced the Script-based Search Template plugin + Velocity template
--I want to reload the dictionary file without restarting
--Sort search results
- I want to prevent the same job name and the same medium from being displayed side by side
- Improve even the top part of search results
- Introduced DynaRank plugin: Sort the top N hits
- Introduced Minhash plugin: Index the bit string of the job title, judge the similarity by the bit string and sort
--I want to change index settings and mapping without stopping
- Indexing Proxy saves update request to file, reflects index in another thread
- Reindex to new index, copy, then write to new index from file
- Replace alias with new index, delete old index
Machine learning
--Job type / industry estimation
- There is no unique job / industry field in the job information
- Extract the characteristics of the job offer by natural language processing, learn the characteristics and estimate the occupation type, use Chainer
--Annual income estimate
- Since there are cases where the annual income is not disclosed, the characteristics of the job offer are extracted, and the characteristics are learned to estimate the annual income.
from now on
--Increase / decrease in the number of nodes by auto scaling
--Search result optimization by Learning To Rank
--Applications such as Word2Vec
Q & A
――What is your evaluation of Query Turing?
- I'm searching and looking at it at hand, and if something goes wrong, the CTR will go down.
Java10 summary and what happens to Java11
https://www.slideshare.net/nowokay/java10-and-11
support
--There is no longer any talk about the version of the year and month, and the major version is incremented as before.
--Maintenance Release goes up revision (10.0.1 in April, 10.0.2 in July), minor version is always 0
--JDK 11 LTS for 2018/9 (LTS is added to the version name)
--Java SE 8 support has been extended until 2019/1 (3 months after JDK11 release), individual users will support until 2020/12
--JDK 11 or later
- Applet and Web Start are no longer supported from JDK 11
- JavaFX is no longer bundled (development seems to be progressing), it was not originally bundled with OpenJDK
- AWT / Swing is actively committed for JDK 11 (fixing bugs rather than enhancements)
JDK10
--The most flashy change is Local-Variable Type Inferences
- var is not a keyword, but a special type (variable / method called var can be defined, but class called var can no longer be defined)
- Usage example: Assign an anonymous class to a variable, when the type name is written on the right side like new ArrayList <> () (I feel that it is better not to use var for wrapped types such as Optional / Flux)
- Java-Based JIT Compiler, Project Metoropolise (Graal)
- Prarell Full GC for G1
- In the first place, Full GC should not occur, but ...
- Heap Allocation on Altenative Memory Devices
- Heap can now be assigned to non-volatile memory such as 3D Xpoint
- OpenJDK now also includes Root Certificates
- API Changes
- java.io.Reader transferTo(Writer)
- Now you can get process id
- java.util.List/Map/Set copyOf(Collection)
- toUnmodifiableList/Set/Map
- Reduced need to use Guava
- Now looks at CPU count / Memory size properly
--The 32-bit version of the JDK is no longer bundled
JDK11
- Launch Single-File Source-Code Programs
- If you have one source code, you don't have to compile
- java Hello.java
- Raw String Literals
+How to handle indent is under discussion, it may not be possible if it is not organized
- Switch Expression
+In addition, multiple values can be written in case
- Local-Variable Syntax for Lambda parameters
- HTTP Client
- Epsilon: A No-Op Garbage Collector
+Criteria for evaluating GC performance, may be good for Serverless
-Java EE and CORBA come off
-Flight Recorder is now Open Source
-Supports Unicode 9 and Unicode 10
- Nestmate
+Project Valhalla deliverables
- String
- repeat():Take a memory area for the number of characters in advance
- strip(): Full-width space is also trimmed
- lines()
+Required for Raw Literal implementation
- Predicate::not
Support
- OracleJDK
+Very expensive for web servicer
- 100 Servers on AWS ->100 million yen (on a price list basis)
- OpenJDK
+Only supported for 6 months from each release (security patch provided)
+OpenJDK may also have LTS, Mark Reinhold says, but there is no official announcement, there may be no overlap even if LTS is included
- AdoptOpenJDK
+IBM Sponsor, London JUG
+Provides 4 years of LTS support
- Zulu
- 100 servers $28,750/year
- Unlimited servers $258,750/year
##I want to have a microservices architecture even with old frameworks
https://docs.google.com/presentation/d/1OZFgxuJQacfTc-3SY-ldxEE4OM3KUaUocdwIdkmy1z8/edit#slide=id.g3b5fd37ef4_0_83
-I want to put old services into microservices
+Reduce the operational load (put it on a common platform)
+Divide the service to make replacement easier
Spring Cloud Config
- Netflix Eureka:Service Discovery (internal DNS)
-Spring Cloud Config: Save DB connection string etc.
-I want to load Spring Cloud Config at startup even with older services other than Spring Boot
- RibbonClient
+Caching application information on Eureka
+Load balance access to services
- Eureka Client
+Eureka Client included in Ribbon Client
+Minimal functionality that does not load balance and returns the location of the target server for the service name
- eureka.registration.enabled=Used with false
+The application information of Eureka Client in Ribblon Client used in the existing framework is no longer updated (scale-in is not possible)
+I've updated two Singleton stuff
+Even when getting from the DI container, the DI container of Seasar and Spring is mixed in the war, so I could not get it from the Seasar side.
-Get data by directly hitting Eureka's REST API using Http Client without using Eureka Client
Spring Cloud Stream(Apahce Kafka)
-I want to add post-processing to a process with an old service
+push notification, email sending function
+I do not want to write the same processing that is used in other services, common processing for each service
+Modules will need to fit into the form available in older services
+I want it to be executed asynchronously
-Use Apache Kafka (1 Source, 2 Sink)
-Topic design
+Example: Push notification after a function
+Make "what you did" Tipic, eg) Like / Follow / Post ("Push YYY to XXX"Do not make Topic)
-How to change the message content
+It's better to create a new Topic
+Since the Source side starts sending messages to the new Topic, the Sink side refers to the new Topic.
##An example of using gRPC in an advertising bidding system that handles 160,000 requests per second from Logicad
https://www.slideshare.net/hiroiso/logicad16grpc
- Sonet Media Networks:Providing DSP business (Logicad)
- Real-Time Bidding
+SSP manages website inventory
+DSP companies request bids
+If a time-out occurs, you will not be able to participate in the auction (it will not be a sale) ⇒ Reduction of latency+Throughput improvement
-Latency
+Must be returned within a maximum of 100ms
+Network latency (Tokyo-Tokyo 1)-2ms, Tokyo-Taiwan 65ms)+Bid processing latency
+Logicad bids average 3ms
+Throughput also processes about 160,000 cases per second
-Architectural configuration
+Dozens of nginx and bidding servers have a mesh structure
+Advertising product information (image URL, size, LP, etc.) is retained locally due to latency
- Aerospike(KVS):User information, advertising budget digestion information
- AWS RDS:Advertising campaign information (predefined, stores information that needs to be loaded regularly)
- Redis:Ad fraud information (bot judgment, stores information provided by 3rd Party)
-Before adopting gRPC
+All servers have the same advertising product information, and it is necessary to scale up as the amount of data increases
+TB-class data (100 million orders of products) exceeds the size that can be held by each server
-To set up an advertising product information server while maintaining latency throughput
+Load balancing is also required
+The bidding server is a gRPC client, and an advertising product information server with a cluster configuration is built as a gRPC Server (Java8).
gRPC
-Candidates for selection
+Redis: Throughput is affected (blocked) because it operates in a single thread.
+Aerospike: Data is sharded and affects latency (multiple communications occur, multithreading increases thread connections)
- gRPC +Local DB: Recruitment
- gRPC
+Fast RPC framework developed by Google
- HTTP/2,Accelerate with Protocol Buffers
- Google, Netflix, Docker,Cisco etc.
-Good points of gRPC
+Latency reduction: HTTP/2 (binary frame, header compression by indexing, non-SSL), Protocaol Buffers (data compression)
+Throughput: HTTP/2 (multiplexing), client-side load balancing
- h2c
- HTTP/2 over TCP: Cannot be used with a browser, but can adopt communication that does not use TLS if it is only an internal network
-Multiplexing
+Parallelization is possible even with tens of thousands of connections with one TCP connection, overhead such as 3-way handshake can be reduced, and it is not blocked
- Protocol Buffers
+Google serializer
- IDL
+The compression rate is higher if you use tag without having the field name as data and design it to handle numeric type rather than character string.
-Client side load balancing
+gRPC Client Name Resolver(DNS)Extends to support dynamic gRPC server increase / decrease (update server list with API or periodic execution)
+Easy to manage without the need for an external load balancer such as nginx
###Results of actually applying gRPC
-benchmark
+Stress test of the advertising product information server alone (maximum throughput, measurement of changes in the number of connections)
+Measurement of latency and throughput in a load test environment (a series of bid processing)
-It is possible to apply load using JMeter even with gRPC
+It is possible to prepare a class that implements JavaSamplerClient and request any protocol.
-Load test
- Java8, 128GB, 8Coe/16thread
+100 million cases, 1.6TB
+We were able to process tens of thousands of cases per second without any problems ⇒ 3 advertising product information servers-4 units available
+The number of connections was constant, not proportional to throughput (multiplexing allows processing with fewer TCP connections)
-Despite externalizing local processing, about 10% change (0).It was within 1ms)
##The story of making a server application with DDD and clean architecture
-Rush construction does not work
+The outline of the specification has been decided, some of the core functions had parts, so I want to see what works using them as soon as possible
+Prototyping proceeded while creating functional specifications
+It was pushed a lot and the prototype was about to become a production code ⇒ Could not be released
-Various reflections
+Download all clients from the server and process (abnormal traffic volume, increased client processing load)
+Reflection of specification changes of external cooperation destination is reflected in all components
+There is no main component of business logic, and each component cannot concentrate on core logic and bears the conversion process.
###design
-Design order
+Define elevator pitch
+Define functional specifications and usage scenarios (also define non-functional requirements)
+Robustness analysis
+Bounded context extraction
+Creating a context map
-Robustness analysis
+Create use case scenarios in advance
+Think about how to create a function that meets the scenario
+Express it with boundaries (screens, cron), entities (data to manage), controls (processing, user authentication, value acquisition, etc.)
+Handwritten, written and erased in a notebook, and adopted the one who seems to be good for analysis by two people
-Context map
+For the functions that appear in the robustness diagram, collect similar ones and define a "bounded context".
+Write → fix → write → fix
+Carefully judged the relationships between contexts (partnerships, customers)/Supplier, adaptor)
-Ubiquitous language
+Establish unique terms for each context
+As a system, it is easier to understand if the terms are unified between contexts, and the same term is not used in different contexts with different meanings.
+We also decided to name it on the program → This is popular
###Things to prepare before implementation
-rule
+Git repository operation, coding standard, package configuration, Rest API standard, log level
+The exception is defined in the class near where it occurred
-Source code review perspective
+Is it possible for another person to maintain it, is it expandable in the future, is there no waste, is the range of influence too wide? ⇒ Described in the Wiki
+Safety (rewriting instance variables of Autowired class, is cache expire taken into account?)
-Architecture in each context
###Architecture in each context
-Interface layer
- Controller: Request
- Presenter: Response
- Gateway: Storage,External service
+Translator for each (translates words from your domain to words from outside)
-Usecase layer
-Domain layer
- Entity, ValueObject, Service, Repository
-Published Language: Defines input / output data of the context, each context uses the class under pl of the callee
-library: A library worth sharing across multiple contexts that has no involvement in business logic
-Repository implementation
+Define only the interface in Domain and perform actual processing in Gateway
+Domain Repository is table and 1:1 Define not to correspond (Define not to be aware of where it is actually saved)
+Gateway defines the physical part
###What I found after implementing it
-You can focus on your interests
- @The range that you have to think about in Transactional can be narrow
+Since the terms are unified, you can clearly connect from external specifications to code.
-Easy to follow because the processing flow is patterned
-Transformation storm ...
-Good thing
+Freedom from monoliths, RDBs allow each person to focus on the business logic of their part/Clear scope of responsibility with cache separation (DB can be used as my own)
+Challenge to new technology: Docker, AWS ECS(I wanted to make it Fargate, but I didn't come to Japan 200-Abandoned 300ms due to latency), GitLab CI
-There is room for improvement
+Thorough rules