It is framework, which you can easily use for Map/Reduce model. As C++ is compiled language, so it is not easy to spread code among multiple computers. It is done by the use of shared library and polymorphism, so user need just to implement map, reduce functions and serialization of used types (all primitive types are available). Moreover, on each computer data is processed concurrently.
For using this utility you need to have next third-pary libraries
- Libssh
- Boost (System, Filesystem, Program options)
In ubuntu you can easily intall them using next command:
sudo apt install libssh-dev libboost-all-devAlso, Dockerfile is provided with preinstalled libraries. You can build it with command:
docker build . -t milishchuk/mapreducemkdir build
cd build
cmake ..
make
make installThis repo contains example, which you can use for better understanding. We can divide implementing by next phases:
If you need specific type for your data, you should implement KeyValueType interface with void parse(const std::string&)
and std::string to_string() const methods. Or you can use implemented primitives(char, int, double, long). Implement
custom KeyValueTypeFactory for this type.
Implement Map and Reduce classes inherited by map_base and reduce_base interfaces.
Implement function std::shared_ptr<job_config> get_config(), which will return config with map,reduce functions, and
Factories for key_in, key_out, value_in, value_out, value_res. For better understanding next diagram:
map groupby reduce key_in, value_in ==> key_out, value_out key_in, value_in ==> key_out, value_out key_in, value_in ==> key_out, value_out ======> key_out, [value_out] =====> key_out, value_res key_in, value_in ==> key_out, value_out key_in, value_in ==> key_out, value_out
Nextly, you need to make shared library with std::shared_ptr<job_config> get_config() function
You can use blocking run_task_blocking function or non-blocking run_task, which will return std::future. All data
to map nodes are read from files on corresponding computers.
For quick enviroment setup you can use run_example.sh script, which launchs 4 map nodes, reduce node and master node to which result will be returned.
Roman Milishchuk @Midren