This is the article on the 10th day of Nextremer Advent Calendar 2016. I am studying a field that combines ** machine learning and material development ** at university. This time, I wrote an article with the hope that ** people who love machine learning will feel like a material researcher **. (Actually, I want material researchers to feel like AI researchers and incorporate machine learning into materials more.)

What is material development

First of all, I think that material development is something like what kind of material it is. As you said, there are various materials such as ceramics, polymers, and metals.

For example iPhone

There are hundreds of such small ceramic capacitors in it.

And in order to make a high-performance capacitor that is chosen by Apple,

・ What kind of elements should be combined? ・ What kind of process should I make?

It will solve the difficult problem. Here is an example of how to solve it.

For the time being, do you want to change the one made by the great men of the past?
Unexpectedly ... I wonder if it is better to increase the mixing time (process optimization)
It's done! Measure and observe various things (measurement)
I see, it's a high-performance physical phenomenon like this (theory, analysis)
The theory and analysis revealed the tendency, so next time I would like to make it with such elements ... (search)

It's like that. This is just an example, but what is material research?

Make great materials by making full use of process optimization, measurement, theory, analysis, exploration, etc.

It's messy, but it looks like this.

Speaking of machine learning, data

The way to develop materials by machine learning is, of course, to learn features from data. Then, we predict and search for materials with good physical properties.

However, the current situation is that AI has not been able to produce much results in material research due to the lack of data at all.

There are two main types of data on materials. ・ Experimental measurement data ・ Analysis data by calculation

Experimental measurement data

--Papers --Dust as written in the lab notebook --Distributed to laboratories and companies around the world

That is the situation. I'd like to mess up the failure data not mentioned in the paper or Deep Learning, but it's covered with dust as it is written in the experiment notebook. I want a huge database, but it seems difficult because everyone wants to hide important results.

Computational analysis data

The database of calculation data is open to the public for free Materials Project Is an organization. The data published here is calculated by a scary sounding method called ** first principle calculation **. The data can be easily retrieved in REST format using https://materialsproject.org/!

For those who are interested.

What is first-principles calculation? A general term for calculations (methods) performed based on the first principle. ([Wikipedia](https://ja.wikipedia.org/wiki/%E7%AC%AC%E4%B8%80%E5%8E%9F%E7%90%86%E8%A8%88%E7% From AE% 97))

Yes.

What is the first principle? The first principles in the natural sciences refer to the most fundamental basic laws that do not include approximations or empirical parameters, and can explain natural phenomena on the premise of that. .. ([Wikipedia](https://ja.wikipedia.org/wiki/%E7%AC%AC%E4%B8%80%E5%8E%9F%E7%90%86#.E8.87.AA.E7 .84.B6.E7.A7.91.E5.AD.A6.E3.81.AB.E3.81.8A.E3.81.91.E3.82.8B.E7.AC.AC.E4.B8.80.E5 From .8E.9F.E7.90.86))

To put it simply, it is a person who calculates the behavior of a very complicated substance only by basic physical laws such as electrons and atoms. But

The number of atoms that can be handled by the so-called first-principles electronic state calculation method is still up to about 100 to 1000 as of 2003, which is far below the Avogadro constant. At the level that a protein (or amino acid) with the simplest structure may finally be handled on the order of 1000 atoms. ([Wikipedia](https://ja.wikipedia.org/wiki/%E7%AC%AC%E4%B8%80%E5%8E%9F%E7%90%86%E8%A8%88%E7% From AE% 97))

It seems. It's been a tremendous progress now in 2016, but the current situation is that it requires a lot of cost and know-how to calculate complicated materials and physical properties. So, if you want to collect a large amount of data, turn around the calculator and calculate it plainly, or take it from the Materials Project.

It's difficult, but it's 2016 and I want to make materials by machine learning.

After all, the times are AI and machine learning. To make materials with AI

** ① Predict the physical properties of materials with AI from chemical formulas → ② Search from a huge number of atomic combinations with AI → ③ Make materials while controlling robots while optimizing processes with AI **

At present, even ** ① has not been completed at all. ** ** So, I'm sorry for the main subject, but let's try ** physical property prediction ** together and feel the mood of a material researcher with ** machine learning!

Let's actually predict material properties by machine learning

Actually, there is a company called Citrine Informatics that is engaged in a service that helps material development with AI, and it has a nice tutorial on it, so I would like to try it. I think. Source: MACHINE LEARNING FOR THE MATERIALS SCIENTIST

Purpose

Machine learn the ** bandgap ** of matter. The band gap represents the size of the barrier when electrons move. If the band gap is wide, electricity will not pass (stone). If the bandgap is narrow, electricity will pass (metal). It may or may not pass if it is OK. (semiconductor)

Things necessary

--python3.5 (2 series is probably possible, but I haven't tried it)

scikitlearn
numpy
pymatgen（Materials Project Convenient module for data acquisition and physical property calculation)

** Install **

pip install scikit-learn
pip install numpy
pip install pymatgen

** Try machine learning with explanatory variables only for chemical formulas **

First, get the data. Download the file bandgapDFT.csv from here. By the way, DFT is a first-principles calculation method called Density Functional Theory.

Now, let's import the required library and load the downloaded csv.

`bandgap.py`


from pymatgen import Composition, Element
from numpy import zeros, mean
trainFile = open("bandgapDFT.csv","r").readlines()

Looking at the data,

LiH,2.981
BeH2,5.326
B9H11,2.9118
B2H5,6.3448
BH3,5.3234
B5H7,3.5551
H34C19,5.4526
H3N,4.3287
(The following is omitted)

I think it looks like this. The first line is the chemical formula and the second line is the bandgap (eV). Create a function to treat the chemical formula as a fixed-length vector for machine learning. The composition that appears here is a Composition object of pymatgen, and you can get atoms, composition ratio, etc. from the chemical formula.

`bandgap.py`


#input:Composition object of pymatgen
#output:Composition vector
def naiveVectorize(composition):
       vector = zeros((MAX_Z))
       for element in composition:
               #element is an atom. fraction is the percentage of the atom in the composition
               fraction = composition.get_atomic_fraction(element)
               vector[element.Z - 1] = fraction
       return(vector)

Read the chemical formula and bandgap from csv, and prepare the vector of the explanatory variables from the chemical formula with the above function. (Generation of teacher data)

`bandgap.py`


materials = []
bandgaps = []
naiveFeatures = []

MAX_Z = 100 #Feature vector maximum length

for line in trainFile:
       split = str.split(line, ',')
       material = Composition(split[0])
       materials.append(material) #Chemical formula
       naiveFeatures.append(naiveVectorize(material)) #Feature vector generation
       bandgaps.append(float(split[1])) #Bandgap loading

The feature vector is super-discrete data like this.

[ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0.5,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0.5,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ]

So, the teacher data is ready. After that, we will do machine learning using the convenient module of scikit learn. Before that, let's first find a simple average error to verify the bandgap prediction accuracy.

`bandgap.py`


baselineError = mean(abs(mean(bandgaps) - bandgaps))
print("Mean Absolute Error : " + str(round(baselineError, 3)) + " eV")

Now it should be 0.728eV. So let's make a bandgap predictor in Random Forest.

`bandgap.py`


from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
from sklearn import linear_model, metrics, ensemble

#Random forest regression of sklearn
rfr = ensemble.RandomForestRegressor(n_estimators=10)

#Cross-validate
cv = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

scores_composition = cross_val_score(rfr, naiveFeatures,\
	bandgaps, cv=cv, scoring='neg_mean_absolute_error')

print("Mean Absolute Error by Random Forest with composition data: "\
	+ str(round(abs(mean(scores_composition)), 3)) + " eV")

If the average error is about 0.36 to eV, you're done. Compared to 0.728eV, I think I have managed to learn.

** Let's use a physical value as an explanatory variable ** So far, we have used only what kind of atoms are contained as explanatory variables, but next we will use values such as atom composition ratio and electronegativity as explanatory variables. It's only 4 dimensions. For example, BeH2 [2.0, 0.63, 1.0, 2.0] It will be.

`bandgap.py`


physicalFeatures = []

for material in materials:
       theseFeatures = []
       fraction = []
       atomicNo = []
       eneg = []
       group = []

       for element in material:
               fraction.append(material.get_atomic_fraction(element))
               atomicNo.append(float(element.Z))
               eneg.append(element.X)
               group.append(float(element.group))

       mustReverse = False
       if fraction[1] > fraction[0]:
               mustReverse = True

       for features in [fraction, atomicNo, eneg, group]:
               if mustReverse:
                       features.reverse()
       theseFeatures.append(fraction[0] / fraction[1])
       theseFeatures.append(eneg[0] - eneg[1])
       theseFeatures.append(group[0])
       theseFeatures.append(group[1])
       physicalFeatures.append(theseFeatures)

Let's learn with a decision tree in the same way using the created explanatory variable physicalFeatures.

`bandgap.py`



scores_physical =cross_val_score(rfr, physicalFeatures,\
	bandgaps, cv=cv, scoring='mean_absolute_error')

print("Mean Absolute Error by Random Forest with physical data: "\
	+ str(round(abs(mean(scores_physical)), 3)) + " eV")

Then Mean Absolute Error by Random Forest with physical data:: 0.267 eV It will be.

By comparison

--Random forest with chemical formula ・・・ 0.362eV --Random forest with physical quantity ・・・ 0.267eV

It turned out that it is more accurate to use the physical quantity as an explanatory variable. It's useless at all, but I think I could somehow feel the mood of a material researcher through machine learning.

Finally

In fact, there are already many papers that propose physical property prediction using neural networks and decision trees, and material search methods using genetic algorithms, and they are not currently applicable at a practical level, but in the future. I think that material research using AI will surely become active. However, I feel that Japan is far behind in this field compared to the United States and other countries. If anyone specializes in materials and physical properties, I would like you to proceed with research that incorporates machine learning.

Thank you very much!

Let's feel like a material researcher with machine learning

What is material development

Speaking of machine learning, data

Experimental measurement data

Computational analysis data

It's difficult, but it's 2016 and I want to make materials by machine learning.

Let's actually predict material properties by machine learning

bandgap.py

bandgap.py

bandgap.py

bandgap.py

bandgap.py

bandgap.py

bandgap.py

Finally

`bandgap.py`

`bandgap.py`

`bandgap.py`

`bandgap.py`

`bandgap.py`

`bandgap.py`

`bandgap.py`