Python continues to lead the way in solving data science tasks and challenges.

Which makes python so popular is the large amount libraries.This blog outlines the most helpful Python libraries at the 2018, Our choice actually includes more than 20 libraries, because some of them are mutually replacement and can solve the same problem. So we put them in the same group.

Core library and statistics

1. NumPy (Commits: 17911, Contributors: 641)

Official website: http://www.numpy.org/

NumPy is one of the main software packages for scientific application libraries. It handles large multidimensional arrays and matrices. It has a large collection of high-level mathematical functions and implementation methods that make it possible for these objects to perform operations.

2. SciPy (Commits: 19150, Contributors: 608)

Official website: https://scipy.org/scipylib/

Another core library of scientific computing is SciPy. It is based on NumPy and its functionality has been extended.The SciPy master data structure is again a multidimensional array, implemented by Numpy. This package contains tools to help solve linear algebra, probability theory, integral calculations, and many other tasks. In addition, SciPy encapsulates many new BLAS and LAPACK functions.

3. Pandas (Commits: 17144, Contributors: 1165)

Official website: https://pandas.pydata.org/

Pandas is a Python library that provides advanced data structures and a variety of analysis tools. The main feature of this package is the ability to convert fairly complex data operations into one or two commands. Pandas includes many built-in methods for grouping, filtering, and combining data, as well as time series capabilities.

4. StatsModels (Commits: 10067, Contributors: 153)

Official website: http://www.statsmodels.org/devel/

Statsmodels is a Python module that provides many opportunities for statistical analysis, such as statistical model estimation, performing statistical tests, and more. With its help, you can implement many machine learning methods and explore different drawing possibilities.

The Python library continues to evolve and continues to enrich new opportunities. As a result, time series improvements and new counting models have emerged this year, namely GeneralizedPoisson, zero inflated models, and NegativeBinomialP, as well as new multivariate methods: factor analysis, multivariate analysis of variance, and repeated measures in analysis of variance.

Visualization

5. Matplotlib (Commits: 25747, Contributors: 725)

Official website: https://matplotlib.org/index.html

Matplotlib is an underlying library for creating 2D graphs and graphics. With its help, you can build a variety of different icons, from histograms and scatter plots to Federer’s plots. In addition, there are many popular painting libraries designed to be used in conjunction with matplotlib.

Python-Matplotlib

Python-Matplotlib

6. Seaborn (Commits: 2044, Contributors: 83)

Official website: https://seaborn.pydata.org/

Seaborn is essentially a high-level API based on the matplotlib library. It contains default settings that are better suited to work with charts. In addition, there are a rich library of visualizations, including some complex types such as time series, joint maps (jointplots) and violin diagrams.

Python-Seaborn

Python-Seaborn

7. Plotly (Commits: 2906, Contributors: 48)

Official website: https://plot.ly/python/

Plotly is a popular library that makes it easy to build complex graphics. This package is suitable for interactive web applications and enables visual effects such as outlines, ternary and 3D drawings.

8. Bokeh (Commits: 16983, Contributors: 294)

Official website: https://bokeh.pydata.org/en/latest/

The Bokeh library uses JavaScript widgets to create interactive and scalable visualizations in the browser. The library provides a variety of chart collections, styling possibilities, link graphs, adding widgets and defining callbacks, and many more useful features.

Python-Bokeh

Python-Bokeh

9. Pydot (Commits: 169, Contributors: 12)

Official website: https://pypi.org/project/pydot/

Pydot is a library for generating complex orientation and undirected graphs. It is the Graphviz interface written in pure Python. With its help, you can display the structure of the graph, which is often used in building neural networks and decision tree-based algorithms.

python-Pydot

python-Pydot

Machine learning

10. Scikit-learn (Commits: 22753, Contributors: 1084)

Official website: http://scikit-learn.org/stable/

This NumPy and SciPy-based Python module is one of the best libraries for working with data. It provides algorithms for many standard machine learning and data mining tasks such as clustering, regression, classification, dimensionality reduction, and model selection. Improve your skills with Data Science School.

Data Science School: http://datascience-school.com/

11. XGBoost / LightGBM / CatBoost (Commits: 3277 / 1083 / 1509, Contributors: 280 / 79 / 61)

Official website:

http://xgboost.readthedocs.io/en/latest/http://lightgbm.readthedocs.io/en/latest/Python-Intro.htmlhttps://github.com/catboost/catboost

Gradient enhancement algorithm is one of the most popular machine learning algorithms. It is to build a basic model of continuous improvement, namely decision tree. Therefore, a specialized library was designed to implement this method quickly and easily. That is, we think XGBoost, LightGBM, and CatBoost deserve special attention. They are all competitors that solve common problems and are used in much the same way. These libraries provide highly optimized, scalable, and fast gradient enhancement implementations that make them very popular among data scientists and Kaggle competitors because of the many games they have won with the help of these algorithms.

12. Eli5 (Commits: 922, Contributors: 6)

Official website: https://eli5.readthedocs.io/en/latest/

Often, the results of machine learning model predictions are not entirely clear, and this is what Eli5 is helping to address. It is a software package for visualizing and debugging machine learning models and stepping through the work of the algorithms, supporting the scikit-learn, XGBoost, LightGBM, lightning, and sklearn-crfsuite libraries, and performing different tasks for each library.

Deep learning

13. TensorFlow (Commits: 33339, Contributors: 1469)

Official website: https://www.tensorflow.org/

TensorFlow is a popular deep learning and machine learning framework developed by Google Brain. It provides the ability to use artificial neural networks with multiple data sets. Target recognition, speech recognition, etc. are found in the most popular TensorFlow applications. There are also different leyer-helpers on regular TensorFlow, such as tflearn, tf-slim, skflow, etc.

14. PyTorch (Commits: 11306, Contributors: 635)

Official website: https://pytorch.org/

PyTorch is a large framework that allows GPU acceleration to perform tensor calculations, create dynamic calculation graphs and automatically calculate gradients. On top of this, PyTorch provides a rich API for solving neural network related applications. Based on Torch, the library is an open source deep learning library implemented in C.

15. Keras (Commits: 4539, Contributors: 671)

Official website: https://keras.io/

Keras is an advanced library for handling neural networks running on TensorFlow, Theano, and now with CNTK and MxNet as backends due to the release of the new version. It simplifies many specific tasks and greatly reduces the amount of monotonic code. However, it may not be suitable for some complex tasks.

Distributed deep learning

16. Dist-keras / elephas / spark-deep-learning (Commits: 1125 / 170 / 67, Contributors: 5 / 13 / 11)

Official website:

Http://joerihermans.com/work/distributed-keras/https://pypi.org/project/elephas/https://databricks.github.io/spark-deep-learning/site/index.html

As more and more use cases take a lot of effort and time, deep learning issues become more and more important.However, with a distributed computing system like Apache Spark, it is much easier to process so much data, which again extends the possibilities for deep learning. As a result, dist-keras, elephas, and spark-deep-learning are rapidly gaining popularity and development, and it is difficult to pick a library because they are designed to solve common tasks. These packages allow you to directly train the Keras library-based neural network with the help of Apache Spark. Spark-deep-learning also provides tools for creating pipes using Python neural networks.

Natural language processing

17. NLTK (Commits: 13041, Contributors: 236)

Official website: https://www.nltk.org/

NLTK is a set of libraries, a complete platform for natural language processing. With the help of NLTK, you can process and analyze text in a variety of ways, tag and mark text, extract information, and more. NLTK is also used for prototyping and building research systems.

18. SpaCy (Commits: 8623, Contributors: 215)

Official website: https://spacy.io/

SpaCy is a natural language processing library with excellent examples, API documentation, and demo applications.This library is written in the Cython language, and Cython is a C extension to Python. It supports nearly 30 languages ​​and provides simple deep learning integration to ensure robustness and high accuracy. Another important feature of SpaCy is the architecture designed to handle the entire document without having to break the document into phrases.

19. Gensim (Commits: 3603, Contributors: 273)

Official website: https://radimrehurek.com/gensim/

Gensim is a Python library for robust semantic analysis, topic modeling, and vector space modeling built on top of Numpy and Scipy. It provides an implementation of the popular NLP algorithm, such as word2vec. Although gensim has its own implementation of models.wrappers.fasttext, the fasttext library can also be used to efficiently learn word representations.

data collection

20. Scrapy (Commits: 6625, Contributors: 281)

Official website: https://scrapy.org/

Scrapy is a library for creating web crawlers , scanning web pages, and collecting structured data. In addition, Scrapy can extract data from the API. Due to its scalability and portability, the library is very convenient to use.

in conclusion

The above list of articles in this article is our collection of rich Python libraries for data science in 2018. Compared to the previous year, some new modern libraries are becoming more and more popular, and the libraries that have become classic data science tasks are constantly improving.