Join JAAGNet and Group

SIgn up for JAAGNet & the Big Data Group its FREE!!


Member Benefits:


Again signing up for JAAGNet & Group Membership is FREE and will only take a few moments!

Here are some of the benefits of Signing Up:

  • Ability to join and comment on all the JAAGNet Domain communities.
  • Ability to Blog on all the Domain communities 
  • Visibility to more pages and content at a group community level, such as Community, Internet, Social and Team Domain Community Feeds.
  • Make this your only content hub and distribute your blogs to LinkedIn, Reddit, Facebook, Twitter, WhatsApp, Messenger, Xing, Skype, WordPress Blogs, Pinterest, Email Apps and many, many more (100+) social network and feed sites. 
  • Opportunity to collaborate (soon to be  released) with various JAAGNet Business communities and other JAAGNet Network members.
  • Connect (become friends), Follow (and be Followed) and Network with JAAGNet members with similar interests.
  • Your Content will automatically be distributed on Domain and JAAGNet Community Feeds. Which are widely distributed by the JAAGNet team.

Join Us!

All Posts (136)

Deriving actionable insights from your data with essential data mining techniques.

Businesses today have access to massive amounts of data. These voluminous data are typically collected and stored in both structured and unstructured forms. These data are gleaned from various sources such as customer data, transactions, third-party vendors, and more. However, to make sense of the data is much challenging and requires relevant skills and tools and techniques to excerpt meaningful information from it. Data mining here has a role to play in extracting information from a given data set, identifying trends, patterns, and useful data.

Data mining refers to the usage of refined data analysis tools to discover previously unknown, valid patterns, and relationships in huge data sets. It integrates statistical models, machine learning techniques, and mathematical algorithms, such as neural networks, to derive insight. Thus, to make sense of your data, businesses must consider data mining techniques.

Here is a look at the top data mining techniques that can help extract optimal results.

Data Cleaning

As businesses often gather raw data, it requires to be analyzed and formatted accurately. By appropriate data cleaning, businesses can understand and prepare the data in different analytic methods. Typically, data cleaning and preparation involves distinct elements of data modeling, transformation, data migration, ETL, ELT, data integration, and aggregation.


Association defines to identify a pattern in a transaction. It specifies that certain data, or events found in data, are related to other data or data-driven events. This technique is used to conduct market basket analysis, which is done to find out all those products that customers buy together regularly. It is useful in understanding customers’ shopping behaviors, providing businesses the opportunity to study sales data of the past and then predict future buying trends.


Clustering is the process of finding groups and clusters in the data in such a way that the degree of association between two objects is highest if they belong to the same group and lowest otherwise. Unlike classification that puts objects into predefined classes, clustering for data puts objects in classes that are defined by it. Essentially, clustering mechanisms use graphics to define where the distribution of data is in relation to different sorts of metrics. This technique also uses different colors to show the distribution of data.


This data mining technique is generally used to classify different data in different classes. It is similar to clustering in a way as it also fragments data records into different segments. But unlike clustering, data analysts in classification analysis would know about different classes or clusters. They would even apply algorithms to determine how new data should be classified.

Outlier Detection

Simply finding patterns in data may not give a clear understanding that businesses want. Outlier analysis or outlier mining, which is the most crucial data mining technique, helps organizations determine anomalies in datasets. Outlier detection generally refers to the observation of data items in a dataset that do not match an expected pattern or expected behavior. Once businesses find deviations in their data, it becomes easier to understand the reason for anomalies and better prepare for any future occurrences to achieve business objectives.


This data mining technique refers to the process of detecting and analyzing the relationship between variables in a dataset. Regression analysis can help businesses understand the characteristic value of the dependent variable changes if any one of the independent variables is varied. It is primarily a form of planning and modeling and can be used to project certain costs, relying on other factors such as availability, consumer demand, and competition.

Sequential Patterns

It is particularly useful for data mining transactional data and focuses on divulging a series of events that take place in a sequence. It encompasses discovering interesting subsequences in a set of sequences, where the stake of a sequence can be measured in terms of various criteria like length, occurrence frequency, and so on. Once a company understands sequential patterns, it can recommend additional items to customers to spur sales. 


Data visualization is an effective technique for data mining. It grants users’ insight into data based on sensory perceptions that people can see. Also, data visualizations can be used through dashboards to unveil insights. Instead of simply using numerical outputs of statistical models, the enterprise can base dashboards on different metrics and use visualizations to highlight patterns in data visually.

Originally published by
Vivek Kumar | September 18, 2020
Analytics Insight

Read more…

Data science is a blend of various tools and algorithms

Data science is the trending topic in Artificial Intelligence (AI). When the tech era started, the major burden it has to undertake was to store data and make good use of it. Data science is the add-on to the emerging utilisation of such stored data.

Artificial Intelligence (AI) technologies like big data and Hadoop are housing large amount of data in encrypted and open-source formations. The data is valued as an asset to organisations with the content it has. But that doesn’t make any profit for the company. Data gains profit and attraction only when technologies like data science are added to the system.

Data science was not getting traction until the 1990s. But since then, the field is widely acting as an attractive platform of AI. Harvard calls data scientist profession as the ‘sexiest’ of all. However, the job of a data scientist is not as easy as it sounds. It involves being an expert in everything.

What is Data Science?

Data science is a blend of various tools, algorithms and machine learning principles with the goal to discover hidden partners from raw data. The technology is primarily used to make decisions and predictions making use of predictive casual analytics, prescriptive analytics and machine learning. It involves large sets of data with statistical methods to extract trends, patterns or other relevant information.

data scientist usually explains what is going on by processing data history. Data scientists may come from many diverse educational and work experience backgrounds, but they should be strong on what data science represents as four pillars of the technology.

• Programming– Data scientists should be aware of the program data hierarchies and datasets to code algorithms and develop models.

• Mathematics– Data science involves a lot of mathematical structures which a data scientist should encounter. It is highly essential for modelling experimental designs.

• Computer science– Basic knowledge of computer science is essential to the field as it incorporates coding and devising.

• Communication– Reaching out to the audience is a major task. The wise work and efforts are kept aside, a data scientist should work viable for all audience by telling the story through right visuals and facts to convey the importance of their work.

Coding as a key feature of data science

It may sound weird to merge data scientists with coding. But that is how data science works. Coding is a mandatory task that comes in when data science is on the table. It appears in every step of the process. Here are some of the step-by-step analysis of how coding untangles the data science issues,

Knowing the tools and addressing the problems: Data scientists should necessarily be aware of the problem that they are going to encounter before starting to program a data science function. It also involves being sensitised about the tools, software and data that should be used throughout the process. This first step towards configurations unravels the preplanning of a data scientist.

Filtering the essential data: Data is an essential substance for making analysis. However, data is vast and large. It is moreover unorganised and mixed making systems confused, pushing the solution to be delayed or delivering wrong predictions. A Forbes report suggests that per day, humans create 2.5 quintillion bytes of data. This data ranges from duplicate or missing datasets and values, inconsistent data, misentered data or even outdated data. Henceforth, a data scientist should first pull out the set of data that he finds necessary and start coding by making the lead out of it.

Analysing data with proper application: The major task of data science is to analyse the formulated uniform data. The process involves applications like Python, R and MATLAB which are popular in the field. Though these languages have a steeper learning curve that Python, they are useful for an aspiring data scientist as they are widely utilised.

Attracting audience thorough creative visualization: The importance of completed work is complemented by the way a presenter conveys it to the viewers. The same task happens with the data scientist and his/her analysis. Visualisation is a vital forum that is being used by data scientists to convey their analysis. This could be established using graphs, charts and other easy-to-read visuals, which will make the audience grasp the concept. Python, the widely used language comes with a package of Seaborn and Prettyplotlib, which will help data scientists in making visualisation.

Programming languages used in Data science

• Python– Pythoncan be used to obtain, clean, analyse and visualize data. It is considered as the programming language that serves as the foundation of data science.

• NumPy and Pandas- The package of NumPyand Pandas can compute complex calculations with matrices of data, making it easier for data scientists to focus on solutions instead of mathematical formula and algorithms.

• Java- Javacan be used in a vast number of the workplace. Remarkably, plenty of big data realm are written in Java.

Data science application involves programming in every step, which further leads to coding taking them head-on to bring perfectly analysed predictions and solutions. Henceforth, an aspiring data scientist needs to be well aware of the coding systems and its features. The data scientist should make sure that he/she is good at all aspects before starting a career at data science platform.

Originally published by
Adilin Beatrice |  September 15, 2020
Analytics Insight

Read more…

Image: Joshua Sortino - Unsplash

The lack of trust in AI systems comes after a number of bad algorithm-driven decisions.

The UK government's recent technological mishaps has seemingly left a bitter taste in the mouth of many British citizens. A new report from the British Computer Society (BCS), the Chartered Institute for IT, has now revealed that more than half of UK adults (53%) don't trust organisations that use algorithms to make decisions about them.

The survey, conducted over 2,000 respondents, comes in the wake of a tumultuous summer, shaken by student uproar after it emerged that the exam regulator Ofqual used an unfair algorithm to predict A-level and GCSE results, after the COVID-19 pandemic prevented exams from taking place.

Ofqual's algorithm effectively based predictions on schools' previous performances, leading to significant downgrades in results that particularly affected state schools, while favoring private schools. 

The government promptly backtracked and allowed students to adopt teacher-predicted grades rather than algorithm-based results. It might have been too little, too late: only 7% of respondents surveyed by the BCS said that they trusted the algorithms used specifically in the education sector. 

The percentage is joint lowest, along with the level of trust placed in algorithms used by social services and the armed forces; and stands even lower than that of respondents who reported trusting social media companies' algorithms to serve content and direct user experience (8%).

Bill Mitchell, director of policy at BCS, told ZDNet that recent events have "seriously" knocked back people's trust in the way algorithms are used to make decisions about them, and that this will have long-term consequences.

"But at the same time, it has actually raised in people's mind the fact that algorithms are ubiquitous," added Mitchell. "Algorithms are always there, people are realising that is the case, and they are asking: 'Why should I trust your algorithm?'"

"That's spot on, it's just what people should be asking, and the rest of us involved in designing and deploying those algorithms should be ready to explain why a given algorithm will work to people's advantage and not be used to do harm."

The prevalence of hidden AI systems in delivering critical public services was signaled by the UK's committee on standards in public life last February, in a report that stressed the lack of openness and transparency from the government in its use of the technology.

One of the main issues identified by the report at the time was that no one knows exactly where the government currently uses AI. At the same time, public services are increasingly looking at deploying AI to high-impact decision-making processes in sectors like policing, education, social care, and health.

With the lack of clarity surrounding the use of algorithms in areas that can have huge impacts on citizens' lives, the public's mistrust of some technologies used in government services shouldn't come as a surprise – nor should attempts to reverse the damaging effects of a biased algorithm be ignored.

"What we've seen happening in schools shows that when the public wants to, they can very clearly take ownership," said Mitchell, "but I'm not sure we want to be in a situation where if there is any problem with an algorithm, we end up with riots in the streets."

Instead, argued Mitchell, there should be a systematic way of engaging with the public before algorithms are launched, to clarify exactly who the technology will be affecting, what data will be used, who will be accountable for results and how the system can be fixed if anything goes wrong.

In other words, it's not only about making sure that citizens know when decisions are made by an AI system, but also about implementing rigorous standards in the actual making of the algorithm itself. 

"If you ask me to prove that you can trust my algorithm," said Mitchell, "as a professional I need to be able to show you – the person this algorithm is affecting – that yes, you can trust me as a professional."

Embedding those standards in the design and development phases of AI systems is a difficult task, because there are many layers of choices made by different people at different times throughout the life cycle of an algorithm. But to regain the public's trust, argued Mitchell, it is necessary to make data science a trusted profession – as trusted as the profession of doctor or lawyer.

The BCS's latest report, in fact, showed that the NHS was organization that citizens trusted the most when it comes to decisions generated by algorithms. Up to 17% of respondents said they had faith in automated decision-making in the NHS, and the number jumped to 30% among 18-24 years-olds.

"People trust the NHS because they trust doctors and nurses. They are professionals that must abide by the right standards, and if they don't, they get thrown out," said Mitchell. "In the IT profession, we don't have the same thing, and yet we are now seeing algorithms being used in incredibly high-stake situations."

Will the public ever trust data scientists like they trust their doctor? The idea might seem incongruous. But with AI permeating more aspects of citizens' lives every day, getting the public on board is set to become a priority for the data science profession as a whole.

Originally published by
Daphne Leprince-Ringuet | September 8, 2020

Read more…
MIT Associate Professor Sarah Williams conducts data-heavy urban research, which can then be expressed in striking visualizations, ideally generating public interest. She has also worked with other scholars on topics including criminal justice, the environment, and housing.  CreditsPhoto: Adam Glanzman
“We hear big data is going to change the world, but I don’t believe it will unless we synthesize it into tools with a public benefit,” Sarah Williams says.

Lacking a strong public transit system, residents of Nairobi, Kenya, often get around the city using “matatus” — group taxis following familiar routes. This informal method of transportation is essential to people’s lives: About 3.5 million people in Nairobi regularly use matatus.

Some years ago, around 2012, Sarah Williams became interested in mapping Nairobi’s matatus. Now an associate professor in MIT’s Department of Urban Studies and Planning (DUSP), she helped develop an app that collected data from the vehicles as they circulated around Nairobi, then collaborated with matatu owners and drivers to map the entire network. By 2014, Nairobi’s leaders liked the map so much they started using Williams’ design themselves.

“The city took it on and made it the official [transit] map for the city,” Williams says. Indeed, the Nairobi matatu map is now a common sight — a distant cousin of the London Underground map. “An image has a long life if it’s impactful,” she adds.

That project was a rapid success story — from academic research effort to mass-media use in a couple of years — but for Williams, her work in this area was just getting started. Cities from Amman, Jordan, to Managua, Nicaragua, have been inspired by the project and mapped their own networks, and Williams created a resource center so that even more places could do the same, from the Dominican Republic to Addis Ababa, Ethiopa.

“We’re trying to build a network that supports this work,” says Williams, who is contemplating ways to make the effort its own MIT-based project. “All these people in the network can help each other. But I think it really needs more support. It probably needs to be a full-time nonprofit with a director who is really doing outreach.”

The matatu project hardly exhausts Williams’ interests. As a scholar in DUSP, her forte is conducting data-heavy urban research, which can then be expressed in striking visualizations, ideally generating public interest. Over her career, she has worked with other scholars on an array of topics, including criminal justice, the environment, and housing. 

Notably, Williams was part of the “Million Dollar Blocks” project (along with researchers from Columbia University and the Justice Mapping Center), which mapped the places where residents had been incarcerated, and noted the costs of incarceration. That project helped lend support to the Criminal Justice Reinvestment Act of 2010, which allocated funding for job-training programs for former prisoners; the maps themselves were exhibited at New York’s Musem of Modern Art.

Williams’ “Ghost Cities in China” project shed new light on the country’s urban geography by examining places where the Chinese government had over-developed. By scraping web data and mapping the information, Williams was able to identify areas without amenities — which indicated that they were notably underinhabited. Doing that helped engender new dialogue among international experts about China’s growth and planning practices.

“It is about using data for the public good,” Williams says. “We hear big data is going to change the world, but I don’t believe it will unless we synthesize it into tools with a public benefit. Visualization communicates the insights of data very quickly. The reason I have such a diversity of projects is because I’m interested in how we can bring data into action in multiple areas.”

Williams also has a book coming out in November, “Data Action,” examining these topics as well. “The book brings all these diverse projects into a kind of manifesto for those who want to use data to generate civic change,” Williams says. And she is expanding her teaching portfolio into areas that include ethics and data. For her research and teaching, Williams received tenure from MIT in 2019.

“I was actually doing planning”

Williams grew up in Washington and studied geography and history as an undergraduate at Clark University. That interest has sustained itself throughout her career. It also led to a significant job for her after college, working for one of the pioneering firms developing Geographic Information System (GIS) tools.

“I got them to hire me to pack boxes, and when I left I was a programmer,” Williams recounts.

Still, Williams had other intellectual interests she wanted to pursue as well. “I was always really, really interested in design,” she says. That manifested itself in the form of landscape architecture. Williams initially pursued a master’s degree in the field at the University of Pennsylvania.

Still, there was one problem: A lot of professional opportunities for landscape architects come from private clients, whereas Williams was mostly interested in public-scale projects. She got a job with the city of Philadelphia, in the Office of Watersheds, working on water mitigation designs for public areas — that is, trying to use the landscape to absorb water and prevent harmful runoff on city properties.

Eventually, Williams says, “I realized I was actually doing planning. I realized what planning was, and the impact I wanted to have in communities. So I went to planning school.”

Williams enrolled at MIT, where she received her master’s in city planning from DUSP, and linked together all the elements of her education and work experience.

“I always had this programmer side of me, and the design part of me, and I realized I could have an impact through doing data analysis, and visualizing it and communicating it,” Williams says. “That percolated while I was here.”

After graduation, Williams was hired on the faculty at Columbia University. She joined the MIT faculty in 2014.

Ethics and computing

At MIT, Williams has taught an array of classes about data, design, and planning — and her teaching has branched out recently as well. Last spring, Williams and Eden Medina, an associate professor in the MIT Program in Science, Technology, and Society, team-taught a new course, 11.155J / STS.005J (Data and Society), about the ethics and social implications of data-rich research and business practices.

“I’m really excited about it, because we’re talking about issues of data literacy, privacy, consent, and biases,” Williams says. “Data has a context, always — how you collect it and who you collect it from really tells you what the data is. We want to tell our undergrads that your data, and how you analyze data, has an effect on society.”

That said, Williams has also found that in any course, creating elements about  ethical issues is a crucial part of contemporary pedagogy.

“I try to teach ethics in all my classes,” she says. And with the development of the new MIT Stephen A. Schwarzman College of Computing, Williams’ research and her teaching might appeal to new students who are receptive to an interdisciplinary, data-driven way of examining urban issues.

“I’m so excited about the College of Computing, because it’s about how you bring computing into different fields,” Williams says. “I’m a geographer, I’m an architect, an urban planner, and I’m a data scientist. I mash up these fields together in order to create new insights and try to create an impact on the world.”

Originally published by
  MIT News Office | September 8, 2020
MIT News

Read more…

The term ‘big data’ represents the unmanageable large datasets

The major asset of today’s tech world is Big Data. When the Covid-19 pandemic hit the economy and workspace, and pushed everyone to do remote professionalism, it is the big data that stood as a complement. Big data paved the way and accelerated the working strategy without pause.

Large datasets that need to be gathered, organised and processed is unprofessionally called big data. The issue of overload of data is not new. But technology has brought a solution to the increasing chaos in the computing sector.

What is Big Data?

Big data is basically referred to a large dataset or the category of computing strategies and technologies that are used to handle large datasets. It defines both structured and unstructured data that inundates a business on an everyday basis. Big data is the high potential of a company that uses insight and analysis to predict the future and detect accurate solutions and answers, and take apt decisions.

The large overflowing data is stored in various computers. The data set storage defers from organisations on their capacity and strategy of maintaining it.

 History of big data

The term ‘big data’ represents the large datasets that are unmanageable. Remarkably, it is not the amount of data that is taken into account when an AI mechanism values it. The features of data are provided by the techniques used by the employees and the technology input to acquire a profitable outcome.

The concept of big data gained wide-range of recognition in the early 2000s when Gartner’s industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s. He differentiated the three V’s from other data processing.


The collection of sources including business transactions, smart (IoT) devices, industrial equipment, videos, images, social media stuff and much more are collected in the form of data. Since the storage would be heavy, it becomes a challenge of polling, allocating and coordinating resources from groups of computers. The technological invasion of cluster breaking the large data into small pieces for management and algorithms became noticeable.


The addition of data cannot be stopped. Every day, millions of data inputs are being added to a stream which is further massaged, processed and analysed in order to keep up with the influx of new information and to surface valuable information early when it is most important.

Time and speed of data input play an important role. Organisations expect data to be in real-time to gain insights and update the current understanding of the system. But to cope with the fast inflow, the organisation needs robust systems with highly available components and storage to guard against failures along the data pipeline.


Data inputs are in all kinds of formats. A drawback about big data is that the wide range of data being processed and their relative quality are mixed. Data comes from various sources like applications, server logs, social media feeds and other external APIs like physical device sensors, and from other providers. They come in the form of unstructured text documents, emails, videos, audio, stock ticker data and financial transactions. A text file is stored in a similar way to a high-quality image. Almost all data transformations and changes to the raw data will happen in memory at the time of processing.

After the figuring of three V’s, various organisations started to find that there are more in big data. They have added two more dimensions to its usage.

Variability- Data flows are often unpredictable, changing and varying according to the wide range it posses. An additional dimension is needed to diagnose and filter the low-quality data and process it separately.

Veracity- Veracity refers to the quality of data input in real-time. Data comes from various sources and it is difficult to link, match, cleanse and transform data across systems. The cleaning and sorting of data are important because it impacts the data analysis outcome. Poor data ruin the effort of employees to gain data predictions.

Value- Acquiring data and delivering accurate value results is a struggle when the input is unorganised. The system and the process are complex adding to the struggle.

Why is big data important?

The data gains importance on the stance of how much data is stored and the way it is utilised. However, big data are remarkably known for its efficiencies like

•Cost reduction

•Time reduction

•New product development through stored data and optimized offerings

•Smart and accurate decision making

Big data is a cycle process

Most big data solutions employ cluster computing. This leads way to the beginning of the technological invasion in the life cycle of big data analysis.

Cluster computing

As the major problem of data from various sources is unsolved, cluster computing plays a major role in filling the gap. It will be difficult for individual computers to sort the data by itself. So companies seek the help of cluster computers where the software combines the resources of many small machines, seeking to provide several benefits.

Resource pooling- The combination and sharing of CPU, memory and large data is added for a beneficial purpose. Large data can’t be stored in a single space and it will be inadequate to do so.

High availability- Hardware and software failures are prevented when the data is shared in the storing purpose. The failure could affect the access to data and processing killing the concept of real-time analytics.

Easy scalability- The system can react to changes in resources required without expanding the physical resources on a machine when the scaling is done horizontally.

The general category of movement in data and its process can be divided into four categories.

Ingesting data into the system

The first step towards data storage is data ingestion. The process involves taking raw data and adding it to the system. Some obstacles that the system encounters during the input are the format and quality of data sources. There is a back door called ingestion tools which could be used to sort the trouble.

Persisting the data in storage

Persisting means leveraging a distributed file system for raw data storage. The management of data storage after ingestion to make it a reliable disk is persistence storage. The operation takes up the volume of incoming data, the requirements for availability, and the distributed computing layer to make more complex storage systems necessary.

Computing and analyzing data

The most important processing takes place in computing and analysing the data to get an outcome. The computing layer is the diverse part of the system as the requirements and best approach lead to better accurate answers through detailed analysis.

Visualizing the result

Presenting the data in an easily adaptive and attractive way will lead to better understanding. Recognising trends and changes in data over time is often more important than the values themselves. Visualizing is the final touch that complements the whole cycle of big data.

Many organisations are adopting big data for certain types of workloads and using it to supplement their existing analysis and business tools to maximize the revenue. Even when big data doesn’t suit all working style, it is still important to gather and store them at all means. May be not now, but one day the stored data will turn to be an invaluable asset.

Originally published 
by Adilin Beatrice | September 2, 2020
Analytics Insight

Read more…
Gold Level Contributor

Image: Unsplash - Fabio

Amid the ongoing and wall-to-wall coverage of the Covid-19 pandemic, you may have missed an important piece of news. An academic paper published last month by Australian climate scientist, Steven Sherwood and a team of global colleagues, is arguably one of the most important – and one of the most terrifying – pieces of climate change research to emerge in recent years. 

The paper increases the estimate for the increase in world temperatures over the next century to between 2.5 and 4 degrees celsius. This is significantly above the 2 degree threshold enshrined in the Paris agreement, and is extremely bad news for the sustainability of our food production system.

As such, the paper has also brought renewed focus on ways to cut carbon emissions, and some analysts believe that big data is key in this effort. In this article, we’ll explore why.

Information and Efficiency

link between big data and climate change has long been noted, but to date the use of big data in climate science has largely been limited to assessing the damage done by pollutants and greenhouse gases. As we’ve previously noted, big data coupled with advanced earth observation is one of the ways in which this is being implemented.

There is a growing consensus, however, that big data has the potential to make our economy more green in a more fundamental way. 

The logic goes like this. Markets, in order to operate effectively, need as much information as possible about the products that are traded on them. At the moment, commodity production companies – whether they are drilling for oil or producing wheat – have vast amounts of information on the origins of their products, and how they were produced. Unfortunately, very little of this data is available to investors in these products, and still less to their final consumers.

By making the data available, we would allow investors and consumers to make better choices when it comes to investing in and purchasing goods. That, eventually, could mean greener products.

Tracking Petroleum

To take an example of how this would work, consider a tank of gas. The petroleum extraction industry is one of the most technologically advanced in our economy, and individual companies collect huge amounts of data on every barrel of gas that they extract and sell. At the moment, however, none of this data makes it to the stock market, where crude oil is bought and sold as generic, interchangeable barrels.

Gasoline refined from Oil Sands oil generates almost twice the level of greenhouse gas emissions as does North Sea oil. But at the moment, investors and consumers have no way of knowing that, and therefore no way of expressing a preference for the greener version. For companies, investors, and consumers who are increasingly basing their purchasing choices on the ecological impact of products, this is a huge problem. 

The most frustrating element of this is that much of the data that would allow the market to make this kind of informed decision already exists, having been collected by vast IoT networks that cover almost every sector of our manufacturing and food production industries. It’s just that it never leaves the networks of the companies that collected it in the first place.

The Future is Now

There remain, of course, huge challenges in making these data available for consumers and investors. One is technological. Despite many food production companies possessing sophisticated IoT companies, it’s difficult to see how small-scale producers in developing countries can share this kind of data when only 58.8% of the world’s population has access to the internet. 

That said, there are some companies already pioneering ways of leveraging big data in pursuit of greener markets. The carbon credit registries that have sprung up in the USA recently are a great example of this. The technologies being developed by Oxy Low Carbon ventures, which envisage a world where carbon credits are traded just as commodities are now, are another example.

In addition, there is a precedent for this kind of market-driven change in a major industry: the music business. Sales of CDs peaked in 1999, the same year that Napster started offering P2P downloads of MP3s. Critically, it was not just the lower cost of tracks from Napster that allowed it to start the digital music revolution: it was also the fact that consumers had far more information about the tracks and albums that they were downloading. 

Instead of trading 500 barrels of crude, investors of the future could have access to extremely detailed information on where this oil was extracted from, and the ecological cost of this. For companies and investors looking to prove their green credentials, this could make all the difference.

Transparency and Profit

Of course, a cynical reader might note that it is not – at least currently – in the interests of huge manufacturing and mining companies to make the data they collect available to the public. Recording the true ecological cost of the way in which we produce food, for example, might make these companies fearful of a public backlash.

In reality, however, this is unlikely to make much of a difference, given the big data revolution that is now upon us. Investors are going to start to demand such data before too long since they know that companies already collect it. The first company to make it available could expect a significant boost to their public perception in doing so, but more importantly they would also be changing the world for the better

Originally published by
Bernard Brode | August 23, 2020
for inside BigData
Bernard Brode has spent a lifetime delving into the inner workings of cryptography and now explores the confluence of nanotechnology, AI/ML, and cybersecurity.

Read more…
Bronze Level Contributor

Image: Unsplash - Fabio

Among everything else going on in the world, big data is another controversial topic, and the conversations are all over the place: forums, social media networks, articles, and blogs.  

That is because big data is really important. 

I’m not saying this only as someone who works in the industry, but as someone who understands the disconnects between what goes on behind the scenes and what’s out there in the media. It’s no secret that quite often big data has a bad reputation, but I don’t think it's the fault of the data so much as how it’s being used.

The internet is the biggest source of data, and what organizations do with it is what matters most. While data can be analyzed for insights that lead to more strategic business decisions, it can also be stolen from social networks and used for political purposes. Among its almost infinite uses, big data can make our world a better place and this article is going to clear up any misconceptions and hopefully convince you that big data is a force for good. 

What is Big Data, Really?

Most of us know what big data is, but I think a quick summary is essential here. We’ve all observed how industry pundits and business leaders have demonized big data, but that’s like demonizing a knife. A minority of people may use a knife for nefarious purposes while the overwhelming majority of people would have a hard time feeding themselves without one. 

It’s all about context.

A simple explanation I would give anyone outside the industry is that big data refers to the sizespeed & complexity of modern data practices that are too difficult or maybe impossible to process using traditional methods.

Doug Laney, a thought leader/consultant and author, initially used the term expressed as a function of three concepts referred to as “the three V’s”:

  • Volume: Part of the “big” in big data involves large amounts of information collected through a range of sources including business transactions, smart (IoT) devices and social media networks
  • Velocity: Big data moves fast through the use of RFID tags, smart meters, and sensors that necessitate the need for information to be handled fast
  • Variety: Big data is diverse, derived from many formats including structured numeric data found in databases to unstructured data derived from formats like emails, financial transactions, audio/video files and text documents of all types

Surveillance Capitalism: Why Some People Hate Big Data

Social networks, government bodies, corporations, developer applications, along with a plethora of organizations of all types are interested in what you do, whether you are asleep or awake. 

Everything is being surveyed and collected and this has resulted in an entire business sprouting up around the collection of big data referred to as surveillance capitalism.

I think this is the aspect of big data that concerns everyone. So concerned in fact, that many use the terms interchangeably.

Originally coined by Harvard professor Shoshana Zuboff, surveillance capitalism describes the business of purchasing data from companies that offer “free” services via applications. Users willingly use these services while the companies collect the data and then access to the data is sold to third parties. 

In essence, it's the commodification of a person’s data with the sole purpose of selling it for a profit, making data the most valuable resource on earth according to some analysts. The data collected and sold enables advertising companies, political parties, and other players to perform a wide range of functions that can include specifically targeting people for the sale of goods and services, improving existing products or services, or gauging opinion for political purposes, among many other uses. 

But that’s only part of the story...

Data collection may have various advantages for some individuals and society as a whole. Consider sites like Skycanner, Google Shopping, Expedia, and Amazon Sponsored Products. 

Just a few short years ago comparison shopping required clicking between several sites. Today with a visit to a single site we can get price comparison on almost every type of product or service. All these sites were built around data collection and represent an example of a service some would say is essential to the ecommerce experience.  

How Big Data is Obtained

Data can be obtained in many ways. One common method is to purchase it from developers of applications or to collect it from a social network. The latter is usually restricted to the owners or stakeholders of the application.

Another way is called “web scraping”. This involves the creation of a script that analyzes a page and collects public information. After collecting the information, the scraped data is then compiled and delivered in a spreadsheet format to the end user for analysis. Referred to as the mining process, this is the stage where the data is analyzed and valuable information is extracted, similar to panning for gold among rocks. 

Specific Web Scraping Examples

Just about any website with publicly available data can be scraped. Some of the most beneficial uses people may be familiar with include:

Price aggregator websites

Whether it’s to book flights, hotel rooms, buy cars or other consumer goods, web scraping is a useful tool for businesses that want to stay price-competitive. The largest benefits accrue to the end-users that are able to source out the lowest prices. 

Tracking World News & Events

Web scraping can be used to extract information and statistics for a variety of world events that include the news, financial market information, and the spread of communicable diseases. 

My company partnered with university students in the United States and Switzerland to support the TrackCorona and CoronaMapper websites that used scraped information from various sources to provide COVID-related statistics.

Tracking Fake News

“Fake News” seems to be everywhere and can spread like wildfire on social networks. Several startups are working to combat the problem of misinformation in the news through the use of machine learning algorithms.

Through processes that can analyze and compare large amounts of data, stories can be evaluated to detect their accuracy. While many of these projects are currently in development, they represent innovative solutions to the issue of false information by tracking it from its source. 

Search Engine Optimization (SEO)

Small businesses and new startups looking to get ranked in search engines are in for an uphill battle with the major players dominating page one. Since SEO can be very challenging, web scraping can be leveraged to research specific search terms, title tags, targeted keywords, and backlinks for use in an effective strategy that can help smaller players beat the competition. 

Academic Research

The internet provides an almost unlimited source of data that can be used by research professionals, academics, and students for papers and studies. Web scraping can be a useful tool to obtain data from public sites in a wide array of areas, providing timely, accurate data on almost any subject. 


Cybersecurity is an increasing field that spans a variety of areas that involve the security of computer systems, networking systems, and online surveillance. Besides corporate/government concerns, cybersecurity also spans email security, social network monitoring/listening, and other forms of tracking that ensure the safety of systems stays intact.

Ethical Web Scraping

Big data is always changing as it grows and evolves, and part of the evolution should include the formation of some generally accepted ethical practices to keep the space free of corruption and mismanagement. 

At Oxylabs, we feel that there are ethical ways to scrape data off the web that doesn’t compromise the ethical concerns of users or the website servers providing them services. 

The guidelines for scraping publicly available data should be based on respect to the intellectual property of third parties and sensitivity to the privacy issues. Also, it is equally important to employ practices that protect servers from the overload of requests.  

Scraping publicly available data with the intent to add value is another suggestion that can enrich the data landscape and enrich the end user’s experience. 

The Bottom Line

Big data has received a terrible reputation thanks to negative perceptions created by the media with respect to recent scandals. The truth is that this is a very narrow definition of what big data is all about. Big data simply refers to the handling of large streams of diverse data that traditional systems could not process. 

Big data has almost unlimited uses with some of the most positive involving optimization strategies that can improve us personally and improve society as a whole. For this reason, factual information should be open and available for everyone. 

At the end of the day, it’s about how the data is used, and as an executive of one of the largest proxy providers in the world, I can attest to the fact that there are many innovative players in the world today that are using big data as a force for good.

Originally published by
Julius Cerniauskas | August 11, 2020
Data Science Central


Read more…
Gold Level Contributor

Data systems that learn to be better - MIT

Storage tool developed at MIT CSAIL adapts to what its datasets’ users want to search.

One of the biggest challenges in computing is handling a staggering onslaught of information while still being able to efficiently store and process it.

Big data has gotten really, really big: By 2025, all the world’s data will add up to an estimated 175 trillion gigabytes. For a visual, if you stored that amount of data on DVDs, it would stack up tall enough to circle the Earth 222 times. 

One of the biggest challenges in computing is handling this onslaught of information while still being able to efficiently store and process it. A team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) believe that the answer rests with something called “instance-optimized systems.”  

Traditional storage and database systems are designed to work for a wide range of applications because of how long it can take to build them — months or, often, several years. As a result, for any given workload such systems provide performance that is good, but usually not the best. Even worse, they sometimes require administrators to painstakingly tune the system by hand to provide even reasonable performance. 

In contrast, the goal of instance-optimized systems is to build systems that optimize and partially re-organize themselves for the data they store and the workload they serve. 

“It’s like building a database system for every application from scratch, which is not economically feasible with traditional system designs,” says MIT Professor Tim Kraska. 

As a first step toward this vision, Kraska and colleagues developed Tsunami and Bao. Tsunami uses machine learning to automatically re-organize a dataset’s storage layout based on the types of queries that its users make. Tests show that it can run queries up to 10 times faster than state-of-the-art systems. What’s more, its datasets can be organized via a series of "learned indexes" that are up to 100 times smaller than the indexes used in traditional systems. 

Kraska has been exploring the topic of learned indexes for several years, going back to his influential work with colleagues at Google in 2017. 

Harvard University Professor Stratos Idreos, who was not involved in the Tsunami project, says that a unique advantage of learned indexes is their small size, which, in addition to space savings, brings substantial performance improvements.

“I think this line of work is a paradigm shift that’s going to impact system design long-term,” says Idreos. “I expect approaches based on models will be one of the core components at the heart of a new wave of adaptive systems.”

Bao, meanwhile, focuses on improving the efficiency of query optimization through machine learning. A query optimizer rewrites a high-level declarative query to a query plan, which can actually be executed over the data to compute the result to the query. However, often there exists more than one query plan to answer any query; picking the wrong one can cause a query to take days to compute the answer, rather than seconds. 

Traditional query optimizers take years to build, are very hard to maintain, and, most importantly, do not learn from their mistakes. Bao is the first learning-based approach to query optimization that has been fully integrated into the popular database management system PostgreSQL. Lead author Ryan Marcus, a postdoc in Kraska’s group, says that Bao produces query plans that run up to 50 percent faster than those created by the PostgreSQL optimizer, meaning that it could help to significantly reduce the cost of cloud services, like Amazon’s Redshift, that are based on PostgreSQL.

By fusing the two systems together, Kraska hopes to build the first instance-optimized database system that can provide the best possible performance for each individual application without any manual tuning. 

The goal is to not only relieve developers from the daunting and laborious process of tuning database systems, but to also provide performance and cost benefits that are not possible with traditional systems.

Traditionally, the systems we use to store data are limited to only a few storage options and, because of it, they cannot provide the best possible performance for a given application. What Tsunami can do is dynamically change the structure of the data storage based on the kinds of queries that it receives and create new ways to store data, which are not feasible with more traditional approaches.

Johannes Gehrke, a managing director at Microsoft Research who also heads up machine learning efforts for Microsoft Teams, says that his work opens up many interesting applications, such as doing so-called “multidimensional queries” in main-memory data warehouses. Harvard’s Idreos also expects the project to spur further work on how to maintain the good performance of such systems when new data and new kinds of queries arrive.

Bao is short for “bandit optimizer,” a play on words related to the so-called “multi-armed bandit” analogy where a gambler tries to maximize their winnings at multiple slot machines that have different rates of return. The multi-armed bandit problem is commonly found in any situation that has tradeoffs between exploring multiple different options, versus exploiting a single option — from risk optimization to A/B testing.

“Query optimizers have been around for years, but they often make mistakes, and usually they don’t learn from them,” says Kraska. “That’s where we feel that our system can make key breakthroughs, as it can quickly learn for the given data and workload what query plans to use and which ones to avoid.”

Kraska says that in contrast to other learning-based approaches to query optimization, Bao learns much faster and can outperform open-source and commercial optimizers with as little as one hour of training time.In the future, his team aims to integrate Bao into cloud systems to improve resource utilization in environments where disk, RAM, and CPU time are scarce resources.

“Our hope is that a system like this will enable much faster query times, and that people will be able to answer questions they hadn’t been able to answer before,” says Kraska.

A related paper about Tsunami was co-written by Kraska, PhD students Jialin Ding and Vikram Nathan, and MIT Professor Mohammad Alizadeh. A paper about Bao was co-written by Kraska, Marcus, PhD students Parimarjan Negi and Hongzi Mao, visiting scientist Nesime Tatbul, and Alizadeh.

The work was done as part of the Data System and AI Lab (DSAIL@CSAIL), which is sponsored by Intel, Google, Microsoft, and the U.S. National Science Foundation. 

Originally published by
Adam Conner-Simons | MIT CSAIL
August 10, 2020

Read more…
Gold Level Contributor

Photo: metamorworks, Getty Images

Data can allow us to improve our public health response to the pandemic, but only if we enable data scientists with the right tools to harness those datasets.

With a global health crisis such as the Covid-19 pandemic comes an enormous amount of rapidly changing, important healthcare data – from the number of confirmed cases by region to hospital ventilator and PPE inventory. Leveraging data-driven insights in real-time is imperative for leaders making critical decisions at the frontlines of this pandemic, but many healthcare organizations are struggling to rapidly and efficiently harness the overwhelming streams of data to meet the demands placed on the healthcare industry.

In response to this challenge, data teams worldwide are mobilizing to help solve the most pressing problems of the pandemic. Modern data platforms are providing powerful processing tools that enable researchers, clinicians and administrators at hospitals, government and pharmaceutical organizations to aggregate and analyze diverse datasets to provide actionable insights for decision-makers.

Hospital systems
Digital transformation has been a slow burn for most hospital systems, but Covid-19 has ignited an accelerated effort, especially in the move toward analytics of consolidated health records. Hospitals use multiple different Electronic Health Record (EHR) systems that have complex data storage and analytics architectures. These complex architectures mean that these systems can’t easily interact, making it difficult to collect all the data that will provide a complete picture of a patient, which is necessary for building accurate machine learning models. Ultimately, data scientists working with legacy EHR architectures spend more time freeing data from their EHR, and less time building innovative models that can improve patient outcomes.

Data scientists are helping push healthcare systems to open interoperable systems that enable seamless analytics across hospitals. Experts advocate for hospital data teams to load EHR data using the open HL7 and FHIR APIs into open source technologies built for analytics like Apache Spark and Delta Lake. As data from EHRs flows more instantaneously, hospitals can build applications to streamline and even automate processes based on real-time information. For example, the data team at a hospital in South Carolina has harnessed the power of AI to build an app that helps caregivers predict a patient’s risk for sepsis and treat them accordingly.

This approach can be generalized to a number of use cases: other hospitals use streaming data from EHRs to predict patient surges, ER overcrowding, ventilator inventory and other important operational considerations. By using a unified data platform to blend EHR data with staffing data, a major multi-state hospital system is producing overcrowding statistics by department within less than five minutes of patient intake, which has been crucial in handling patient surges caused by Covid-19.

As the novel coronavirus first began to surge in the U.S., one of our partners noticed that most hospitals didn’t know whether a patient surge was likely, or whether they had enough supplies to care for a sudden influx of patients. Data engineers created a specialized AI live streaming app now in use at several hospitals that takes EHR data and builds out predictive dashboards showing how many patients they were likely to receive, ventilator capacity and peak ventilator usage. As healthcare system needs evolve throughout the pandemic, so will the volume of data being generated and how it can be utilized to meet challenges head-on.

Government agencies
Data analytics technology is also helping government agencies at the national level rapidly generate up-to-date data sets and run predictive models. This allows them to optimally allocate resources, provide data to public health research efforts and, ultimately, curb the spread of Covid-19. Because much of the data important to government response is being generated in hospitals, the data must be easily communicated between healthcare systems and governments for policy development.

Originally published by
Frank Nothaft | August 10, 2020
MedCity News

Read more…
Gold Level Contributor

Researchers at the McKelvey School of Engineering at Washington University in St. Louis have developed a new algorithm for solving a common class of problem -- known as linear inverse problems -- by breaking them down into smaller tasks, each of which can be solved in parallel on standard computers. (Image: Shutterstock)

A computational framework for solving linear inverse problems takes a parallel computing approach

In this era of big data, there are some problems in scientific computing that are so large, so complex and contain so much information that attempting to solve them would be too big of a task for most computers.

Now, researchers at the McKelvey School of Engineering at Washington University in St. Louis have developed a new algorithm for solving a common class of problem — known as linear inverse problems — by breaking them down into smaller tasks, each of which can be solved in parallel on standard computers.

The research, from the lab of Jr-Shin Li, professor in the Preston M. Green Department of Electrical & Systems Engineering, was published July 30 in the journal Scientific Reports.

In addition to providing a framework for solving this class of problems, the approach, called Parallel Residual Projection (PRP), also delivers enhanced security and mitigates privacy concerns.

Linear inverse problems are those that attempt to take observational data and try to find a model that describes it. In their simplest form, they may look familiar: 2x+y = 1, x-y = 3. Many a high school student has solved for x and y without the help of a supercomputer.

And as more researchers in different fields collect increasing amounts of data in order to gain deeper insights, these equations continue to grow in size and complexity.

“We developed a computational framework to solve for the case when there are thousands or millions of such equations and variables,” Li said.

This project was conceived while working on research problems from other fields involving big data. Li’s lab had been working with a biologist researching the network of neurons that deal with the sleep-wake cycle.

“In the context of network inference, looking at a network of neurons, the inverse problem looks like this,” said Vignesh Narayanan, a research associate in Li’s lab:

Given the data recorded from a bunch of neurons, what is the ‘model’ that describes how these neurons are connected with each other?

“In an earlier work from our lab, we showed that this inference problem can be formulated as a linear inverse problem,” Narayanan said.

If the system has a few hundred nodes — in this case, the nodes are the neurons — the matrix which describes the interaction among neurons could be millions by millions; that’s huge.

“Storing this matrix itself exceeds the memory of a common desktop,” said Wei Miao, a PhD student in Li’s lab.

Add to that the fact that such complex systems are often dynamic, as is our understanding of them. “Say we already have a solution, but now I want to consider interaction of some additional cells,” Miao said. Instead of starting a new problem and solving it from scratch, PRP adds flexibility and scalability. “You can manipulate the problem any way you want.”

Even if you do happen to have a supercomputer, Miao said, “There is still a chance that by breaking down the big problem, you can solve it faster.”

In addition to breaking down a complex problem and solving in parallel on different machines, the computational framework also, importantly, consolidates results and computes an accurate solution to the initial problem.

An unintentional benefit of PRP is enhanced data security and privacy. When credit card companies use algorithms to research fraud, or a hospital wants to analyze its massive database, “No one wants to give all of that access to one individual,” Narayanan said.

“This was an extra benefit that we didn’t even strive for,” Narayanan said.

Originally published by
Brandie Jefferson
 August 4, 2020
Washington University in St. Louis


The McKelvey School of Engineering at Washington University in St. Louis promotes independent inquiry and education with an emphasis on scientific excellence, innovation and collaboration without boundaries. McKelvey Engineering has top-ranked research and graduate programs across departments, particularly in biomedical engineering, environmental engineering and computing, and has one of the most selective undergraduate programs in the country. With 140 full-time faculty, 1,387 undergraduate students, 1,448 graduate students and 21,000 living alumni, we are working to solve some of society’s greatest challenges; to prepare students to become leaders and innovate throughout their careers; and to be a catalyst of economic development for the St. Louis region and beyond
Read more…
Gold Level Contributor

Source: Markus Spiske from Pexels

Not long ago, people using Microsoft Word would check for spelling errors by specifically telling the software to run “Spell Check.” The check took a few seconds to do, and users could then go in and fix their typos. Nowadays, Spell Check runs automatically as users write — as I write this story. 

Microsoft Word and its constant running of Spell Check is a basic example of “concurrent” programming – a form of computing in which an executable runs simultaneously with other programs and computations. Most programs today are concurrent programs, ranging from your operating system to the many applications, from word processing to web browsing, that people use on a daily basis.

“When you have multiple things happening at the same time, you need some way of coordinating between them to make sure they’re not stomping on each other,” says CyLab’s Bryan Parno, an associate professor in Electrical and Computer Engineering and the Computer Science Department. “Historically, this has been a very buggy process.”

Parno and a team of researchers recently published a new coding language and tool for high-performance concurrent programs that ensures that programs are provably-correct – that is, that the code is mathematically proven to compute correctly. The language and tool, named Armada, was presented at this year’s Conference on Programming Language Design and Implementation, and the paper received a Distinguished Paper award.

Quote: Bryan ParnoAssociate Professor, Electrical and Computer Engineering, Computer Science "Historically this has been a very buggy process"

“What’s novel about Armada is that it’s designed to be extremely flexible so you can write the code the way you want so it’ll go as fast as it can,” says Parno. “But you’ll still get strong assurance that it’s going to do the right thing and not mess anything up on the back end.”

Parno likens the complexity of concurrent programs – and their susceptibility to bugs – to an auction. Typically, one auctioneer receives bids from lots of people. It may take a long time to get to the highest bid with so many people and one auctioneer. If you split everyone up into, say, ten rooms, each with its own auctioneer, that would speed things up, but it would be very difficult for the auctioneers to stay coordinated; there would be lots of room for error.

Quote: Bryan ParnoAssociate Professor, Electrical and Computer Engineering, Computer Science "Aside from simple programs, these days almost everything has some sort of concurrency to it."

“There needs to be a way for all of those auctioneers to talk to each other while simultaneously working towards the highest bid amongst all of the rooms,” says Parno. “It can get very complicated, which is why you don’t usually see auctions run in this way.”

Parno believes that Armada will benefit anyone writing concurrent programs, which span a huge range in applications. 

“From payroll systems to hospital records to any kind of e-commerce – they are all backed by databases, and databases are always going to be backed by concurrent software,” says Parno. “Aside from simple programs, these days almost everything has some sort of concurrency to it.”

Originally published by
Daniel Tkacik | JUL 29, 2020
Carnegie & Mellon University - Security & Privacy Institute

Read more…
Bronze Level Contributor

Financial crime fighting platform Quantexa has raise $64.7 million in funding with support from HSBC and ABN Amro Ventures.

The round was led by Evolution Equity Partners, and also included backing from Dawn Capital, AlbionVC, British Patient Capital and Accenture Ventures.

The company says it will use the capital injection to push into new vertical industries such as the public sector, while developing more platform applications across financial services.

This new round of funding follows a $22.7 million Series B round secured in August 2018, bringing total funds raised to date to $90 million.

Using the latest advancements in big data and AI, Quantexa’s platform uncovers hidden risk and new opportunities by providing a contextual view of internal and external data, which can be interrogated in a single place to solve major challenges across financial crime, customer intelligence, credit risk, and fraud.

Big ticket clients include HSBC, Standard Chartered Bank and Accenture.

Vishal Marria, CEO at Quantexa, comments: “We are seeing a huge demand for our platform to support multiple applications across our core markets in financial services and within new industry sectors. This investment will accelerate our product innovation roadmap and enable us to invest further into Europe, North America and Asia Pacific regions, as well as expand into new locations.”
Originally published by
Finextra | July 23, 2020
Read more…
Gold Level Contributor

Telefonica expands IoT, big data partnership

Telefonica predicted an acceleration in deployment of IoT and big data telemetry solutions after expanding a partnership with Spanish specialist Erictel.

The pair plan to focus on developing and launching new services covering asset tracking and field management. In a statement, Telefonica explained the agreement builds on M2M-focused partnerships with Erictel spanning a number of years and expands the geographic reach to cover all of the operator’s markets.

Elena Gil, product and business operations director for IoT and Big Data at the operator’s Telefonica Tech division, said the agreement enhanced its capabilities “at a time when collaboration with our partners is more important than ever in order to support our B2B customers in their path towards digitisation, automation and data-driven decisions”.

Telefonica noted Covid-19 (coronavirus) had highlighted the importance of automation and digitalisation, showing “how data-based decision-making can help businesses stay competitive”.

The operator recently bolstered IoT security by deepening ties with cybersecurity specialists Nozomi Networks and Fortinet.

Originally published by
Manny Pham | July 15, 2020
Mobile World Live

Read more…
Gold Level Contributor

Image credit: Pixabay

Data science's impact on our everyday lives has become even more profound as the technology develops. Big Data, alongside artificial intelligence and machine learning, form a permanent part of our online lives. Smart Data Collective even states that artificial intelligence will form an integral part of the future or social media. At the core of data science is artificial intelligence, which allows businesses to gain insight based on data it gathers from user interactions (or buys from social media providers). In this article, we'll explore how data science impacts our social visibility.

Self-Improving Applications

AI is one of the most amazing and confusing modern fields currently in existence. The ability to create applications that can make their own decisions and learn from user input is a revolutionary feat. The complicated part comes from how AI thinks. As The BBC notes, even the people who design AI systems don't always know how they come to their conclusions. However, despite this shortcoming, AI is still able to make some remarkably accurate predictions, given enough data. AI has been used in business before, and most companies that have an online presence have invested in an AI bot to interact with their customers when they can't have a human doing so. This interaction also opens the door to another benefit of artificial intelligence to business.

Personalized Service

What if a business could customize every aspect of its service so that each consumer felt unique? Forbes mentions that personalized service tends to result in higher revenue for businesses. With AI working on the back end, companies could theoretically develop customized service for each of their customers. For example, AI could detect customers' locations (so long as they're not using a VPN to hide their address) and customize the landing page to their local language. It might seem like a small and insignificant detail but could have an enormous impact on how the customer views the business.

Data Science and AI's Role in Social Media

By now, it's evident that social media networks have invested heavily in the development of AI to help them manage their business. However, aside from developing AI agents that can successfully moderate platforms, social media networks have also developed systems that aid users. Most people who use Facebook, for example, will be familiar with its facial recognition system that suggests automatic picture tags once it spots your face. Advertisers can take this a step further by attaching a face to a user profile.

Social media also offers a unique and unprecedented look at how consumers think. Twitter's hashtags are a perfect example of this. By simply looking at the trending hashtags, businesses can develop a picture of the current social issues facing their customer base. Since activism-based marketing is such a popular path these days, companies can use data science to refine these trending tags to understand the politics of their core demographic. As Impact points out, political stances could be leveraged to increase a business's overall sales numbers.

The Lure of Easy Solutions

AI presents a handy tool, but it's no silver bullet. Effectively utilized, AI can provide a valuable advantage to businesses and even push them towards being more profitable. This result depends on the type of data that the business collects and how useful it is in defining the business's core demographic. As more data comes in, companies have the chance to hyper-focus on their consumers' needs. Artificial intelligence and data science do provide a way to become more socially visible through the clever use of trending data and the knowledge of what a business's core audience likes. As more users flock to social media, these targeting methods will only get better.

Originally published by
Steve Jones | July 14, 2020

Read more…
Gold Level Contributor

Image:  Alexandra Gorn | Unsplash

One way to explore what trends may be emerging is to talk to people about what has them most excited or worried about the future. In the analytics and data science space, a recurring theme among experienced leaders is the concern of not being able to keep up with all of the rapid change taking place – both individually and as a team. New algorithms, platforms, data, business partners, and more are constantly challenging analytics leaders’ ability to stay current on everything they oversee.

The Rise of Complexity and Disruption

Until well into the 2000’s, the number of tools and platforms for performing analytics was relatively small. Virtually all analytic logic was coded using SAS, SQL, or (sometimes) SPSS. Most data use for analysis was stored in a relational database or (sometimes) a mainframe. The majority of analytics being pursued at major corporations involved classic statistical and forecasting models. Nothing was easy, but skill needs were concentrated in a few core areas. Analytics generalists ruled the day, and generalists filled roles from the bottom to the top of the analytics organization.

Given the past stability of the space, even executives who had not done hands-on work for a number of years were still mostly current (if rusty) and could still review and understand the code and analytical logic being created by their teams. Leaders were comfortable that they could stay on top of the details of what was going on. They could even jump in and get their hands dirty if they needed to! The generalist skills that leaders grew up with still represented the bulk of their team’s skill sets.

As we neared 2010, an explosion of complexity hit through a combination of, among other things, big data, open source, the cloud, and artificial intelligence. Suddenly there was more data, more algorithms, more tools, and more platforms than ever before. They were all evolving rapidly, and many were not mature. The analytics space was, and continues to be, disrupted heavily while simultaneously analytics was being used to disrupt business models.

This poses tremendous opportunities for analytics organizations, but also tremendous challenges. No individual, whether entry level data scientist or senior leader, can possibly keep up with it all from a technical perspective. There just isn’t enough time in a day to become an expert on all the data, tools, and technologies that were sprung upon us so quickly and surround us today.

The Impact on Analytics and Data Science Organizations

As a result of the complexity and disruption, analytics leaders began hiring larger teams, with a broader range of skills, and a lot of specialists. The breadth and complexity of the analytics and data science processes being built and deployed has evolved far beyond anything in the past. While productivity is enhanced with all of the pre-packaged functionality now readily available for use, deploying and scaling processes requires many pieces working well together and those individual pieces are often understood and managed by different people.

This causes major stress for analytics and data science leaders. They are now responsible for many varied and complex analytical processes. At the same time, there may be nobody on the executive’s team who truly understands how all of the technical details for a given process work from start to finish. Instead, different people understand distinct pieces of the process. For example, a data engineer might make available a data pipeline that a data scientist can then make use of. The engineer and scientist may not understand the details of what the other is doing, but simply understand what handoffs are required.

What Keeps Analytics And Data Science Executives Up At Night?

For the typical analytics executive who is detail oriented and likes a sense of control, the lack of end to end understanding is disconcerting, and it brings us to the question in the title of this blog – what keeps analytics executives up at night? Time and again, when comfortable that they can speak freely and in confidence, analytics executives confess their insecurity as it relates to keeping on top of everything their team is doing. They simply aren’t up to speed on all the latest tools, platforms, protocols, and techniques. Sure, they understand the concepts and know how things work at a high level – at a generalist level. However, they are no longer able to jump in and personally quality check all the work their team has done, nor are they any longer able to do most of the work themselves if they had to. There is simply too much specialist work now being incorporated. As a result, analytics and data science executives now must fully rely on and trust their teams.

While this level of trust may be common for some executive roles (most notably CEOs who can’t possibly know the details of what everyone does within a large organization), it was not common for analytics executives until recently. It is a very big adjustment because it is one thing to lead a team as an “Alpha” generalist resource who, like a good drill sergeant, everyone knows can still jump in and show folks how it is done. It is another thing to lead a team as a guide, coach, and mentor who helps the team go the right direction and set the right priorities, but who everyone knows can’t jump into the trenches to help do the dirty work.

Many analytics executives still long for the ability to learn everything their team is doing and to be current on all of the details. However, the smart ones have realized that this isn’t possible. Furthermore, it isn’t desirable. An executive is being paid to lead and that is where their focus should be. It isn’t a bad thing to understand many of the details, but that is what the team is there for.

To be successful today, an analytics executive should hire a good team and let them do what they do best. In turn, the executive must focus on guiding the strategy, managing the politics, and selling the team’s capabilities to the organization. The requirement today is no longer for an executive who is a technical star and also has some leadership skills. Rather, the requirement is for a leadership star who also has some technical skills. Failure to recognize and adapt for this difference will cause both an executive and an organization to pay a price.

Originally published by the International Institute for Analytics
Bill Franks | July 9, 2020

Read more…
Bronze Level Contributor

Image: Alex Machado - Unsplash

Cloud computing has come to occupy a central locus in the data management ecosystem, much more so than it did even a couple months ago. In the wake of patent economic instabilities, global health concerns, and unparalleled need for remote access, many organizations are struggling to simply keep their lights on to retain what customers they still have.

Advanced analytics only helps so much with the necessities of reducing costs and provisioning IT resources in immensely distributed settings—which is the crux of the requirements for maintaining operations in such an arduous business climate.

Although Artificial Intelligence will likely always be considered “cool”, the cloud—and not AI—is the indisputably pragmatic means of staying in business in an era in which budget slashing and layoffs (even of IT personnel) are a disturbingly familiar reality.

The cloud is the single most effective way to address contemporary concerns of:

  • Overhead: Most cloud manifestations dramatically decrease costs for securing and accessing IT resources, which for most organizations is simply “an enabler, it’s always overhead,” divulged Denodo CMO Ravi Shankar. “A CEO of any reasonable company will try to cut the overhead as much as possible.”
  • Remote Access: The cloud’s collaboration benefits are critical to working in decentralized settings (including from home or anywhere else) and support the escalating need for services like mobile banking or telemedicine.
  • IT: Cloud architecture enables organizations to outsource the difficulty of modern IT needs to specialists, so companies can focus on mission critical activities central to revenue generation.

What were once incentives for cloud migration are rapidly becoming mandates for contemporary IT needs. However, “the cloud… is fairly complex, [there are] a lot of cloud services, a lot of moving parts,” reflected Privacera CEO Balaji Ganesan. By understanding the options available for overcoming the inherent complexities of the cloud—involving data governance and security, integration, and data orchestation—organizations can perfect this paradigm to thrive in the subsequent days of economic uncertainties.

Public Cloud Strengths

Most companies know the three main public cloud providers are Amazon, Azure, and Google. Fewer realize these provide the foundation for serverless computing; only a chosen few realize they have the following respective strengths that are determinative when selecting providers.

  • Azure: According to Shankar, many larger enterprises gravitate towards Microsoft Azure, which excels in “office productivity applications and BI.” Oracle is also increasing market share among larger organizations, particularly those investing in its applications.
  • Google: For organizations in which machine learning and cognitive computing applications are core to their business, Google Cloud—which focuses on these areas—is a natural fit.
  • Amazon: Amazon Web Services is the incumbent among public cloud providers and resonates with small and mid-sized businesses because of perceived pricing advantages, its capability to enable smaller retailers to reach global audiences, and its “marketplace is bigger than Microsoft’s or other marketplaces simply for those reasons: people can find and use these services,” Shankar commented.

Continue reading

Originally published by
Jelani Harper | Inside Big Data | July 7, 2020

About the author: Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.

Read more…
Gold Level Contributor

A virtual "datathon" organized by MIT to bring fresh insights to the Covid-19 pandemic drew 300 participants and 44 mentors from around the world. Here, mentors who volunteered to judge the final projects meet on Zoom to select the top 10 projects. Image: Leo Anthony Celi

Uncertainty about the course of the Covid-19 pandemic continues, with more than 2,500,000 known cases and 126,000 deaths in the United States alone. How to contain the virus, limit its damage, and address the deep-rooted health and racial inequalities it has exposed are now urgent topics for policymakers. Earlier this spring, 300 data scientists and health care professionals from around the world joined the MIT Covid-19 Datathon to see what insights they might uncover.

“It felt important to be a part of,” says Ashley O’Donoghue, an economist at the Center for Healthcare Delivery Science at Beth Israel Deaconess Medical Center. “We thought we could produce something that might make a difference.”

Participants were free to explore five tracks: the epidemiology of Covid-19, its policy impacts, its disparate health outcomes, the pandemic response in New York City, and the wave of misinformation Covid-19 has spawned. After splitting into teams, participants were set loose on 20 datasets, ranging from county-level Covid-19 cases compiled by The New York Times to a firehose of pandemic-related posts released by Twitter. 

The participants, and the dozens of mentors who guided them, hailed from 44 countries and every continent except for Antarctica. To encourage the sharing of ideas and validation of results, the event organizers — MIT Critical Data, MIT Hacking Medicine, and the Martin Trust Center for MIT Entrepreneurship — required that all code be made available. In the end, 47 teams presented final projects, and 10 were singled out for recognition by a panel of judges. Several teams are now writing up their results for peer-reviewed publication, and at least one team has posted a paper.

“It’s really hard to find research collaborators, especially during a crisis,” says Marie-Laure Charpignon, a PhD student with MIT’s Institute for Data, Systems, and Society, who co-organized the event. “We’re hoping that the teams and mentors that found each other will continue to explore these questions.”

In a pre-print on medRxiv, O’Donoghue and her teammates identify the businesses most at risk for seeding new Covid-19 infections in New York, California, and New England. Analyzing location data from SafeGraph, a company that tracks commercial foot traffic, the team built a transmission-risk index for businesses that in the first five months of this year drew the most customers, for longer periods of time, and in more crowded conditions, due to their modest size. 

Comparing this risk index to new weekly infections, the team classified 16.3 percent of countywide businesses as “superspreaders,” most of which were restaurants and hotels. A 1 percent increase in the density of super-spreader businesses, they found, was linked to a 5 percent jump in Covid-19 cases. The team is now extending its analysis to all 50 states, drilling down to ZIP code-level data, and building a decision-support tool to help several hospitals in their sample monitor risk as communities reopen. The tool will also let policymakers evaluate a wide range of statewide reopening policies.

“If we see a second wave of infections, we can determine which policies actually worked,” says O’Donoghue.

The datathon model for collaborative research is the brainchild of Leo Anthony Celi, a researcher at MIT and staff physician at Beth Israel Deaconess Medical Center. The events are usually coffee-fueled weekend affairs. But this one took place over a work week, and amid a global lockdown, with teammates having to meet and collaborate over Slack and Zoom.

With no coffee breaks or meals, they had fewer chances to network, says Celi. But the virtual setting allowed more people to join, especially mentors, who could participate without taking time off to travel. It also may have made teams more efficient, he says. 

Read more

Originally published by
Kim Martineau | MIT Quest for Intelligence
July 1, 2020
MIT News

Read more…
Gold Level Contributor


It’s a conundrum that many database startups have yet to solve: Why does Oracle continue to dominate the space if its relational database is so out of step with the times? The software vendor that cracks the code will be extremely wealthy. The distributed SQL database company Yugabyte will take its best shot under former Pivotal co-founder and president Bill Cook, who recently joined the company.

At Pivotal, Cook and his colleagues helped usher in a new way of approaching application development. “[We were] helping the community and large-scale clients move to this new cloud-native world” by “showing people how to build applications in a new CI/CD, cloud-native microservices way,” Cook explained recently to Datanami.

As an EMC/VMware spinout, Pivotal was quite successful in redefining the creation of the application layer of the modern cloud stack, and Cook stayed with the company until he ferried it through its IPO and subsequent re-acquisition by VMware in 2019. At that point, with 35 years in the IT saddle, Cook has could have hung up his spurs. But something was bugging him.

“I really thought the data side of that story has been underserved, meaning these applications are being moved, people [are] embracing all these technologies and new ways of building software to drive business results, but the data tier was still a bit problematic,” Cook says.

What the market demands, Cook says, is a geographically distributed SQL database that’s rock-solid, feels familiar to developers, and doesn’t lock them in. In short, it borrows the positive attributes of the eponymous Oracle database, but without the vendor lock-in, lack of horizontal scalability, and DevOps complexity.

As Cook got to know the founders of Yugabyte–Kannan Muthukkaruppan, Karthik Ranganathan, and Mikhail Bautin–he came to the conclusion that YugabyteDB was the right database to tackle the challenges posed by the modern cloud platform.

According to Muthukkaruppan, there were three main design points that he and his fellow founders strived to meet as they were developing Yugabyte. The first design point was developer familiarity.

“The relational database paradigm, with SQL as the language, is something developers are really comfortable with,” Muthukkaruppan says. “We took that to heart. YugabyteDB…is a distributed SQL database, but it’s completely PostgreSQL compatible. The compatibility is not only at the language level, but the ecosystem and the driver level.”

Read more

Originally published by
Alex Woodie | June 30, 2020


Read more…
Gold Level Contributor

Did Big Data Fail Us During COVID-19?

The COVID-19 pandemic has been something of a proving grounds for tech. Industry professionals have upheld the value of tools like big data for years, and now they have a chance to prove it. With the outbreak still raging on, you may wonder if big data has been all that helpful.

Governments and organizations across the world have employed big data to respond to the crisis. Some continue to sing its praises, but should that be the case? How has the world of big data affected the fight against coronavirus?

How Big Data Has Helped

Big data has undoubtedly been a useful service amid the pandemic. To determine the best strategies for addressing the outbreak, you must know how widespread the problem is. Without big data, tracking infection rates would be a far more challenging task.

Researchers have used various data, from GPS readings to temperatures, to help understand the disease. The scope of big data enables you to look at infection rates through a range of different lenses. As scientists still don’t know much about the virus, this versatility has proved useful.

The perspective that big data provides on the virus could also help find a treatment. Some organizations are using big data to model coronavirus so that they can run simulations of how it would interact with different medications. That way, they can find potential treatments to focus on in real-world lab tests.

Concerns Over Big Data in the Pandemic

You’ve probably noticed that despite these advantages, the world still seems to have little control over the situation. While you can’t blame that entirely on big data, there are some shortcomings to note. Most notably, privacy concerns may cause people to avoid testing and treatment.

When big data becomes part of healthcare, it raises concerns over patient confidentiality. If people know the government is tracking them, they may avoid doing anything that would create more data points. They may then stay away from vital services, like filing for ACA coverage online or logging into a health screening website.

As a result of these privacy fears, big data-based tracking may prove counter-effective. If you’re worried about privacy, you’ll avoid creating more data, which would skew results. More severely, you may not seek out treatment in fear of authorities overstepping their boundaries and looking at more of your information, like criminal records.

Using big data to fuel AI in healthcare could also misdirect efforts to fight the virus. AI tends to exaggerate human biases, which could lead to unreliable and even unjust results. Any misdiagnoses or false trends could cause these programs to direct officials away from the actual source of the problem.

Has Big Data Failed?

Big data has undoubtedly helped track the outbreak, but these results may not be accurate. It would also only take a few missteps for big data to misdirect anti-pandemic efforts. With these concerns in mind, has big data failed?

It’s difficult to say. Because of the nature of these drawbacks, you can’t be sure if they’re actual problems or just theoretical. At the same time, the benefits of big data amid the pandemic have yet to produce many verifiable results.

It remains unclear whether big data has helped or hindered the world’s coronavirus response. You may not know until after the pandemic is over, and you can look back on how it played out. Until then, it seems like big data is still a useful tool, but organizations should be careful with it.

Health and Safety in the Digital Age

The world is becoming more digital by the day. All of this data can be a tremendous resource for fighting disease, but there are also some noteworthy drawbacks. As healthcare and data become more intertwined, concerns over patient confidentiality are more prevalent than ever.

The marriage of medicine and big data is all but inevitable. Depending on how you look at it, that could be either a considerable benefit or a cause for concern. As this new age begins, we should probably proceed with caution.

Originally published by
Caleb Danziger - Read more from Caleb on The Byte Beat, his tech blog.
Inside Big Data | June 26, 2020

Read more…
Gold Level Contributor

Which Are The Real Benefits of Big Data?

Big Data essentially refers to an incredible amount of data and information that keeps on growing exponentially with time, and needs to be properly analyzed and processed in order to uncover valuable information that will eventually benefit organizations and businesses.

While Big Data is quite famous right now, the truth is that this incredible tech innovation is quite unknown for many people around the world, even when it’s influence has been increasing so much that most analysts believe it will eventually expand to our everyday lives.

This way, in case you don’t know too much about Big Data, here are the most important benefits that this tech innovation provides to businesses and organizations that make use of it.

The Benefits of Big Data

The first main benefit that Big Data gives us is the so-called predictive analysis, which is the feature that allows its analytics tools to predict outcomes in the most accurate manner. Naturally, we’re talking about a dream come true for most businesses considering that predictive analysis allows it to make better decisions and even reduce risks by optimizing their operational efficiencies.

Another outstanding benefit that Big Data provides is the ability to harness data from the different social media platforms in the market by using analytics tools. This way, businesses around the world can easily streamline their digital marketing strategies in order to improve the consumer experience. After all, this tech innovation allows you to determine the customer pain points, which is one of the most important marketing strategies anyone can use today.

Additionally, Big Data combines relevant information from numerous sources in order to produce actionable insights, which can be easily done by making use of the famous segment alternative. In case you don’t know why this is a great benefit, let’s say the reason is that it allows all companies and organizations to filter out the so-called “garbage information,” which eventually let them save a lot of money.

Want more? Then let’s say that what most analysts point out it’s simply the greatest benefit of Big Data is the fact that this one helps companies and organizations to increase their sales leads, which traduce in a significant revenue boost. This usually happens because the analytics tool of this tech innovation can determine the way certain services and products are performing in the market and how clients around the world are responding to these.

Finally, Big Data allows all companies to check not only the market but also the way the competition is performing, by showing the different promotions that are being provided to customers. What makes this incredible is the fact the Big Data will let you know if customers are feeling attracted to these promotions or not.

Originally published by
Big Data Analytics News | June 26, 2020

Read more…

JAAGNet Big Data Feed

AI & Big Data Expo Europe 2020 - Postponed to November 24 - 25, 2020

DATA 2021 - Conference

  • Description:

    DATA 2021 - Conference

    The purpose of the International Conference on Data Science, Technology and Applications (DATA) is to bring together researchers, engineers and practitioners interested on databases, big data, data mining, data management, data security and other aspects of information systems and technology involving advanced applications of data.


  • Created by: Kathy Jones
  • Tags: data, big data, data science, paris, france

JAAGNet Big Data Blog Archive

See Original | Powered by elink

JAAGNet Channel Big Data Playlist